I suppose I earned "expert" the hard way.
Torzdf is correct about the PCI lanes.
The cards need to communicate with each other. The performance is hindered by the slowest cards.
On typical mainboards the first PCIe slot will go 16x in the others are not populated.
If the 'first' 2 are populated, both cards will communicate at 8x usually, and this is just fine.
After that it, mainboards begin to vary a lot.
The 3rd and 4th slots may only be hardwired for 4x or 1x.
PCIE 4x is getting a bit slow for distributed.
If one card is running at 1x, all of them will be waiting on the slowest card.
For funzies, I tried a 2070 @ PCIe Gen 3 @ x16 + 1060 PCIE Gen 1 @ 1x (PCIe riser card like for bitcoin miner rigs).
Training took about 3 hours to start, and was slower than the 1060 training solo.
Once it started I think I was getting 1 iteration every 2-3 seconds.