Followup if I may. Why does it take so long to start? It stays at this point for quite a long time...
09/30/2020 22:31:50 INFO batch_all_reduce: 102 all-reduces with algorithm = nccl, num_packs = 1
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:31:51 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
09/30/2020 22:32:06 INFO batch_all_reduce: 102 all-reduces with algorithm = nccl, num_packs = 1
Then sometimes it fails or just stays here and never starts training? Is there that much overhead getting two cards to work together over just one card?