Hello,
First off, thanks for this really interesting toolset - this is the first time I've begun trying it out.
System basics:
OS: Windows 10 Pro
motherboard: ASUS Pro WS X570-ACE
RAM: 64GG (DDR4)
GPUs: 2 x ASUS ROG Strix RTX3090-024G
FaceSwap version: updated today via the FS menu option (06/07/2022)
Have followed the guide to Extract (i.e., prepare the source and targets, extract, align, manual alignment adjustments, prepare for training) and have followed the next guide to setup Training for the Dlight model with batch size = 10, both Initializers enabled and the Distributed Tensor processing option checked.
Both GPUs have been enabled for all FaceSwap operations until Training, where I disabled one at the beginning because the docs said at least one of the Initializers was incompatible with multi-GPUs. So, after initialization was complete and training started with a single GPU, I stopped training, enabled both GPUs and tried to restart the training again.
Unfortunately, when both GPUs are enabled, the training simply exits - however, with either GPU enabled on their own, training operates as expected. Am I missing a basic incompatibility, perhaps?
Sample output for a single GPU (training continues):
Loading...
Setting Faceswap backend to NVIDIA
06/07/2022 14:51:41 INFO Log level set to: INFO
06/07/2022 14:51:44 INFO Model A Directory: '<path>\sourceA' (5000 images)
06/07/2022 14:51:44 INFO Model B Directory: '<path>\sourceB' (5324 images)
06/07/2022 14:51:44 INFO Training data directory: <path>\modelAB
06/07/2022 14:51:44 INFO ===================================================
06/07/2022 14:51:44 INFO Starting
06/07/2022 14:51:44 INFO ===================================================
06/07/2022 14:51:45 INFO Loading data, this may take a while...
06/07/2022 14:51:45 INFO Loading Model from Dlight plugin...
06/07/2022 14:51:45 INFO Using configuration saved in state file
06/07/2022 14:51:45 INFO Enabling Mixed Precision Training.
06/07/2022 14:51:45 INFO Mixed precision compatibility check (mixed_float16): OK\nYour GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0
06/07/2022 14:51:46 INFO Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
06/07/2022 14:51:46 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:51:46 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:51:46 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:51:46 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:51:46 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:51:46 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:51:47 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:51:47 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:51:47 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:51:47 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:51:48 INFO Loaded model from disk: '<path>\dlight.h5'
06/07/2022 14:51:48 INFO Loading Trainer from Original plugin...06/07/2022 14:52:21 INFO [Saved models] - Average loss since last save: face_a: 0.08094, face_b: 0.09512
. . .
Sample output for both GPUs (training exits):
Loading...
Setting Faceswap backend to NVIDIA
06/07/2022 14:50:18 INFO Log level set to: INFO
06/07/2022 14:50:20 INFO Model A Directory: '<path>\sourceA' (6000 images)
06/07/2022 14:50:20 INFO Model B Directory: '<path>\sourceB' (5324 images)
06/07/2022 14:50:20 INFO Training data directory: <path>\modelAB
06/07/2022 14:50:20 INFO ===================================================
06/07/2022 14:50:20 INFO Starting
06/07/2022 14:50:20 INFO ===================================================
06/07/2022 14:50:21 INFO Loading data, this may take a while...
06/07/2022 14:50:21 INFO Loading Model from Dlight plugin...
06/07/2022 14:50:21 INFO Using configuration saved in state file
06/07/2022 14:50:21 INFO Enabling Mixed Precision Training.
06/07/2022 14:50:22 INFO Mixed precision compatibility check (mixed_float16): OK\nYour GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0
06/07/2022 14:50:23 INFO Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
06/07/2022 14:50:24 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:50:24 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:50:24 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:50:24 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:50:24 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:50:24 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:50:24 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:50:24 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:50:24 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:50:24 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:50:26 INFO Loaded model from disk: '<path>\dlight.h5'
06/07/2022 14:50:26 INFO Loading Trainer from Original plugin...
06/07/2022 14:50:33 INFO batch_all_reduce: 114 all-reduces with algorithm = hierarchical_copy, num_packs = 1
06/07/2022 14:50:44 INFO batch_all_reduce: 114 all-reduces with algorithm = hierarchical_copy, num_packs = 1
Process exited.
Thoughts welcome.
- wader