Can train on single 3090 GPU, but not two 3090s

wader · Post by **wader** » Tue Jun 07, 2022 7:42 pm

Hello,
First off, thanks for this really interesting toolset - this is the first time I've begun trying it out.

System basics:
OS: Windows 10 Pro
motherboard: ASUS Pro WS X570-ACE
RAM: 64GG (DDR4)
GPUs: 2 x ASUS ROG Strix RTX3090-024G
FaceSwap version: updated today via the FS menu option (06/07/2022)

Have followed the guide to Extract (i.e., prepare the source and targets, extract, align, manual alignment adjustments, prepare for training) and have followed the next guide to setup Training for the Dlight model with batch size = 10, both Initializers enabled and the Distributed Tensor processing option checked.

Both GPUs have been enabled for all FaceSwap operations until Training, where I disabled one at the beginning because the docs said at least one of the Initializers was incompatible with multi-GPUs. So, after initialization was complete and training started with a single GPU, I stopped training, enabled both GPUs and tried to restart the training again.

Unfortunately, when both GPUs are enabled, the training simply exits - however, with either GPU enabled on their own, training operates as expected. Am I missing a basic incompatibility, perhaps?

Sample output for a single GPU (training continues):

Loading...
Setting Faceswap backend to NVIDIA
06/07/2022 14:51:41 INFO 06/07/2022 14:51:44 INFO 06/07/2022 14:51:44 INFO 06/07/2022 14:51:44 INFO 06/07/2022 14:51:44 INFO 06/07/2022 14:51:44 INFO 06/07/2022 14:51:44 INFO 06/07/2022 14:51:45 INFO 06/07/2022 14:51:45 INFO 06/07/2022 14:51:45 INFO 06/07/2022 14:51:45 INFO 06/07/2022 14:51:45 INFO 06/07/2022 14:51:46 INFO 06/07/2022 14:51:46 INFO 06/07/2022 14:51:46 INFO 06/07/2022 14:51:46 INFO 06/07/2022 14:51:46 INFO 06/07/2022 14:51:46 INFO 06/07/2022 14:51:46 INFO 06/07/2022 14:51:47 INFO 06/07/2022 14:51:47 INFO 06/07/2022 14:51:47 INFO 06/07/2022 14:51:47 INFO 06/07/2022 14:51:48 INFO 06/07/2022 14:51:48 INFO Log level set to: INFO
Model A Directory: '<path>\sourceA' (5000 images)
Model B Directory: '<path>\sourceB' (5324 images)
Training data directory: <path>\modelAB
===================================================
Starting
===================================================
Loading data, this may take a while...
Loading Model from Dlight plugin...
Using configuration saved in state file
Enabling Mixed Precision Training.
Mixed precision compatibility check (mixed_float16): OK\nYour GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0
Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
Loaded model from disk: '<path>\dlight.h5'
Loading Trainer from Original plugin...

06/07/2022 14:52:21 INFO [Saved models] - Average loss since last save: face_a: 0.08094, face_b: 0.09512
. . .

Sample output for both GPUs (training exits):

Loading...
Setting Faceswap backend to NVIDIA
06/07/2022 14:50:18 INFO Log level set to: INFO
06/07/2022 14:50:20 INFO Model A Directory: '<path>\sourceA' (6000 images)
06/07/2022 14:50:20 INFO Model B Directory: '<path>\sourceB' (5324 images)
06/07/2022 14:50:20 INFO Training data directory: <path>\modelAB
06/07/2022 14:50:20 INFO ===================================================
06/07/2022 14:50:20 INFO Starting
06/07/2022 14:50:20 INFO ===================================================
06/07/2022 14:50:21 INFO Loading data, this may take a while...
06/07/2022 14:50:21 INFO Loading Model from Dlight plugin...
06/07/2022 14:50:21 INFO Using configuration saved in state file
06/07/2022 14:50:21 INFO Enabling Mixed Precision Training.
06/07/2022 14:50:22 INFO Mixed precision compatibility check (mixed_float16): OK\nYour GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0
06/07/2022 14:50:23 INFO Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
06/07/2022 14:50:24 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:50:24 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:50:24 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:50:24 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:50:24 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:50:24 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:50:24 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:50:24 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:50:24 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:50:24 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 14:50:26 INFO Loaded model from disk: '<path>\dlight.h5'
06/07/2022 14:50:26 INFO Loading Trainer from Original plugin...
06/07/2022 14:50:33 INFO batch_all_reduce: 114 all-reduces with algorithm = hierarchical_copy, num_packs = 1
06/07/2022 14:50:44 INFO batch_all_reduce: 114 all-reduces with algorithm = hierarchical_copy, num_packs = 1
Process exited.

Thoughts welcome.

wader

Post by **torzdf** » Tue Jun 07, 2022 10:01 pm

Huh? Ok, well first up, my help will be limited, as I do not have a multi-gpu setup. However, it just saying "process exited" with no further output (i.e. no warning etc.) is not great, as it doesn't give us much to go off.

Can you please run dual training with loglevel set to "TRACE". Once it exits, then go into the faceswap folder, grab the faceswap.log and then upload it somewhere I can get at it (you can PM me if you are worried about it containing any identifying info. It shouldn't do, beyond path names).

Hopefully that will give more of an indication of why it exited.

Also the output from the GUI: Help > Output System Info will also be required.

wader · Post by **wader** » Wed Jun 08, 2022 12:17 am

Thanks for offering to look at what's happening!

I should probably confirm that this system is regularly used to render using both GPUs, so their working together with other software packages is at least something I know works today.

I've tried to anonymize the data and sent you a PM with the download link for the crash log + two faceswap.log files for clean runs of FaceSwap with one GPU (successfully restarted training activities) and then with two GPUs (exited again) so that they could be compared.

TRACE-level logging definitely showed a bit more, hope it offers clues as to the cause and how to potentially resolve. Here's the related console output from the FaceSwap UI:

Loading...
Setting Faceswap backend to NVIDIA
06/07/2022 19:31:02 INFO Log level set to: TRACE
06/07/2022 19:31:05 INFO Model A Directory: 'D:\faceswap\sourceA' (10597 images)
06/07/2022 19:31:05 INFO Model B Directory: 'D:\faceswap\sourceB' (5324 images)
06/07/2022 19:31:05 INFO Training data directory: D:\faceswap\modelAB
06/07/2022 19:31:05 INFO ===================================================
06/07/2022 19:31:05 INFO Starting
06/07/2022 19:31:05 INFO ===================================================
06/07/2022 19:31:06 INFO Loading data, this may take a while...
06/07/2022 19:31:06 INFO Loading Model from Dlight plugin...
06/07/2022 19:31:06 VERBOSE Loading config: 'C:\Users\User\faceswap\config\train.ini'
06/07/2022 19:31:06 VERBOSE Loading config: 'C:\Users\User\faceswap\config\train.ini'
06/07/2022 19:31:06 INFO Using configuration saved in state file
06/07/2022 19:31:06 INFO Enabling Mixed Precision Training.
06/07/2022 19:31:06 INFO Mixed precision compatibility check (mixed_float16): OK\nYour GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0
2022-06-07 19:31:06.903921: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-07 19:31:07.562957: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21676 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:07:00.0, compute capability: 8.6
2022-06-07 19:31:07.563695: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 21676 MB memory: -> device: 1, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:08:00.0, compute capability: 8.6
06/07/2022 19:31:08 INFO Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
06/07/2022 19:31:09 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 19:31:09 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 19:31:09 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 19:31:09 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 19:31:09 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 19:31:09 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 19:31:09 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 19:31:09 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 19:31:09 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 19:31:09 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
06/07/2022 19:31:11 CRITICAL Error caught! Exiting...
06/07/2022 19:31:11 ERROR Caught exception in thread: 'training_0'
06/07/2022 19:31:13 ERROR Got Exception on main handler:
Traceback (most recent call last):
File "C:\Users\User\faceswap\lib\cli\launcher.py", line 188, in execute_script
process.process()
File "C:\Users\User\faceswap\scripts\train.py", line 190, in process
self.end_thread(thread, err)
File "C:\Users\User\faceswap\scripts\train.py", line 230, in end_thread
thread.join()
File "C:\Users\User\faceswap\lib\multithreading.py", line 121, in join
raise thread.err[1].with_traceback(thread.err[2])
File "C:\Users\User\faceswap\lib\multithreading.py", line 37, in run
self.target(*self.args, **self.kwargs)
File "C:\Users\User\faceswap\scripts\train.py", line 252, in _training
raise err
File "C:\Users\User\faceswap\scripts\train.py", line 240, in training
model = self.load_model()
File "C:\Users\User\faceswap\scripts\train.py", line 268, in load_model
model.build()
File "C:\Users\User\faceswap\plugins\train\model\_base.py", line 300, in build
model = self.io._load() # pylint:disable=protected-access
File "C:\Users\User\faceswap\plugins\train\model\_base.py", line 584, in _load
raise err
File "C:\Users\User\faceswap\plugins\train\model\_base.py", line 575, in load
model = load_model(self.filename, compile=False)
File "C:\Users\User\MiniConda3\envs\faceswap\lib\site-packages\keras\utils\traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "h5py\_objects.pyx", line 54, in h5py.objects.with_phil.wrapper
File "h5py\_objects.pyx", line 55, in h5py.objects.with_phil.wrapper
File "C:\Users\User\MiniConda3\envs\faceswap\lib\site-packages\h5py\_hl\dataset.py", line 1015, in array
self.read_direct(arr)
File "C:\Users\User\MiniConda3\envs\faceswap\lib\site-packages\h5py\_hl\dataset.py", line 976, in read_direct
self.id.read(mspace, fspace, dest, dxpl=self.dxpl)
File "h5py\_objects.pyx", line 54, in h5py.objects.with_phil.wrapper
File "h5py\_objects.pyx", line 55, in h5py.objects.with_phil.wrapper
File "h5py\h5d.pyx", line 192, in h5py.h5d.DatasetID.read
File "h5py\_proxy.pyx", line 112, in h5py.proxy.dset_rw
File "h5py\h5fd.pyx", line 164, in h5py.h5fd.H5FD_fileobj_read
RuntimeError: Could not allocate bytes object!
06/07/2022 19:31:13 CRITICAL An unexpected crash has occurred. Crash report written to 'C:\Users\User\faceswap\crash_report.2022.06.07.193111179895.log'. You MUST provide this file if seeking assistance. Please verify you are running the latest version of faceswap before reporting
Exception ignored in: <function Pool.del at 0x000002373ABFBDC0>
Traceback (most recent call last):
File "C:\Users\User\MiniConda3\envs\faceswap\lib\multiprocessing\pool.py", line 268, in del
self.change_notifier.put(None)
File "C:\Users\User\MiniConda3\envs\faceswap\lib\multiprocessing\queues.py", line 365, in put
self.writer.send_bytes(obj)
File "C:\Users\User\MiniConda3\envs\faceswap\lib\multiprocessing\connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "C:\Users\User\MiniConda3\envs\faceswap\lib\multiprocessing\connection.py", line 280, in _send_bytes
ov, err = winapi.WriteFile(self.handle, buf, overlapped=True)
OSError: [WinError 6] The handle is invalid
Process exited.

wader

Post by **torzdf** » Wed Jun 08, 2022 12:28 am

I can't get to your log files (Nginx is blocking it).

The output I see above looks like an issue with a corrupted model file. Basically this:

Code: Select all

File "h5py\h5fd.pyx", line 164, in h5py.h5fd.H5FD_fileobj_read
RuntimeError: Could not allocate bytes object!

Caused the following WinOS error.

Try again with a fresh model folder to rule that out as an issue

wader · Post by **wader** » Wed Jun 08, 2022 5:58 am

Thanks for the idea, unfortunately everything fails with what appears to be a memory allocation error when two GPUs are enabled for training.

Tried deleting the existing model folder and letting the Training module recreate a new model for every test run
Tried deselecting both Initializers before a fresh training run
Tried Deselecting Mixed Precision before a fresh training run
(tried combinations of the above)

Only when I select a single GPU does training run without breaking. From various monitors, it does appear that GPU's VRAM is being allocated to FaceSwap during the training runs and that it's being used, but the CPU is also at 39% usage - I'm guessing that it's normal for both to be used, each to their own relative extent.

Again, this Windows system has been stable for multiple GPU usage for other needs and it's on the latest NVidia driver - I even went from a stable 3xGPU configuration to 2xGPU when upgrading the cards - so I wonder what the difference might be here.

Post by **torzdf** » Wed Jun 08, 2022 9:20 am

Unfortunately I'm not at a loss. The failure message you shared is definitely happening when it is de-serializing the model file.

All I can suggest is make sure that you have no GPU overclocks enabled, but I doubt that's the issue here.

Ultimately all we do for multi-gpu is turn on a switch in Tensorflow to enable it, and (to date) I have not seen reports of this failure with multi-gpu enabled.

I wish I could be more help. Sorry.

wader · Post by **wader** » Wed Jun 08, 2022 2:20 pm

Yeah, it's weird, but thanks for looking.

Doing a diff of the two Trace-level logs (between 1xGPU and 2xGPU) the major difference is that when the 1xGPU case successfully loads the model:

MainProcess _training_0 _base _load INFO Loaded model from disk: 'D:\faceswap\sourceA\modelAB\dlight.h5'

the 2xGPU case can't allocate memory to do so:

MainProcess training_0 multithreading run DEBUG Error in thread (training_0): Could not allocate bytes object!
MainProcess MainThread train _monitor DEBUG Thread error detected

That's why the 2xGPU case can't load the model: not because of model corruption, but because of some memory allocation problem when both GPUs are enabled.

I just stopped a training run after reaching 50K iterations, switched to 2xGPU, then restarted training and saw this:

: error with training run after swtich to 2xGPU.png (13.41 KiB) Viewed 2520 times

That seems to support the log-based observation of there being issues allocating memory (VRAM) only when both cards are enabled.

Yet, it's not due to a VRAM hardware problem in either card that I can see, because both cards work fine for training if they are selected for single GPU processing. I wonder if this is exposing a sensitivity in Tensor code for multiple GPU situations? Or if there is an issue when one card is being used for the display and the other is not (i.e., maybe the Tensor memory allocation code is trying to claim memory from the display GPU that is already being used?)

It might be interesting to know if there is a Tensor testbed driver we can try to check this kind of compatibility, completely outside of Faceswap.

Will re-look at my system config + try different training models and options, alter size of the inputs and other changes to see if this problem can be avoided when using all GPUs.

wader · Post by **wader** » Thu Jun 09, 2022 2:19 pm

Short update, still testing various options:

Because the reported issue seemed to be related to a problem allocating VRAM when both GPU cards were enabled, I tried checking the

Train -> Settings -> Configure Settings -> Train -> Network -> Allow Growth

option, because the Help says it prevents Tensorflow from allocating ALL VRAM up-front and maybe that logic was making assumptions about memory being equally available between both cards, even though one was initially unused while the other was supporting the computer display (so had some of its memory already allocated.)

This guess also meshed with my observation that when a single GPU was selected for training, I saw almost all of its VRAM taken up early in the process on a GPU monitor (which was curious to me at the time, but I didn't question it when training was working as expected.)

This single option allowed both GPU cards to perform the training! And their initial VRAM usage started out low, building slowly as training has continued.

Also as the docs stated, I am seeing far from a linear bump in training speed, so still measuring the benefit vs twice the electricity being used.

Still doing other tests and after hitting a NaN at 100K under Dlight I cleaned up my alignments again, started a new training run using Villain (and also some tweaked general Train settings), so will update here if anything of further interest comes about.

wader

Post by **torzdf** » Thu Jun 09, 2022 2:28 pm

Allow growth is, sadly, also going to give you a bit of a performance hit. Sadly it seems unavoidable in your case, but it basically means your VRAM will be more fragmented, so not as fast.

wader · Post by **wader** » Fri Jun 10, 2022 3:10 am

torzdf wrote: ↑Thu Jun 09, 2022 2:28 pm
Allow growth is, sadly, also going to give you a bit of a performance hit. Sadly it seems unavoidable in your case, but it basically means your VRAM will be more fragmented, so not as fast.

Understood, though from what I am reading here:

https://www.tensorflow.org/guide/gpu

The first option is to turn on memory growth by calling tf.config.experimental.set_memory_growth, which attempts to allocate only as much GPU memory as needed for the runtime allocations: it starts out allocating very little memory, and as the program gets run and more GPU memory is needed, the GPU memory region is extended for the TensorFlow process. Memory is not released since it can lead to memory fragmentation.

It seems that - given most of my GPU memory is available on the first card (which supports the display) and the second card is dedicated to rendering/computing and otherwise not being used before/during Faceswap processing, since Tensor doesn't release memory during a session I am hoping to see minimal fragmentation if the system doesn't otherwise try to use the GPUs during a training session. So, at least that might minimize impact from no pre-allocation.

wader

Faceswap Forum

Can train on single 3090 GPU, but not two 3090s

Can train on single 3090 GPU, but not two 3090s

Re: Can train on single 3090 GPU, but not two 3090s

Re: Can train on single 3090 GPU, but not two 3090s

Re: Can train on single 3090 GPU, but not two 3090s

Re: Can train on single 3090 GPU, but not two 3090s

Re: Can train on single 3090 GPU, but not two 3090s

Re: Can train on single 3090 GPU, but not two 3090s

Re: Can train on single 3090 GPU, but not two 3090s

Re: Can train on single 3090 GPU, but not two 3090s

Re: Can train on single 3090 GPU, but not two 3090s