This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.
Please mark any answers that fixed your problems so others can find the solutions.
2019-08-29 16:19:54.889887: E tensorflow/stream_executor/cuda/cuda_driver:828] failed to allocate 12.32G (13231885056 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-08-29 16:19:55.732980: E tensorflow/stream_executor/cuda/cuda_driver:828] failed to allocate 11.09G (11908696064 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-08-29 16:19:56.563352: E tensorflow/stream_executor/cuda/cuda_driver:828] failed to allocate 9.98G (10717826048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
The machine has a NVIDIA Tesla V100 (16GB) so I'm not sure why it's failing to allocate. Any help would be appreciated.
Unfortunately I don't think it even gets to a point where it generates a crash report since python crashes (unless I'm looking in the wrong spot? should it be in the install directory e.g., C:\Users\[user]\faceswap?).
Here's a windows event log entry of the python crash:
08/29/2019 16:19:49 MainProcess MainThread logger log_setup INFO Log level set to: INFO
08/29/2019 16:19:51 MainProcess MainThread train get_images INFO Model A Directory: C:\Users\[user]\Desktop\faceswap\faces\bjm
08/29/2019 16:19:51 MainProcess MainThread train get_images INFO Model B Directory: C:\Users\[user]\Desktop\faceswap\faces\jrt
08/29/2019 16:19:51 MainProcess MainThread train process INFO Training data directory: C:\Users\[user]\Desktop\faceswap\models
08/29/2019 16:19:51 MainProcess MainThread train monitor INFO ===================================================
08/29/2019 16:19:51 MainProcess MainThread train monitor INFO Starting
08/29/2019 16:19:51 MainProcess MainThread train monitor INFO Press 'Terminate' to save and quit
08/29/2019 16:19:51 MainProcess MainThread train monitor INFO ===================================================
08/29/2019 16:19:52 MainProcess training_0 train training INFO Loading data, this may take a while...
08/29/2019 16:19:52 MainProcess training_0 plugin_loader _import INFO Loading Model from Original plugin...
08/29/2019 16:19:52 MainProcess training_0 _base load WARNING No existing state file found. Generating.
08/29/2019 16:19:52 MainProcess training_0 deprecation_wrapper __getattr__ WARNING From C:\Users\[user]\MiniConda3\envs\faceswap\lib\site-packages\keras\backend\tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.\n
08/29/2019 16:19:52 MainProcess training_0 deprecation_wrapper __getattr__ WARNING From C:\Users\[user]\MiniConda3\envs\faceswap\lib\site-packages\keras\backend\tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.\n
08/29/2019 16:19:52 MainProcess training_0 deprecation_wrapper __getattr__ WARNING From C:\Users\[user]\MiniConda3\envs\faceswap\lib\site-packages\keras\backend\tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.\n
08/29/2019 16:19:52 MainProcess training_0 deprecation_wrapper __getattr__ WARNING From C:\Users\[user]\MiniConda3\envs\faceswap\lib\site-packages\keras\backend\tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.\n
08/29/2019 16:19:52 MainProcess training_0 deprecation_wrapper __getattr__ WARNING From C:\Users\[user]\MiniConda3\envs\faceswap\lib\site-packages\keras\backend\tensorflow_backend.py:181: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.\n
I get that, just curious why. I am using default settings(original trainer, batch size is 32, tried 16 as well).. I tried running this on my local PC with a GTX 1070 and runs fine, but decided to try running it in a cloud instance with a higher powered (and more vram) GPU.
This information is unfortunately insufficient. We'd really need a full log of the crash or operation to diagnose.
The only thing I can say is that it's probably your system itself, a 4 core (Probably only dual with hyperthreading) and 16gb of system ram (With 14gb free) is probably simply not enough to push a v100 in any way shape or form. In addition, V100 drivers in Windows are not really very robust.
For advice on CPUs please see our hardware guide at https://faceswap.dev/forum/viewtopic.php?f=16&t=10 . But I believe that your current setup is at fault but would really need a crash log to see why.