Issue using dual P106-100s when training

If training is failing to start, and you are not receiving an error message telling you what to do, tell us about it here


Forum rules

Read the FAQs and search the forum before posting a new topic.

This forum is for reporting errors with the Training process. If you want to get tips, or better understand the Training process, then you should look in the Training Discussion forum.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
ericpan0513
Posts: 23
Joined: Wed Jul 22, 2020 3:34 am
Answers: 0
Has thanked: 6 times

Issue using dual P106-100s when training

Post by ericpan0513 »

I also have the pop up and quickly disappear question, and also fix it by reinstall miniconda and faceswap.
I have questions about the new multi_gpus function.
After the update, how can we choose how many GPUs we will use? Or if I enable the "distributed" function, it just use all of the GPUs detected?

And there's another problem, I just enable distributed option with 2 P106-100 GPUs(just wanna try if this work on the new multi_gpu strategy), However one GPU was 100% loaded, the other only got 6%. What's more, the training speed dropped from 27 EG/s(1 GPU) to 4 EG/s(2GPUs). Do you know what's going on?
Thanks.

User avatar
abigflea
Posts: 182
Joined: Sat Feb 22, 2020 10:59 pm
Answers: 2
Has thanked: 20 times
Been thanked: 63 times

Re: Can't use 2 GPU's after latest Faceswap update

Post by abigflea »

Were you using those P106-100 before ? Those mining cards have an odd internal architecture .

Distributed enables multi GPU, and will use the GPU's you have not excluded.

Are you on Linux or Windows?
What is your GPU setup before and now?
Can you post the crash log?

:o I dunno what I'm doing :shock:
2X RTX 3090 : RTX 3080 : RTX: 2060 : 2x RTX 2080 Super : Ghetto 1060

User avatar
ericpan0513
Posts: 23
Joined: Wed Jul 22, 2020 3:34 am
Answers: 0
Has thanked: 6 times

Re: Can't use 2 GPU's after latest Faceswap update

Post by ericpan0513 »

Code: Select all

============ System Information ============
encoding:            cp950
git_branch:          master
git_commits:         0a25dff model.config - Make convert batchsize a user configurable option. 45d6995 bugfix - Extract - VGG Clear Mask - Fix for TF2. baa2866 bugfix - Update Dependencies - Avoid constantly trying to redownload Tensorflow. 9c5568f Bugfix - Models.dfl_h128. f897562 Set minimum python version to 3.7
gpu_cuda:            No global version found. Check Conda packages for Conda Cuda
gpu_cudnn:           No global version found. Check Conda packages for Conda cuDNN
gpu_devices:         GPU_0: P106-100, GPU_1: P106-100
gpu_devices_active:  GPU_0, GPU_1
gpu_driver:          432.00
gpu_vram:            GPU_0: 6077MB, GPU_1: 6077MB
os_machine:          AMD64
os_platform:         Windows-10-10.0.18362-SP0
os_release:          10
py_command:          C:\Users\Guan Yi\faceswap/faceswap.py gui
py_conda_version:    conda 4.8.4
py_implementation:   CPython
py_version:          3.7.7
py_virtual_env:      True
sys_cores:           4
sys_processor:       Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
sys_ram:             Total: 16275MB, Available: 10177MB, Used: 6097MB, Free: 10177MB

=============== Pip Packages ===============


============== Conda Packages ==============
# packages in environment at C:\Users\Guan Yi\MiniConda3\envs\faceswap:
#
# Name                    Version                   Build  Channel
absl-py                   0.9.0                    pypi_0    pypi
astunparse                1.6.3                    pypi_0    pypi
blas                      1.0                         mkl  
ca-certificates 2020.6.24 0
cachetools 4.1.1 pypi_0 pypi certifi 2020.6.20 py37_0
chardet 3.0.4 pypi_0 pypi cudatoolkit 10.1.243 h74a9793_0
cudnn 7.6.5 cuda10.1_0
cycler 0.10.0 py37_0
fastcluster 1.1.26 py37h9b59f54_1 conda-forge ffmpeg 4.3.1 ha925a31_0 conda-forge ffmpy 0.2.3 pypi_0 pypi freetype 2.10.2 hd328e21_0
gast 0.3.3 pypi_0 pypi git 2.23.0 h6bb4b03_0
google-auth 1.20.1 pypi_0 pypi google-auth-oauthlib 0.4.1 pypi_0 pypi google-pasta 0.2.0 pypi_0 pypi grpcio 1.31.0 pypi_0 pypi h5py 2.10.0 pypi_0 pypi icc_rt 2019.0.0 h0cc432a_1
icu 58.2 ha925a31_3
idna 2.10 pypi_0 pypi imageio 2.9.0 py_0
imageio-ffmpeg 0.4.2 py_0 conda-forge importlib-metadata 1.7.0 pypi_0 pypi intel-openmp 2020.1 216
joblib 0.16.0 py_0
jpeg 9b hb83a4c4_2
keras-preprocessing 1.1.2 pypi_0 pypi kiwisolver 1.2.0 py37h74a9793_0
libpng 1.6.37 h2a8f88b_0
libtiff 4.1.0 h56a325e_1
lz4-c 1.9.2 h62dcd97_1
markdown 3.2.2 pypi_0 pypi matplotlib 3.2.2 0
matplotlib-base 3.2.2 py37h64f37c6_0
mkl 2020.1 216
mkl-service 2.3.0 py37hb782905_0
mkl_fft 1.1.0 py37h45dec08_0
mkl_random 1.1.1 py37h47e9c7a_0
numpy 1.19.1 py37h5510c5b_0
numpy-base 1.19.1 py37ha3acd2a_0
nvidia-ml-py3 7.352.1 pypi_0 pypi oauthlib 3.1.0 pypi_0 pypi olefile 0.46 py37_0
opencv-python 4.4.0.42 pypi_0 pypi openssl 1.1.1g he774522_1
opt-einsum 3.3.0 pypi_0 pypi pathlib 1.0.1 py37_2
pillow 7.2.0 py37hcc1f983_0
pip 20.2.2 py37_0
protobuf 3.13.0 pypi_0 pypi psutil 5.7.0 py37he774522_0
pyasn1 0.4.8 pypi_0 pypi pyasn1-modules 0.2.8 pypi_0 pypi pyparsing 2.4.7 py_0
pyqt 5.9.2 py37h6538335_2
python 3.7.7 h81c818b_4
python-dateutil 2.8.1 py_0
python_abi 3.7 1_cp37m conda-forge pywin32 227 py37he774522_1
qt 5.9.7 vc14h73c81de_0
requests 2.24.0 pypi_0 pypi requests-oauthlib 1.3.0 pypi_0 pypi rsa 4.6 pypi_0 pypi scikit-learn 0.23.1 py37h25d0782_0
scipy 1.4.1 pypi_0 pypi setuptools 49.6.0 py37_0
sip 4.19.8 py37h6538335_0
six 1.15.0 py_0
sqlite 3.32.3 h2a8f88b_0
tensorboard 2.2.2 pypi_0 pypi tensorboard-plugin-wit 1.7.0 pypi_0 pypi tensorflow-gpu 2.2.0 pypi_0 pypi tensorflow-gpu-estimator 2.2.0 pypi_0 pypi termcolor 1.1.0 pypi_0 pypi threadpoolctl 2.1.0 pyh5ca1d4c_0
tk 8.6.10 he774522_0
tornado 6.0.4 py37he774522_1
tqdm 4.48.2 py_0
urllib3 1.25.10 pypi_0 pypi vc 14.1 h0510ff6_4
vs2015_runtime 14.16.27012 hf0eaf9b_3
werkzeug 1.0.1 pypi_0 pypi wheel 0.34.2 py37_0
wincertstore 0.2 py37_0
wrapt 1.12.1 pypi_0 pypi xz 5.2.5 h62dcd97_0
zipp 3.1.0 pypi_0 pypi zlib 1.2.11 h62dcd97_4
zstd 1.4.5 h04227a9_0 ================= Configs ================== --------- .faceswap --------- backend: nvidia --------- convert.ini --------- [color.color_transfer] clip: True preserve_paper: True [color.manual_balance] colorspace: HSV balance_1: 0.0 balance_2: 0.0 balance_3: 0.0 contrast: 0.0 brightness: 0.0 [color.match_hist] threshold: 99.0 [mask.box_blend] type: gaussian distance: 11.0 radius: 5.0 passes: 1 [mask.mask_blend] type: normalized kernel_size: 3 passes: 4 threshold: 4 erosion: 0.0 [scaling.sharpen] method: unsharp_mask amount: 150 radius: 0.3 threshold: 5.0 [writer.ffmpeg] container: mp4 codec: libx264 crf: 23 preset: medium tune: none profile: auto level: auto skip_mux: False [writer.gif] fps: 25 loop: 0 palettesize: 256 subrectangles: False [writer.opencv] format: png draw_transparent: False jpg_quality: 75 png_compress_level: 3 [writer.pillow] format: png draw_transparent: False optimize: False gif_interlace: True jpg_quality: 75 png_compress_level: 3 tif_compression: tiff_deflate --------- extract.ini --------- [global] allow_growth: False [align.fan] batch-size: 12 [detect.cv2_dnn] confidence: 50 [detect.mtcnn] minsize: 20 threshold_1: 0.6 threshold_2: 0.7 threshold_3: 0.7 scalefactor: 0.709 batch-size: 8 [detect.s3fd] confidence: 70 batch-size: 4 [mask.unet_dfl] batch-size: 8 [mask.vgg_clear] batch-size: 6 [mask.vgg_obstructed] batch-size: 2 --------- gui.ini --------- [global] fullscreen: False tab: extract options_panel_width: 30 console_panel_height: 20 icon_size: 14 font: default font_size: 9 autosave_last_session: prompt timeout: 120 auto_load_model_stats: True --------- train.ini --------- [global] coverage: 100.0 mask_type: vgg-obstructed mask_blur_kernel: 3 mask_threshold: 4 learn_mask: False penalized_mask_loss: True loss_function: mae icnr_init: False conv_aware_init: False optimizer: adam learning_rate: 5e-05 reflect_padding: False allow_growth: False mixed_precision: False convert_batchsize: 16 [model.dfl_h128] lowmem: False [model.dfl_sae] input_size: 256 clipnorm: True architecture: liae autoencoder_dims: 0 encoder_dims: 42 decoder_dims: 21 multiscale_decoder: False [model.dlight] features: best details: good output_size: 384 [model.original] lowmem: False [model.realface] input_size: 64 output_size: 128 dense_nodes: 1536 complexity_encoder: 128 complexity_decoder: 512 [model.unbalanced] input_size: 128 lowmem: False clipnorm: True nodes: 1024 complexity_encoder: 128 complexity_decoder_a: 384 complexity_decoder_b: 512 [model.villain] lowmem: False [trainer.original] preview_images: 14 zoom_amount: 5 rotation_range: 10 shift_range: 5 flip_chance: 50 color_lightness: 30 color_ab: 8 color_clahe_chance: 50 color_clahe_max_size: 4

This is my sys info. However I couldn't find a crash log.(Or I don't know where)
I'm using Windows 10.
One P106-100 is working fine, even better than 1060. However when distributed, it get worse and one of it have nearly 0 % load(by GPU-Z).
BTW how can I upload pictures? Can't I just copy it on the web?

User avatar
abigflea
Posts: 182
Joined: Sat Feb 22, 2020 10:59 pm
Answers: 2
Has thanked: 20 times
Been thanked: 63 times

Re: Can't use 2 GPU's after latest Faceswap update

Post by abigflea »

Im currently installing a clean install of Faceswap on a very clean Win10.
Let me see if i can replicate the issues.

The mining cards seem to be fine solo, but may be problematic and 'use at your own risk',
I have 2 and will see what happens. I just need to do some testing including pulling out my current GPU. Give me a bit.

:o I dunno what I'm doing :shock:
2X RTX 3090 : RTX 3080 : RTX: 2060 : 2x RTX 2080 Super : Ghetto 1060

User avatar
abigflea
Posts: 182
Joined: Sat Feb 22, 2020 10:59 pm
Answers: 2
Has thanked: 20 times
Been thanked: 63 times

Re: Can't use 2 GPU's after latest Faceswap update

Post by abigflea »

ericpan0513 wrote: Thu Aug 20, 2020 6:52 am

And there's another problem, I just enable distributed option with 2 P106-100 GPUs(just wanna try if this work on the new multi_gpu strategy), However one GPU was 100% loaded, the other only got 6%. What's more, the training speed dropped from 27 EG/s(1 GPU) to 4 EG/s(2GPUs). Do you know what's going on?
Thanks.

I havent forgotten you ericpan0513. Pulling GPU cards now and will start testing your situation which is likely different.
Bit of info I need, do you have any of your GPU connected through a 1x Pcie extender?
Or all plugged directly into your mainboard?
Can you pull up GpuZ and tell me the reported bus interface and number of shaders (Mining cards sometimes do odd things here)

:o I dunno what I'm doing :shock:
2X RTX 3090 : RTX 3080 : RTX: 2060 : 2x RTX 2080 Super : Ghetto 1060

User avatar
ericpan0513
Posts: 23
Joined: Wed Jul 22, 2020 3:34 am
Answers: 0
Has thanked: 6 times

Re: Issue using dual P106-100s when training

Post by ericpan0513 »

I've got one gpu plugged into mainboard, and the other is connected through a 1x pcie extender.
I have tried both gpus training singly(on mainboard & through extender), and they both worked fine. So maybe its not about the connecting way? And I've also changed the one with extender to different buses, but it still have the same issue.

Here's the two log files of the two gpus recorded by gpuz.
The weird thing is that although the gpu through extender has 100% load, its temperature won't go up, which maybe means that it's also not working? Usually when I train on one, the temperature went up to 65°C. But the CPU was also not full loaded so it's not using it to train. I'm just confused.

extender log.txt
(90.15 KiB) Downloaded 832 times
mainboard log.txt
(88.69 KiB) Downloaded 1184 times

Hope we can figure it out. Thanks for helping.

User avatar
abigflea
Posts: 182
Joined: Sat Feb 22, 2020 10:59 pm
Answers: 2
Has thanked: 20 times
Been thanked: 63 times

Re: Issue using dual P106-100s when training

Post by abigflea »

Those logs are not showing me the number of shaders on the P106. That can be a huge tell something is amiss.
Maybe a screenshot of just Gpuz

Those 1xpcie connectors seem to cause some other issues I can't pin down, its how Nvidia drivers work in Linux and Windows. I don't think the Devs are in the mood to rewrite Nvidia drivers and Tensorflow from the ground up.

FYI I did test with my custom cards, also P106. They would get like 1.1 EGS/s which is way 'better' than in FS 1 distrubuted . Technically they shouldn't work at all.
Although, just like you, individually they work just fine! They get 8EGS/s each on my typical DFL-SAE model.
A single 1070 gets 20EGs/s .

Current Nvidia 452 drivers, and updated faceswap as of 12hrs ago.

Anyway, the screenshot of Gpuz and maybe the Faceswap.log or crash.log generated when you start up could be handy to see whats up. There still may be a chance you can use the new more efficient FS.

Yes, I tested a lot of different hardware and software configs today.

:o I dunno what I'm doing :shock:
2X RTX 3090 : RTX 3080 : RTX: 2060 : 2x RTX 2080 Super : Ghetto 1060

User avatar
ericpan0513
Posts: 23
Joined: Wed Jul 22, 2020 3:34 am
Answers: 0
Has thanked: 6 times

Re: Issue using dual P106-100s when training

Post by ericpan0513 »

It;s 1280 unified
The starting log:

Code: Select all

Loading...
Setting Faceswap backend to NVIDIA
08/21/2020 11:47:23 INFO     Log level set to: INFO
08/21/2020 11:47:25 INFO     Model A Directory: C:\Users\Guan Yi\Desktop\train\Trump_hd\Trump_hd_ex
08/21/2020 11:47:25 INFO     Model B Directory: C:\Users\Guan Yi\Desktop\train\Albert_hd\albert_hd_ex
08/21/2020 11:47:25 INFO     Training data directory: C:\Users\Guan Yi\Desktop\train\Model1
08/21/2020 11:47:25 INFO     ===================================================
08/21/2020 11:47:25 INFO       Starting
08/21/2020 11:47:25 INFO       Press 'Stop' to save and quit
08/21/2020 11:47:25 INFO     ===================================================
08/21/2020 11:47:26 INFO     Loading data, this may take a while...
08/21/2020 11:47:26 INFO     Loading Model from Dfl_Sae plugin...
08/21/2020 11:47:26 INFO     Using configuration saved in state file
08/21/2020 11:47:27 INFO     Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
08/21/2020 11:47:34 INFO     Loaded model from disk: 'C:\Users\Guan Yi\Desktop\train\Model1\dfl_sae.h5'
08/21/2020 11:47:34 WARNING  Clipnorm has been selected, but is unsupported when using distributed or mixed_precision training, so has been disabled. If you wish to enable clipnorm, then you must disable these options.
08/21/2020 11:47:34 INFO     Loading Trainer from Original plugin...

Reading training images (A):   0%|          | 0/3707 [00:00<?, ?it/s]
Reading training images (A):   1%|          | 43/3707 [00:00<00:09, 371.24it/s]
Reading training images (A):  10%|▉         | 369/3707 [00:00<00:06, 505.41it/s]
Reading training images (A):  13%|█▎        | 474/3707 [00:00<00:06, 534.17it/s]
Reading training images (A):  26%|██▌       | 950/3707 [00:00<00:03, 728.00it/s]
Reading training images (A):  31%|███       | 1157/3707 [00:00<00:02, 903.46it/s]
Reading training images (A):  37%|███▋      | 1364/3707 [00:00<00:02, 1069.82it/s]
Reading training images (A):  43%|████▎     | 1580/3707 [00:00<00:01, 1260.21it/s]
Reading training images (A):  48%|████▊     | 1792/3707 [00:00<00:01, 1405.39it/s]
Reading training images (A):  55%|█████▍    | 2033/3707 [00:01<00:01, 1605.49it/s]
Reading training images (A):  61%|██████    | 2250/3707 [00:01<00:00, 1740.40it/s]
Reading training images (A):  66%|██████▋   | 2465/3707 [00:01<00:00, 1844.67it/s]
Reading training images (A):  73%|███████▎  | 2690/3707 [00:01<00:00, 1948.71it/s]
Reading training images (A):  79%|███████▊  | 2912/3707 [00:01<00:00, 2021.50it/s]
Reading training images (A):  85%|████████▍ | 3137/3707 [00:01<00:00, 2083.43it/s]
Reading training images (A):  91%|█████████ | 3358/3707 [00:01<00:00, 2058.76it/s]
Reading training images (A):  96%|█████████▋| 3573/3707 [00:01<00:00, 1827.56it/s]


Reading training images (B):   0%|          | 0/4045 [00:00<?, ?it/s]
Reading training images (B):   1%|          | 36/4045 [00:00<00:12, 322.14it/s]
Reading training images (B):   6%|▋         | 254/4045 [00:00<00:08, 432.74it/s]
Reading training images (B):  13%|█▎        | 545/4045 [00:00<00:06, 578.81it/s]
Reading training images (B):  20%|█▉        | 796/4045 [00:00<00:04, 752.34it/s]
Reading training images (B):  24%|██▍       | 964/4045 [00:00<00:03, 898.44it/s]
Reading training images (B):  31%|███       | 1241/4045 [00:00<00:02, 1126.47it/s]
Reading training images (B):  36%|███▌      | 1457/4045 [00:00<00:01, 1314.68it/s]
Reading training images (B):  41%|████      | 1662/4045 [00:00<00:01, 1328.80it/s]
Reading training images (B):  48%|████▊     | 1954/4045 [00:00<00:01, 1587.83it/s]
Reading training images (B):  55%|█████▍    | 2218/4045 [00:01<00:01, 1802.56it/s]
Reading training images (B):  61%|██████    | 2469/4045 [00:01<00:00, 1967.89it/s]
Reading training images (B):  67%|██████▋   | 2706/4045 [00:01<00:00, 1675.41it/s]
Reading training images (B):  77%|███████▋  | 3105/4045 [00:01<00:00, 2002.57it/s]
Reading training images (B):  83%|████████▎ | 3358/4045 [00:01<00:00, 1844.54it/s]
Reading training images (B):  92%|█████████▏| 3711/4045 [00:01<00:00, 2151.79it/s]
Reading training images (B):  98%|█████████▊| 3973/4045 [00:01<00:00, 2107.07it/s]
08/21/2020 11:47:38 INFO     Reading alignments from: 'C:\Users\Guan Yi\Desktop\train\Trump_hd\trump_hd_alignments.fsa'
08/21/2020 11:47:38 INFO     Reading alignments from: 'C:\Users\Guan Yi\Desktop\train\Albert_hd\albert_hd_alignments.fsa'
08/21/2020 11:47:39 WARNING  169 alignments have been removed as their corresponding faces do not exist in the input folder for side B. Run in verbose mode if you wish to see which alignments have been excluded.
08/21/2020 11:47:40 INFO     batch_all_reduce: 78 all-reduces with algorithm = hierarchical_copy, num_packs = 1
08/21/2020 11:47:46 INFO     Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
08/21/2020 11:47:46 INFO     Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
08/21/2020 11:47:46 INFO     Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
08/21/2020 11:47:46 INFO     Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
08/21/2020 11:47:46 INFO     Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
08/21/2020 11:47:46 INFO     Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
08/21/2020 11:47:48 INFO     batch_all_reduce: 78 all-reduces with algorithm = hierarchical_copy, num_packs = 1
08/21/2020 11:47:49 INFO     Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
08/21/2020 11:47:49 INFO     Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
08/21/2020 11:47:49 INFO     Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
08/21/2020 11:47:49 INFO     Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

08/21/2020 11:48:10 INFO     [Saved models] - Average loss since last save: face_a: 0.00428, face_b: 0.00553

Is

Code: Select all

Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')

a problem? They are using the same replica. Or it's just the way it works?

Attachments
info.png
info.png (54.58 KiB) Viewed 11062 times
User avatar
abigflea
Posts: 182
Joined: Sat Feb 22, 2020 10:59 pm
Answers: 2
Has thanked: 20 times
Been thanked: 63 times

Re: Issue using dual P106-100s when training

Post by abigflea »

Well, I suspect this will work. scratches head
the "MirroredStrategy with devices" is correct. No problem there.

Update your NVIDIA drivers to the current first.
I am using this one . https://www.nvidia.com/download/driverR ... 3246/en-us

I that don't work , lets go through the typical steps.

  1. Force Update Windows

  2. Follow this to remove any possible conflicts with CUDA, Conda, or Python.
    app.php/faqpage?sid=8a113082dbf6d2351b3 ... e7b0b#f1r1

  3. Then install this, cant hurt, lack of a DLL got me the other day.
    Microsoft Visual C++ Redistributable for Visual Studio 2015, 2017 and 2019

  4. Then run a copy of the fresh installer
    https://github.com/deepfakes/faceswap/releases/latest/download/faceswap_setup_x64.exe

Everything should be nice and fresh, and we will go from there.

:o I dunno what I'm doing :shock:
2X RTX 3090 : RTX 3080 : RTX: 2060 : 2x RTX 2080 Super : Ghetto 1060

User avatar
ericpan0513
Posts: 23
Joined: Wed Jul 22, 2020 3:34 am
Answers: 0
Has thanked: 6 times

Re: Issue using dual P106-100s when training

Post by ericpan0513 »

OK, really thanks for your help!
I won't be able to use this computer during weekends, so maybe I will try this next week.
If it works, I will reply here again. :D

User avatar
abigflea
Posts: 182
Joined: Sat Feb 22, 2020 10:59 pm
Answers: 2
Has thanked: 20 times
Been thanked: 63 times

Re: Issue using dual P106-100s when training

Post by abigflea »

Ill be happy to hear what happens.

:o I dunno what I'm doing :shock:
2X RTX 3090 : RTX 3080 : RTX: 2060 : 2x RTX 2080 Super : Ghetto 1060

User avatar
abigflea
Posts: 182
Joined: Sat Feb 22, 2020 10:59 pm
Answers: 2
Has thanked: 20 times
Been thanked: 63 times

Re: Issue using dual P106-100s when training

Post by abigflea »

It has occurred to me have you checked the distribute box ?

:o I dunno what I'm doing :shock:
2X RTX 3090 : RTX 3080 : RTX: 2060 : 2x RTX 2080 Super : Ghetto 1060

User avatar
ericpan0513
Posts: 23
Joined: Wed Jul 22, 2020 3:34 am
Answers: 0
Has thanked: 6 times

Re: Issue using dual P106-100s when training

Post by ericpan0513 »

abigflea wrote: Fri Aug 21, 2020 8:34 am

Well, I suspect this will work. scratches head
the "MirroredStrategy with devices" is correct. No problem there.

Update your NVIDIA drivers to the current first.
I am using this one . https://www.nvidia.com/download/driverR ... 3246/en-us

I that don't work , lets go through the typical steps.

  1. Force Update Windows

  2. Follow this to remove any possible conflicts with CUDA, Conda, or Python.
    app.php/faqpage?sid=8a113082dbf6d2351b3 ... e7b0b#f1r1

  3. Then install this, cant hurt, lack of a DLL got me the other day.
    Microsoft Visual C++ Redistributable for Visual Studio 2015, 2017 and 2019

  4. Then run a copy of the fresh installer
    https://github.com/deepfakes/faceswap/releases/latest/download/faceswap_setup_x64.exe

Everything should be nice and fresh, and we will go from there.

Couldn't update NVIDIA drivers, I think is because p106 is too old or something. "Graphics driver could not found compatible graphic hardware"
I've finished other steps, still not working. Too bad. But thanks for helping.
And yes, my situation only happened when I check the distributed box, the training speed went down(compare to one GPU only) with one GPU 100% load but the GPU temperature won't go up, while the other 6% load, which is really weird.
However, thanks for helping me. Maybe I shouldn't use P106-100 for multiple GPU training. Already spent a few weeks trying to fix the problem. :(

Locked