Page 1 of 1
Issue using dual P106-100s when training
Posted: Thu Aug 20, 2020 6:52 am
by ericpan0513
I also have the pop up and quickly disappear question, and also fix it by reinstall miniconda and faceswap.
I have questions about the new multi_gpus function.
After the update, how can we choose how many GPUs we will use? Or if I enable the "distributed" function, it just use all of the GPUs detected?
And there's another problem, I just enable distributed option with 2 P106-100 GPUs(just wanna try if this work on the new multi_gpu strategy), However one GPU was 100% loaded, the other only got 6%. What's more, the training speed dropped from 27 EG/s(1 GPU) to 4 EG/s(2GPUs). Do you know what's going on?
Thanks.
Re: Can't use 2 GPU's after latest Faceswap update
Posted: Thu Aug 20, 2020 8:37 am
by abigflea
Were you using those P106-100 before ? Those mining cards have an odd internal architecture .
Distributed enables multi GPU, and will use the GPU's you have not excluded.
Are you on Linux or Windows?
What is your GPU setup before and now?
Can you post the crash log?
Re: Can't use 2 GPU's after latest Faceswap update
Posted: Thu Aug 20, 2020 10:23 am
by ericpan0513
Code: Select all
============ System Information ============
encoding: cp950
git_branch: master
git_commits: 0a25dff model.config - Make convert batchsize a user configurable option. 45d6995 bugfix - Extract - VGG Clear Mask - Fix for TF2. baa2866 bugfix - Update Dependencies - Avoid constantly trying to redownload Tensorflow. 9c5568f Bugfix - Models.dfl_h128. f897562 Set minimum python version to 3.7
gpu_cuda: No global version found. Check Conda packages for Conda Cuda
gpu_cudnn: No global version found. Check Conda packages for Conda cuDNN
gpu_devices: GPU_0: P106-100, GPU_1: P106-100
gpu_devices_active: GPU_0, GPU_1
gpu_driver: 432.00
gpu_vram: GPU_0: 6077MB, GPU_1: 6077MB
os_machine: AMD64
os_platform: Windows-10-10.0.18362-SP0
os_release: 10
py_command: C:\Users\Guan Yi\faceswap/faceswap.py gui
py_conda_version: conda 4.8.4
py_implementation: CPython
py_version: 3.7.7
py_virtual_env: True
sys_cores: 4
sys_processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
sys_ram: Total: 16275MB, Available: 10177MB, Used: 6097MB, Free: 10177MB
=============== Pip Packages ===============
============== Conda Packages ==============
# packages in environment at C:\Users\Guan Yi\MiniConda3\envs\faceswap:
#
# Name Version Build Channel
absl-py 0.9.0 pypi_0 pypi
astunparse 1.6.3 pypi_0 pypi
blas 1.0 mkl
ca-certificates 2020.6.24 0
cachetools 4.1.1 pypi_0 pypi
certifi 2020.6.20 py37_0
chardet 3.0.4 pypi_0 pypi
cudatoolkit 10.1.243 h74a9793_0
cudnn 7.6.5 cuda10.1_0
cycler 0.10.0 py37_0
fastcluster 1.1.26 py37h9b59f54_1 conda-forge
ffmpeg 4.3.1 ha925a31_0 conda-forge
ffmpy 0.2.3 pypi_0 pypi
freetype 2.10.2 hd328e21_0
gast 0.3.3 pypi_0 pypi
git 2.23.0 h6bb4b03_0
google-auth 1.20.1 pypi_0 pypi
google-auth-oauthlib 0.4.1 pypi_0 pypi
google-pasta 0.2.0 pypi_0 pypi
grpcio 1.31.0 pypi_0 pypi
h5py 2.10.0 pypi_0 pypi
icc_rt 2019.0.0 h0cc432a_1
icu 58.2 ha925a31_3
idna 2.10 pypi_0 pypi
imageio 2.9.0 py_0
imageio-ffmpeg 0.4.2 py_0 conda-forge
importlib-metadata 1.7.0 pypi_0 pypi
intel-openmp 2020.1 216
joblib 0.16.0 py_0
jpeg 9b hb83a4c4_2
keras-preprocessing 1.1.2 pypi_0 pypi
kiwisolver 1.2.0 py37h74a9793_0
libpng 1.6.37 h2a8f88b_0
libtiff 4.1.0 h56a325e_1
lz4-c 1.9.2 h62dcd97_1
markdown 3.2.2 pypi_0 pypi
matplotlib 3.2.2 0
matplotlib-base 3.2.2 py37h64f37c6_0
mkl 2020.1 216
mkl-service 2.3.0 py37hb782905_0
mkl_fft 1.1.0 py37h45dec08_0
mkl_random 1.1.1 py37h47e9c7a_0
numpy 1.19.1 py37h5510c5b_0
numpy-base 1.19.1 py37ha3acd2a_0
nvidia-ml-py3 7.352.1 pypi_0 pypi
oauthlib 3.1.0 pypi_0 pypi
olefile 0.46 py37_0
opencv-python 4.4.0.42 pypi_0 pypi
openssl 1.1.1g he774522_1
opt-einsum 3.3.0 pypi_0 pypi
pathlib 1.0.1 py37_2
pillow 7.2.0 py37hcc1f983_0
pip 20.2.2 py37_0
protobuf 3.13.0 pypi_0 pypi
psutil 5.7.0 py37he774522_0
pyasn1 0.4.8 pypi_0 pypi
pyasn1-modules 0.2.8 pypi_0 pypi
pyparsing 2.4.7 py_0
pyqt 5.9.2 py37h6538335_2
python 3.7.7 h81c818b_4
python-dateutil 2.8.1 py_0
python_abi 3.7 1_cp37m conda-forge
pywin32 227 py37he774522_1
qt 5.9.7 vc14h73c81de_0
requests 2.24.0 pypi_0 pypi
requests-oauthlib 1.3.0 pypi_0 pypi
rsa 4.6 pypi_0 pypi
scikit-learn 0.23.1 py37h25d0782_0
scipy 1.4.1 pypi_0 pypi
setuptools 49.6.0 py37_0
sip 4.19.8 py37h6538335_0
six 1.15.0 py_0
sqlite 3.32.3 h2a8f88b_0
tensorboard 2.2.2 pypi_0 pypi
tensorboard-plugin-wit 1.7.0 pypi_0 pypi
tensorflow-gpu 2.2.0 pypi_0 pypi
tensorflow-gpu-estimator 2.2.0 pypi_0 pypi
termcolor 1.1.0 pypi_0 pypi
threadpoolctl 2.1.0 pyh5ca1d4c_0
tk 8.6.10 he774522_0
tornado 6.0.4 py37he774522_1
tqdm 4.48.2 py_0
urllib3 1.25.10 pypi_0 pypi
vc 14.1 h0510ff6_4
vs2015_runtime 14.16.27012 hf0eaf9b_3
werkzeug 1.0.1 pypi_0 pypi
wheel 0.34.2 py37_0
wincertstore 0.2 py37_0
wrapt 1.12.1 pypi_0 pypi
xz 5.2.5 h62dcd97_0
zipp 3.1.0 pypi_0 pypi
zlib 1.2.11 h62dcd97_4
zstd 1.4.5 h04227a9_0
================= Configs ==================
--------- .faceswap ---------
backend: nvidia
--------- convert.ini ---------
[color.color_transfer]
clip: True
preserve_paper: True
[color.manual_balance]
colorspace: HSV
balance_1: 0.0
balance_2: 0.0
balance_3: 0.0
contrast: 0.0
brightness: 0.0
[color.match_hist]
threshold: 99.0
[mask.box_blend]
type: gaussian
distance: 11.0
radius: 5.0
passes: 1
[mask.mask_blend]
type: normalized
kernel_size: 3
passes: 4
threshold: 4
erosion: 0.0
[scaling.sharpen]
method: unsharp_mask
amount: 150
radius: 0.3
threshold: 5.0
[writer.ffmpeg]
container: mp4
codec: libx264
crf: 23
preset: medium
tune: none
profile: auto
level: auto
skip_mux: False
[writer.gif]
fps: 25
loop: 0
palettesize: 256
subrectangles: False
[writer.opencv]
format: png
draw_transparent: False
jpg_quality: 75
png_compress_level: 3
[writer.pillow]
format: png
draw_transparent: False
optimize: False
gif_interlace: True
jpg_quality: 75
png_compress_level: 3
tif_compression: tiff_deflate
--------- extract.ini ---------
[global]
allow_growth: False
[align.fan]
batch-size: 12
[detect.cv2_dnn]
confidence: 50
[detect.mtcnn]
minsize: 20
threshold_1: 0.6
threshold_2: 0.7
threshold_3: 0.7
scalefactor: 0.709
batch-size: 8
[detect.s3fd]
confidence: 70
batch-size: 4
[mask.unet_dfl]
batch-size: 8
[mask.vgg_clear]
batch-size: 6
[mask.vgg_obstructed]
batch-size: 2
--------- gui.ini ---------
[global]
fullscreen: False
tab: extract
options_panel_width: 30
console_panel_height: 20
icon_size: 14
font: default
font_size: 9
autosave_last_session: prompt
timeout: 120
auto_load_model_stats: True
--------- train.ini ---------
[global]
coverage: 100.0
mask_type: vgg-obstructed
mask_blur_kernel: 3
mask_threshold: 4
learn_mask: False
penalized_mask_loss: True
loss_function: mae
icnr_init: False
conv_aware_init: False
optimizer: adam
learning_rate: 5e-05
reflect_padding: False
allow_growth: False
mixed_precision: False
convert_batchsize: 16
[model.dfl_h128]
lowmem: False
[model.dfl_sae]
input_size: 256
clipnorm: True
architecture: liae
autoencoder_dims: 0
encoder_dims: 42
decoder_dims: 21
multiscale_decoder: False
[model.dlight]
features: best
details: good
output_size: 384
[model.original]
lowmem: False
[model.realface]
input_size: 64
output_size: 128
dense_nodes: 1536
complexity_encoder: 128
complexity_decoder: 512
[model.unbalanced]
input_size: 128
lowmem: False
clipnorm: True
nodes: 1024
complexity_encoder: 128
complexity_decoder_a: 384
complexity_decoder_b: 512
[model.villain]
lowmem: False
[trainer.original]
preview_images: 14
zoom_amount: 5
rotation_range: 10
shift_range: 5
flip_chance: 50
color_lightness: 30
color_ab: 8
color_clahe_chance: 50
color_clahe_max_size: 4
This is my sys info. However I couldn't find a crash log.(Or I don't know where)
I'm using Windows 10.
One P106-100 is working fine, even better than 1060. However when distributed, it get worse and one of it have nearly 0 % load(by GPU-Z).
BTW how can I upload pictures? Can't I just copy it on the web?
Re: Can't use 2 GPU's after latest Faceswap update
Posted: Thu Aug 20, 2020 10:37 am
by abigflea
Im currently installing a clean install of Faceswap on a very clean Win10.
Let me see if i can replicate the issues.
The mining cards seem to be fine solo, but may be problematic and 'use at your own risk',
I have 2 and will see what happens. I just need to do some testing including pulling out my current GPU. Give me a bit.
Re: Can't use 2 GPU's after latest Faceswap update
Posted: Thu Aug 20, 2020 4:46 pm
by abigflea
ericpan0513 wrote: ↑Thu Aug 20, 2020 6:52 am
And there's another problem, I just enable distributed option with 2 P106-100 GPUs(just wanna try if this work on the new multi_gpu strategy), However one GPU was 100% loaded, the other only got 6%. What's more, the training speed dropped from 27 EG/s(1 GPU) to 4 EG/s(2GPUs). Do you know what's going on?
Thanks.
I havent forgotten you ericpan0513. Pulling GPU cards now and will start testing your situation which is likely different.
Bit of info I need, do you have any of your GPU connected through a 1x Pcie extender?
Or all plugged directly into your mainboard?
Can you pull up GpuZ and tell me the reported bus interface and number of shaders (Mining cards sometimes do odd things here)
Re: Issue using dual P106-100s when training
Posted: Fri Aug 21, 2020 2:45 am
by ericpan0513
I've got one gpu plugged into mainboard, and the other is connected through a 1x pcie extender.
I have tried both gpus training singly(on mainboard & through extender), and they both worked fine. So maybe its not about the connecting way? And I've also changed the one with extender to different buses, but it still have the same issue.
Here's the two log files of the two gpus recorded by gpuz.
The weird thing is that although the gpu through extender has 100% load, its temperature won't go up, which maybe means that it's also not working? Usually when I train on one, the temperature went up to 65°C. But the CPU was also not full loaded so it's not using it to train. I'm just confused.
Hope we can figure it out. Thanks for helping.
Re: Issue using dual P106-100s when training
Posted: Fri Aug 21, 2020 3:31 am
by abigflea
Those logs are not showing me the number of shaders on the P106. That can be a huge tell something is amiss.
Maybe a screenshot of just Gpuz
Those 1xpcie connectors seem to cause some other issues I can't pin down, its how Nvidia drivers work in Linux and Windows. I don't think the Devs are in the mood to rewrite Nvidia drivers and Tensorflow from the ground up.
FYI I did test with my custom cards, also P106. They would get like 1.1 EGS/s which is way 'better' than in FS 1 distrubuted . Technically they shouldn't work at all.
Although, just like you, individually they work just fine! They get 8EGS/s each on my typical DFL-SAE model.
A single 1070 gets 20EGs/s .
Current Nvidia 452 drivers, and updated faceswap as of 12hrs ago.
Anyway, the screenshot of Gpuz and maybe the Faceswap.log or crash.log generated when you start up could be handy to see whats up. There still may be a chance you can use the new more efficient FS.
Yes, I tested a lot of different hardware and software configs today.
Re: Issue using dual P106-100s when training
Posted: Fri Aug 21, 2020 3:50 am
by ericpan0513
It;s 1280 unified
The starting log:
Code: Select all
Loading...
Setting Faceswap backend to NVIDIA
08/21/2020 11:47:23 INFO Log level set to: INFO
08/21/2020 11:47:25 INFO Model A Directory: C:\Users\Guan Yi\Desktop\train\Trump_hd\Trump_hd_ex
08/21/2020 11:47:25 INFO Model B Directory: C:\Users\Guan Yi\Desktop\train\Albert_hd\albert_hd_ex
08/21/2020 11:47:25 INFO Training data directory: C:\Users\Guan Yi\Desktop\train\Model1
08/21/2020 11:47:25 INFO ===================================================
08/21/2020 11:47:25 INFO Starting
08/21/2020 11:47:25 INFO Press 'Stop' to save and quit
08/21/2020 11:47:25 INFO ===================================================
08/21/2020 11:47:26 INFO Loading data, this may take a while...
08/21/2020 11:47:26 INFO Loading Model from Dfl_Sae plugin...
08/21/2020 11:47:26 INFO Using configuration saved in state file
08/21/2020 11:47:27 INFO Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
08/21/2020 11:47:34 INFO Loaded model from disk: 'C:\Users\Guan Yi\Desktop\train\Model1\dfl_sae.h5'
08/21/2020 11:47:34 WARNING Clipnorm has been selected, but is unsupported when using distributed or mixed_precision training, so has been disabled. If you wish to enable clipnorm, then you must disable these options.
08/21/2020 11:47:34 INFO Loading Trainer from Original plugin...
Reading training images (A): 0%| | 0/3707 [00:00<?, ?it/s]
Reading training images (A): 1%| | 43/3707 [00:00<00:09, 371.24it/s]
Reading training images (A): 10%|▉ | 369/3707 [00:00<00:06, 505.41it/s]
Reading training images (A): 13%|█▎ | 474/3707 [00:00<00:06, 534.17it/s]
Reading training images (A): 26%|██▌ | 950/3707 [00:00<00:03, 728.00it/s]
Reading training images (A): 31%|███ | 1157/3707 [00:00<00:02, 903.46it/s]
Reading training images (A): 37%|███▋ | 1364/3707 [00:00<00:02, 1069.82it/s]
Reading training images (A): 43%|████▎ | 1580/3707 [00:00<00:01, 1260.21it/s]
Reading training images (A): 48%|████▊ | 1792/3707 [00:00<00:01, 1405.39it/s]
Reading training images (A): 55%|█████▍ | 2033/3707 [00:01<00:01, 1605.49it/s]
Reading training images (A): 61%|██████ | 2250/3707 [00:01<00:00, 1740.40it/s]
Reading training images (A): 66%|██████▋ | 2465/3707 [00:01<00:00, 1844.67it/s]
Reading training images (A): 73%|███████▎ | 2690/3707 [00:01<00:00, 1948.71it/s]
Reading training images (A): 79%|███████▊ | 2912/3707 [00:01<00:00, 2021.50it/s]
Reading training images (A): 85%|████████▍ | 3137/3707 [00:01<00:00, 2083.43it/s]
Reading training images (A): 91%|█████████ | 3358/3707 [00:01<00:00, 2058.76it/s]
Reading training images (A): 96%|█████████▋| 3573/3707 [00:01<00:00, 1827.56it/s]
Reading training images (B): 0%| | 0/4045 [00:00<?, ?it/s]
Reading training images (B): 1%| | 36/4045 [00:00<00:12, 322.14it/s]
Reading training images (B): 6%|▋ | 254/4045 [00:00<00:08, 432.74it/s]
Reading training images (B): 13%|█▎ | 545/4045 [00:00<00:06, 578.81it/s]
Reading training images (B): 20%|█▉ | 796/4045 [00:00<00:04, 752.34it/s]
Reading training images (B): 24%|██▍ | 964/4045 [00:00<00:03, 898.44it/s]
Reading training images (B): 31%|███ | 1241/4045 [00:00<00:02, 1126.47it/s]
Reading training images (B): 36%|███▌ | 1457/4045 [00:00<00:01, 1314.68it/s]
Reading training images (B): 41%|████ | 1662/4045 [00:00<00:01, 1328.80it/s]
Reading training images (B): 48%|████▊ | 1954/4045 [00:00<00:01, 1587.83it/s]
Reading training images (B): 55%|█████▍ | 2218/4045 [00:01<00:01, 1802.56it/s]
Reading training images (B): 61%|██████ | 2469/4045 [00:01<00:00, 1967.89it/s]
Reading training images (B): 67%|██████▋ | 2706/4045 [00:01<00:00, 1675.41it/s]
Reading training images (B): 77%|███████▋ | 3105/4045 [00:01<00:00, 2002.57it/s]
Reading training images (B): 83%|████████▎ | 3358/4045 [00:01<00:00, 1844.54it/s]
Reading training images (B): 92%|█████████▏| 3711/4045 [00:01<00:00, 2151.79it/s]
Reading training images (B): 98%|█████████▊| 3973/4045 [00:01<00:00, 2107.07it/s]
08/21/2020 11:47:38 INFO Reading alignments from: 'C:\Users\Guan Yi\Desktop\train\Trump_hd\trump_hd_alignments.fsa'
08/21/2020 11:47:38 INFO Reading alignments from: 'C:\Users\Guan Yi\Desktop\train\Albert_hd\albert_hd_alignments.fsa'
08/21/2020 11:47:39 WARNING 169 alignments have been removed as their corresponding faces do not exist in the input folder for side B. Run in verbose mode if you wish to see which alignments have been excluded.
08/21/2020 11:47:40 INFO batch_all_reduce: 78 all-reduces with algorithm = hierarchical_copy, num_packs = 1
08/21/2020 11:47:46 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
08/21/2020 11:47:46 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
08/21/2020 11:47:46 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
08/21/2020 11:47:46 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
08/21/2020 11:47:46 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
08/21/2020 11:47:46 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
08/21/2020 11:47:48 INFO batch_all_reduce: 78 all-reduces with algorithm = hierarchical_copy, num_packs = 1
08/21/2020 11:47:49 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
08/21/2020 11:47:49 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
08/21/2020 11:47:49 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
08/21/2020 11:47:49 INFO Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
08/21/2020 11:48:10 INFO [Saved models] - Average loss since last save: face_a: 0.00428, face_b: 0.00553
Is
Code: Select all
Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
a problem? They are using the same replica. Or it's just the way it works?
Re: Issue using dual P106-100s when training
Posted: Fri Aug 21, 2020 8:34 am
by abigflea
Well, I suspect this will work. scratches head
the "MirroredStrategy with devices" is correct. No problem there.
Update your NVIDIA drivers to the current first.
I am using this one . https://www.nvidia.com/download/driverR ... 3246/en-us
I that don't work , lets go through the typical steps.
Force Update Windows
Follow this to remove any possible conflicts with CUDA, Conda, or Python.
app.php/faqpage?sid=8a113082dbf6d2351b3 ... e7b0b#f1r1
Then install this, cant hurt, lack of a DLL got me the other day.
Microsoft Visual C++ Redistributable for Visual Studio 2015, 2017 and 2019
Then run a copy of the fresh installer
https://github.com/deepfakes/faceswap/releases/latest/download/faceswap_setup_x64.exe
Everything should be nice and fresh, and we will go from there.
Re: Issue using dual P106-100s when training
Posted: Fri Aug 21, 2020 9:12 am
by ericpan0513
OK, really thanks for your help!
I won't be able to use this computer during weekends, so maybe I will try this next week.
If it works, I will reply here again. 
Re: Issue using dual P106-100s when training
Posted: Fri Aug 21, 2020 9:15 am
by abigflea
Ill be happy to hear what happens.
Re: Issue using dual P106-100s when training
Posted: Fri Aug 21, 2020 2:02 pm
by abigflea
It has occurred to me have you checked the distribute box ?
Re: Issue using dual P106-100s when training
Posted: Mon Aug 24, 2020 3:26 am
by ericpan0513
Couldn't update NVIDIA drivers, I think is because p106 is too old or something. "Graphics driver could not found compatible graphic hardware"
I've finished other steps, still not working. Too bad. But thanks for helping.
And yes, my situation only happened when I check the distributed box, the training speed went down(compare to one GPU only) with one GPU 100% load but the GPU temperature won't go up, while the other 6% load, which is really weird.
However, thanks for helping me. Maybe I shouldn't use P106-100 for multiple GPU training. Already spent a few weeks trying to fix the problem. 