GPU not being recognized though it's definitely there (even visible in sysinfo)

Replicon · Post by **Replicon** » Sat Jun 25, 2022 10:51 pm

Background: I run in the cloud, and my super old image stopped working recently due to s3fd models being moved around. This forces my hand, so I'm trying to set up nVidia drivers from scratch on a GCE instance with a T4.

It's actually pretty easy to bring up an instance and install the basic nVidia stuff. I can't find any global "cuda" libraries anywhere, and if I look for the uninstaller (or run the apt commands to uninstall it), nothing happens.

When I run nvidia-smi, it clearly shows that it sees the T4.

When I get around to running faceswap, it says "Setting Faceswap backend to NVIDIA" which is a good start.

... but when I start training, it outputs the following, and processes incredibly slowly:

Code: Select all

WARNING  Mixed precision compatibility check (mixed_float16): WARNING
The dtype policy mixed_float16 may run slowly because this machine does not have a GPU. Only Nvidia GPUs with compute capability of at least 7.0 run quickly with mixed_float16.
If you will use compatible GPU(s) not attached to this host, e.g. by running a multi-worker model, you can ignore this warning. This message will only be logged once

Here's my sysinfo, since that's what the cool kids are posting:

Note there's no global cuda found, which I think is the thing that most people get tripped up by. Also, in the sysinfo, it has no problem seeing that there IS a GPU, so... help?

Code: Select all

============ System Information ============
encoding:            UTF-8
git_branch:          Not Found
git_commits:         Not Found
gpu_cuda:            No global version found. Check Conda packages for Conda Cuda
gpu_cudnn:           No global version found. Check Conda packages for Conda cuDNN
gpu_devices:         GPU_0: Tesla T4
gpu_devices_active:  GPU_0
gpu_driver:          495.46
gpu_vram:            GPU_0: 15109MB
os_machine:          x86_64
os_platform:         Linux-5.13.0-1033-gcp-x86_64-with-glibc2.31
os_release:          5.13.0-1033-gcp
py_command:          /home/(((redacted_username)))/faceswap/faceswap.py
py_conda_version:    conda 4.12.0
py_implementation:   CPython
py_version:          3.9.12
py_virtual_env:      True
sys_cores:           4
sys_processor:       x86_64
sys_ram:             Total: 14992MB, Available: 14411MB, Used: 280MB, Free: 11950MB

=============== Pip Packages ===============
absl-py==1.1.0
astunparse==1.6.3
cachetools==5.2.0
certifi==2022.6.15
charset-normalizer==2.0.12
cloudpickle==2.1.0
cycler @ file:///tmp/build/80754af9/cycler_1637851556182/work
decorator==5.1.1
dm-tree==0.1.7
fastcluster @ file:///home/conda/feedstock_root/build_artifacts/fastcluster_1649783242764/work
ffmpy==0.2.3
flatbuffers==2.0
fonttools==4.25.0
gast==0.5.3
google-auth==2.8.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
grpcio==1.47.0
h5py==3.7.0
idna==3.3
imageio @ file:///tmp/build/80754af9/imageio_1617700267927/work
imageio-ffmpeg @ file:///home/conda/feedstock_root/build_artifacts/imageio-ffmpeg_1649960641006/work
importlib-metadata==4.12.0
joblib @ file:///tmp/build/80754af9/joblib_1635411271373/work
keras==2.8.0
Keras-Preprocessing==1.1.2
kiwisolver @ file:///opt/conda/conda-bld/kiwisolver_1653292039266/work
libclang==14.0.1
Markdown==3.3.7
matplotlib @ file:///tmp/build/80754af9/matplotlib-suite_1647441664166/work
mkl-fft==1.3.1
mkl-random @ file:///tmp/build/80754af9/mkl_random_1626186066731/work
mkl-service==2.4.0
munkres==1.1.4
numpy @ file:///opt/conda/conda-bld/numpy_and_numpy_base_1652801679809/work
nvidia-ml-py==11.510.69
oauthlib==3.2.0
opencv-python==4.6.0.66
opt-einsum==3.3.0
packaging @ file:///tmp/build/80754af9/packaging_1637314298585/work
Pillow==9.0.1
protobuf==3.19.4
psutil @ file:///tmp/build/80754af9/psutil_1612297992929/work
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyparsing @ file:///tmp/build/80754af9/pyparsing_1635766073266/work
python-dateutil @ file:///tmp/build/80754af9/python-dateutil_1626374649649/work
requests==2.28.0
requests-oauthlib==1.3.1
rsa==4.8
scikit-learn @ file:///tmp/build/80754af9/scikit-learn_1642617106979/work
scipy @ file:///tmp/build/80754af9/scipy_1641555004408/work
sip==4.19.13
six @ file:///tmp/build/80754af9/six_1644875935023/work
tensorboard==2.8.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow-estimator==2.8.0
tensorflow-gpu==2.8.2
tensorflow-io-gcs-filesystem==0.26.0
tensorflow-probability==0.16.0
termcolor==1.1.0
threadpoolctl @ file:///Users/ktietz/demo/mc3/conda-bld/threadpoolctl_1629802263681/work
tornado @ file:///tmp/build/80754af9/tornado_1606942317143/work
tqdm @ file:///opt/conda/conda-bld/tqdm_1650891076910/work
typing_extensions @ file:///opt/conda/conda-bld/typing_extensions_1647553014482/work
urllib3==1.26.9
Werkzeug==2.1.2
wrapt==1.14.1
zipp==3.8.0

============== Conda Packages ==============
# packages in environment at /home/(((redacted_username)))/miniconda3:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  

_openmp_mutex             4.5                       1_gnu  

brotlipy                  0.7.0           py39h27cfd23_1003  

ca-certificates           2022.3.29            h06a4308_1  

certifi                   2021.10.8        py39h06a4308_2  

cffi                      1.15.0           py39hd667e15_1  

charset-normalizer        2.0.4              pyhd3eb1b0_0  

colorama                  0.4.4              pyhd3eb1b0_0  

conda                     4.12.0           py39h06a4308_0  

conda-content-trust       0.1.1              pyhd3eb1b0_0  

conda-package-handling    1.8.1            py39h7f8727e_0  

cryptography              36.0.0           py39h9ce1e76_0  

idna                      3.3                pyhd3eb1b0_0  

ld_impl_linux-64          2.35.1               h7274673_9  

libffi                    3.3                  he6710b0_2  

libgcc-ng                 9.3.0               h5101ec6_17  

libgomp                   9.3.0               h5101ec6_17  

libstdcxx-ng              9.3.0               hd4cf53a_17  

ncurses                   6.3                  h7f8727e_2  

openssl                   1.1.1n               h7f8727e_0  

pip                       21.2.4           py39h06a4308_0  

pycosat                   0.6.3            py39h27cfd23_0  

pycparser                 2.21               pyhd3eb1b0_0  

pyopenssl                 22.0.0             pyhd3eb1b0_0  

pysocks                   1.7.1            py39h06a4308_0  

python                    3.9.12               h12debd9_0  

readline                  8.1.2                h7f8727e_1  

requests                  2.27.1             pyhd3eb1b0_0  

ruamel_yaml               0.15.100         py39h27cfd23_0  

setuptools                61.2.0           py39h06a4308_0  

six                       1.16.0             pyhd3eb1b0_1  

sqlite                    3.38.2               hc218d9a_0  

tk                        8.6.11               h1ccaba5_0  

tqdm                      4.63.0             pyhd3eb1b0_0  

tzdata                    2022a                hda174b7_0  

urllib3                   1.26.8             pyhd3eb1b0_0  

wheel                     0.37.1             pyhd3eb1b0_0  

xz                        5.2.5                h7b6447c_0  

yaml                      0.2.5                h7b6447c_0  

zlib                      1.2.12               h7f8727e_1  

================= Configs ==================
--------- convert.ini ---------

[color.manual_balance]
colorspace:               HSV
balance_1:                0.0
balance_2:                0.0
balance_3:                0.0
contrast:                 0.0
brightness:               0.0

[color.color_transfer]
clip:                     True
preserve_paper:           True

[color.match_hist]
threshold:                99.0

[writer.ffmpeg]
container:                mp4
codec:                    libx264
crf:                      23
preset:                   medium
tune:                     none
profile:                  auto
level:                    auto
skip_mux:                 False

[writer.gif]
fps:                      25
loop:                     0
palettesize:              256
subrectangles:            False

[writer.opencv]
format:                   png
draw_transparent:         False
jpg_quality:              75
png_compress_level:       3

[writer.pillow]
format:                   png
draw_transparent:         False
optimize:                 False
gif_interlace:            True
jpg_quality:              75
png_compress_level:       3
tif_compression:          tiff_deflate

[mask.mask_blend]
type:                     normalized
kernel_size:              3
passes:                   4
threshold:                4
erosion:                  0.0
erosion_top:              0.0
erosion_bottom:           0.0
erosion_left:             0.0
erosion_right:            0.0

[scaling.sharpen]
method:                   gaussian
amount:                   150
radius:                   0.3
threshold:                5.0

--------- train.ini ---------

[global]
centering:                face
coverage:                 87.5
icnr_init:                False
conv_aware_init:          False
optimizer:                adam
learning_rate:            5e-05
epsilon_exponent:         -7
autoclip:                 False
reflect_padding:          False
allow_growth:             False
mixed_precision:          False
nan_protection:           True
convert_batchsize:        16

[global.loss]
loss_function:            ssim
loss_function_2:          mse
loss_weight_2:            100
loss_function_3:          none
loss_weight_3:            0
loss_function_4:          none
loss_weight_4:            0
mask_loss_function:       mse
eye_multiplier:           3
mouth_multiplier:         2
penalized_mask_loss:      True
mask_type:                extended
mask_blur_kernel:         3
mask_threshold:           4
learn_mask:               False

[model.villain]
lowmem:                   False

[model.dlight]
features:                 best
details:                  good
output_size:              256

[model.original]
lowmem:                   False

[model.dfl_sae]
input_size:               128
architecture:             df
autoencoder_dims:         0
encoder_dims:             42
decoder_dims:             21
multiscale_decoder:       False

[model.unbalanced]
input_size:               128
lowmem:                   False
nodes:                    1024
complexity_encoder:       128
complexity_decoder_a:     384
complexity_decoder_b:     512

[model.phaze_a]
output_size:              128
shared_fc:                none
enable_gblock:            True
split_fc:                 True
split_gblock:             False
split_decoders:           False
enc_architecture:         fs_original
enc_scaling:              7
enc_load_weights:         True
bottleneck_type:          dense
bottleneck_norm:          none
bottleneck_size:          1024
bottleneck_in_encoder:    True
fc_depth:                 1
fc_min_filters:           1024
fc_max_filters:           1024
fc_dimensions:            4
fc_filter_slope:          -0.5
fc_dropout:               0.0
fc_upsampler:             upsample2d
fc_upsamples:             1
fc_upsample_filters:      512
fc_gblock_depth:          3
fc_gblock_min_nodes:      512
fc_gblock_max_nodes:      512
fc_gblock_filter_slope:   -0.5
fc_gblock_dropout:        0.0
dec_upscale_method:       subpixel
dec_upscales_in_fc:       0
dec_norm:                 none
dec_min_filters:          64
dec_max_filters:          512
dec_slope_mode:           full
dec_filter_slope:         -0.45
dec_res_blocks:           1
dec_output_kernel:        5
dec_gaussian:             True
dec_skip_last_residual:   True
freeze_layers:            keras_encoder
load_layers:              encoder
fs_original_depth:        4
fs_original_min_filters:  128
fs_original_max_filters:  1024
fs_original_use_alt:      False
mobilenet_width:          1.0
mobilenet_depth:          1
mobilenet_dropout:        0.001
mobilenet_minimalistic:   False

[model.dfl_h128]
lowmem:                   False

[model.dfaker]
output_size:              128

[model.realface]
input_size:               64
output_size:              128
dense_nodes:              1536
complexity_encoder:       128
complexity_decoder:       512

[trainer.original]
preview_images:           14
zoom_amount:              5
rotation_range:           10
shift_range:              5
flip_chance:              50
color_lightness:          30
color_ab:                 8
color_clahe_chance:       50
color_clahe_max_size:     4

--------- .faceswap ---------
backend:                  nvidia

--------- extract.ini ---------

[global]
allow_growth:             False

[detect.s3fd]
confidence:               70
batch-size:               4

[detect.mtcnn]
minsize:                  20
scalefactor:              0.709
batch-size:               8
threshold_1:              0.6
threshold_2:              0.7
threshold_3:              0.7

[detect.cv2_dnn]
confidence:               50

[mask.vgg_obstructed]
batch-size:               2

[mask.unet_dfl]
batch-size:               8

[mask.vgg_clear]
batch-size:               6

[mask.bisenet_fp]
batch-size:               8
weights:                  faceswap
include_ears:             False
include_hair:             False
include_glasses:          True

[align.fan]
batch-size:               12

One thing that's really strange is when I run "sudo nvidia-smi", it does display a CUDA version, and it's not the version installed by faceswap (which is 11.2 iirc)... But I can't find any other cuda installs, and apparently neither can faceswap sysinfo.

Code: Select all

Sat Jun 25 21:28:54 2022       

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 495.46       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   66C    P8    12W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Post by **torzdf** » Sun Jun 26, 2022 12:16 am

Replicon wrote: ↑Sat Jun 25, 2022 10:51 pm
Note there's no global cuda found, which I think is the thing that most people get tripped up by. Also, in the sysinfo, it has no problem seeing that there IS a GPU, so... help?

Yes, that's what you want. However, there should be a cudatoolkit and cudnn installed locally in your conda environment, but they are not there. Check for any install errors after running the install script

One thing that's really strange is when I run "sudo nvidia-smi", it does display a CUDA version, and it's not the version installed by faceswap (which is 11.2 iirc)... But I can't find any other cuda installs, and apparently neither can faceswap sysinfo.

Code: Select all

Sat Jun 25 21:28:54 2022       

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 495.46       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   66C    P8    12W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

This is normal. It's a bit misleading, but the version shown in Nvidia-SMI is the maximum supported Cuda version for the installed driver. It's not an indication of whether you have Cuda installed, nor which version it might be.

Replicon · Post by **Replicon** » Sun Jun 26, 2022 4:14 am

torzdf wrote: ↑Sun Jun 26, 2022 12:16 am
Yes, that's what you want. However, there should be a cudatoolkit and cudnn installed locally in your conda environment, but they are not there. Check for any install errors after running the install script

How can you tell from the sysinfo that they are not there? The top level system info for my old environment (which still works for training, just not for extract) shows basically the same info.

(just the top level stuff from my old working environment)

Code: Select all

============ System Information ============
encoding:            UTF-8
git_branch:          Not Found
git_commits:         Not Found
gpu_cuda:            No global version found. Check Conda packages for Conda Cuda
gpu_cudnn:           No global version found. Check Conda packages for Conda cuDNN
gpu_devices:         GPU_0: Tesla T4
gpu_devices_active:  GPU_0
gpu_driver:          440.100
gpu_vram:            GPU_0: 15109MB
os_machine:          x86_64
os_platform:         Linux-5.4.0-1021-gcp-x86_64-with-glibc2.17
os_release:          5.4.0-1021-gcp
py_command:          /home/(((REDACT)))/faceswap/faceswap.py extract -i /home/(((REDACT)))/data_root/sources/x.mp4 -o /home/(((REDACT)))/data_root/extract/x -D s3fd -A fan -nm hist -rf 8 -min 0 -l 0.4 -sz 512 -een 1 -si 0 -L INFO
py_conda_version:    conda 4.10.3
py_implementation:   CPython
py_version:          3.8.11
py_virtual_env:      True
sys_cores:           4
sys_processor:       x86_64
sys_ram:             Total: 15005MB, Available: 14490MB, Used: 244MB, Free: 14468MB

I don't see any install errors. It unpacks cudatoolkit and cudnn and happily installs them.

Looking for the libraries locally, and comparing the two environments yields:

Old, working environment:

Code: Select all

$ find . -type d | egrep -e 'cuda|cudnn'
./miniconda3/pkgs/cudnn-7.6.5-cuda10.1_0
./miniconda3/pkgs/cudnn-7.6.5-cuda10.1_0/include
./miniconda3/pkgs/cudnn-7.6.5-cuda10.1_0/lib
./miniconda3/pkgs/cudnn-7.6.5-cuda10.1_0/info
./miniconda3/pkgs/cudnn-7.6.5-cuda10.1_0/info/test
./miniconda3/pkgs/cudnn-7.6.5-cuda10.1_0/info/licenses
./miniconda3/pkgs/tensorflow-base-2.2.0-gpu_py38h83e3d50_0/lib/python3.8/site-packages/tensorflow/include/tensorflow/stream_executor/cuda
./miniconda3/pkgs/tensorflow-base-2.2.0-gpu_py38h83e3d50_0/lib/python3.8/site-packages/tensorflow/include/external/local_config_cuda
./miniconda3/pkgs/tensorflow-base-2.2.0-gpu_py38h83e3d50_0/lib/python3.8/site-packages/tensorflow/include/external/local_config_cuda/cuda
./miniconda3/pkgs/tensorflow-base-2.2.0-gpu_py38h83e3d50_0/lib/python3.8/site-packages/tensorflow/include/external/local_config_cuda/cuda/cuda
./miniconda3/pkgs/cudatoolkit-10.1.243-h6bb024c_0
./miniconda3/pkgs/cudatoolkit-10.1.243-h6bb024c_0/lib
./miniconda3/pkgs/cudatoolkit-10.1.243-h6bb024c_0/info
./miniconda3/pkgs/cudatoolkit-10.1.243-h6bb024c_0/info/test
./miniconda3/pkgs/cudatoolkit-10.1.243-h6bb024c_0/info/recipe
./miniconda3/envs/faceswap/lib/python3.8/site-packages/tensorflow/include/tensorflow/stream_executor/cuda
./miniconda3/envs/faceswap/lib/python3.8/site-packages/tensorflow/include/external/local_config_cuda
./miniconda3/envs/faceswap/lib/python3.8/site-packages/tensorflow/include/external/local_config_cuda/cuda
./miniconda3/envs/faceswap/lib/python3.8/site-packages/tensorflow/include/external/local_config_cuda/cuda/cuda

New, broken environment:

Code: Select all

$ find . -type d | egrep -e 'cuda|cudnn'
./miniconda3/pkgs/cudnn-8.1.0.77-h90431f1_0
./miniconda3/pkgs/cudnn-8.1.0.77-h90431f1_0/bin
./miniconda3/pkgs/cudnn-8.1.0.77-h90431f1_0/include
./miniconda3/pkgs/cudnn-8.1.0.77-h90431f1_0/info
./miniconda3/pkgs/cudnn-8.1.0.77-h90431f1_0/info/test
./miniconda3/pkgs/cudnn-8.1.0.77-h90431f1_0/info/licenses
./miniconda3/pkgs/cudnn-8.1.0.77-h90431f1_0/info/recipe
./miniconda3/pkgs/cudnn-8.1.0.77-h90431f1_0/lib
./miniconda3/pkgs/cudatoolkit-11.2.2-hbe64b41_10
./miniconda3/pkgs/cudatoolkit-11.2.2-hbe64b41_10/bin
./miniconda3/pkgs/cudatoolkit-11.2.2-hbe64b41_10/info
./miniconda3/pkgs/cudatoolkit-11.2.2-hbe64b41_10/info/test
./miniconda3/pkgs/cudatoolkit-11.2.2-hbe64b41_10/info/licenses
./miniconda3/pkgs/cudatoolkit-11.2.2-hbe64b41_10/info/recipe
./miniconda3/pkgs/cudatoolkit-11.2.2-hbe64b41_10/lib
./miniconda3/envs/faceswap/lib/python3.9/site-packages/tensorflow/include/tensorflow/stream_executor/cuda
./miniconda3/envs/faceswap/lib/python3.9/site-packages/tensorflow/include/external/local_config_cuda
./miniconda3/envs/faceswap/lib/python3.9/site-packages/tensorflow/include/external/local_config_cuda/cuda
./miniconda3/envs/faceswap/lib/python3.9/site-packages/tensorflow/include/external/local_config_cuda/cuda/cuda

The only notable difference I am spotting is that "tensorflow-base" isn't there in the newer environment, but it definitely has tensorflow, if I add that to the search.

Replicon · Post by **Replicon** » Sun Jun 26, 2022 4:24 am

Actually, I take some of that back, if I look in the miniconda3/pkgs directory, the old environment has several packages with tensorflow in their names, but the new one has none.

Looking at the install, it DID error out when trying to install tensorflow via conda, but then it succeeded when it fell back to using pip.

Is this expected? Could it be the root cause?

Code: Select all

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

  - tensorflow_probability[version='<0.17']

Current channels:

  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.


INFO    "tensorflow_probability<0.17" not available in Conda. Installing with pip
INFO    Installing tensorflow_probability<0.17
Collecting tensorflow_probability<0.17
  Downloading tensorflow_probability-0.16.0-py2.py3-none-any.whl (6.3 MB)
     |████████████████████████████████| 6.3 MB 7.7 MB/s

Post by **torzdf** » Sun Jun 26, 2022 10:04 am

Replicon wrote: ↑Sun Jun 26, 2022 4:14 am
How can you tell from the sysinfo that they are not there? The top level system info for my old environment (which still works for training, just not for extract) shows basically the same info.

I don't see any install errors. It unpacks cudatoolkit and cudnn and happily installs them.

They are not listed in your conda environment in your output system info. If they are not listed there then they are not installed. It should appear like so:

Code: Select all

charset-normalizer        2.0.12                   pypi_0    pypi
cloudpickle               2.0.0              pyhd3eb1b0_0  

cudatoolkit               11.2.2               he111cf0_8    conda-forge
cudnn                     8.1.0.77             h90431f1_0    conda-forge
cycler                    0.11.0             pyhd3eb1b0_0  

cytoolz                   0.11.0           py39h27cfd23_0

The tensorflow-probability thing is fine. That package is only available in pip

Replicon · Post by **Replicon** » Sun Jun 26, 2022 3:43 pm

torzdf wrote: ↑Sun Jun 26, 2022 10:04 am
They are not listed in your conda environment in your output system info. If they are not listed there then they are not installed. It should appear like so:

That is so strange, they are not listed in ANY of my environments, even the working one, or if I just run "conda list" on my personal desktop.

I know for a fact I saw those packages get installed while running the installer, progress bar and all.

I went into my environment and manually installed those versions of cudatoolkit and cudnn (had to add channel conda-forge, which it looks like setup.py is already accounting for when installing them).

I don't know much about conda, but it doesn't make sense to me that they wouldn't show up in the listing after faceswap installer installs them.

... Regardless, even after manually installing those two packages, I still get the "Mixed precision compatibility check" warning, and definitely no GPU usage. Where else can I look? Should I try slightly older versions of the nvidia driver? In my latest experiments, I was on 510.

Replicon · Post by **Replicon** » Sun Jun 26, 2022 4:37 pm

I might be circling in on the issue, but I don't have any expertise with the libraries in play to know where to look next.

If I hack up faceswap.py to have:

Code: Select all

import tensorflow as tf
(...)
print(tf.config.list_physical_devices('GPU'))

on the broken install, I can see: "Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory"

I can also see it's failing to load libraries like "libcublasLt.so.11", so definitely a cuda issue of sorts.

The file versions don't seem to match:

Example error message: Could not load dynamic library 'libcudart.so.11.0';
File I have: ~/miniconda3/envs/faceswap/lib/libcudart.so.11.2.152

In the working environment, the major and minor versions match:

Message: Successfully opened dynamic library libcudart.so.10.1
File: miniconda3/envs/faceswap/lib/libcudart.so.10.1.243

I wonder if there's a bug somewhere that's making it see the wrong version, since I know there's some hackery in the code to override it based on tensorflow version.

My broken environment seems to have "tensorflow 2.8" packages, so cuda 11.2 is correct, but I'm not finding where it decides to look for 11.0. Maybe there's a rogue broken dependency somewhere.

Post by **torzdf** » Sun Jun 26, 2022 4:42 pm

Unfortunately this is as about as far as I can help. The fact that you are installing cudatoolkit and it isn't showing up suggests something weird with the setup on the cloud image, which is outside of anything I can help with. The fact that it does find cuda but reports linking to the wrong version backs this up, imho.

At this point, I would be checking which cuda/cudnn versions the installed tensorflow version supports and look to install the correct versions globally

Replicon · Post by **Replicon** » Mon Jun 27, 2022 9:03 pm

Thanks for the help, I figured it out!

All this time, I wasn't properly activating the conda environment. I was running python within the environment. I'm pretty sure I was doing that because I looked it up and it used to work.

Anyway, through some combination of the move to python 3.9 changing how it loads stuff, and/or the library dependencies happening to match some default thing, it used to work.

I just had to add the following to my own tooling:

Code: Select all

source "${fs_conda_root}/etc/profile.d/conda.sh" activate && conda activate 'faceswap'

I can now get an image going from scratch (starting with a clean ubuntu 20.04 GCE instance). No old base image required!

Faceswap Forum

GPU not being recognized though it's definitely there (even visible in sysinfo)

GPU not being recognized though it's definitely there (even visible in sysinfo)

Re: GPU not being recognized though it's definitely there (even visible in sysinfo)

Re: GPU not being recognized though it's definitely there (even visible in sysinfo)

Re: GPU not being recognized though it's definitely there (even visible in sysinfo)

Re: GPU not being recognized though it's definitely there (even visible in sysinfo)

Re: GPU not being recognized though it's definitely there (even visible in sysinfo)

Re: GPU not being recognized though it's definitely there (even visible in sysinfo)

Re: GPU not being recognized though it's definitely there (even visible in sysinfo)

Re: GPU not being recognized though it's definitely there (even visible in sysinfo)