Preamble:
This was a small project by myself, a non-expert in either Python or Machine Learing (ML).
Nvidia is the king of ML, this isn't me attempting to deny that. This was done purely as an 'I wonder if'.
The code i've written is not the best and would require some cleanup before anybody programming
in Python for more than month saw it.
ROCm is a Linux only software stack.
I have just started looking at DeepFakes and ML and have an AMD GPU, a position im sure people
other than myself have been in. Looking at the forums i saw multipule posts that essentially summed
what i already knew, the comparison in performace as terrible. A large and immediate disadvantage
to some people who cannot / do not wish to buy another GPU.
RTX GPUs aside, i was very much in the belief that this was a software rather than a hardware issue.
I do not mean that the developers have done a 'bad' job of coding, i think FaceSwap is excellent and
the code i looke at was very easy to follow. What i do mean is that the entire software stack is biased
to CUDA, which shouldn't surprise anyone. Jen-Hsun Huang stated 10 years ago that Nvidia is a
software comapny.
My belief came from looking at software requirements, CUDA or OpenCL...
Due to Nvidia pushing CUDA for the past 10+ years they stopped thier OpenCL support at v1.2, while
AMD and Intel continued to support up to v2.2 and v2.0 respectively. What caught my attention was
that a program would be written in OpenCL v1.2, what looked to me as a way of staying compatible
with Nvidia OpenCL, but would then also have a seperate CUDA path. Why not drop Nvidia OpenCL
support?
Now i do not know what the benefits of migrating from OpenCL v1.2 to v2.0+ are if thier are any at
all. I was a very ignorant observation and I'm sure there's someone reading this than knows far
better than I and can say why this is. It doesn't matter though, i'm mearly explaining why i belived
what i did, whether is was based on false pretenses or not.
AMD has been trying to catch up in the ML game, how well they are doing is something you can
decide for yourself. In the attempt to catch up they have built the Radeon Open Compute stack
(ROCm). Part of this software stack is MIOpen, a low-level ML libary and a new runtime API "HIP".
In short HIP is cuda clone designed to allow easy migration from CUDA, perform a find+replace
from 'cudaFunction()' to 'hipFuntion()' and you're 90% done. If you're interested give it a google.
This has allowed (i think this is the reason anyway) TensorFlow to be built ontop of MIOpen so
that TensorFlow can run on AMD GPUs.
For those of you that aren't aware FaceSwap uses PlaidML for AMD and TensorFlow for Nvidia.
PlaidML is a OpenCL v1.2 ML libary but does not conform to my previous 'v1.2 and CUDA' rant.
This is one of the ways FaceSwap developers has managed to allow you to use any hardware
for ML and that is something we should thank them for. I now have another option though...
What would be the performace difference between PlaidML and Tensorflow on AMD?
If one is written with OpenCL and the other with low-level primitives there must be a difference.
I booted up a fresh Linux enviroment, installed the ROCm stack with the modified TensorFlow and
a recent git copy of FaceSwap. I then made the changes required so that FaceSwap would use
the same TensorFlow code that it uses for Nvidia and ran some tests.
All of these tests were ran on the same data using the FaceSwap default configs except for Dlight.
I had to change the Dlight output to 128 in order to get any meaningful data.
Notes for this table:
ROCm - Linux OS, TensorFlow code path.
PlaidML - Linux OS, PlaidML code path.
Windows - Windows 10, PlaidML code path.
OOM - Out of memory see notes after
Please allow for a slight margin of error, these things are never perfect.
Ryzen 1600 - 32GB RAM - AMD Vega 56
Code: Select all
Model: Original
Batch size: ROCm EGs: PlaidML EGs: Windows EGs:
16 35.7 32 31
32 63.1 42 42.3
64 97.7 48.7 48.7
128 110.5 48 49.3
Code: Select all
Model: Dfaker
Batch size: ROCm EGs: PlaidML EGs: Windows EGs:
16 20.5 9.8 10.8
20 23.6 9 10.6
32 35.7 OOM 11.3
Code: Select all
Model: Dlight - Output 128px
Batch size: ROCm EGs: PlaidML EGs: Windows EGs:
8 22 11.5 11.4
12 30.4 14.8 14.3
16 36.1 15.9 16.2
28 50.1 14.6 16.4
Code: Select all
Model: RealFace
Batch size: ROCm EGs: PlaidML EGs: Windows EGs:
8 10.6 OOM 4.6
16 17.4 OOM 5.1
Code: Select all
Model: DFL-H128
Batch size: ROCm EGs: PlaidML EGs: Windows EGs:
8 11.1 4.9 5.2
16 16 5.6 5.8
24 18.8 Error* 4.5
Interesting right? There are some other points id like to make but first id like to remind you that
this has only been ran by me on my (poor) dataset. I am also unable to test DFL-SAE as get the
same error as i do with DFL-H128 at batch size 24. With SAE its immediate, with H128 its hit and miss.
I have yet to find out exactly what the problem is.
TensorFlow will very happy to throw a OOM error and quit. PlaidML seems to be happy using system
memory too and swapping between this and VRAM. Unsurprisingly this destroys performace.
This seem to much more of a thing on Windows than Linux and im not sure if its because of Vega's
HBCC thing. Google will explain HBCC far better than I. When i saw this happening i just quit and
noted it as a OOM. PlaidML does allow me to see how much VRAM im using where as TensorFlow pre allocates 96% and you cant see any deeper than that. PlaidML OOM was obvious as id see 100%
VRAM allocation and GPU usage flip between 0% and 30%. Windows 10 actually exposes the copy
operations.
Other interesting observations:
The ROCm stack (TensorFlow on MIOpen) used less VRAM for each batch.
PlaidML on Windows vs Linux yeilded essentially matching performace except with RealFace.
I'd now like to take the chance to reiterate that this was a small interest project by a beginer.
This is not me attempting to change the world, crown AMD king or demand FaceSwap change its
core code. This is for the interest of those who have read this far (congratulations by the way) and
to see how this changes the performance gap between AMD and Nvidia.
Thank you for reading.