Interesting AMD performace using the ROCm stack

Off topic. Chat about anything and everything
Post Reply
User avatar
more11o
Posts: 6
Joined: Wed Apr 08, 2020 1:22 am

Interesting AMD performace using the ROCm stack

Post by more11o » Wed Apr 08, 2020 2:11 pm

Preamble:
This was a small project by myself, a non-expert in either Python or Machine Learing (ML).
Nvidia is the king of ML, this isn't me attempting to deny that. This was done purely as an 'I wonder if'.
The code i've written is not the best and would require some cleanup before anybody programming
in Python for more than month saw it.
ROCm is a Linux only software stack.

I have just started looking at DeepFakes and ML and have an AMD GPU, a position im sure people
other than myself have been in. Looking at the forums i saw multipule posts that essentially summed
what i already knew, the comparison in performace as terrible. A large and immediate disadvantage
to some people who cannot / do not wish to buy another GPU.

RTX GPUs aside, i was very much in the belief that this was a software rather than a hardware issue.
I do not mean that the developers have done a 'bad' job of coding, i think FaceSwap is excellent and
the code i looke at was very easy to follow. What i do mean is that the entire software stack is biased
to CUDA, which shouldn't surprise anyone. Jen-Hsun Huang stated 10 years ago that Nvidia is a
software comapny.

My belief came from looking at software requirements, CUDA or OpenCL...
Due to Nvidia pushing CUDA for the past 10+ years they stopped thier OpenCL support at v1.2, while
AMD and Intel continued to support up to v2.2 and v2.0 respectively. What caught my attention was
that a program would be written in OpenCL v1.2, what looked to me as a way of staying compatible
with Nvidia OpenCL, but would then also have a seperate CUDA path. Why not drop Nvidia OpenCL
support?

Now i do not know what the benefits of migrating from OpenCL v1.2 to v2.0+ are if thier are any at
all. I was a very ignorant observation and I'm sure there's someone reading this than knows far
better than I and can say why this is. It doesn't matter though, i'm mearly explaining why i belived
what i did, whether is was based on false pretenses or not.

AMD has been trying to catch up in the ML game, how well they are doing is something you can
decide for yourself. In the attempt to catch up they have built the Radeon Open Compute stack
(ROCm). Part of this software stack is MIOpen, a low-level ML libary and a new runtime API "HIP".

In short HIP is cuda clone designed to allow easy migration from CUDA, perform a find+replace
from 'cudaFunction()' to 'hipFuntion()' and you're 90% done. If you're interested give it a google.
This has allowed (i think this is the reason anyway) TensorFlow to be built ontop of MIOpen so
that TensorFlow can run on AMD GPUs.

For those of you that aren't aware FaceSwap uses PlaidML for AMD and TensorFlow for Nvidia.
PlaidML is a OpenCL v1.2 ML libary but does not conform to my previous 'v1.2 and CUDA' rant.
This is one of the ways FaceSwap developers has managed to allow you to use any hardware
for ML and that is something we should thank them for. I now have another option though...

What would be the performace difference between PlaidML and Tensorflow on AMD?
If one is written with OpenCL and the other with low-level primitives there must be a difference.

I booted up a fresh Linux enviroment, installed the ROCm stack with the modified TensorFlow and
a recent git copy of FaceSwap. I then made the changes required so that FaceSwap would use
the same TensorFlow code that it uses for Nvidia and ran some tests.

All of these tests were ran on the same data using the FaceSwap default configs except for Dlight.
I had to change the Dlight output to 128 in order to get any meaningful data.

Notes for this table:
ROCm - Linux OS, TensorFlow code path.
PlaidML - Linux OS, PlaidML code path.
Windows - Windows 10, PlaidML code path.
OOM - Out of memory *see notes after*

Please allow for a slight margin of error, these things are never perfect.
Ryzen 1600 - 32GB RAM - AMD Vega 56

Code: Select all

Model: Original
Batch size:	ROCm EGs:	PlaidML EGs:	Windows EGs:
16		35.7		32		31
32		63.1		42		42.3
64		97.7		48.7 		48.7
128		110.5		48		49.3

Code: Select all

Model: Dfaker
Batch size:	ROCm EGs:	PlaidML EGs:	Windows EGs:
16		20.5		9.8		10.8
20		23.6		9		10.6
32		35.7		OOM 		11.3

Code: Select all

Model: Dlight - Output 128px
Batch size:	ROCm EGs:	PlaidML EGs:	Windows EGs:
8		22 		11.5		11.4
12		30.4		14.8		14.3
16		36.1		15.9		16.2
28		50.1		14.6		16.4

Code: Select all

Model: RealFace
Batch size:	ROCm EGs:	PlaidML EGs:	Windows EGs:
8		10.6		OOM		4.6
16		17.4		OOM		5.1

Code: Select all

Model: DFL-H128
Batch size:	ROCm EGs:	PlaidML EGs:	Windows EGs:
8		11.1		4.9		5.2
16		16		5.6		5.8
24		18.8		Error*		4.5

Interesting right? There are some other points id like to make but first id like to remind you that
this has only been ran by me on my (poor) dataset. I am also unable to test DFL-SAE as get the
same error as i do with DFL-H128 at batch size 24. With SAE its immediate, with H128 its hit and miss.
I have yet to find out exactly what the problem is.

TensorFlow will very happy to throw a OOM error and quit. PlaidML seems to be happy using system
memory too and swapping between this and VRAM. Unsurprisingly this destroys performace.
This seem to much more of a thing on Windows than Linux and im not sure if its because of Vega's
HBCC thing. Google will explain HBCC far better than I. When i saw this happening i just quit and
noted it as a OOM. PlaidML does allow me to see how much VRAM im using where as TensorFlow pre allocates 96% and you cant see any deeper than that. PlaidML OOM was obvious as id see 100%
VRAM allocation and GPU usage flip between 0% and 30%. Windows 10 actually exposes the copy
operations.

Other interesting observations:
The ROCm stack (TensorFlow on MIOpen) used less VRAM for each batch.
PlaidML on Windows vs Linux yeilded essentially matching performace except with RealFace.

I'd now like to take the chance to reiterate that this was a small interest project by a beginer.
This is not me attempting to change the world, crown AMD king or demand FaceSwap change its
core code. This is for the interest of those who have read this far (congratulations by the way) and
to see how this changes the performance gap between AMD and Nvidia.

Thank you for reading.

User avatar
torzdf
Posts: 547
Joined: Fri Jul 12, 2019 12:53 am
Answers: 86
Has thanked: 16 times
Been thanked: 120 times

Re: Interesting AMD performace using the ROCm stack

Post by torzdf » Wed Apr 08, 2020 4:12 pm

Thanks for this informative post.

There is no question that Tensorflow compiled with ROCm support will perform far greater than plaidML with it's OpenCL support, but we didn't go that way for a couple of reasons....
  1. As you alluded to, ROCm is only available on Linux, so for us and our desire to support all the major OSes, it rules ROCm out
  2. Tensorflow does not come pre-compiled with ROCm support. One of our main goals is to make complicated Machine Learning principles as easily accessible to all as possible. Instructing users on Tensorflow compilation is a definite no-no for that reason. We used to have DLib in our code (an incredibly useful library), but the barrier to entry that this library caused, with compilation issues, forced us to remove it.
That being said, if a user has the technical skills or the inclination to compile Tensorflow with ROCm support, then I would highly encourage this as the way forward, as they will get the best performance from their AMD card this way, and will have access to features which are just not available in plaidML.

As to you OOM issue, you can actually force Tensorflow to only allocate the VRAM it requires by enabling the "Allow Growth" parameter, which should give you some visibility on the amount of VRAM that is actually being used.
My word is final

User avatar
more11o
Posts: 6
Joined: Wed Apr 08, 2020 1:22 am

Re: Interesting AMD performace using the ROCm stack

Post by more11o » Wed Apr 08, 2020 6:04 pm

Thank you.

It was a very long winded way of showing what you summed up and those so inclined now have an idea of how much more performance they may be able to get. I have since managed to get DFL-SAE working and so far test have show the performance improvements to be inline with the tests above.

To your point about why you ruled ROCm out not only do i understand i entirely agree; I would add that this is also far too niche to support. FaceSwaps ability to run on any hardware and the OS' that i wish is one of the thing i liked.

TensorFlow does now have complied ROCm pip packages for up to and including 2.1.1 but i don't feel this lowers the barrier for entry too much. Whilst the complied ROCm stack installation is also fairly easy it isn't without its own issues and at least one library to get this all working requires building from source.

I think it also goes without saying that if anybody else does attempt this this support will be limited and you'll have to happy getting your own hands dirty.

I anybody is running an AMD card on a Linux machine and wishes to try this out I'm happy to send you the modified source and help getting it running as much as i can with the following caveats:

- As above, I'm of limited help past install and to expect support from here is unreasonable.
- Unless you are running Ubuntu 18.04 this is going to require a fresh OS install or happy compiling ROCm from source.
- You need to be fairly happy in a Linux environment and with Python.
- This may not work for you at the end.

User avatar
Soulless
Posts: 3
Joined: Thu Apr 16, 2020 5:01 pm

Re: Interesting AMD performace using the ROCm stack

Post by Soulless » Thu Apr 16, 2020 5:05 pm

Hello, I installed ROCm and Tensorflow, tested it worked, but I can't configure faceswap app, can you please send me sources . Thank you.

User avatar
more11o
Posts: 6
Joined: Wed Apr 08, 2020 1:22 am

Re: Interesting AMD performace using the ROCm stack

Post by more11o » Sat Apr 18, 2020 4:09 am

The modified source is on my Google drive here.

If you're using a virtual env then you'll want to activate that first, i prefer mini-conda over anaconda.

Run setup.py and say no to PlaidML and yes to ROCm. It will run as the normal setup would and pull all the dependencies then all you have to do is run faceswap.py gui.

User avatar
Soulless
Posts: 3
Joined: Thu Apr 16, 2020 5:01 pm

Re: Interesting AMD performace using the ROCm stack

Post by Soulless » Thu Apr 23, 2020 4:42 pm

Google writes that you deleted file, can you restore it, thank you.


User avatar
Soulless
Posts: 3
Joined: Thu Apr 16, 2020 5:01 pm

Re: Interesting AMD performace using the ROCm stack

Post by Soulless » Sun Apr 26, 2020 1:02 pm

Thank you so much.

User avatar
calipheron
Posts: 2
Joined: Thu May 14, 2020 7:39 pm

Re: Interesting AMD performace using the ROCm stack

Post by calipheron » Thu May 14, 2020 7:44 pm

So I've just spent the last few hours trying to get this to work on a fresh install of Lubuntu 18.04
i7-4770 system with a pokey RX 460, but this should be enough to get Original training going.
I managed to get faceswap to run and it even says ROCM is in use.
But it complains about MLOpen not being able to access / write to a file.
Training doesn't seem to work properly, it is incredibly slow.

ROCMINFO correctly lists the 460 as available, but it shows an error:
"Failed to get user name to check for video group membership"

This issue is listed in the ROCm github here:
https://github.com/RadeonOpenCompute/ROCm/issues/928

Tried everything I can, triple-checking that my username is in the video group.
Any suggestions would be appreciated. I have an 8GB RX580 that could train fairly well with ROCm but anythign more than Original under Windows is just horribly slow.

User avatar
more11o
Posts: 6
Joined: Wed Apr 08, 2020 1:22 am

Re: Interesting AMD performace using the ROCm stack

Post by more11o » Fri May 15, 2020 8:38 am

That is unusual.

Which files is it unable to open?

Did you follow the instructions in 'install.md'? I probably should of made it more obvious that had changed.

The initial part of training is slow as it has to 'build' everything and then run the training.

Please feel free to find me on Discord under the same username.

Edit: I don't believe that the rx460 is properly supported on ROCm and the rx560/570 support is a little messy.

User avatar
calipheron
Posts: 2
Joined: Thu May 14, 2020 7:39 pm

Re: Interesting AMD performace using the ROCm stack

Post by calipheron » Fri May 15, 2020 9:56 am

Thanks for replying, I'll try my 580 instead and see if that behaves any differently.

Post Reply