Any experience with preemptible resources and/or FUSE?

Want to use Faceswap in The Cloud? This is not directly supported by the Devs, but you may find community support here


Forum rules

Read the FAQs and search the forum before posting a new topic.

NB: The Devs do not directly support using Cloud based services, but you can find community support here.

Please mark any answers that fixed your problems so others can find the solutions.

Locked
User avatar
Replicon
Posts: 38
Joined: Mon Mar 22, 2021 4:24 pm
Been thanked: 1 time

Any experience with preemptible resources and/or FUSE?

Post by Replicon »

I'd like to try moving my workflow to the cloud; it's getting to be too time-consuming on my personal machine (which is really old and can barely manage Lightweight@BS=16).

I'd like to be as cost-effective as possible, so I've been thinking of using Preemptible instances/GPUs.

These are super cheap compared to their dedicated counterparts, but I need to engineer around the caveats. Those caveats are that they can be preempted at any moment (there's maybe a 5-15% chance of it happening during its lifetime, from what I'm reading). Plus, their lifespan is at most 24 hours, so they'll get torn down on you no matter what. This has the added benefit of protecting me from accidentally leaving an expensive resource on, only to get a crazy bill at the end of the month.

To work around the caveats, I was thinking of just mounting a storage bucket using FUSE.

This can be slower than interacting with a regular filesystem, because GCS objects are immutable, so a small update to a large file still requires basically re-uploading the entire file (e.g. a 400MB model file).

I figure, almost all of the work happens in memory, and it should only need to touch the disk when writing its state every 250 iterations (which I can crank up to 10000 or something), so there's a sweet spot where it will not get in the way too much. And that way, if I get preempted in the middle of the night, my state is saved in my bucket and the instance can safely disappear into the ether.

Has anyone played with this kind of approach before? Does it work, or should I just build a thing that periodically uploads to a bucket manually? I suppose I can test it locally; no need for the cloud to validate the proof of concept. :)


User avatar
bryanlyon
Site Admin
Posts: 634
Joined: Fri Jul 12, 2019 12:49 am
Answers: 41
Location: San Francisco
Has thanked: 3 times
Been thanked: 161 times
Contact:

Re: Any experience with preemptible resources and/or FUSE?

Post by bryanlyon »

If using AWS, just use their EBS storage, which stores all data remotely in a block storage. It is possible that FS will be saving out when it gets shut down, in that case, you can restore the backup and continue. GCE and Azure offer similar options for their premptable instances.


User avatar
Replicon
Posts: 38
Joined: Mon Mar 22, 2021 4:24 pm
Been thanked: 1 time

Re: Any experience with preemptible resources and/or FUSE?

Post by Replicon »

Thanks!

I just performed the experiment with FUSE this morning, and it is, indeed, totally unviable.

Training is extremely slow, as it touches the "disk" at ~5QPS.

I tried adding the -nl flag, which at least eliminated the constant writing to the logs, but it's still reading from disk pretty constantly. GCS FUSE was a fun idea, but it won't work, because training is not quite as in-memory as I thought it might be.

Filestore would likely work, but it's hugely expensive for our purposes.

I could create and mount a persistent disk separately, like you mention. I just worry about it getting corrupted, because it may not have enough time to flush and unmount, if we have to wait for training to shut itself down.

The other thing I could do is just periodically copy the model to GCS during training. I assume the <trainer>.h5 and <trainer>_state.json (and maybe corresponding .bk files) are all that matter. Since they change rarely, and all in one shot, a basic "compare checksums" or "checksum before and after" or similar approach is good enough for a hobbyist. Might even be baked into gsutil today.


User avatar
bryanlyon
Site Admin
Posts: 634
Joined: Fri Jul 12, 2019 12:49 am
Answers: 41
Location: San Francisco
Has thanked: 3 times
Been thanked: 161 times
Contact:

Re: Any experience with preemptible resources and/or FUSE?

Post by bryanlyon »

You could set an inotify for any changes to the .h5 and copy the files over the network if a change is detected. Just don't overwrite on the remote until you can verify the copy was good and you'll be fine.

The reason that training isn't great on a FUSE is the training data needs to be loaded/processed as the training goes on. Might be best to copy the data local and only keep the model itself on a remote drive.


User avatar
Replicon
Posts: 38
Joined: Mon Mar 22, 2021 4:24 pm
Been thanked: 1 time

Re: Any experience with preemptible resources and/or FUSE?

Post by Replicon »

I wrote a little shell script that wraps gsutil and uploads the model to GCS periodically. I kinda bodged together a little ecosystem to make it super easy to export just the GPU-intensive parts of the work to a cloud instance.

To prevent corruption shenanigans (e.g. it being written as I read and upload), I'm first checksumming the model and using that checksum as a precondition on the upload, retrying as needed.


Locked