do you know any good AI tool for voice swapping?
I’m trying to swap my face for one of my friends’ face and thought it would be cool if my voice could also be changed to look like my friend’s voice.
This has been asked quite a lot. There are strides being made, but voice is a lot harder. At this moment in time, no one has made a decent AI that doesn't sound robotic,
My word is final
While an AI audio swapping tool might not be known, does any have any recommendations for a good audio editor/mixer that can change lots of different aspects of a voice (or mp3 file) so we could at least get it a voice thats somewhat close? Kind of like changing the pitch to make it deeper, but taking it much further than just changing the pitch.
Edit: it doesn't have to be a program, a free online tool would be fine too.
There is currently a voice AI program online that allows you to recreate various celebrity and video game voices to varying degree of success. https://fifteen.ai/app
Currently the voices they offer are the main cast of Portal, My Little Pony, The Stanley Parable, Team Fortress 2, Undertale, and the Tenth Doctor from Doctor Who .....So realistically just one person you might deepfake. But they seem to be constantly adding more so maybe one day they'll have more celebs.
Regarding the audio editor, you may want to look at Steinberg's SpectraLayers. This goes WAY further than most editors, which mostly only handle editing, mixing and special effects. SpectraLayers allows you to view your audio as a spectrum (FFT transformed so frequency is on the vertical axis and intensity is color coded). You can view that spectrum isometrically, so that you see spectrum intensity "from above" and at any angle. You can then perform editing, including the movement of sounds in both time and frequency. For example, you easily delete background noises (or even voices) with little or no impact on background noise. Even better, you can move a selection of several areas as a group to a differnent location "vertically", changing the pitch and frequency relationship of only those items without affecting the rest of the background. You can alter a voice by shifting the frequency or changing the intensity of the different harmonics withing a voice. When you shift things in frequency, it automatically takes care of all the little details, such as logrithmic volume modification, so things sound right in their new location. One of their tutorials uses the example of reducing a horrible sounding guitar "string slide scratch" to a more pleasant sound by going in and lowering just the high intensity peaks of the harmonics with a partial eraser without affecting any of the other sound.
It's a pretty insane piece of software, and I suspect there is a hellava lot of math going on in the background. The professional version (used by studios) is pricy, but I got the new SpectraLayers Elements version that has way more capability than I'll ever use for $80. I was hoping to use this at some point to see if I could grok voice dynamics enough to develop an AI model for voice modification. Wow, way above my paygrade. As noted previously, voice seems to be far more complex than a simple face. BUT, if you want to do things like get rid of unwanted voices and noises without messing up the rest of the soundscape, this will do it. And it will also allow you to see exactly what the same words look like from different people, and allow you to mess around with them. I just won't convert speaker A to speaker B, which is what a lot of us would like to see someday.
BTW, I started to understand the complexity of audio when I looked at a single guitar string pluck in SpectraLayers. I expected to see a single frequency tone moving across the screen. Nope. Now I understand when they mean when someone talks about a guitar's "tone". There is the original string frequency, but then there are harmonics and overtones going all the way up the spectrum. Looking at them from an angle, you can see the different harmonics pulsing in intensity at different, apparantly unrelated rates. After seeing how complex something as simple as a single note on a static guitar body was, I can only imagine the interactions between a constantly varying laryanx frequency and the various nasal, throat and mouth cavities. My dad worked with Bell Labs when they were working on speech synthesis back in the 70's, and I remember they were using various analog filters to represent the different parts of the vocal tract. So they would take a phoneme and then warp it through all of these analog filters to get the final result. Which sounded about as bad as the 15.ai site. I remember the huge breadboard singing "Daisy", like the HAL9000 in 2001 when they were turning it off. I suspect that's where the movie got the idea from. We take TTS for granted now - I use the @VoiceAloud reader daily for listening to articles. But I have the utmost respect for the researchers who have worked on this tough nut over the years.
FOUND! It's not exatly what you were looking for (direct tranlation of speaker A to speaker B), but it's pretty close. LyreBird.AI, which is a part of Descript, allows you to train speaking voices for things like podcasts, and then output text using that voice. So once you set up (train) your voice, it basically allows you to edit a script, including sound effects, and then it will transcribe it using your voice. You can use phonetic pronunciation to make words sound better, and you can set up the same voice with different speech styles (neutral, excited, depressed, etc). Finally, you can record live or upload an audio recording to create the script using speech recognition. And it sounds surprisingly good. A free subscription allows 3 hours of transcription per month. Good luck!
I tried the free version of Descript app.... OMG WHAT A DUD!!! Wasted 50 mins recording 50 voice samples and then they took about an hour to process the results (pretty quick according to my estimate) and then the end result was soooo robotic...highly disappointed.
There have a lot of conversations regarding adding voice swap to face swap and most of the tools I tested out there still give a pretty robotic output (including lyrebird/descript), So this is my call to arms for all you amazing geniuses, let's get together, lets ponder and let's build this shit out! I can be the guy who does the dirty work for you...
Message from moderator
Merged two posts
Voice swapping is a difficult problem to solve, as evidenced by the lack of decent solutions. Ultimately there is a lot more nuance in voice than there are in images.
My word is final