The first thing we ask Voicemod‘s CEO and co-founder, Jamie Bosch, when he picks up the phone to talk about a new funding round is not something we’re accustomed to asking — but our question may become the norm in the generative AI future that’s fast-flying at us: Is this your real voice?
Bosch’s startup has been fiddling with audio effects for almost a decade, playing in the field of digital signal processing (DSP) — where its early focus was on creating fun ‘sound emoji’ effects and reactions for gamers to spice up their voice chats. And gamers do remain its main user-base (for now). But the audio field is being charged by developments in AI — which Voicemod’s team is hoping will lead to whole new use-cases and many more users for its tools.
So where DSP technology was about applying effects to a person’s (real) voice, developments in artificial intelligence are enabling startups like Voicemod to offer tools to create entirely synthesized (unreal) voices. And even the ability for users to ‘wear’ these voices in real-time — so they can speak with a voice that isn’t theirs. Think of it as the audio equivalent of a Snapchat lens or TikTok’s viral teenage filter or Reface’s celebrity face-swaps.
AI voice can even enable voice-shifting into another person’s (real) voice. And not just for talking about the weather or shooting the shit. But for what’s known as sing-to-sing voice conversion. Meaning you could get to sing in someone else’s voice — supercharging your karaoke game, say, by singing Bohemian Rhapsody as literally the voice of Freddie Mercury. And even switching between Mercury, May and Taylor, for the full mock opera effect if you have enough trained AI models (and microphones) on hand. Mamma-mia!
Artificial intelligence makes all this possible — even if legal and ethical questions may create pause for thought about rushing to unleash real-time voice-shifting upon a world that still relies plenty upon fixed identities. (Banks pushing customers to record ‘a unique voiceprint’ to use as a password definitely need to sit tf up and start listening.)
Voicemod acquired another audio effects startup last year, called Voctro Labs, whose technology Bosch says it’s working to blend with its own to create an amped up hybrid platform. The combo has already allowed it to expand what it offers — launching a text-to-song feature last December which lets you turn your own lyrics into a vocal composition using generative AI. He tells us more is on the way — including the aforementioned sing-to-sing feature.
Voctro’s tech may be familiar as it was involved in the development of a voice clone of musician Holly Herndon which appeared in a viral Ted Talk last year — in which her AI voice could be heard duetting with another musician (Pher)’s real voice in real-time. Which, well, if you haven’t already seen it is quite the visual-audio spectacle, as well as being a mouthful to explain. It’s also a taster of what Voicemod has coming to a keyboard near you.
“We’re definitely going to launch more products and more ways for people to express themselves with the generative AI technology,” Bosch tells us. “Not all Voctro Labs’ technologies are related to music — but they have a lot of technology related to singing, from this text-to-song technology to sing-to-sing technology in real time. So we have a lot of new projects and new products of upcoming.
“We are going to strengthen our speech-to-speech AI real-time technology, because we are basically merging our technology with their technology. We’re basically creating an hybrid technology that will be better than ours — or there’s a mix of both… [So their sing-to-sing technology will be] combined with our DSP technology — that we could use to do autotune. So we could potentially help artists with their voice and on the tone. And so this is, this is gonna be really, really interesting.”
As well as providing direct-to-consumer/creator audio tools, it offers its technologies via SDK and APIs for third parties to integrate into their own products, from games and apps to hardware. So it’s set up to distribute its tech across the gamer-creator ecosystem and have demand come find it.
Generative AI-powered disruption in audio of course mirrors (in a non-exact fairground ‘crazy mirror’ kind of a way) developments we’re seeing happen elsewhere: Visually, to graphics and illustration, as a result of deep learning and the advent of prompt-based image generation interfaces (such as DALL-E and Stable Diffusion). Also to the written word, through the large language models that underpin generative AI chatbots like ChatGPT that can produce song lyrics or a whole essay on demand. And, indeed, in the case of musical composition — where Google recently showed off a prompt-based generative AI song composer which can apparently produce arrangements that match the musical vibe you describe (although it said it’s not releasing that particular generative AI model — but surely someone else will).
It’s clear that AI is bending the rules of what it’s possible for a single person to create. And, well, as with freedom, the open concept, this is both thrilling and terrifying. Because, it’s what you do with it that counts.
The coming years are going to be all about finding out what people do with such powerful AI tools at their fingertips.
Voicemod is positioning itself to ride this wave by building a toolbox for creators to survive and thrive in a reality-bending future and across a range of use-cases — hence it’s talking in terms of sonic identity and voice avatars for the social metaverse (at the future-gaze-y end) but also just helping you sound your sparkling best on a work Zoom call. So a sort of audio make-up as it were. Apply as needed.
“Now suddenly everyone can become a creator,” predicts Bosch of the generative AI boon. “Everyone can come, basically, with no skill set. Or with no learnings on how to really craft those audios. They will be able to actually create those pieces of music. Songs. And this eventually evolves into into — probably — even voices. So the ability to create voices.”
“This could potentially be something really viral for platforms like TikTok, or YouTube Shorts or Instagram… And this could eventually evolve into things like karaoke, for example. And be, I don’t know, part of game consoles, or things like that, for people to use this to entertain. And, if we go a step further — and it’s the technology getting better and better as we think it will be — this could potentially be a professional tool for people who want to create music. Or for people who want to create voices for movies or voices for games characters.
“We have a strong belief in user-generated content, and we are building tools for our users to start creating sounds and creating voices. And we will be putting technology in the hands of the users to create those [sounds]. And, eventually in the future, hopefully, they will go even to a professional level.”
So while — currently — in order for the startup to synthesize a whole voice it does still involve a team of sound engineers and designers, Bosch suggests generative AI will put that power in the hands of the individual — and it’ll happen soon; “in the near future”.
“I don’t know if we’ll be prompting — now we’re in this wave of everything is done through prompts — I’m not sure if that will be the way or it will be more tools that will have AI technology embedded and we have user experiences that will make things a lot easier,” he adds. “But definitely what I see from generative AI in the audience but also in the management phase is that suddenly everyone’s can come become a creator, which I think is really interesting.”
The birth of AI voice may not sound like amazing news for the employment prospects of sound engineers and designers (albeit, tech advances may simply create new requirements that just shift where their expertise is needed). But Bosch reckons that voice actors, at least, will still have a key role to play — emoting for AI. Since robot voices aren’t good at getting the pitch and intonation, or indeed emotion, right. It’s a voice clone without a soul, basically. (Or as Nick Cave might put it, AI voice lacks ‘its own blood, its own struggle, its own suffering’ — it lacks humanness.)
“I think that you will always need a human factor in your sample with these voices,” suggests Bosch. “You could have the best voice — of even a famous person — but what really comes is the impression. You still need a human to do the cadence on the words. You still need a human to do the rhythm, the tone. So [it’s not just that] I can speak normally and I will sound like a famous person — no, you don’t — you still need to act a little bit. So… I think human factor for expression is key.”
Might generative AI not be able to be learn to emote as well, with the right human data-sets — and further dial up its mimickry so as to make us laugh or cry or love or hate on-demand too?
“Yeah. Well, we will see,” responds Bosch. “I’m not sure. I mean, as of today, for me AI is a tool to be used by humans. But yeah, we don’t know where this is going to evolve.”
Voicemod is gearing up for whatever phonic crazyiness lies ahead with a fresh tranche of funding. The 2014-founded startup has been revenue generating for years, via pro versions of its tools — its main product, Voicemod for Desktop, has had more than 40 million downloads to-date, while Bosch says it has 3.3 million monthly active users — but it’s just closed $14.5 million in expansion funding, following an $8M Series A back in summer 2020. Madrid-based Kfund’s growth fund Leadwind, led the round, with participation from Minifund (Eros Resmini former CMO at Discord) and Bitkraft Ventures.
“We’re super excited by what generative AI can do to all creative industries and more specifically audio, especially when it comes to enhancing and augmenting the job that creative people already do,” Jamie Novoa, partner at Kfund, tells TechCrunch. “In the past few months there’s been an explosion in generative AI in general and more specifically in audio but we think this is a phenomenon that’s just starting.
“What many of the cool technologies being launched to market lack are concrete and scalable business models attached to them, and Voicemod differentiates itself from the pack by having built a product used by millions of people on a daily basis and with significant revenue traction. We’re super excited about what Jaime and the rest of the Voicemod team have in the pipeline and what’s to come.”
Voicemod says the extra funds will be used to enhance the development of its real-time AI voice identity capabilities — and dial up its proposition for Gen Z, gamers, content creators, and professionals of all skill levels wanting tools to help them express themselves vocally in digital spaces.
Per Bosch, part of the reason it’s taking more funding now relates to the acquisition of Voctro Labs. Beyond that, he says it’s about making the most of the opportunities sparking off the Cambrian explosion in generative AI tools.
“We are in the middle of tremendous revolution in AI,” he says. “We want to be well funding in order to be able to develop technology but also to be able to deliver technology to users. So I think one of our competitive advantages is that we already have the market and the traction and we basically are able to put this in the hands of the users. And I want to make sure to have enough runway, also due to market conditions, to be able to put all of this in place. So it will be mainly focused… on building the next generation AI technology and putting it in the hands of the users and also building these creation tools for the users to create content.”
The first new tool will be landing next month — with a launch of Voicemod’s desktop product on macOS (currently it’s PC only). The goal is to evolve into a multi-platform product spanning all devices. “We’re also working on a creation tool mobile app that hopefully will see the light towards the beginning of next quarter. And, and yeah, some more stuff to come, hopefully,” Bosch adds.
He also tells us the startup is working on a watermarking technology which it hopes to launch in Q2 this year — to give platforms a way to be able to spot AI-generated voices in the wild.
Such a feature is likely to be a vital tool to counter all the possible negative use-cases (scams, fraud, manipulation, abuse, bullying, trolling etc etc) one could imagine humans coming up with for voice-shifting tools that let you sound exactly like someone you’re not.
“It’s an algorithm to watermark the audio,” explains Bosch. “Moderation is is complicated because it really changes depending on the space… on which are the platforms where the audio is used — so we believe that the channel is the one that should own that moderation and what we are doing is we will be providing this watermarking system in order for them to be able to know if the audio is created via synthetic voice or is created by a real voice.”
“Every single new technology can be used for for the good or for the bad,” he adds. “So we are of course putting some technology some tools in place to be able to have more control around a misuse of this technology.”
On questions of licensing for training data, IP issues here are currently a grey area — as the law hasn’t caught up with developments in AI (let alone generative AI). That means startups operating in the space have to consider whether to make the most of total legal freedom to do whatever they want (and hope expensive consequences don’t come clanging down on them in short order), or tread more carefully and thoughtfully. (Other startups in the space include the likes of Voice AI, Koe and ElevenLabs.)
Bosch claims Voicemod is taking the latter approach — using (paid) voice actors to build up data-sets to train and hone its AI models. If it wants to make use of some original content he says the team will go to the IP provider and negotiate — and figure out what kind of licensing terms they’d be up for. (The generative AI boom is also a crazy-thrilling time to be an IP lawyer, clearly.)
“We are basically pioneering here,” he adds. “So a lot of things are without laws yet so we were trying to stick to our values, basically, and try to do the right thing. That’s our approach on the data [side]. But yeah, you’re completely, right — there’s no ‘legal attachment’ to your voice, as of today… We own our fingerprint. You don’t own, like, whatever the fingerprint of your voice [is]. As of today.
“It sounds a little bit like science fiction but maybe, in the future, we will ‘own’ something related to our voice.”
For the record, Bosch was talking to me with his actual voice. The company’s real-time voice-shifting technology doesn’t yet work over mobile. But he says that’s coming too. So buckle up: The synthesized future is gonna be a screaming wild ride.