Deepfake Audio Detection Method Spots Fake Voices

Engineers have developed a deepfake audio detection method designed to sniff out increasingly realistic audio deepfakes. Unlike the more commonly known video deepfakes which can reproduce the look and voice of an actual person, audio deepfakes use advanced AI software to almost perfectly replicate an actual human voice. As a result, these software-generated fake voices are increasingly used by con artists to trick unsuspecting victims.


Until recently most folks had never heard the term deepfake. Then a video of Tom Cruise that wasn’t actually a video of Tom Cruise shocked the world.

Since then, folks have become increasingly aware of this latest technological advancement and its obvious potential for deception. Perhaps unsurprisingly, opportunistic criminals have already begun to use audio deepfakes that can replicate the voice of someone familiar or even someone famous to run online and telephone cons.

Now, the team behind the new detection method say they have found a weakness in the AI generated impersonators and have developed a set of custom algorithms to determine if the familiar voice on the other end of that call is the real thing or an artificial computerized construct.



To develop their audio deepfake spotting method, researchers Joel Frank and Lea Schönherr from the Horst Görtz Institute for IT Security at Ruhr-Universität Bochum, assembled roughly 118,000 samples of synthesized audio voice recordings. To ensure their dataset represented a wide spectrum of audio synthesizing methods, they used six distinct audio generation tools to make the samples. All told, the research team ended up accumulating almost 196 hours of fake voices recordings in both English and Japanese.

“Such a dataset for audio deepfakes did not exist before,” explained Schönherr in the press release announcing the new detection method. “But in order to improve the methods for detecting fake audio files, you need all this material.”

Next, the researchers plotted the audio files as spectrograms and compared them side-by-side with the deepfakes. Almost immediately a pattern began to emerge. Specifically, the release explains, the side-by-side comparison “revealed subtle differences in the high frequencies between real and fake files.” These difference, the etea, sates, were significant enough to allow then to determine which file is real and which file is fake.

deepfake audio
These spectrograms show the frequency distribution over time of a real (top) and a fake audio file (bottom). The subtle differences in the higher frequencies are marked with red circles. © RUB, Lehrstuhl für Systemsicherheit


Based on their findings, which were presented at the December 7th, 2021, Conference on Neural Information Processing Systems, the two researchers say they have developed a set of algorithms which employ their method to distinguish between an actual human voice and one that has been synthesized.

However, they also caution in the release, even with the detection software’s success, their work is not yet a fully realized system. Instead, they state, “these algorithms are designed as a starting point for other researchers to develop novel detection methods.”

Follow and connect with author Christopher Plain on Twitter: @plain_fiction