Research in progress: voice biometrics, work that will leave you speechless!
Today, an increasing number of applications are based on voice and, more particularly, voice recognition. A technological approach which, in addition to presenting certain risks, requires a lot of work from researchers around the world, including EPITA via the Security & System Laboratory (LSE) working group dedicated to automatic speaker and language recognition (ESLR).
On the sidelines of the new video episode of the “Research in progress, EPITA Innovation Laboratory” series produced by the school, Reda Dehak, teacher-researcher on the Artificial intelligence Team and voice biometrics specialist, took to the floor to explain his job and the goals behind his work.
What is your definition of research?
Reda Dehak: To advance the current state of the art, in other words, to bring innovation to what we do best today in order to make improvements for the future. A society without research is like a day without a tomorrow, and a life without a future. Without research, it is impossible to do new things. And without innovation, we cannot move forward, only backward!
What is your field of research?
I work in a particular field that is linked to automatic speech processing, which is also called “voice biometrics” in our jargon and, in another simpler register, speaker identification or voice recognition. The goal is to recognize an individual through their voice in the same manner as facial recognition. The voice we are trying to recognize comes either from an audio or video recording, or from a phone call. However, as the development of smartphones and mobile applications has become increasingly important, it is essential that we are able to correctly identify the people with whom we speak. It is therefore vital to secure these systems using the human voice, which is the only information available via these means of communication.
Can you give us an example of how voice biometrics may be used?
Traditionally speaking, the most important uses are for legal investigations, when a person is wiretapped. In order to provide solid evidence, it is necessary to be able to identify the person who is speaking! However, research in this field does not yet allow for 100% positive verification of the speaker’s identity. Speech is subject to a great deal of variability – when something changes form, it becomes much more difficult to identify. Our goal is to improve the feasibility of these systems and provide methods and increasingly powerful algorithms to remedy all eventual problems. This is a challenge because we do not speak in the same manner, depending on our mood or our level of fatigue, for example! There is also another significant variability linked to the environment. One can speak in a quiet place or on the street with ambient noise, which will add additional complexity. Finally, there is a third variability related to recording a voice: the existence of several types of microphones, with different impulse responses. And a fourth related to transmission: transmitting speech signals on communication networks uses compression and therefore leads to loss. In short, all of these issues must be taken into account when developing the most accurate system possible. Moreover, speech conveys a great deal of other information, and, in the laboratory, we not only work on speaker identification, but we also strive to identify emotions and language…
What technologies are used in voice biometrics?
Voice biometrics relies on voice spectral analysis techniques to extract the best features and machine learning techniques (statistical models and deep neural networks) to model the voiceprint.
From a practical point of view, we have two kinds. On the one hand, there are technologies based on “passphrases”: you need a specific phrase to be identified. This is a technology that can be easily circumvented because when the “passphrase” is recorded and replayed, it triggers identification. On the other hand, there are more secure technologies called “text independent”, which is what we are working on. This means that identification is independent of the spoken sentence. However, now that audio generation systems are available, we are faced with potential identity theft. For example, we have voice conversion solutions that try to transform my “hello” into your “hello”. There are other fairly robust generation systems, which, when they have a sufficient number of recordings of your voice in their database, are also capable of generating audio from scratch that corresponds to your voice (WaveNet): this is called deepfake audio! Although these new tools can be used for humorous impersonations, like impersonators, they can also be used maliciously to harm people.
This is why we must continue our research to combat this new problem.
Are EPITA students also involved in your work?
As this field of research requires extremely advanced mathematical and statistical notions, beyond what the EPITA curriculum offers, it is not easy to invite students to take part in our research projects. Nonetheless, we make sure that they are involved in the numerous subtasks! The goal is to ensure that they understand what research is and what is behind the systems in our smartphones, Siri, Google Assistant and Alexa, among others.
Do you work with other researchers outside of EPITA?
Of course, we are not alone in this field: there is an entire community in France and abroad that works on the subject. We regularly meet at conferences, workshops and competitions to exchange information about each other’s progress, build solid relationships and, sometimes, create promising collaborations. These interactions are essential because research is also a long-term process. Moreover, as researchers, we do not immediately think about how our research will be used or the impact that it could have because the results of our work are not generally applied for one or more years. A good example is voice biometrics: to this day, and despite all the progress made, it still does not constitute sufficient evidence to convict a person in France in a court of law. Of course, it may be used in a legal case, but not as the main piece of evidence, even though we are able to obtain results that are similar to those of a fingerprint. At the end of the day, it is not the researcher’s job to implement what he/she finds, simply because this requires other skills and, often, other more important means.
What are you most proud of as a researcher?
After all these years, I continue to be fascinated by automatic speech processing and am very happy to still be working on it! When you know that this field can be used to save lives by proving a defendant’s innocence or preventing an attack, it makes you feel useful. In any case, when you are a researcher, it is a true vocation, not simply a day time job. You are constantly examining and documenting the world around you, even on vacation and on the weekends, and sometimes you even work at night… It is as exciting as it is demanding. And personally, I can’t say how I came to make it my job: little by little, it simply became my calling. This is also the case for voice biometrics. I started becoming interested in the subject about ten years ago, while working on machine learning and it was one of the few areas where there was a lot of data available to develop machine learning systems.