Image de l'article de Blog Vivoka sur le fonctionnement de la reconnaissance vocale. Vivoka

Voice recognition: How does it work exactly?



Like a lot of people in recent years, you have experienced the massive democratization of personal assistants. Similarly, seeing a friend or co-worker giving strange orders or asking funny questions on his phone, has never been as normal as it is today. As you may have noticed, we are in the era of cognitive technologies, and speech recognition is one of them. 

Many people talk about it, but who really knows how it works? A hint: you, just after reading this article.

To gain insight on this process, it is necessary to understand what it is made of. A total of 5 technological building blocks make up the vocale recognition process.

At the origin of this technology is the human voice.

The first step which initiates the whole process is commonly called Wake-up Word (or Hot Word). This is not necessarily a voice command, it can also be a push button or another interaction between the user and the machine. The main purpose of this step is actually to activate vocale recognition (TWU, which will be explained later), i.e. to “wake up” the system so that it can start recording. This is all the more important when we look at the context in which we are, people today are afraid of technology for fear of having their privacy and intimacy swallowed up. Thus, without having performed the action or spoken the necessary words, speech recognition will be in standby and will not record any tracks.

Once the system is in operation, it is necessary to use speech. To do this, it is first of all essential to register and digitize it via the TWU: to simply recognize it! At the end of this step, the voice is translated into sound frequencies (like music for example) that can be interpreted by the system. In order to improve the understanding of these frequencies, various processing operations are carried out:


  • Normalization with the aim of eliminating peaks and valleys in the different frequencies in order to harmonize them.
  • Background noise suppression to improve audio quality.
  • The division of segments into phonemes (distinctive units, expressed in thousandths of a second, allowing words to be distinguished from each other)


Frequencies can be analyzed by a previously trained neural network (Deep Learning): an algorithm capable of analyzing a large amount of information and building a “database” of frequency and word associations. This makes it possible, particularly through statistical analysis, to match a frequency to the most common word and therefore theoretically the most accurate. For example, let’s take two sentences “a glass of water” and “a water worm”. The one that will be selected will be the first one because “glass” is more used than “worms” with “water”.


Once voice recognition and the various processing operations have been carried out, the data are sent directly to the NLP (Natural Language Processing) system. The main mission of this technology is to analyze the sentence and extract its meaning. To do this, it starts by associating tags to the words in the sentence, this is called “tokenization”. In fact, they are “labels” that are affixed to each word in order to characterize them. For example, “I” will be defined as a singular pronoun of the first person, “switch on” as the verb defining an action, “the” as the determinant referring to “light” which is a proper noun but also an accusative etc…. for each of the elements of the sentence. Then comes syntactic and semantic analysis to model the sentence structure and understand the relationships between the different words.

The importance of NLP lies in its ability to translate textual elements (i.e. words and sentences) into standardized orders (always in the same format) that can be interpreted by artificial intelligence in addition.


To achieve the stated order in practice, the AI is the centrepiece.

Artificial intelligences work in different ways, some more basic than others.

In the case of Vivoka, the AI developed over the past 5 years now works by aggregating different elements.


  • Contexts (where is it? why? with and for whom?)
  • Information (objects, known users, current status of objects, stocks, schedules, etc.)
  • External services (access to APIs from external actors such as: ordering a meal, having train schedules, doing internet research,listening to music in streaming etc.)

The idea is to bring these different elements together and make links between them in order to obtain results that are relevant and effective.

Here is a (very basic) illustration of AI in home automation:

Context: Home, Control connected objects, for users

Information : Light, fridge, Shutters, Television (on), Heating (26°)

External services: Weather, Wikipedia, SNCF


The TTS (Text To Speech) concludes the process. It corresponds to the feedback of the AI which is characterized by a sound, a voice or a text displayed for example. This one makes it possible to communicate information to the user, a symbol of a complete human-machine interface.

Once the cycle is complete, an individual can then converse with the machine and give it orders. By summarizing, the sentence is captured, then interpreted and then executed as an action that gives rise to feedback from the system (voice feedback or not). As the most experienced of you will have understood, this article explains a complex technology in a very simple way. The idea here is not to make you experts in this field but to make you aware of how it works and how it is articulated.


Post a Comment

seventeen + 12 =