voice recognition vivoka

Voice recognition : How does it work with Vivoka ?

[Total: 0    Average: 0/5]


Like many in recent years, you have experienced the massive democratization of personal assistants. Similarly, seeing a friend or co-worker giving strange orders or asking funny questions on his phone has never been as normal as it is today. As you may have noticed, we are in the era of cognitive technologies, and voice recognition is one of them.

Many people talk about it, but who really knows how it works? A hint: you, just after reading this article.

To better understand this process, it is necessary to understand what it is made of. A total of 5 technological blocks make up the voice recognition process.


The origin of voice recognition is the human voice.


The first step that initiates the whole process is commonly called Wake-up Word (or Hot Word). This is not necessarily a voice command, it can also be a push button or another interaction between the user and the machine. The main purpose of this step is actually to activate the voice recognition (STT, which will be explained later), i.e. to “wake up” the system so that it can start recording. This is all the more important when we look at the context in which we are, people today are afraid of technology for fear of having their privacy and intimacy swallowed up. Thus, without having performed the action or spoken the necessary words, speech recognition will be in standby and will not record any tracks.


Once the system is operating, it is necessary to use speech. To do this, it is first of all essential to register and digitize it via the STT in order to simply recognize it! During this step, the voice is recorded in sound frequencies (audio files like music) that can be interpreted by the system because they are transcribed into text. In order to improve the understanding of these frequencies, various processing operations are carried out:


  • Normalization with the aim of eliminating peaks and lows in the different frequencies in order to harmonize them.
  • Background noise suppression to improve audio quality.
  • The division of segments into phonemes (distinctive units, expressed in thousandths of a second, allowing words to be distinguished from each other)


Frequencies can be analyzed by a previously trained neural network (Deep Learning): an algorithm capable of analyzing a large amount of information and building a “database” of frequency and word associations. This makes it possible, particularly through statistical analysis, to match a frequency to the most common word and therefore theoretically the most accurate.


Identify and understand the user’s intention. 


Once voice recognition done, transcription and various processing operations performed, the data are sent directly to the NLP (Natural Language Processing) system. The main mission of this technology is to analyze the sentence and extract its meaning. To do this, it starts by associating tags to the words in the sentence, this is called tokenization. In fact, they are “labels” that are affixed to each word in order to characterize them. For example, “I” will be defined as a singular pronoun of the first person, “turn on” as the verb defining an action, “the” as the determinant referring to “light” which is a proper noun but also an object complement etc…. for each of the elements of the sentence. Then comes syntactic and semantic analysis to model the sentence structure and understand the relationships between the different words.


The importance of NLP lies in its ability to translate textual elements (i.e. words and sentences) into standardized orders (always in the same format) that can be interpreted by artificial intelligence.



Artificial intelligence, combined with voice recognition.


During a full voice recognition process in order to achieve the result of the stated request, AI is the masterpiece. Artificial intelligences work in different ways, some more basic than others. In the case of Vivoka, the AI developed over the past 5 years now works by aggregating different elements.


  • Contexts (where is it? why? with and for whom?)
  • Information (objects, known users, current status of objects, stocks, schedules, etc.)
  • External services (access to APIs from external actors such as: ordering a meal, having train schedules, doing internet research, listening to music in streaming etc.)


The idea is to bring these different elements together and make links between them in order to obtain results that are relevant and effective. Here is a (very basic) illustration of AI in home automation:


Context : Home, Control connected objects, for users

Information : Lamp, Refrigerator, Shutters, Television (on), Heating (26°)

External services : Weather, Wikipedia, SNCF (french rail company)


Finally, the TTS (Text To Speech) concludes the process. It corresponds to the feedback that AI give the user via a synthetized voice. The latter makes it possible to communicate information to the user, a symbol of a complete human-machine interface.

Once the cycle is complete, an individual can then converse with the machine and give it orders. By summarizing, the sentence is captured, then interpreted and then executed as an action that gives rise to feedback from the system (voice feedback or not).


As the most experienced of you will have understood, this article explains a complex technology in a very simple way. The idea here is not to make you experts in this field but to make you aware of how it works and how it is articulated.