voice recognition vivoka

Speech recognition : How does it work with Vivoka ?


The democratization of speech recognition is no longer to be proven, each of us has a voice assistant, even without knowing it directly. Progress in this area is significant and has accelerated recently. This exponential growth is also leading to a diversification of voice applications, from personal assistants to the solution planned for a particular sector of activity.

We are now in what is called the era of cognitive technologies where we find Augmented Reality, Virtual Reality, visual recognition and… speech recognition!

However, even if the “Generation Voice” are the most capable of conceptualizing this technology because they are born in the midst of its expansion, many people talk about it, but who really knows how it works?

Let’s get to the heart of the matter now, how does speech recognition work?

*A little aside, in this article we will introduce you to the current and most common operation of voice technologies, it is obvious that other methods exist!*


“Strength in numbers”: the components of speech recognition.


As this title presents it so well, the functioning of speech recognition is based on the complementarity between several technologies from the same field. To present all this to you, we will detail each of them chronologically, from the moment the individual speaks, to the realization of the order.


technologie reconnaissance vocale vivoka


Wake-up Word, activate the system by voice.


The first step that initiates the entire process is called Wake-up Word (translated as “trigger word”). The main purpose of this technological brick is in fact to activate by voice the complete recording of the order that will follow and also be given vocally. It is a matter of “waking up” the system. Although there are other ways to proceed to trigger recording, maintaining end-to-end voice usage is essential in our view to provide a linear experience with only the voice as the interface.

Wake-up Word has several inherent interests in the design of voice assistants. In our context today, a major concern regarding speech recognition is the protection of personal data related to the registration of individuals. With the recent appearance of the GDPR (General Data Protection Regulations), this fear with regard to respect for privacy and intimacy has further increased, with a jurisdiction regulating it.

That’s why Wake-up Word is so important. By framing the voice recording phase with this action, only the sentences with the intention of the action to be performed will be recorded and analyzed to ensure the operation of the voice assistant. To learn more about Wake-up Word, we invite you to read our article on Google’s Wake-up Word and best practices for finding an effective one!


The STT (Speech-to-Text), pick up and transcribe the voice into text.


Once speech recognition has been initiated using Wake-up Word, it is necessary to use speech. To do this, it is first of all essential to save and digitize it via the STT (Speech-to-Text). During this step, the voice is picked up in sound frequencies (in the form of audio files, like music or any other noise) that can be used later.

Depending on the listening environment, noise pollution may or may not be present. In order to improve the recording of these frequencies and therefore their reliability, different processing operations can be carried out.

  • Normalization to remove peaks and valleys in frequencies in order to harmonize the whole.
  • Background noise suppression to improve audio quality.
  • The division of segments into phonemes (which are distinctive units within frequencies, expressed in thousandths of a second, making it possible to distinguish words from each other)

The frequencies, once recorded, can be analyzed in order to associate a word or a group of words to each phoneme to form a text. This step can be done in different ways, but one particular method is the state of the art today: Machine Learning. A sub-section of this technology is called Deep Learning: an algorithm that recreates a neural network, can analyze a large amount of information and build a “database” of associations between frequencies and words. Each association will create a neuron that will be used to deduce new matches. Thus, the more information there is, the more statistically accurate the model is and taking into account the general context to assign the best word based on the others already defined. Limiting TWU errors is essential for effective speech recognition!


NLP (Natural Language Processing), understanding the intention.


Once the previous steps have been completed, the text data is sent directly to the NLP (Natural Language Processing) module. The main mission of this technology is to analyze the sentence and extract as much linguistic information as possible.

To do this, it starts by associating tags to the words in the sentence, this is called tokenization. In fact, they are “labels” that are affixed to each word in order to characterize them. For example, “I” will be defined as a singular pronoun of the first person, “ignites” as the verb defining an action, “la” as the determinant referring to “light” which is a proper noun but also a COD etc…. for each of the elements of the sentence.

Once these first elements have been identified, it is necessary to give meaning to the orders resulting from speech recognition. This is why two additional analyses are carried out.

First of all, the syntactic analysis which aims to model the structure of the sentence. The aim here is to identify the place of words within the whole but also their relative position in relation to others in order to understand their relationships.

To complete and finish, semantic analysis aims, once the nature and position of the words have been found, to understand their meaning individually but also when they are assembled in the sentence in order to characterize a general intention.

The importance of NLP in speech recognition lies in its ability to translate text elements (words and sentences) into standardized orders, including meaning and intent, that can be interpreted by artificial intelligence and realized.




Artificial intelligence, combined with speech recognition.


To achieve the stated order in practice, AI (Artificial Intelligence) is the key element. Artificial intelligences work in different ways, some more basic than others.  The main idea is to harmonize several pieces of information with, for example, actions to be carried out, external or internal services to be operated or databases to be consulted.

In other words, artificial intelligence is the use case itself, the concrete action that will result from the voice interface. Depending on the context of use and the nature of the order, the elements requested and the results given will be different.


For example, in a domestic context “Turn on the light”, the representation could be as follows:

1. Query: “Turn on the light”

2. Context: Room of the house, Users, Lamp status: Off

3. External services: Access to lamp APIs (programming interface)


In a less pragmatic and more complex case, “How do I dress tomorrow? “it could look like this:

1. Query: “How do I dress tomorrow? »

2. Context: User type, User’s clothing database, Last purchases, Location, Calendar (appointment schedule etc…)

3. Services: Weather API, Clothing recommendation service


The TTS (Text-to-Speech), the synthetic voice.


Finally, the TTS (Text-to-Speech) concludes the process. It corresponds to the feedback of the AI which is characterized by a synthetic voice. In the same spirit as Wake-up Word, it closes speech recognition by responding vocally in order to maintain the conversational interface from beginning to end.

The latter is useful because it allows responses to be communicated to the user, a symbol of a complete human-machine interface and also of a well-designed user experience. In the same way, it represents an important dimension of Voice Marketing because synthesized voices are customizable, in the same way as the sentences said. Thus, the branding of brands can very largely benefit from it!




Once the cycle is completed, an individual can then converse with a system and give it the orders to be carried out. To summarize, the sentence is captured, then interpreted and then executed as an action that gives rise to feedback from the system.

As the most experienced of you will have understood, this article explains in a very simple way a very complex technology. Exhaustiveness is also not possible in its entirety, as use cases are too diverse, and it is difficult to cover all of them with a single explanation.

However, we have presented you with the state of the art technology regarding speech recognition. These methods are the most widely used today! To find the different components of speech recognition, you can go directly to voice-market.io to discover the best!