`

Speech Recognition : How it works and what it is made of

Favicon Vivoka Author

Written by Aurélien Chapuzet

Aurélien Chapuzet is the Digital Marketing Manager at Vivoka, leading content creation and marketing strategies.

19 February 2021

Speech recognition is a proven technology. Indeed, voice interfaces and voice assistants are now more powerful than ever and are developing in many fields. This exponential and continuous growth is leading to a diversification of speech recognition applications and related technologies.

Currently, we are in an era governed by cognitive technologies where we find for instance virtual or augmented reality, visual recognition and speech recognition!

However, even if the “Voice Generation” are the most apt to conceptualize this technology because they are born in the middle of its expansion, many people talk about it, but few really know how it works and what solutions are available.

And it is for this very reason that we propose you to discover speech recognition in detail through this article. Of course, this is just the basics to understand the field of speech technologies, other articles in our blog cover some topics in more detail.

 

“Strength in numbers”: the components of speech recognition

 

For the following explanations, we assume that “speech recognition” corresponds to a complete cycle of voice use.

Speech recognition is based on the complementarity between several technologies from the same field. To present all this, we will detail each of them chronologically, from the moment the individual speaks, until the order is carried out.

It should be noted that the technologies presented below can be used independently of each other and cover a wide range of applications. We will come back to this later.

 

The wake word, activate speech recognition, with voice

 

The first step that initiates the whole process is called the wake word. The main purpose of this first technology in the cycle is to activate the user’s voice to detect the voice command he or she wishes to perform.

Here, it is literally a matter of “waking up” the system. Although there are other ways of proceeding to trigger the listening, keeping the use of the voice throughout the cycle is, in our opinion, essential. Indeed, it allows us to propose a linear experience with voice as the only interface.

The trigger keyword inherently has several interests with respect to the design of voice assistants.

In our context, one of the main fears about speech recognition is the protection of personal data related to audio recording. With the recent appearance of the GDPR (General Data Protection Regulation), this fear regarding privacy and intimacy has been further amplified, leading to the creation of a treaty to regulate it.

This is why the trigger word is so important. By conditioning the voice recording phase with this action, as long as the trigger word has not been clearly identified, nothing is recorded theoretically. Yes, theoretically, because depending on the company’s data policy, everything is relative. To prevent this, embedded (offline) speech recognition is an alternative.

Once the activation is confirmed, only the sentences carrying the intent of the action to be performed will be recorded and analyzed to ensure the use case works.

To learn more about the Wake-up Word, we invite you to read our article on Google’s Wake-up Word and the best practices to find your own!

 

 

Speech to Text (STT), identifying and transcribing voice into text

 

Once speech recognition is initiated with the trigger word, it is necessary to exploit the voice. To do this, it is first essential to record and digitize it with Speech to Text technology (also known as automatic speech recognition).

During this stage, the voice is captured in sound frequencies (in the form of audio files, like music or any other noise) that can be used later.

Depending on the listening environment, sound pollution may or may not be present. In order to improve the recording of these frequencies and therefore their reliability, different treatments can be applied.

  • Normalization to remove peaks and valleys in the frequencies in order to harmonize the whole.
  • The removal of background noise to improve audio quality.
  • The cutting of segments into phonemes (which are distinctive units within frequencies, expressed in thousandths of a second, allowing words to be distinguished from one another.

The frequencies, once recorded, can be analyzed in order to associate each phoneme with a word or a group of words to constitute a text. This step can be done in different ways, but one method in particular is the state of the art today: Machine Learning.

A sub-part of this technology is called Deep Learning: an algorithm recreating a neural network, capable of analyzing a large amount of information and building a database listing the associations between frequencies and words. Thus, each association will create a neuron that will be used to deduce new correspondences.

Therefore, the more information there is, the more precise the model is statistically speaking and capable of taking into account the general context to assign the best word according to the others already defined.

Limiting STT errors is essential to obtain the most reliable information to proceed with the next steps.

 

 

NLP (Natural Language Processing), translating human language into machine language

 

Once the previous steps have been completed, the textual data is sent directly to the NLP (Natural Language Processing) module. The main purpose of this technology is to analyze the sentence and extract a maximum of linguistic data.

To do this, it starts by associating tags to the words of the sentence, this is called tokenization. These are actually “tags” that are applied to each word in order to characterize it. For example, “Open” will be defined as the verb defining an action, “the” as the determinant referring to “Voice Development Kit” which is a proper noun but also a COD etc… and this for each element of the sentence.

Once these first elements have been identified, it is necessary to give meaning to the orders resulting from the speech recognition. This is why two complementary analyses are performed.

First of all, syntactic analysis aims to model the structure of the sentence. It is a question here of identifying the place of the words within the whole but also their relative position compared to the others in order to understand their relations.

To complete and finish, the semantic analysis which, once the nature and the position of the words are found, will try to understand their meaning individually but also when they are assembled in the sentence in order to translate a general intention of the user.

The importance of NLP in speech recognition lies in its ability to translate textual elements (i.e. words and sentences) into normalized orders, including meaning and intent, that can be interpreted by the associated artificial intelligence and carried out.

 

Artificial intelligence, a necessary ally of speech recognition

 

First of all, artificial intelligence, although integrated in the process of the previous technologies, is not always essential to achieve the use cases. Indeed, if we are talking about connected technologies (i.e. Cloud), AI will be useful. Especially since the complexity of some use cases, especially the information to be correlated to produce them, makes it mandatory.

For example, it is sometimes necessary to compare several pieces of information with actions to be carried out, integrations of external or internal services or databases to be consulted.

In other words, artificial intelligence is the use case itself, the concrete action that will result from the voice interface. Depending on the context of use and the nature of the order, the elements requested and the results given will be different.

Let’s take a case in point. Vivoka has created a connected motorcycle helmet that allows to use functionalities with the voice.  Different uses are available, such as using GPS or music.

The request “Take me to a gas station on the way” will return a normalized command to the artificial intelligence with the user’s intention:

  1. Context: Vehicle fuel type, Price preference (affects distance travelled)
  2. External services: Call the API of the GPS solution provider
  3. Action to be performed: Keep the current route, add a step on the route

Here, the intelligence used by our system will submit information and a request to an external service that has a specialized intelligence to send us back the result to operate on the user.

AI is therefore a key component in many situations. However, for embedded functionalities (i.e. offline), the needs are less, being closer to the realization of simple commands such as navigation on an interface or the reporting of actions. It is a question here of having specific use cases that do not require consulting multiple information.

 

TTS (Text to Speech), voice to answer and inform the user

 

Finally, the TTS (Text-to-Speech) concludes the process. It corresponds to the feedback of the system which is expressed through a synthetic voice. In the same spirit as the Wake-up Word, it closes the speech recognition by answering vocally in order to keep the homogeneity of the conversational interface.

The voice synthesis is built from human voices and sounds diversified according to language, gender, age or mood. Synthetic voices are thus generated in real time to dictate words or sentences through a phonetic assembly.

This speech recognition technology is useful for communicating information to the user, a symbol of a complete human-machine interface and also of a well-designed user experience.

Similarly, it represents an important dimension of Voice Marketing because the synthesized voices can be customized to match the image of the brands that use it.

 

 

The different speech recognition solutions

 

The speech recognition market is a fast-moving environment. As use cases are constantly being born and reinvented with technological progress, the adoption of speech solutions is driving innovation and attracting many players.

Today on the market, we can count major categories of uses related to speech recognition. Among them, we can mention :

 

Voice assistants

 

We find the GAFAs and their multi-support virtual assistants (smart speaker, telephone, etc.) but also initiatives from other companies. The personalization of voice assistants is a trend on the fringe of the market domination by GAFA, where brands want to regain their technical governance.

For example, KSH and its connected motorcycle helmet are among those players with specific needs, both marketing and functional.

 

Professional voice interfaces

 

We are talking about productivity tools for employees. One of the fastest growing sectors is the supply chain with the pick-by-voice. This is a voice device that allows operators to use speech recognition to work more efficiently and safely (hands-free, concentration…). The voice commands are similar to reports of actions and confirmations of operations performed.

There are many possibilities for companies to gain in productivity. Some use cases already exist and others will be created.

 

Speech recognition software

 

Voice dictation, for example, is a tool that is already used by thousands of individuals, personally or professionally (DS Avocats for instance). It allows you to dictate text (whether emails or reports) at a rate of 180 words per minute, whereas manual input is on average 60 words per minute. The tool brings productivity and comfort to document creation through a voice transcription engine adapted to dictation.

 

 

Connected objects (Internet of Things IoT)

 

The IoT world is also fond of voice innovations. This often concerns navigation or device use functionalities. Whether it is home automation equipment or more specialized products such as connected mirrors, speech recognition promises great prospects.

 

As the more experienced among you will have understood, this article explains in a succinct and introductory way a complex technology and uses. Similarly, the tools we have presented are a specific design of speech technologies, not the norm, although they are the most common designs.

To learn more about speech recognition and its capabilities, we recommend you browse our blog for more information or contact us directly to discuss the matter!