Speech synthesis (TTS), why is it so important ?

Today, let’s zoom in on one of the bricks of speech recognition. Widely used in the operation of the voice assistants we are all familiar with, text-to-speech (also abbreviated to TTS) is not a technology that exploits speech, it produces it. Synthetic voices are generally the final stage of the process and are becoming increasingly popular. Why is this happening ? Because they are important in the overall “voice” experience, we explain why.


What’s the TTS ? How does it work ?


Speech synthesis (TTS) is defined as the artificial production of human voices. The main use (and what induced its creation) is the ability to translate a text into spoken speech automatically. How does it work ?


Unlike speech recognition systems that use phonemes (the smallest units of sound) in the first place to break up sentences, TTS are based on what are called graphemes: the letters and groups of letters that transcribe a phoneme. This means that the basic resource is not the sound, but the text. This is usually done in two steps. The first one will cut the text into sentences and words (our famous graphemes) and assign phonetic transcriptions, the pronunciation, to all these groups. Once the different text/phonetic groups have been identified, the second step is to convert these linguistic representations into sound. In other words, to read these indications to produce a voice that will read the information.


Attention ! The TTS should not be confused with voice response systems, which are generally used in public transport for example. In the latter case, it is a database containing a large amount of voice information recorded by one or more operators. This information is very limited and contextual, and will be read at key moments, such as a stop or a connection. This operation is therefore much simpler than a TTS which will try to really synthesize a voice for each text provided. This does not mean that TTS is not used in the transport sector !


What uses does it have ?


TTS are found in a multitude of applications. However, it is important to know that this technology was originally designed to help people with disabilities (especially visually impaired) in their daily lives. For example, the very famous Stephen Hawking, because of his heavy disability, used a TTS to communicate with people around him (you can try it directly on this link).


Since then, many use cases have been developed more or less close to the original virtue of TTS. For example, as mentioned above in the context of transport, it is a question of using this technology to generate voices to transmit messages to passengers via voice, whether or not they suffer from a disability. It is very easy today to find traces of TTS in our uses. Another example can be found in language translation engines. These are equipped with this technology to suggest the pronunciation of the translated information in order to complete the textual translation.


At the same time, TTS is now widely used and is becoming more and more popular! Have you guessed it ? Just a hint : Google, Amazon, Apple and Vivoka have them !


Why is this technology essential for voice ?


The point we’re getting at is the most telling use of TTS : Voice Assistants. Who today has never heard the voices of Siri, Alexa or the Google assistant ? True ambassadors of “voice”, these assistants have all been directly equipped with voice synthesis in order to be able to respond to the user. This is not insignificant ! The aim is to strengthen the relationship between the human and the machine through a conversational link… which is reciprocal. The user talks to the assistant and the latter answers, as in a natural conversation between two or more humans. This component is more important than we imagine. 


Indeed, as with any innovation, the adoption process is generally complex, especially when it brings a break in usage. The best way to gain acceptance for voice assistants was to offer new features that promote their use, but also to maximize the user experience by humanizing the technology. These synthesized voices then made it possible to give an identity to the various assistants, making it possible to differentiate them, but also to consider them as entities in their own.


Beyond a simple functionality (attention the marketing course is about to begin) voices are today an integral part of the brand image!  Some even consider that the voice is a pillar of the branding in the making. First of all, it replaces images, which by nature are fixed (and over-represented in the media), with more engaging messages : an image is worth a thousand words, a voice is worth a thousand images ?


The other interest for brands is to grasp the potential of the pool of voice assistants that is already large and tends to be larger, so isn’t it a good idea to go into voice with your own voice to reach such a large audience ?


