Speech synthesis (TTS), how to use it and why is it so important ?

Speech synthesis (also abbreviated as TTS, Text-to-Speech), unlike speech recognition, is not a technology that exploits the voice, it produces it. Synthetic voices are generally the final phase of the “voice assistant process” and are becoming increasingly popular, from youtubers and twitch streamers to what we support at Vivoka, enterprises complex projects.

Why are they getting that much interest would you say? Because they are a key component in the voice user experience (VUX), we’ll show you how.


What is speech synthesis? How does it work?


Speech synthesis (TTS) is defined as the artificial production of human voices. The main use (and what induced its creation) is the ability to translate a text into spoken speech automatically. How does it work?

Unlike speech recognition systems that use phonemes (the smallest units of sound) in the first place to cut out sentences, TTS will be based on what are known as graphemes: the letters and groups of letters that transcribe a phoneme. This means that the basic resource is not the sound, but the text. This is usually done in two steps.

The first will cut the text into sentences and words (our famous graphemes) and assign phonetic transcriptions, the pronunciation, to all these groups. Once the different text/phonetic groups have been identified, the second step consists of converting these linguistic representations into sound. In other words, to read these indications to produce a voice that will read the information.


This is what voice synthesis can sound like : play the following soundcloud file to read out loud the following paragraph.


[ Attention! Speech synthesis should not be confused with voice response systems, which are generally used in public transport for example. In the latter case, it is a database containing a large amount of voice information recorded by one or more operators. This information, which is very limited and contextual, will be read at key moments, for example a stop or a connection. This operation is therefore much simpler than a TTS which will try to really synthesize a voice for each text provided. This does not mean that TTS are not used in the transport sector! ]


Speech synthesis capabilities are limitless. For instance, we created a TTS voice with our Voice Development Kit to comment its demo video, take a look at it !



What uses does it have?


Speech synthesis can be found in a multitude of applications. However, it is important to know that this technology was originally designed to help people with disabilities (particularly visually impaired) in their daily lives. For example, the very famous Stephen Hawking, because of his severe disability, used a TTS to communicate with the people around him (you can try it directly at this link).

Since then, many cases of use have been developed more or less close to the original virtue of TTS. For example, as mentioned above in the context of transport, this technology is used to generate voices to transmit messages to passengers via voice, whether or not they are disabled. It is very easy today to find traces of TTS in our uses. Another example can be found in language translation engines. These are equipped with this technology to suggest the pronunciation of the translated information in order to complete the textual translation.

Another sector that integrates speech synthesis in embedded systems or cloud applications that continues to revolutionize uses is the broad field of IoT. Indeed, in a rapidly expanding universe, intelligent devices are increasingly equipped with TTS, on the one hand to improve the user experience and on the other hand to improve accessibility and the intelligence of the interfaces. A strong example that continues to progress is that of household appliances (also known as “appliances” in English), increasingly equipping consumer products and robots with voice.



How to choose and integrate speech synthesis?


In order to choose the right speech synthesis (text-to-speech), it is essential to take into account several criteria. These parameters are the following: the language spoken, the type of speaker, the quality of the voice and the supplier. With this information, it is easier to select the right solution that meets your needs and constraints. Indeed, not all companies offering TTS have equivalent ranges, so it is very important to source these partners well before you start. Next, the language and the type of voice are important criteria for the user experience proposed, there must be consistency between the voice interface and what it should inspire.

On the integration side, speech synthesis are technologies that are also based on the notions of cloud, embedded or hybrid (also known as “on-premise”). It should be remembered that embedded has technical limits in terms of sentence storage that a cloud will not have, but the embedded voice will work no matter what happens where the cloud needs a connection. These parameters are to be taken into account according to the nature of your projects, in transport for example it is recommended to use embedded to ensure a continuous service.

If you are looking for an embedded speech synthesis solution, we suggest you go to the Voice Development Kit page, our software development kit that gives you access to offline voice synthesis that can be easily configured and integrated.


Why is this text-to-speech essential for voice?


Who today has never heard the voices of Siri, Alexa or the Google assistant? True ambassadors of “voice”, these assistants have all been directly equipped with voice synthesis in order to be able to respond to the user. This is not insignificant! It is precisely a question of strengthening the relationship between human and machine through a conversational… reciprocal link. The user talks to the assistant and the latter answers, as in a natural conversation between two or more humans. This component is more important than we imagine.

In fact, like any innovation, the adoption process is generally complex, especially when it brings a break in usage. The best way to gain acceptance for voice assistants was to offer new features that promote their use, but also to improve the user experience as much as possible by humanising the technology. These synthesised voices then made it possible to give an identity to the various assistants, making it possible to differentiate them, but also to consider them as entities in their own right.

Beyond a simple functionality (the marketing course is about to begin), voices are now an integral part of the brand image! Some people even consider that the voice is a pillar of branding in the making. First of all, it replaces images, which by nature are fixed (and over-represented in the media), with more engaging messages: an image is worth a thousand words, a voice is worth a thousand images?

The other interest for brands is to tell themselves that the pool of voice assistants is already large and that it tends to be larger, so isn’t it a good idea to go into voice with your own voice to reach such a large audience?