Speech-to-text (or automatic speech recognition – ASR) and voice technologies in general have become indispensable features in upcoming products and/or services. Already existing ones will also benefit from being updated with this new ability. Whether you already possess a smart assistant or not, you probably know that it is to become a must-have in most uses. Indeed, the adoption rate for smart speakers like Google Home or Alexa has boomed since 2018 and they have become one of the fastest adopted technologies in history. Soon everybody will own one and each company will have its own branded assistant! In this post, we’ll “speak” about the technology which enables these voice assistants (and a ton of other things) to understand what we want them to do: Speech-to-text.
Understanding Speech-to-text’s expansion
Speech-to-text is mandatory when you want to create approximately anything which is voice-enabled. It is the step that will allow the device to identify and transcribe a voice or an audio file into an understandable text for a machine in order to process what you want. You may remember our blog post about speech recognition? It dealt with all the technologies necessary for it to work, if you haven’t read that yet: you’ll find a good explanation of what STT is and how it works!
Speech-to-text: mandatory in a voicefirst society
Do you ever speak to your pet? –Of course you do if you have one- I bet you also speak to your inanimate objects? Like your robot-vacuum or your TV? See, speaking is a part of the majority of human beings as it is our main medium of expression. That’s one of the first things you get to learn as a baby. It is more natural to speak up your intent instead of pushing buttons and that is why people started implementing voice into their devices. Moreover, it is way easier for people with certain disabilities. The world is moving toward a more accessible direction (finally) and voice technology is helping to do so!
Did you say voicefirst?
VoiceFirst is a term that is used to refer to the increasing use and reliance on voice-controlled technologies. As it is more natural to speak, they are easily and quickly adopted. Everything is becoming easier since the democratisation of personal computers and even more with smartphones and applications. In fact, next generations will be less and less used to personal computers, as their usage declines in favour of smartphones’. As we said just before, speaking is natural, so what is easier than that? Tech companies are leading the way to a voicefirst world. As an example, we have recently read that Apple may currently be trying to train users to become comfortable speaking to Siri anywhere at any time through the absence of physical controls on their AirPods.
Why speech-to-text is essential
The majority of companies have -or are looking to develop- their own voice assistant. Indeed, voice allows to transform a basic experience into an interactive one. In a world where experience is more important than the product you buy, it can make a big difference. And what would be a voice assistant (or any voice-enabled system) without speech-to-text? You get it: nothing. Because it is the basis of “understanding” for the machine. Thus, voicefirst is emerging and so is speech-to-text.
Speech-to-text: A complex technology to harness
Still, speech-to-text faces some difficulties. Indeed, the technology isn’t perfect and everybody isn’t ready to step into voicefirst society. Let’s have a look at the limits of this technology.
People are not 100% used to it
Even though you are sufficiently at ease to speak to your pet or your technological devices alone at your place, you may not be when you are out in public. Which I understand, and yet I work for a voicetech company. In fact, nobody is comfortable with the idea of speaking to non-human devices when other people are there yet. Do you still find it strange to walk by a person on a call with their earbuds only? Well, this is just a barrier to overcome, as many others have been before. Through the use, voice technology will enter our daily life. Therefore, people will be speaking out loud their voice commands in the streets for their device to play some music, call their friends or whatever thanks to STT. That is why we are already seeing some companies working toward it, like Apple.
Yet, we won’t ever replace everything with voice technology. Doing that would be counterproductive: it is making people do something they don’t want to do. The best option is to use it in a multimodal environment to allow people to use what they want when they want in order to keep it as a great and useful option.
Speech-to-text & privacy compliance
Another limit to adoption would be privacy. Indeed, in order to “understand” what you want, the STT has to collect and process your sayings beforehand. As voice is a biometric data which allows the recognition of the speaker, it is protected by various data protection legislations worldwide (GDPR, CCPA and such). Thus, its collection and processing of this data is strictly monitored. On the first hand, professionals are looking for solutions which won’t risk causing them legal problems so they are still a bit hesitant.
On the other hand, individuals are concerned about how their personal data are being collected and used. However, this is mainly a GAFAM-problem. Indeed, cloud and hybrid solutions are usually processing datas to improve their models and can cause personal datas to be leaked. That’s why most companies with sensitive data are choosing embedded solutions for their speech to text software like Vivoka’s. Completely offline softwares allow businesses to develop their own automatic speech recognition safely because they process data locally. It means that data doesn’t circulate. We recommend you to read our article on voice technology if you want to learn more.
The lack of accuracy can also be a drag linked to the fact that you cannot create a voice assistant able to understand and answer perfectly to any kind of request (yet). Everybody wants a continuous voice experience but, as you can’t do it correctly, voice technologies implementation is more a streamlined feature. Still, it is really matching most of the current requirements for voice picking and other hands-free kits for example.
It can be expensive, sometimes…
Business models are different regarding companies and solutions. Indeed, some offer pay-as-you-go packages, which can become difficult to deal with if you get tons of queries. Licensing models are safer to predict your costs and match your forecasts but could lack scalability. Subscription is also an option, matching scalability and cost control, but not suited for every company. All in all, business complexity is something companies are not quite used to, mostly with new technologies.
Apart from that, there is an overall strong resistance to change. The voice technology market hasn’t yet reached its peak as only few people are using it for its full potential. These are the main reasons why people still have trouble adopting voice technologies in their everyday lives and/or for business. But there is more. As I said before, technology isn’t perfect and faces technical limitations as well.
Accuracy under any circumstance
First of all, you need to understand that speech-to-text accuracy is based on the information you give it. For instance, in the majority of cases if you use STT, you will find that speech-to-text will easily recognize terms it is used to hear. But as soon as you speak new words or expressions… well, it struggles to transcribe it. That is because whether the model lacks some specific training or it is based on generic language models and are not used to these. Therefore, if you want to caption a news channel on TV, it will be difficult to catch every word. In fact, real language changes faster than models so it needs to be trained often in order to stay truly accurate. In addition, most of the speech to text models are trained on not so diversified human voices sets which can induce gender or ethnicity bias. Moreover, accents sometimes are sources of concerns as words may be pronounced differently according to regions in a single country…
That’s why, in order to hit the expected accuracy level for a wide range of uses, you will still have a hard time today. There is no perfect model yet. The best option is to narrow the expected user journey to match it as close as possible with the e
Hardware capabilities and footprint
Most of speech-to-text applications are based on devices, their host. If you want to go for cloud providers, lucky you: there’s only an API to implement within your software. That is if it works properly but that’s another issue. In the world of embedded systems, which Vivoka is more familiar with, hardware is all. And so do its requirements. We usually face a duality: microcontrollers vs. microprocessors. One is cheap but has low specifications, the other is much more powerful but so is its cost. Companies need to compose with this reality and we, technology providers, have to find a way to meet both ends. Even if speech models are getting smaller and at the same time hardware has much more capacities (and flexibility to fit with softwares), the struggle stays. For example, Vivoka works on offering the best offline speech-to-text so that companies can develop their own voice assistant according to their needs and market specificities. And as it is embedded technology, we need to be aware of the available resources on the devices in which it can work and the footprint of the models.
How did speech-to-text evolve?
You do know that speech-to-text technology isn’t born with smart assistants like Alexa right?
You may have already heard about Audrey? It is one of the first projects similar to a speech-to-text engine and it dates from 1952! After that came the IBM shoebox and many others. In this part, we’ll be talking about the evolution of speech-to-text technology.
Fields of application of speech to text
Speech recognition has come a long way since 1950. There have been some big innovations in the 70’s and 80’s. But it is in the late 90’s and 2000 that this computer science has bonded thanks to Google and its data centres. From then on, huge amounts of data could be processed to match users’ queries. It opened the way to more located usage like for companies for example.
How speech-to-text has been democratized
Speech-to-text has been long-used by various domains. In fact, these participated in the democratisation of commercial speech to text. Indeed, in early 90’s STT was only used in military and for speech researches. Nuance launched its first Dragon Dictate in 1990 at a staggering price of $9000 even though it wasn’t state-of-the-art STT yet. The process was very long and time-consuming back then. But in 1997 came Dragon NaturallySpeaking which was much more powerful and less expensive ($150). This contributed to democratise voice technology. The early adopters for voice technology were therefore:
- Customer services with IVR which exists since 70’s but was thus considered too complex and expensive;
- And supply chain with voice picking systems which appeared in late 90’s.
Where is it usually used?
Today several other kinds of professions do use speech-to-text:
- Banking & finance;
- Marketing with the voice search;
- Consumer goods;
Some of them only use it for dictation because it is more efficient and it allows people to focus on value-added tasks. Indeed, in banking and finance for instance, dictation frees agents from taking notes during meetings so that they can concentrate on useful tasks that demand more skill or human intervention. Whereas others are on the edge of great innovations and want to implement voice technologies to optimise it to the maximum.
Emerging use of speech-to-text
In the current context, we can easily say that voice can and will be implemented in virtual reality. Indeed, the use of VR headsets is all fun and games until you find out that you are very limited in your controls. So what do we do? How do we manage to play VR games? How do we send a document to somebody else in the Metaverse?
Speech to text in the Metaverse
Voice could be the ideal solution to these problems, allowing to add more commands thanks to speech recognition and speech-to-text. With it you wouldn’t need more physical commands than you have or can manage. It would simplify the user experience but also allow people to navigate the VR world without any handheld controller! Indeed, automatic speech recognition could be used in an advertising context. Keyword detection may be a way for Google Ads to reinvent itself in a world where typed searches would be less and less used. Finally, in a digital world where everybody can be and express themselves orally, speech to text could help prevent abuse and/or hate speech.
Live shopping is also a growing trend since lockdowns. Indeed, democratised by the Chinese website AliExpress, it has been taken over by renowned brands and companies. Merchants or influencers can host a live stream and present products in front of their viewers. These ones can buy products directly live or they can go to the merchant’s website and often benefit from discounts. How can speech-to-text enhance this new shopping experience? It would improve accessibility thanks to live transcription and translation. If we go even further, we could imagine asking questions to the streamer from home without having to type. Moreover, it could become part of the Metaverse with VR shopping and deepen the experience thanks to voice and hearing.
Evolution of habits: the voice commerce
Voice commerce is also an emerging trend. It allows people to proceed purchases by using voice commands with their smart assistant for example, which increases visibility and accessibility for a lot of businesses. It is to become the evolution of habits and be included in the routine of people. You’d be able to say something out loud such as “oh no I’m running out of red beans for my vegan chilli” (or anything not necessarily related to food) and your assistant could remind you that or even better: order some for you if you want!
It is democratising today, but we hope that speech to text is up to become an ubiquitous feature so that the world could be more inclusive for people with disabilities or facing other difficulties. Metaverse is going that way but everything else should too.
What should you be aware of before choosing a STT solution?
We are here to help you and we set up a list of the main specifications you should keep in mind concerning the solution you would like to adopt. Depending on your project, take the time to carefully study each of them to find the perfect match!
There are loads of STT solutions on the market but one of the main elements that can help you choose the one you need is the business model. Indeed, you may have observed that the majority are cloud-based and, often, they charge based on the volume you want to transcribe. Whether it is per hour or per request: the more you use it, the more expensive it gets. Thus, if you know that you will need a lot of volume, the model which corresponds better may be the perpetual license one. With this, you will get a specific license that you will have to pay once and then, depending on your project, royalties can be required. For instance, the VDK is based on this model because it makes more sense for professionals to pay according their needs. Moreover, this model allows companies to better manage their costs because it is fixed and can be forecasted. Speech recognition needs to be integrated seamlessly and it has to be regarding costs too.
Technology evolves rapidly and it implies that it can be quickly obsolete. In order to avoid this, you need to make sure that upgrading and updating the technologies are included in the package of your software supplier. Moreover, you may also want to check if support is available and if it is qualified. Indeed, most of the time you will have to buy credits in order to use it so make sure you are not wasting your money! More than that, the solution you choose needs to be scalable. Whether it is the technology or the business model or anything, make sure you can get more if you ever feel the need to. It is important to sustain your investments.
This one is closely linked to scalability. Indeed, if you want your speech-to-text solution to be long-lasting, you will need something you can customize based on your needs and the ones of your customers. Know that you won’t be able to customize every ASR software and you won’t be able to add grammar to every model neither. If your audience is very specific, you may need to add technical jargon and complex vocabulary to your solution.
Technical stack robustness
For an easy integration, check if the solution you want to opt for has documentation, example codes, and a qualified support available for the installation too. There are also other things to verify, particularly if your aim for an embedded system. Indeed, we recommend that you get information on the operating systems and hard drive the solution uses. It will allow you to make sure it is compatible with what you actually use.
Does your audience speak exclusively english? Or do you need several languages to make sure nobody’s let aside? And more importantly: are you SURE your audience won’t change? Each SDK provider offers a defined set of languages and some of them are really restricted. Still from a scalability perspective, always check that the languages you need are available. Or that it will soon be if it is a bonus one. Indeed, tech companies are continuously developing their offer but they may not have the same priorities as you. On top of that, you now know that accents may be a challenge for a speech-to-text software. Verify by testing it if it manages well several accents. Most of the time, you will also have different versions of a same language because there are specificities based on whether it is American english or British English for instance.
Finally, in order to find out if the solution corresponds to your needs, you may want to evaluate it. Take time to think about your use case(s) and even, if it is possible, prototype it directly. Thanks to it, you will see if it fits what you imagined and if it makes sense in your product’s or service’s process. Indeed, even if we are going towards a voicefirst society and voice technology is to get more and more important, you don’t want to put it in every device just because you can. The more you put voice commands into insignificant objects, the more it lessens its perceived utility. Moreover, as said before, you can’t expect for an ASR to understand everything. The more you ask from it, the less accurate it gets. That is why, overall, you need to understand your use case in order to know all these points. It is what will give you the insight on which solution is the best for you and for the use you are going to make with it.
Vivoka provides an offline ASR to meet the specific needs of companies. Indeed, thanks to grammar, the model is more accurate and has a lighter footprint so you can easily integrate it into limited devices. Moreover, the Studio allows you to work with all voice technologies and offers an undeniable ease of use. If you are curious about it or would like to discuss the possible use cases for your business, feel free to request a demo!