According to Statista, the voice recognition market in 2020 was valued at a total value of USD 10.7 Billion. For a (relatively) new market, this figure is simply mind boggling — and that isn’t all.
This figure is expected to increase by about 170% by 2026. The total market value of voice/ speech recognition apps and their accompanying screen readers is projected to touch a whopping 27.16 billion by 2026, and what makes it amazing is that this is a “conservative” figure.
The industry is going to explode, given the increasing popularity of IoT (The Internet Of Things) applications and devices such as smart home devices, retail, healthcare, banking, and many more.
Here is the guide I have written about speech and voice recognition apps, their trends, and the things to remember when building your own such application.
It is here that I must make the distinction between speech and voice recognition: speech recognition is simply when a neural engine detects that a human is speaking, and voice recognition is one step ahead of that: it also detects different people’s voices.
For the purposes of this article, I am going to focus on the latter, voice recognition.
So, how exactly does one go about building one of these things? Let’s find out.
Table of Contents
Step One: Choose Your App Type.
No, I do not mean choosing between speech and voice recognition — remember, all voice recognition apps are, by default, also speech recognition apps, but the reverse of that is not true.
What I mean to say is that voice recognition applications will majorly fall into one of two categories:
- Speaker Dependent
- Speaker Independent
Speaker Dependent: Speaker dependent voice recognition applications store, analyze, and recognize the voice of a single person only, and do not use as many calculations neurally as speaker independent voice recognition apps.
The earlier voice recognition apps were all speaker dependent, as these take initial inputs from the user, memorize them, and then build up from there.
These apps ask the users to repeat certain templated words and phrases in a particular order, and then memorize the vocal range and pitch of the user accordingly.
These words and phrases are structured such that when broken down, their syllables can be engineered into a variety of new sentences, and this is where speaker dependent apps become number crunchers — and start to require exponentially more processing power as new sentences are added.
Speaker Independent: On the other hand, speaker independent voice recognition applications do not require training in advance to recognize a voice, and they can also recognize the voices of multiple people, provided that the voices are not too similar.
Speaker independent apps use Fourier transformations or LPC (Linear Predictive Coding).
Fourier Transformations: These are mathematical conversions that transform — hence the name — frequency graphs into mathematical functions. Thus, when the waveform of a person’s voice is recorded, Fourier functions break down that graph into small data chunks that are easy for the computer to understand.
Linear Predictive Coding: LPC is made up of a linear representation of a predictive model (something that uses statistics to predict future outcomes). LPC is used for speech synthesis, and it is this technology that allows speaker independent apps to synthesize human voices, and thus, differentiate between them.
Find out more about Linear Predictive Coding and speech synthesis here. Source: The University Of California, Santa Barbara.
Step Two: Decide Your Development Place.
Again, there are two main routes you can take at this point:
- Cloud-Based (Remote)
If budget is not an issue, then on-premise development is the road that most people take, however, I propose that the second option is the better one.
Forming a remote team, or even a partly remote team has myriad benefits.
For one, the costs are significantly reduced. For example, According to Indeed.com, the average software engineer earns a base salary of USD 94,026 per year.
However, if you choose to hire iOS developers or even android developers remotely, you can get rates as low as USD 25 per hour.
That is not the end of the story either: the willingness to hire remote workers gives you access to a wealth of candidates that you wouldn’t have been able to work with otherwise, because of geographical constraints.
Step Three: Decide The End Use Case.
Steps one through three are not superfluous, I can assure you. These three are very likely the questions that your development team will ask you, even before you onboard them.
This is because each point where I have shown you a distinction (fork in the paths you can take) requires specialized talent and relevant experience, which all candidates may not have.
Anyway, step three: answer the question, “what do you want your application to do?”.
Depending on the end user, your app(s) will use different API’s for the actual speech recognition part.
If it is a simple web voice recognition app, for instance, you could easily get away with using Java and a simple web server.
Step Four: Choose Your Voice API.
Now we’re really getting into the meat and potatoes of building your app.
Here are some of the API’s that you can use, and most of them come with their own libraries, which is nice because that eliminates the need to code your voice recognition library from scratch.
I recommend picking one of the following: (in no particular order though).
- Bing Speech API — by MS Cognitive Services (Azure)
- Google Speech API
- Alexa Voice Service — By Amazon AWS
- Wit.ai — By Meta (Facebook)
If your app is going to mainly be mobile (iOS or Android) based, consider the options listed below as their components are highly customizable and they also have several prerequisites that are vital to a mobile app development.
- CMUSphinx or CMU PocketSphinx
- The Kaldi ASR Toolkit
- The Hidden Markov Model Toolkit (HTK) by The University Of Cambridge
Step 4: Decide On Deployment Location.
Now we come to the question of where the application will be deployed. Again, you have reached a fork in the road.
Make your choice wisely:
- Cloud-based Deployment.
- Embedded Deployment.
Here is a table that summarizes the advantages and disadvantages of cloud based app development:
Cloud Development Vs. Embedded Development Pros
|Cloud Development is more efficient.||Embedded Development needs a lot of free disk space.|
|It does not overload the device memory.|
|All calculations are done on powerful servers.||These apps need a fair amount of processing power to be usable: not good, just usable.|
|Server uptime is usually flawless, so no worries about data loss.|
|Cloud Development Vs. Embedded Development Cons|
|Cloud Development requires an internet connection to be maintained throughout the pre and post development process.||Embedded apps can run offline.|
|There is very little delay/ desync as these apps run locally, and thus there is no input lag from the node to the device, and no delay from the device to the server, which can really be a major pain down the line, especially when it comes to larger applications.|
|These need to be severely optimized to use the available bandwidth effectively.|
Choose wisely, as one tends to get locked into the particular ecosystem, depending on which path you take.
Even after just a few short weeks of development, it becomes inefficient and impractical to switch from embedded to the cloud and vice versa, so spare a good deal of thought to this part of the process.
Step Five: Select Your Voice Recognition Stack.
Just like there are different levels of proficiency among musicians, ranging from hobbyists to maestros, there are different “tiers” to a voice recognition stack.
Each of these stacks is a bit costlier and a little harder to implement, but brings many-fold benefits with it.
The main “tiers” are:
- MEMS Microphones:
These are microphones that have a sensor in them (called the MEMS, hence the name). They are the easiest to implement and last extremely long, and can have an SNR (Signal-Noise Ratio) as high as 80 dBA. A good, cost-effective choice overall.
- Microphone Array Algorithms
This just means arranging two or more (omnidirectional) MEMS microphones in a circuit so that all of them collect audio at the same time, which is later collected for processing.
This stands for “Automated Speech Recognition”, and this technology uses Machine Learning (ML) in conjunction with MEMS microphone(s). Its main purpose is to convert human speech into text, so any commercial use app needs to start here. Siri, Alexa, and Google are all examples of ASR-powered apps.
This stands for Natural Language Understanding, and involves accelerated processing to break down human speech into a structured ontology. This is where things begin to get complicated as neural networks come into play, with intent and entity recognition playing a big part in the overall coding.
- Cloud Orchestration
This method needs to be used in conjunction with two things called “skill routing” and “skill execution”, and covers aspects such as Microservices, Scalability, Load Balance, and permissions.
Standing for natural language generation, this is commonly found in AI chatbots and realization engines, and needs a proper team to implement.
Lastly, we have the big boy — Text To Speech generators, also known as speech synthesizers. It is this technology that allows your devices to “talk” back at you.
There are other features that can be implemented such as deep learning, and of course, a strong backend database is a must, but this was my comprehensive, basic, “how-to” manual for creating a voice recognition application.
Hopefully, this piece will be enough to get the ball rolling, and with that, I’ll wish you success with the building of your voice recognition app.