I have a client who needs an Android App that can recognize spoken commands. From what I understand the built-in voice to text functionality actually sends data to Google's servers which then sends back a text translation. This is a major problem, as the voice data is extremely sensitive (unless if the data is encrypted when it is sent to and from Google - but I doubt it is encrypted).
There are 2 options that I can think of. First is to convert speech-to-text on the Android, though this seems like it would be an extremely expensive operation. The second possibility is to have a local server convert the data for me (I could encrypt the voice data and the translation when it is being sent to and from). Is this something CMU Sphinx could pull off? It may be worth noting that I will also have access to an Asterisk server, which could possibly assist with this (I don't know).
In reality, there should only be ~200 words which will need to be recognized. I would prefer opensource/free software solutions however I am also open to a commercial solution (perhaps FlexT9). Ideally, I can send the audio stream somewhere, get back a String which is the text, and I can then parse and do other things with the String.
I haven't done much android or any speech recognition development in the past, so I'm hoping someone can at least point me in the right direction. Thanks!
CMUSphinx is an open source speech recognition toolkit you can use to build your application. It contains tools, libraries and data which will enable you to build a speech application. You can learn more about CMUSphinx on the website above.
On Android you have several options to use CMUSphinx:
Recognize audio on the device. For that you can compile Pocketsphinx engine for android. For details see this blog post.
Recognize audio on server. As a server you can use either Pocketsphinx or Sphinx4. You can send audio in compressed flac format or extract speech recognition features on device and send feature stream to the server.
CMUSphinx provides you several acoustic models which will enable you to recognize audio in several languages like English, French, Mandarin, German, Dutch, Russian.
You can also improve the recognition result with adaptation tools.
If you have any questions on CMUSphinx you are welcome to ask in our community forums.
Closed source, but free, is the Microsoft speech engines. For some background see What is the difference between System.Speech.Recognition and Microsoft.Speech.Recognition?. For some more background you can try https://stackoverflow.com/a/4217638/90236
The complete SDK for the Microsoft Server Speech Platform 11 is available at http://www.microsoft.com/download/en/details.aspx?id=27226. The speech engine is a free download.
Related
Is it possible to integrate external TTS engine with Pepper Robot?
I want to integrate Third party Speech engine with pepper robot. Please guide me on the same.
You can integrate an external TTS engine with Pepper. Either offboard (like the services offered by IBM, MS Azure or Google) or onboard (ideal would be something in Java or Kotlin for Android Pepper, but anything is possible). If you have a specific technology in mind, please provide more details and we can give you a more precise answer.
Bear in mind that this may introduce latency in speech synthesising compared to the default text to speech engine.
Edit - sorry, I missed your Android tag. The below mentioned APIs only work on Pepper 2.5 (Choregraphe Pepper)
Alternatively, there are a number of different voices available on Pepper, perhaps one will suit your needs. Use the naoqi API function ALTextToSpeech.getAvailableVoices to list the different voice options, then ALTextToSpeech.setVoice to set the voice to one of those options.
I am new to speech recognition, android and i have a use case where i need to build an android app which takes commands(limited set of commands, less than 100) from users and executes some logic. I have googled a bit and found the following can be done
Use google cloud speech api
Use Android inbuilt speech to text capability (Is it different from google cloud speech api? If so how?). Also what are the pros and cons of using offline mode of android speech to text?
Use open source speech recognition libraries like Kaldi, CMU Sphinx(it looked like they need a lot of effort in collecting and training the data)
Can someone please suggest me which of the above might best suit my use case?
I have a limited set of commands and speed matters the most to me.
I am really confused and thus putting this question. Thanks in advance.
Use google cloud speech api
Very expensive since you have to pay for every request.
Use Android inbuilt speech to text capability (Is it different from google cloud speech api? If so how?). Also what are the pros and cons of using offline mode of android speech to text?
The inbuilt API is ok to use. It is different from cloud API and it is free. It does not work offline transparently for the user though. Bad side it is slow and you can not configure the vocabulary. So it will decode all words instead of some particular set of commands and often will confuse the required commands with other words in noise.
Use open source speech recognition libraries like Kaldi, CMU Sphinx(it looked like they need a lot of effort in collecting and training the data)
Proper development is always an effort.
Is it possible to just have our Android app answer the question to only our Alexa custom skill and not have the entire default behavior of Echo? For example: I created a custom skill, say, calculate Can I make an android app which uses Alexa Voice service API to answer the questions only related to calculate and no other questions? (i.e. no default behavior like weather, music)
Why does the example on developer documentation app say "companion app"? Do I need Echo to use it? Can I not make an app which will answer questions but does not require echo?
Is it possible to get text and audio both as output using Alexa API?
I appreciate any input. Any links and references are welcome.
The benefit of Alexa is it's voice recognition abilities, and the ability to choose an appropriate intent based on a voice interaction. If the skill is written with clearly defined intents Alexa will be able to respond as you wanted. It may be that "Calculate..." might be too vague an intent for Alexa to differentiate.
Also, the useful bit is the skill you build. You define how things are calculated, and what answer to give. Unless you are trying to leverage the voice recognition and AI you might be better off going with some other technology (and if you need those things, then maybe WitAI might be more useful to you: https://wit.ai/ it's a little more roll-your-own than Alexa).
Alexa Voice Services (AVS) is available in the US, but not yet the UK or Germany until 2017 (and who know's when for other markets). AVS can be added physical devices that have a speaker and microphone, so it is possible to use Alexa without using an Echo or Echo Dot.
At it's core, the input and output of Alexa apps are JSON (so text). Alexa parses the text response and speaks the appropriate part. I'm not sure that you can route this response in some other way than having it spoken. However, in between the request and response is the Lambda function (or native device function), so in addition to generating the response to Alexa, you could dump the response somewhere else at the same time that would be available outside of Alexa.
Is it possible to just have our Android app answer the question to only our Alexa custom skill and not have the entire default behavior of Echo? For example: I created a custom skill, say, calculate Can I make an android app which uses Alexa Voice service API to answer the questions only related to calculate and no other questions? (i.e. no default behavior like weather, music)
Yes, it's possible to override the commands. First of all, create your custom skills using Amazon Skill Kit, then use android or iOS Alexa Application for
Android.
In "Settings", go to your product if echo/dot or your android/iOS application and enable your skill.
Why does the example on developer documentation app say "companion app"? Do I need Echo to use it? Can I not make an app which will answer questions but does not require echo?
The documentation context of companion app is only to use your hardware as an Alexa device. So using the Amazon login with Amazon SDK library the developer has the to authorize the user and get token from Amazon server for your hardware to communicate with the Alexa server.
Yes, you can make an android or iOS app for talking to the Alexa server. The link below is to a well-developed library for the same.
https://github.com/willblaschko/AlexaAndroid
Is it possible to get text and audio both as output using Alexa API?
No you will never get the text intepretation you will only get the response from Alexa in the form of JSON.
I'm developing a voice command app and need to use speech to text in Android.
I want my app to work offline. Its yet possible only in jellybean version and it requires huge sized database to download and keep in the device. But i don't require whole database, i just want few keywords for the conversions.
Is it possible to record a .wav files on our own and set its reference to a particular word and when a voice input is given we could match the two voice tracks and recognize the corresponding word accordingly. So basically i want to make my own speech to text dictionary database? If yes then how can i achieve it?
You can try Pocketsphinx on Android:
http://cmusphinx.sourceforge.net/wiki/tutorialandroid
It allows you to look for keywords. Database size is about 5mb now, but if you limit keywords it can be reduced to about 500kb. You can learn more about CMUSphinx from website
http://cmusphinx.sourceforge.net/wiki/tutorial
Since your developing for Android, why don't you just use Androids base voice recognition software as your own. (Unless it's to be paid app)
Creating .wav files yourself will prove difficult for usage from people outside your vocal culture range, meaning e.g. Someone with a different accent won't be able to use it.
So access googles libaries for voice recognition.
Google has recently made great progress with their speech recognition software, which is used in several open source products, e.g. Chromium Web Speech and Android Handsfree texting. I would like to use their speech recognition as part of my server stack, however I can't find much about it.
Is the text recognition software available as a library or package? Or alternatively, can I call chromium from another program to transcribe some audio file to text?
The Web Speech API's are designed only to be used in the context of either Chrome or Android. There is a lot of work that goes on in the client so there is no public server to server API that would just take an audio file and process it.
If you search github you find tools such as https://gist.github.com/alotaiba/1730160 but I am pretty certain that this method of access is 100% not supported, endorsed or confirmed to keep working.
The method previously stated at https://gist.github.com/alotaiba/1730160 does work for me. I use it on a daily basis in my home automation programs. I use a python script to capture audio and determine what is useful audio or just noise, then it sends the little audio snippet to google and returns the text all under a second!! I have successfully integrated it into my programs and if you google around you will find even more people that have as well!