Google has recently made great progress with their speech recognition software, which is used in several open source products, e.g. Chromium Web Speech and Android Handsfree texting. I would like to use their speech recognition as part of my server stack, however I can't find much about it.
Is the text recognition software available as a library or package? Or alternatively, can I call chromium from another program to transcribe some audio file to text?
The Web Speech API's are designed only to be used in the context of either Chrome or Android. There is a lot of work that goes on in the client so there is no public server to server API that would just take an audio file and process it.
If you search github you find tools such as https://gist.github.com/alotaiba/1730160 but I am pretty certain that this method of access is 100% not supported, endorsed or confirmed to keep working.
The method previously stated at https://gist.github.com/alotaiba/1730160 does work for me. I use it on a daily basis in my home automation programs. I use a python script to capture audio and determine what is useful audio or just noise, then it sends the little audio snippet to google and returns the text all under a second!! I have successfully integrated it into my programs and if you google around you will find even more people that have as well!
Related
I am new to speech recognition, android and i have a use case where i need to build an android app which takes commands(limited set of commands, less than 100) from users and executes some logic. I have googled a bit and found the following can be done
Use google cloud speech api
Use Android inbuilt speech to text capability (Is it different from google cloud speech api? If so how?). Also what are the pros and cons of using offline mode of android speech to text?
Use open source speech recognition libraries like Kaldi, CMU Sphinx(it looked like they need a lot of effort in collecting and training the data)
Can someone please suggest me which of the above might best suit my use case?
I have a limited set of commands and speed matters the most to me.
I am really confused and thus putting this question. Thanks in advance.
Use google cloud speech api
Very expensive since you have to pay for every request.
Use Android inbuilt speech to text capability (Is it different from google cloud speech api? If so how?). Also what are the pros and cons of using offline mode of android speech to text?
The inbuilt API is ok to use. It is different from cloud API and it is free. It does not work offline transparently for the user though. Bad side it is slow and you can not configure the vocabulary. So it will decode all words instead of some particular set of commands and often will confuse the required commands with other words in noise.
Use open source speech recognition libraries like Kaldi, CMU Sphinx(it looked like they need a lot of effort in collecting and training the data)
Proper development is always an effort.
Is it possible to intercept audio data using google+ hangout api? I writing an app using g+ hangout for android and I would like to process the audio. To be precessive, I want to denoise speech and use speech-to-text (e.g. google search, sphinx) to make basic voice commands.
Because I have full control of the android app it doesn't matter for me if I will have a callback with audio data from hangout or I can record audio using android AudioRecorder and then somehow forward those data to google hangout (Though the latter solution would be better because we can denoise on the android device). Actually I would be happy with any feasible workaround that may work at this stage of the API.
The Hangouts API is not going to help you develop this feature.
What you need is a platform agnostic API for accessing hangouts data. The API is instead intended to solve a different problem. It allows you to write HTML/JavaScript applications that run inside the canvas of hangouts running on desktop web browsers.
One possible "workaround" that I'm currently investigating, myself—
publish the hangout "on air"
get the youtube live id (available as of 2012-08-22, roughly... since Hangout API 1.2) ~ gapi.hangout.onair.getYouTubeLiveId() https://developers.google.com/+/hangouts/api/gapi.hangout.onair#gapi.hangout.onair.getYouTubeLiveId (note that this can only be grabbed by the host?)
grab http://www.youtube.com/watch?v=${LIVEID} // suggestion: look at youtube-dl: http://rg3.github.com/youtube-dl/documentation.html
and then use ffmpeg to process the flv
Information for this answer was primarily grabbed from Downloading videos in flv format from youtube. and http://grokbase.com/t/gg/google-plus-developers/128fbteedb/google-hangout-api-url-to-youtube-stream-screenshot-and-hangout-topic
I have a client who needs an Android App that can recognize spoken commands. From what I understand the built-in voice to text functionality actually sends data to Google's servers which then sends back a text translation. This is a major problem, as the voice data is extremely sensitive (unless if the data is encrypted when it is sent to and from Google - but I doubt it is encrypted).
There are 2 options that I can think of. First is to convert speech-to-text on the Android, though this seems like it would be an extremely expensive operation. The second possibility is to have a local server convert the data for me (I could encrypt the voice data and the translation when it is being sent to and from). Is this something CMU Sphinx could pull off? It may be worth noting that I will also have access to an Asterisk server, which could possibly assist with this (I don't know).
In reality, there should only be ~200 words which will need to be recognized. I would prefer opensource/free software solutions however I am also open to a commercial solution (perhaps FlexT9). Ideally, I can send the audio stream somewhere, get back a String which is the text, and I can then parse and do other things with the String.
I haven't done much android or any speech recognition development in the past, so I'm hoping someone can at least point me in the right direction. Thanks!
CMUSphinx is an open source speech recognition toolkit you can use to build your application. It contains tools, libraries and data which will enable you to build a speech application. You can learn more about CMUSphinx on the website above.
On Android you have several options to use CMUSphinx:
Recognize audio on the device. For that you can compile Pocketsphinx engine for android. For details see this blog post.
Recognize audio on server. As a server you can use either Pocketsphinx or Sphinx4. You can send audio in compressed flac format or extract speech recognition features on device and send feature stream to the server.
CMUSphinx provides you several acoustic models which will enable you to recognize audio in several languages like English, French, Mandarin, German, Dutch, Russian.
You can also improve the recognition result with adaptation tools.
If you have any questions on CMUSphinx you are welcome to ask in our community forums.
Closed source, but free, is the Microsoft speech engines. For some background see What is the difference between System.Speech.Recognition and Microsoft.Speech.Recognition?. For some more background you can try https://stackoverflow.com/a/4217638/90236
The complete SDK for the Microsoft Server Speech Platform 11 is available at http://www.microsoft.com/download/en/details.aspx?id=27226. The speech engine is a free download.
I am trying to get continuous voice input to work in my Android application. I tried using the built-in SpeechRecognizer Intent but it waits for the user to finish speaking before processing the words. This is not sufficient for me. I need the device to process the words while the user is still speaking.
I read that this is supported in Ice Cream Sandwich now. However, I did not find any API that allows me to access this feature. Does anyone know how this works now?
Thanks for your help!
I guess you heard about the new voice typing feature of Android 4.0. Take a look at this article.
You have to use an external library for it. Though the article says the library is designed for IME developers, and as I see the result of voice recognition will appear in a registered IME through InputMethodService. You can also check the source of the library, because it is a project on Google Code
I want see the source code for the voice enabled-keyboard feature for android.
Can someone tell me where to find the code?
I assume you're referring to the speech recognition feature demonstrated on the Nexus One with Android 2.1.
If this application is open sourced as part of Android, it will be posted on the Android Open Source Project website at https://android.googlesource.com.
However, Android 2.1 has not yet been posted; it should hopefully be available soon.
In the meantime, you could take a look at the source to the voice dialler application.
As far as I know this code is not currently planned to be open sourced -- it is owned by Google as part of their voice recognition server technology. The IME is a fork that Google made of the standard platform input method, adding voice search to it, much like other manufacturers make their own proprietary customizations.