I'm writing a game for Google Glass, but unfortunately SpeechRecognizer API isn't available on the current builds on Google Glass GDK.
So I've been thinking about implementing an algorithm for a very simple voice recognition.
Let's say I want to recognize only: "Yes" and "No".
Do you know any example code or any helpful resources to help me in implementing this ?
Is it so hard that I should drop the idea and go with big frameworks like CMUSphinx ?
What about recognizing: up, down, right, left or numbers from 1 to 10 ?
As I know, there often used transition to the frequency domain by fast Fourier transform (FFT) and it analyzing. Also need some dictionary of speeched words for frequency correlation.
Please see this links:
CMU Sphinx have java implementation.
David Wagner have a good article and matlab implementation.
P.S. Ohh, if you speak in russian, why you don't read this article - very simple, with java examples.
P.P.S. Honestly, I never use this framework, but if you have only a superficial knowledge about speech recognition, robust and easyest way is to use existing complete solutions like frameworks or libraries, otherwise you need spend time to possess the necessary knowledge threshold. In this case you can read this article.
Related
I would like to implement offline voice recognition in my app. But I want it for two purposes:
For a small set of commands (play, stop, previous, next and a couple of others);
For a list of a few hundred bird names.
To implement (1), it seems to me a bad idea (slower and resource consuming) to use the full voice recognition force of android. In my mind, it would be easier to tell my app to only interpret a few words. That is, to use my own dictionary, telling my app to "use only these 10 words".
To implement (2) is similar to (1), but with a few hundred instead of 10.
Does this makes sense, and if so is there an easy way to implement it? Is it worth it?
Thanks!
L.
You can implement your app using CMUSphinx on Android. CMUSphinx tutorial is here:
http://cmusphinx.sourceforge.net/wiki/tutorial
The language models to recognize limited set of words are described here
http://cmusphinx.sourceforge.net/wiki/tutoriallm
You can use keyword spotting mode to recognize few commands.
Pocketsphinx on Android is described here:
http://cmusphinx.sourceforge.net/wiki/tutorialandroid
The demonstration includes the way to switch recognition modes from 10 words to few hundred words as you intend.
I have seen lot of questions with this topic and read alot of articles but still cant find the best sloution for what I am looking for.
I want to build an app (Android/IOS/...whatever) which has this feature:
when the user write down a text (using killboard), the app will can recognize speech to text on what he wrote with 99.9% performance, I dont mind if he would have to record his voise first to make performance better... I want it to be "live" like Google Servies unlike Seri that writes the texts only after you finish talking.
I have found this site:
http://cmusphinx.sourceforge.net
and I wish to start working with it but before start I wanted to make sure it is the best way.
can anyone give some advises?
thanks
*edit: I dont care to build a new field for a new launguage if needed (its not in english).
I mean like, if you do some research you'll see that 99% accuracy in speech-to-text is only a very recent thing, and an example is Nuance's Dragon.
High accuracy speech-to-text can cost around $600 for a license. It's not an easy thing to create. You have to pay for high accuracy TTS libraries.
For what you're doing though, a really good service I have used is Wit.ai. It's very accurate, and its getting faster every week.
Another possibility for you might be the AT&T speech engine (Watson) found here: http://developer.att.com/
They offer 1 million API calls per month for a fee(low) and allow you to customize the "library" you use to recognize speech. It might be what you are looking for given your latest statements. You can try it for free though it is throttled until you pay.
I want to record a dog bark, save the file and compare with several files containing different types of bark (warning bark, crying bark, etc..).
How could i do that comparison in order to get a match? What is the process to follow in this type of apps?
Thank you for the tips.
There is no simple answer to your problem. However, for starters, you might look into how audio fingerprinting works. This paper is an excellent start written by the creators of shazam:
http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf
I'm not sure how well that approach would work for dog barking, but there are some concepts there that might prove useful.
Another thing to look into is how the FFT works. Here's a tutorial with code that I wrote for pitch tracking, which is one way to use the FFT. You are looking more at how the tone and pitch interact with the formant structure of a given dog. So parameters you'll want to derive might include fundamental pitch (which, alone, might be enough to distinguish whining from other kinds of barks), and ratio of fundamental pitch to higher harmonics, which would help identify how agressive the bark is (I'm guessing a bit here):
http://blog.bjornroche.com/2012/07/frequency-detection-using-fft-aka-pitch.html
Finally, you might want to do some research into basic speech recognition and speech processing, as there will be some overlap. Wikipedia will probably be enough to get you started.
EDIT: oh, also, once you've identified some parameters to use for comparison, you'll need a way to compare your multiple parameters to your database of sounds with multiple parameters. I don't think the techniques in the shazam article will work. One thing you could try is Logistic Regression. There are other options, but this is probably the simplest.
I'd check out Google's open source lib musicg API: http://code.google.com/p/musicg/
It's Java so it works in Android and it gives similarity metrics for two audio files.
But it's compatible only with .wav files.
Is it possible to restrict voice search search widget to look for a match near to a given set of words. For example if I am using it to search over a list of names, its not meaningful as names are often corrected to some words.
If you use 3rd party Android recognition from Nuance (The people behind DragonDictate), it supports a "grammar mode" where you can somewhat restrict the phrases that will be recognised during recognition.
Importantly, if you add unusual names into a Custom Vocabulary, they SHOULD become recognizable (Complex pronunciation issues aside)
You can find information if you dig through:
http://dragonmobile.nuancemobiledeveloper.com ,
looking for 'Custom Vocabularies'. Grammar mode is essentially a special mode of custom vocabularies.
At the time of writing, there was a document here that makes some mention of grammar mode:
http://dragonmobile.nuancemobiledeveloper.com/downloads/custom_vocabulary/Guide_to_Custom_Vocabularies_v1.5.pdf - It only really becomes clear when you try to progress in their provisioning web GUI.
You have to set up an account, and jump through other hoops, but there is a free tier. This is the only potential way I have found to constrain a recognition vocabulary.
Well, short of running up PocketSphinx, but that is still described as a 'Research' 'PreAlpha'.
No, I don't work for Nuance. Not sure anyone does. They may have all been eaten by zombies. You would guess as much reading their support forums. They never reply.
No, unfortunately this is not possible.
You could also look at recognition from AT&T.
They have a very feature-rich web API, including full grammar support. I only found out about it recently!
1,000,000 transactions per month for free. Generous!
Look for 'AT&T API Program'. Weird name.
Link at time of writing:
http://developer.att.com/apis/speech
Unfortunately for me, no Australian accent language models at time of writing. Only US and UK English. Boooo.
EDIT: Some months after I submitted the above, AT&T retired the service mentioned. It seems everyone just wants a 'dumbed down' API where you just call a recognizer, and it returns words. Sure, that is of course the holy grail, but a properly designed, constrained grammar will generally work better. As someone with speech skills, the minimalism of the common Speech APIs today is really frustrating...
I have seen several posts here about cocos2d-android, so ambition to get more idea on it drag me at coco2ds-android-1 and a good example .
My analysis can not find any significant benefit of using coco2ds instead of usual 2d approach of surfaceView and SurfaceHolder.Callback .
I will be thankful if anybody have expertise over coco2ds-android will guide me about benefits to use it instead of usual gaming approach .
Just by clicking on links starting from the ones in OP, I have came across http://dan.clarke.name/2011/04/how-to-make-a-simple-android-game-with-cocos2d/ - which states the obvious answer you are looking for.
First of all, this is a 2D gaming engine. All of the physics and whatnot effects are just there for you. No need to re-implement from scratch
Secondly, this is actually a port of the iPhone gaming library with the same name - great news if you also plan on porting to iPhone. And thirdly it is open sourced, meaning you can tweak anything accordingly.
I have noticed however that this is a pure java library, so do not expect amazing performance. If performance is critical, google for something NDK based, not SDK based. I could not advise here as gaming is not my thing.