Continuous Speech Recognition Android - Without Gaps - android

I have an activity that implements RecognitionListener. To make it continuous, every time onEndOfSpeech() I start the listener again:
speech.startListening(recognizerIntent);
But, it takes some time (around half a second) till it starts, so there is this half a second gap, where nothing is listening. Therefore, I miss words that were spoken in that time difference.
On the other hand, when I use Google's Voice input, to dictate messages instead of the keyboard - this time gap does not exist. Meaning - there is a solution.
What is it?
Thanks

I'll recommend using CMUSphinx to recognize speech continuously. To achieve continuous speech recognition using google speech recognition api, you might have to resort to a loop in a background service which will take too much resources and drains the device battery.
On the other hand, Pocketsphinx works really great. It's fast enough to spot a key phrase and recognize voice commands behind the lock screen without users touching their device. And it does all this offline. You can try the demo.
If you really want to use google's api, see this

try looking at a couple other api's....
speech demo : has source here and is discussed here and operated on CLI here
you could use the full duplex google api ( its rate capped at 50 per day )
Or if you like that general idea check ibm's watson discussed here
IMO - its more complex but not capped .

There are options like:
intent.putExtra(RecognizerIntent.EXTRA_SPEECH_INPUT_COMPLETE_SILENCE_LENGTH_MILLIS, 2000); // value to wait
or
intent.putExtra(RecognizerIntent.EXTRA_SPEECH_INPUT_POSSIBLY_COMPLETE_SILENCE_LENGTH_MILLIS, 2000);
These ceased to work on Jelly Bean and above, but work on ICS and below - not sure if intended or a bug!

Related

What is the most reliable TTS (Text-to-Speech) engine to use when developing for Android?

Our company is developing a full-featured app for blind people, which means it relies heavily on text-to-speech (TTS). We have noticed that the TTS voice simply stops speaking randomly. It usually works fine and we have no issues with speech, but once in a blue moon we get no voice output and the app doesn't know otherwise so it continues to work like usual, but without any voice. Users can still use the app for the most part, but they no longer hear the speech until the app is restarted and everything is reset.
Is there a reliable way to know if the voice fails to speak something?
I already utilized an utterance complete listener to handle certain scenarios with what it says, but that makes no difference when the TTS simply doesn't output the speech. It's as if the voice "thinks" it said it but we never hear it.
Is there an event we can capture that would be fired when the TTS engine tries to say something but fails?
In my experience, at the time of writing, the most reliable TTS Engine is Google's own. This of course, is a matter of opinion and not something Stackoverflow encourages. Currently, Google TTS is the only one that uses the latest Voice API correctly, where as other engines crash, fail or simply report incorrectly.
It's unfortunately all too common that a TTS Engine will believe it has spoken the utterance correctly, but hasn't.
To combat this, when I request the speech, I set a Runnable, which checks if onDone() fails to be called within a period relative to the length of the speech. I also check onDone() is not called too quickly, which would suggest that the speech failed, unless the request was deliberately one of silence.
Those two checks enable me to toast to the user if there's an issue - given that you are dealing with the blind, you will obviously have to find another way to communicate the problem! Perhaps a series of multiple small vibrates could generically denote an issue?
Hope that helps.

background service for voice command

I need to use a background service to launch my application with voice command even when the screen is locked. For example when I say "start" the screen will be unlocked and the application launches automatically, I tried to make this code work https://github.com/gast-lib/gast-lib/blob/master/library/src/root/gast/speech/activation/SpeechActivationService.java
but I don't know how to use that and how to do the service with the activity .
I'll recommend using CMUSphinx to recognize speech continuously. To achieve continuous speech recognition using google speech recognition api, you might have to resort to a loop in a background service which will take too much resources and drains the device battery.
On the other hand, Pocketsphinx works really great. It's fast enough to spot a key phrase and recognize voice commands behind the lock screen without users touching their device. And it does all this offline. You can try the demo.
If you really want to use google's api as I've demonstrated above, see this

Real-time call transcription on Android

I am an Android developer who is living with hearing impairment and I am currently exploring the option of making a speech to text app with Speech Recognizer API in Android. Closed-captioning telephones and Innocaption are not available in in my home country. Potential applications might be like captioning during telephone calls.
https://developer.android.com/reference/android/speech/SpeechRecognizer.html
The API is meant for capturing voice commands, not for real-time live transcribing. I am even able to implement it as a service but I constantly need to restart it after it has delivered a result or a partial result, which is not feasible in a conversational setting (words get lost while the service is restarting).
Do note that I don't need a 100% accuracy for this app. Many hearing impaired people find it helpful to have some context of the conversation to help them along. So I don't actually need comments about how this is not going to be accurate.
Is there a way to implement Speech Recognizer in a continuous mode? I can create a textview that constantly updates itself when new text is returned from the service. If this API is not what I should be looking at, is there any recommendation? I tested CMUSphinx but find that it is too dependent on blocks of phrases/sentences that it is not likely to work for the kind of application I have in mind.
I am a deaf software developer, so I can chime in. I've been monitoring the state of art of Speech-To-Text APIs, and the APIs have now become "good enough" to provide operatorless relay/captioning services for CERTAIN kinds of phone conversations with people using telephone in quiet settings. For example, I get 98% transcription accuracy with my spouse's voice with the Apple Siri realtime transcription (iOS 8).
I was able to jerryrig phone captioning by routing the sound out of one phone, to a 2nd iPhone that I press the microphone button (popup keyboard), and successfully captioned a telephone conversation with ~95% accuracy at 250 words per minute (faster than Sprint Captioned Telephone and Hamilton Captioned Telephone), at least until the 1 minute cutoff time.
Thusly, I declare computer-based voice recognition practical for phone calls with family members (of the type you call frequently in quiet environments), where you can at least coach them to move to a quiet place to allow captioning to work properly (with >95% accuracy). Since iOS 8 got released, we REALLY need this, so we don't need to rely on rely operators or captioning telephone. Sprint Captioned telephone lags badly during fast speech, while Apple Siri keeps up, so I can conduct more natural telephone conversations with my jerryrigged two-iOS-device Apple Siri "realtime Captioned Telephone" setup.
Some cellphones transmit audio in a higher-def manner, so it works well between two iPhones (iPhone speaker piped into another iPhone's Siri running in iOS8 continuous mode). That's assuming you're on G.722.2 (AMR-WB), like when running two iPhones on the same carrier that supports the high-def audio telephony standard. It works perfectly when piped through Siri -- roughly as good as doing it in front of the phone, for the same human voice (assuming the other end is speaking into the phone in a quiet environment).
Google and Apple needs to open up their speech-to-text APIs to assistive applications, pronto, because operatorless telephone transcription is finally now practical, at least when calling family members (good voices & coached to be in a quiet environment when receiving call). The continuous recognition time limit needs to also be removed during this situation, too.
Google is not going to work with telephone quality audio anyway, you need to work on captioning service using CMUSphinx yourself.
You probably didn't configure CMUSphinx properly, it should be ok for large vocabulary transcription, the only thing you should care about is to use telephony 8khz model, not wideband model and generic language model.
For the best accuracy it's probably worth to move processing on the server, you can setup the PBX to make the calls and transcribe audio there instead of hoping to do something on a limited device.
It is true that the SpeechRecognizer API documentation claims that
The implementation of this API is likely to stream audio to remote
servers to perform speech recognition. As such this API is not
intended to be used for continuous recognition, which would consume a
significant amount of battery and bandwidth.
This bit of text was added a year ago (https://android.googlesource.com/platform/frameworks/base/+/2921cee3048f7e64ba6645d50a1c1705ef9658f8). However, no changes were made to the API at the time, i.e. the API remained the same. Also, I don't really see anything specific to networking and battery drain in the API documentation. So, go ahead and implement a recognizer (maybe based on CMUSphinx) and make it accessible via this API.

Voice Activity Detection in Android

I am writing an application that will behave similar to the existing Voice recognition but will be sending the sound data to a proprietary web service to perform the speech recognition part. I am using the standard MediaRecord (which is AMR-NB encoded) which seems to be perfect to speech recognition. The only data provided by this is the Amplitude via the getMaxAmplitude() method.
I am trying to detect when the person starts to talk so that when the person stops talking for about 2 seconds I can proceed to send the sound data to the web service. Right now I am using a threshold for the amplitude that if its goes over a value (i.e. 1500) then I assume the person is speaking. My concern is that the amplitude levels may vary by device (i.e. Nexus One v Droid), so I am looking for a more standard approach to this that can be derived from the amplitude values.
P.S.
I looked at graphing-amplitude but it doesn't provide a way to do it with just the amplitude.
Well, this might not be of much help but how about starting by measuring the offset noise captured by the microphone of the device by the application, and apply the threshold dynamically based on that? That way you would make it adaptable to the different devices' microphones and also to the environment the user is using it at, at a given time.
1500 is too low of a number. Measuring the change in amplitude will work better.
However, it will still result in miss detections.
I fear the only way to solve this problem is to figure out how to recognize a simple word or tone rather than simply detect noise.
There are now multiple VAD library designed for Android. One of these are:
https://github.com/gkonovalov/android-vad
Most of the smartphones come with a proximity sensor. Android has API for using these sensors. This would be adequate for the job you described. When the user moves the phone near to his ear, you can code the app to start recording. It should be easy enough.
Sensor class for android

Voice recognition with android

Hey guys, I was wondering if it were possible to translate audio without having to call a recognizer intent (ie a dialog that says you are recording audio). I want to be able to recover the results of the voice recognition every 2 to 3 seconds or so and plan to use this with a bunch of listviews. Is this possible? If so any ideas? Thanks!
Edit: I forgot to mention that I am playing around with android.speech.SpeechRecognizer but so far, in my implementation of the RecognitionListener interface all I have been able to get from ddms is that there is a client side error. Nothing else seems to be called. Also, is it essential that I implement a RecognitionService? I know that the example in the API is just that. If so, how would I create and use this service? Thanks again.
Speech recognition does not work in the Emulator. You need a device.
I just posted some working code skeleton stuff in another thread -
Voice Recognition Commands Android
The speech recognizer can be triggered every few seconds without UI. You might need to write your own code to decide when is good to record and when is not (you get an audio buffer you could peek through) - or your could do something in your own UI.
I think you could re-trigger it over and over again. Not sure it'd work perfectly but worth a try.
It is impossible in Android < 2.1 and probably in 2.2
When I asked a Google support person, he said, "Maybe you can figure out what packets are being sent and then just make a direct web call"
Wow.

Categories

Resources