Optimizing Android built in Text to Speech

Optimizing Android built in Text to Speech - android

I've been using android text to speech and I have it working well, but the voice never sounds as good as I would like it to. I understand its never going to sound "natural" but does anyone have any suggestions or know of any ways to make it sound more human or at least understandable.
I know you can adjust the pitch and rate and language. Are there any other adjustments that can be made? or even a better text to speech API that sounds better?

No, you can't improve TTS quality. Even changing the pitch is dangerous. If you set it too low the TTS sounds terrible.
The only way to get a better voice would be to use a service that is not google such as ispeech

Android text to speech supports multiple voices, the ones installed with your phone will vary depending on the model. One of my Nexus phones had excellent voices, but every (non-nexus) phone since then has had poor ones. You can change the active voice and download more voices in Settings->Language & [varies]->Text-to-speech output. Google's range seems to be limited to poor quality voices at the moment. You can also download paid for voices and/or engines from the likes of Svox, Pico and Ivona.

Related

Real-time call transcription on Android

I am an Android developer who is living with hearing impairment and I am currently exploring the option of making a speech to text app with Speech Recognizer API in Android. Closed-captioning telephones and Innocaption are not available in in my home country. Potential applications might be like captioning during telephone calls.
https://developer.android.com/reference/android/speech/SpeechRecognizer.html
The API is meant for capturing voice commands, not for real-time live transcribing. I am even able to implement it as a service but I constantly need to restart it after it has delivered a result or a partial result, which is not feasible in a conversational setting (words get lost while the service is restarting).
Do note that I don't need a 100% accuracy for this app. Many hearing impaired people find it helpful to have some context of the conversation to help them along. So I don't actually need comments about how this is not going to be accurate.
Is there a way to implement Speech Recognizer in a continuous mode? I can create a textview that constantly updates itself when new text is returned from the service. If this API is not what I should be looking at, is there any recommendation? I tested CMUSphinx but find that it is too dependent on blocks of phrases/sentences that it is not likely to work for the kind of application I have in mind.

I am a deaf software developer, so I can chime in. I've been monitoring the state of art of Speech-To-Text APIs, and the APIs have now become "good enough" to provide operatorless relay/captioning services for CERTAIN kinds of phone conversations with people using telephone in quiet settings. For example, I get 98% transcription accuracy with my spouse's voice with the Apple Siri realtime transcription (iOS 8).
I was able to jerryrig phone captioning by routing the sound out of one phone, to a 2nd iPhone that I press the microphone button (popup keyboard), and successfully captioned a telephone conversation with ~95% accuracy at 250 words per minute (faster than Sprint Captioned Telephone and Hamilton Captioned Telephone), at least until the 1 minute cutoff time.
Thusly, I declare computer-based voice recognition practical for phone calls with family members (of the type you call frequently in quiet environments), where you can at least coach them to move to a quiet place to allow captioning to work properly (with >95% accuracy). Since iOS 8 got released, we REALLY need this, so we don't need to rely on rely operators or captioning telephone. Sprint Captioned telephone lags badly during fast speech, while Apple Siri keeps up, so I can conduct more natural telephone conversations with my jerryrigged two-iOS-device Apple Siri "realtime Captioned Telephone" setup.
Some cellphones transmit audio in a higher-def manner, so it works well between two iPhones (iPhone speaker piped into another iPhone's Siri running in iOS8 continuous mode). That's assuming you're on G.722.2 (AMR-WB), like when running two iPhones on the same carrier that supports the high-def audio telephony standard. It works perfectly when piped through Siri -- roughly as good as doing it in front of the phone, for the same human voice (assuming the other end is speaking into the phone in a quiet environment).
Google and Apple needs to open up their speech-to-text APIs to assistive applications, pronto, because operatorless telephone transcription is finally now practical, at least when calling family members (good voices & coached to be in a quiet environment when receiving call). The continuous recognition time limit needs to also be removed during this situation, too.

Google is not going to work with telephone quality audio anyway, you need to work on captioning service using CMUSphinx yourself.
You probably didn't configure CMUSphinx properly, it should be ok for large vocabulary transcription, the only thing you should care about is to use telephony 8khz model, not wideband model and generic language model.
For the best accuracy it's probably worth to move processing on the server, you can setup the PBX to make the calls and transcribe audio there instead of hoping to do something on a limited device.

It is true that the SpeechRecognizer API documentation claims that
The implementation of this API is likely to stream audio to remote
servers to perform speech recognition. As such this API is not
intended to be used for continuous recognition, which would consume a
significant amount of battery and bandwidth.
This bit of text was added a year ago (https://android.googlesource.com/platform/frameworks/base/+/2921cee3048f7e64ba6645d50a1c1705ef9658f8). However, no changes were made to the API at the time, i.e. the API remained the same. Also, I don't really see anything specific to networking and battery drain in the API documentation. So, go ahead and implement a recognizer (maybe based on CMUSphinx) and make it accessible via this API.

Change voice during phone call android

I want to make an android application that allow user change the voice during phone call. For example: You are a man, you can change the voice to a woman or robot when talking over phone. It is like a funny prank.
I work around android's API and google for some days but still have no idea. Some one told is impossible but I see some app on google play can do:
https://play.google.com/store/apps/details?id=com.gridmob.android.funnycall
So I think there are some ways to do that.
I think about recording and play back by using AudioTracker but I have 2more problem:
1. I cannot mute the voice from phone call, so the phone only play my sound after processing
2. record and process will make a long delay (slow-realtime)
Can any one share some solution for this?

The app you linked isn't changing voices on the phone: it uses SIP (or similar) to place a call through the authors' servers and the voice changing happens there. That's why you only get a small number of free minutes of use before you have to pay them.

Yes it uses a sip server to do this process. The reason you cannot actually create an app that does this on the phone is because of two things. The first thing being, sound processing for the phone is locked. You can't unlock this because its strictly engineered through hardware not software. A pc can do this because it uses a standard sound card in which software can modify its frequencies. The second thing is phone manufactures are required to design their phones in a standard format. There are laws that force these companies to make it impossible to do any voice morphing. It is against the law to impersonate someone you are not, over any telephone network.

Hard way
You get the input voice, you use voice recognition to detect the words, then you use speech-to-text with your desired voice as output.
Less hard way
Sound processing: Changing frequencies, amplitude etc.

Appropriate audio capture and noise reduction

In my android application I need to capture the user's speech from the microphone and then pass it to the server. Currently, I use the MediaRecorder class. However, it doesn't satisfy my needs, because I want to make glowing effect, based on the current volume of input sound, so I need an AudioStream, or something like that, I guess. Currently, I use the following:
this.recorder = new MediaRecorder();
this.recorder.setAudioSource(MediaRecorder.AudioSource.MIC);
this.recorder.setOutputFormat(MediaRecorder.OutputFormat.MPEG_4);
this.recorder.setAudioEncoder(MediaRecorder.AudioEncoder.AMR_NB);
this.recorder.setOutputFile(FILENAME);
I am writing using API level 7, so I don't see any other AudioEncoders, but AMR Narrow Band. Maybe that's the reason of awful noise which I hear in my recordings.
The second problem I am facing is poor sound quality, noise, so I want to reduct (cancel, suppress) it, because it is really awful, especially on my noname chinese tablet. This should be server-side, because, as far as I know, requiers a lot of resources, and not all of the modern gadgets (especially noname chinese tablets) can do that as fast as possible. I am free to choose, which platform to use on the server, so it can be ASP.NET, PHP, JSP, or whatever helps me to make the sound better. Speaking about ASP.NET, I have come across a library, called NAudio, may be it can help me in some way. I know, that there is no any noise reduction solution built in the library, but I have found some examples on FFT and auto-corellation using it, so it may help.
To be honest, I have never worked with sound this close before and I have no idea where to start. I have googled a lot about noise reduction techniques, code examples and found nothing. You guys are my last hope.
Thanks in advance.

Have a look at this article.
Long story short, it uses MediaRecorder.AudioSource.VOICE_RECOGNITION instead of AudioSource.MIC, which gave me really good results and noise in the background did reduce very much.
The great thing about this solution is, it can be used with both AudioRecord and MediaRecorder class.

For audio capture you can use the AudioRecord class. This lets you record raw audio, i.e. you are not restricted to "narrow band" and you can also measure the volume.

Many smartphones have two microphones, one is the MIC you are using, the other one is near camera for video shooting, called CAMCORDER. You can get data from both of them to do noise reduction. There are many papers talking about audio noise reduction with multiple microphones.
Ref: http://developer.android.com/reference/android/media/MediaRecorder.AudioSource.html
https://www.google.com/search?q=noise+reduction+algorithm+with+two+mic

Detect the beginning of a sound or voice in Android

I would like to listen to the mic (I guess using AudioRecord) and perform some action the very moment a person starts to speak. I know I can buffer audio with AudioRecord, but how do I analyze it ?

Well, the difficult part will be getting the phone to recognize that it's voice. You can set the voice recognition system as the input, instead of the mic, which might be able to do that. I don't think so though, because (I actually read all about this yesterday) the phone doesn't actually do the recognizing, it just opens up a live stream (like a phone call) to the Google servers, and they do the recognizing.
Also, the information that I have found so far points to the conclusion that Android does not support analysis of live audio from the mic. All these other apps that seem to be "live" are actually just taking a bunch of small samples and analyzing them really quickly so that they seem live. A 500 millisecond sample every 300 milliseconds seems to be common.
Luckily, on the side of my programming job, I'm also a sound technician, so I can tell you that (if you were willing to put in the work) there is a way to detect actual voice as opposed to just sound. Every voice is split into a few distinct ratios of frequencies which all combine to make the voice we hear, and every voice's ratios remains pretty constant, while each individual voice's ratios are different (which is why voice-based passwords work). So, if you were able to take a sample, break it up into frequencies of about 10hz each, and watch for the amplitude of each, and when you got a frequency/amplitude pattern that looked similar to a voice instead of just "white noise", you'd be in business. DOING that however, doesn't seem like it'd be easy at all. Something similar has been done before with the app called SpectralView, which displays the audio spectrum all broken up.
Also, as you can see by using the Voice Search, a voice also fluctuates a lot in how loud it is. You could look for that, but it wouldn't be as reliable.
In conclusion, how do you analyze it? Well, you would have to look for a pattern in the frequencies that looks like a voice. How do you do that? Well, to be honest, I don't know for sure. Sorry.

Voice Activity Detection in Android

I am writing an application that will behave similar to the existing Voice recognition but will be sending the sound data to a proprietary web service to perform the speech recognition part. I am using the standard MediaRecord (which is AMR-NB encoded) which seems to be perfect to speech recognition. The only data provided by this is the Amplitude via the getMaxAmplitude() method.
I am trying to detect when the person starts to talk so that when the person stops talking for about 2 seconds I can proceed to send the sound data to the web service. Right now I am using a threshold for the amplitude that if its goes over a value (i.e. 1500) then I assume the person is speaking. My concern is that the amplitude levels may vary by device (i.e. Nexus One v Droid), so I am looking for a more standard approach to this that can be derived from the amplitude values.
P.S.
I looked at graphing-amplitude but it doesn't provide a way to do it with just the amplitude.

Well, this might not be of much help but how about starting by measuring the offset noise captured by the microphone of the device by the application, and apply the threshold dynamically based on that? That way you would make it adaptable to the different devices' microphones and also to the environment the user is using it at, at a given time.

1500 is too low of a number. Measuring the change in amplitude will work better.
However, it will still result in miss detections.
I fear the only way to solve this problem is to figure out how to recognize a simple word or tone rather than simply detect noise.

There are now multiple VAD library designed for Android. One of these are:
https://github.com/gkonovalov/android-vad

Most of the smartphones come with a proximity sensor. Android has API for using these sensors. This would be adequate for the job you described. When the user moves the phone near to his ear, you can code the app to start recording. It should be easy enough.
Sensor class for android

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.