Live phone Call caption / transcription (speech to text) on Android

Live phone Call caption / transcription (speech to text) on Android - android

This question is to help the "Hard of hearing community" so that they can READ the phone/mobile call because they can not hear it.
Android 11 provides an API "AudioPlaybackCaptureConfiguration". This API gives apps the ability to copy the audio being played by other apps.
Google also implemented the same on Pixel mobiles as shownn here - https://www.youtube.com/watch?v=7hb3p8LZIq8 . But it has few limitations -
It supports only english language, How to enable support for the regional language
The current implementation translates voice to text using a local mobile engine i.e. voice is not going to google server(all the processing is happening offline in mobile itself), so accuracy is also low.
After seeing a lot of posts here it seems developers are facing issue while implementing the same to capture the caller voice and then transcibe it due to some restriction by Google.
How to record internal audio on Android devices or record MediaPlayer Audio Stream?
Is there anyway to capture the caller voice (https://developer.android.com/guide/topics/media/playback-capture#allowing_playback_capture) ? Like in the youtube video I shared above, Google must be capturing caller voice and its offline engine is processing that voice and converting it to text. So can we capture caller voice using some way and then send that voice to some server API or to Google Live Transcribe app (or whatever it is) for better accuracy and then the converted text will be displayed on the screen (as per user choice of language).
I am also a developer though not a mobile one. So some terminology may be wrong , please excuse it and provide your suggestion.
Can we modify the Android source code itself according to our requirement and remove that limitation so that we can achieve what we want to do even if it require to build custom Android OS ?

Related

Voice Recognition with an audio file?

I am currently working on an app that would require recording the audio within my app and then sending the clip to google for transcription.
Is there any way I can send an audio clip to be processed with speech to text?
Or is there any other way other than this to convert that recording to text ?

Google's Voice To Text API is not available publicly at the moment and there's no announcement on where it could become available. On Android you can use system voice recognition feature, but it will only transcribe what it records by itself and your won't be able to feed it with any audio file for processing.
As for now, you either need to use other services like AT&T's, IBM's Watson, Dragon Dictation (all are on-line) or maybe consider including Sphinx CMU into your app if you absolutely demand off-line solution.

Why it is not possible to play an audio file on a voice call in android

This question might seem to be a repetition of the questions such as following:
How to play an audio file on a voice call in android
Background Audio for a Call in Progress - Possible?
The answers of these questions suggests that it is not possible to play a pre-recorded audio on a voice call in android. I want to know why it is not possible? What is the limitation (hardware/software)? Is it really a limitation or done purposely? Can we alter the source code of android to make it possible?

I think this is a limitation, imposed for security reasons and restricted at the OS level.
Let's analyze the security threat, first of all. If you were able to play custom audio files to the callee, a whole world of cons opens up: you could trick customer supports, you could pretend to be someone else, you could give unauthorized purchase confirmations, and so on. For this reason, neither Android nor iOS allows this functionality.
On Android, you won't be able to do so in a programmatic way, simply because the current APIs won't allow you to do so. It is stated in the official documentation as well, as pointed out here. If you dig into the source code, you can probably enable this feature by accessing the microphone output during a phone call, but that would require running your custom version of Android. A good starting point would be the AudioTrack source, available here.
EDIT: a good example of an audio mod involves enabling the Nexus 5 earpiece as a second loudspeaker (requires root). Can be found here.

After a thorough research, what I have come to know is that there are more than one limitations/hurdles to make it possible. These limitations/hurdles are at three different levels.
First limitation is at API level, because there is no high-level API to play sound files in the conversation audio during a call as mentioned in Android official documentation.
Second limitation is at Radio Interface Layer (RIL). RIL passes on complete control of the call to Radio Daemon (rild) of the Linux library which then further passes the control to the vendor RIL. That means we cannot manipulate voice call in android source code.
Even if we are able to remove these two limitations, we may still not be able to play audio file to an ongoing voice call. Because there is a third limitation. Every vendor has their own library of RIL that communicates with Radio Daemon (rild). This requires that vendor RIL to be open source which is not actually. Hardware vendors do not usually make their device drivers code available.
Detail discussion on this topic is present at this link.

This is software related due to the prioritization of audio routing in Android.
Take a look into the CallManager where you can dig into the method setAudioMode(). After the audio mode was set to MODE_IN_COMMUNICATION the following code is called
audioManager.requestAudioFocusForCall(AudioManager.STREAM_VOICE_CALL,
AudioManager.AUDIOFOCUS_GAIN_TRANSIENT);
From this point on the telephony service has the highest priority and won't let any other audio play in parallel.

Note: You can play back the audio data only to the standard output device. Currently, that is the mobile device speaker or a Bluetooth headset. You cannot play sound files in the conversation audio during a call.
See official link
http://developer.android.com/guide/topics/media/mediaplayer.html

By implementing the AudioManager.OnAudioFocusChangeListener you can get the state of the audiomanager. so by this if any music is playing in the background you can get the AudioManager states(playing and pausing is completely in developer hands) similarly......
Some of the native music players in android device where handling this, they restrict the music when call is in TelephonyManager.EXTRA_STATE_OFFHOOK.so this scenario is also completely in developer hand (whether to handle or not) if he is not handling both will play parallel y

Audio Matching (Audio Fingerprinting)

I'm writing an android app that lets user record his voice through microphone & save it in storage & link it to a specific content (like a Contact). Later, user call that voice again & the app should compare it with saved audio files & find the one that matches the voice.
I searched a lot & found some libraries that do this online, like EchoPrint that generates fingerprint from recorded audio & sends it to opensource server & returns the result. But I need to do this offline.
Has anybody know such library?

If you are aiming to compare an old recording of a user with a new call as it comes in, audio fingerprinting solutions like Dejavu in Python on a server or Echoprint in C++ won't help you. They are for doing recognition and retrieval on recorded audio segments plus noise.
They cannot deal with the variabilites in human voice. See an explanation here.
If that's the case, what you are referring to is speaker recognition, which is much harder and involves quite a bit of machine learning. It would be tough to do this for a large corpus of users (especially offline on a phone), but for determining between a couple users, it might be doable.

Below is a good Library. Which is Easy to use. But you need to convert your Audio Files to Wave Format prior to this.
https://code.google.com/p/musicg/

How does Google Keep do Speech Recognition while saving the audio recording at the same time?

Android's SpeechRecognizer apparently doesn't allow to record the input on which you're doing speech recognition into an audio file.
That is, either you record voice using a MediaRecorder (or AudioRecord for that matter) or you do Speech Recognition with a SpeechRecognizer, in which case the audio isn't recorded into a file (at least not one you can access); but you can't do both at the same time.
The question of how to achieve recording audio and doing speech recognition at the same time in Android has been asked several times, and the most popular "solution" is to record a flac file and use Google's unofficial Speech API which allows you to send a flac file via a POST request and obtain a json response with the transcription.
http://mikepultz.com/2011/03/accessing-google-speech-api-chrome-11/ (outdated Android version)
https://github.com/katchsvartanian/voiceRecognition/tree/master/VoiceRecognition
http://mikepultz.com/2013/07/google-speech-api-full-duplex-php-version/
That works pretty well but has a huge limitation which is it can't be used with files longer than about 10-15 seconds (the exact limit is not clear and may depend on file size or perhaps the amount of words). This makes it not suitable for my needs.
Also, slicing the audio file into smaller files is NOT a possible solution; even forgetting about the difficulties in properly splitting the file at the right positions (not in the middle of a word), many consecutive requests to the abovementioned web service api will randomly result in empty responses (Google says there's a usage limit of 50 requests per day, but as usual they don't disclose the details of the real usage limits which clearly restrict bursts of requests).
So, all this would seem to indicate that getting a transcription of speech while at the same time recording the input into an audio file in Android is IMPOSSIBLE.
HOWEVER, the Google Keep Android app does exactly that.
It allows you to speak, transcrbes what you said into text, and saves both the text and the audio recording (well it's not clear where it stores it, but you can replay it).
And it has no length limitation.
So the question is: DOES ANYBODY HAVE AN IDEA OF HOW GOOGLE KEEP DOES IT?
I would look at the source code but it doesn't seem to be available, is it?
I sniffed the packets Google Keep sends and receives while doing speech recognition, and it definitely does NOT use the speech api mentioned above. All the traffic is TLS and (from the outside) it looks pretty much the same as when you're using SpeechRecognizer.
So does perhaps a way exist to kind of "split" (i.e. duplicate, or multiplex) the microphone input stream into two streams, and feed one of them to a SpeechRecognizer and the other to a MediaRecorder?

Google Keep launches RecognizerIntent with certain undocumented extras and expects the resulting intent to contain the URI of the recorded audio. If RecognizerIntent is serviced by Google Voice Search then it all works out and Keep gets the audio.
See record/save audio from voice recognition intent for more information and a code sample that calls the recognizer in the same way as Keep (probably) does.
Note that this behavior is not part of Android. It's simply the current undocumented way of how two closed-source Google apps communicate with each other.

It uses onPartialResults(Bundle)
This event returns text recognized from recorded speech while it's still recording
It's also available on Xamarin

What format is speech sent to cloud with Android speech recognition?

I'm building an app that includes speech recognition - I intend to use the Android speech recognition service or the voice typing functionality.
From what I have read, the speech is mostly processed in the cloud. The question I have is whether anyone knows what format the audio is sent to the cloud in? For example, is something like WAV or MP3 or PCM, or is it likely to be something else entirely?
I admit this is mostly out of plain curiosity to know a bit more of what is going on behind the scenes. (But partly it also relates to an interest in the impacts of pre and post processing on recognition.)

Well , I've been looking for that info too , and the closest thing I could get to was the Google's speech recognition API for chrome which used FLAC audio codec. I'm not sure if android uses it too, but it is the closest thing I ever get.

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.