Detect the beginning of a sound or voice in Android

Detect the beginning of a sound or voice in Android - android

I would like to listen to the mic (I guess using AudioRecord) and perform some action the very moment a person starts to speak. I know I can buffer audio with AudioRecord, but how do I analyze it ?

Well, the difficult part will be getting the phone to recognize that it's voice. You can set the voice recognition system as the input, instead of the mic, which might be able to do that. I don't think so though, because (I actually read all about this yesterday) the phone doesn't actually do the recognizing, it just opens up a live stream (like a phone call) to the Google servers, and they do the recognizing.
Also, the information that I have found so far points to the conclusion that Android does not support analysis of live audio from the mic. All these other apps that seem to be "live" are actually just taking a bunch of small samples and analyzing them really quickly so that they seem live. A 500 millisecond sample every 300 milliseconds seems to be common.
Luckily, on the side of my programming job, I'm also a sound technician, so I can tell you that (if you were willing to put in the work) there is a way to detect actual voice as opposed to just sound. Every voice is split into a few distinct ratios of frequencies which all combine to make the voice we hear, and every voice's ratios remains pretty constant, while each individual voice's ratios are different (which is why voice-based passwords work). So, if you were able to take a sample, break it up into frequencies of about 10hz each, and watch for the amplitude of each, and when you got a frequency/amplitude pattern that looked similar to a voice instead of just "white noise", you'd be in business. DOING that however, doesn't seem like it'd be easy at all. Something similar has been done before with the app called SpectralView, which displays the audio spectrum all broken up.
Also, as you can see by using the Voice Search, a voice also fluctuates a lot in how loud it is. You could look for that, but it wouldn't be as reliable.
In conclusion, how do you analyze it? Well, you would have to look for a pattern in the frequencies that looks like a voice. How do you do that? Well, to be honest, I don't know for sure. Sorry.

Related

Compare two voice in android

I am working on one voice messaging application, I need to compare two voice like,
Register with app by record your voice
Sent voice message to
another user by record voice, but first need to compare this voice
to recorded voice in profile.
Its for security purpose and need to know recorded message is from specific user or not.
I tried :
Compare two sound in Android
http://www.dreamincode.net/forums/topic/274280-using-fft-to-compare-two-audio-files-and-then-realtime-comparison/
But not getting idea about voice Comparison.
Please share if anybody know about the same. Didn't find any sample to do this.

Since you indicated it's for security purpose, I'd like to first share a few things on voice biometry :-)
The problem with authenticating someone is that you'd need to be sure he was actually there saying the things that were recorded... and that's a whole different level of complexity than merely comparing voice characteristics.
Algorithms extracting voice features from a sample and later calculating the distance between a new sample and the first one can easily be fooled by a recording made up by an attacker.
Since in your case there's a human recipient, creating a message made up of chopped words or sentences from random conversations is actually quite difficult and time consuming. But not completely impossible...
There are very good sounding softwares created for the music industry that will e.g. take some voice audio input and make it sound (intonation and time wise) like a second audio sample (a guide, made by the fraudster). Vocalign Pro by SynchroArts does this to help get perfect backing vocal tracks. You could further tweak the audio by hand using other voice editing softwares and achieve an acceptable level of quality that wouldn't be immediately detected by the recipient.
Depending on what the attacker wants your user to say, the process complexity could range from an hour to a day provided he has all the recording material he wants...
To fight against this type of attack, you need to detect the audio sample has been edited. The digital edition will leave unnatural traces. E.g. in the background noise surrounding the voice.
AFAICT, only the best commercial softwares achieve this level security check, but I can't tell how far they go in the detection of such edits.
From a pure security perspective, you'd also need to be sure the device was not compromised. So these voice verification check should happen server side and not on the phone itself.
Please note these are general considerations and it all depends on what sort of security measures you actually need for your use case. My car alarm is certainly not unbreakable but it helps raising the bar so fewer attackers could potentially steal it...
Another thing to consider is that biometry is by definition a statistical process and it will yield a certain percentage of false positives and false negatives. By changing the acceptance threshold, you'll be able to lower one of them at the cost of raising the other.
Selecting an appropriate threshold will require you to have a fair amount of test data. Say 1 minute recording of at least 200 speakers to start getting a picture.
One more thing I think you'll need to consider is the inherent variability of the human voice. People may be sick which in some cases might render the voice unrecognizable. Also the emotional state might play a role: sadness or anger will yield different sounding voices...
And last but not least, the surrounding noise might pose a problem. Say the user enrolled while at home and later records a message while on the go in a busy city environment, the system might have troubles making sure it's actually the same person speaking. The signal to noise ratio is definitely going to be one of your main issues. Small tip: depending on the distance of the microphone to the mouth, the ratio will be quite different. You'll get way better result when the user puts the phone close to its face like in a regular phone conversation than when the user looks at the screen while recording the message.
Voice variability and signal to noise ratio are probably the main reasons behind false negative results.
Hopefully, you now have a better understanding of the challenges awaiting you and I can start sharing some pointers for open source and commercial libraries.
AFAIK, there are no open source libraries that includes fraudster detection...
You may want to check Nuance Communication for state-of-the-art. There are plenty other vendors, just check with Google, I only mentioned Nuance because of it's reputation.
There is an OSS library called Alize (written in C++, under LGPL license) which uses an algorithm called MFCC (Mel Frequency Cepstrum Coefficients). MFCC is known to bring excellent results. Expect a steep learning curve as this software is aimed at researchers willing to improve the state-of-the-art on this topic and the vocabulary used is very specific.
I wrote an OSS library named Recognito (Java, Apache 2.0) aimed at regular developers so you should be able to test it in a matter of minutes. The lib is very young and I first focused on it's API before improving the algorithms. The algorithm I use for the moment is called Linear Predictive Coding (LPC) and is known to bring good results (and I do have good results, provided recordings yield the same level of quality :-)). I'm currently in the process of releasing a new version including a likelihood coefficient in the match results. MFCC implementation is on the road map.
There is plenty of javadoc and the code should be very straightforward...
https://github.com/amaurycrickx/recognito
Recognito has a dependency on javax.sound packages for audio file handling. You may want to check this post for what it takes to use it in Android: Voice matching in android
Given many people need something for android, I'll do something about it in the near future instead of saying how one should modify the lib :-)
HTH

Appropriate audio capture and noise reduction

In my android application I need to capture the user's speech from the microphone and then pass it to the server. Currently, I use the MediaRecorder class. However, it doesn't satisfy my needs, because I want to make glowing effect, based on the current volume of input sound, so I need an AudioStream, or something like that, I guess. Currently, I use the following:
this.recorder = new MediaRecorder();
this.recorder.setAudioSource(MediaRecorder.AudioSource.MIC);
this.recorder.setOutputFormat(MediaRecorder.OutputFormat.MPEG_4);
this.recorder.setAudioEncoder(MediaRecorder.AudioEncoder.AMR_NB);
this.recorder.setOutputFile(FILENAME);
I am writing using API level 7, so I don't see any other AudioEncoders, but AMR Narrow Band. Maybe that's the reason of awful noise which I hear in my recordings.
The second problem I am facing is poor sound quality, noise, so I want to reduct (cancel, suppress) it, because it is really awful, especially on my noname chinese tablet. This should be server-side, because, as far as I know, requiers a lot of resources, and not all of the modern gadgets (especially noname chinese tablets) can do that as fast as possible. I am free to choose, which platform to use on the server, so it can be ASP.NET, PHP, JSP, or whatever helps me to make the sound better. Speaking about ASP.NET, I have come across a library, called NAudio, may be it can help me in some way. I know, that there is no any noise reduction solution built in the library, but I have found some examples on FFT and auto-corellation using it, so it may help.
To be honest, I have never worked with sound this close before and I have no idea where to start. I have googled a lot about noise reduction techniques, code examples and found nothing. You guys are my last hope.
Thanks in advance.

Have a look at this article.
Long story short, it uses MediaRecorder.AudioSource.VOICE_RECOGNITION instead of AudioSource.MIC, which gave me really good results and noise in the background did reduce very much.
The great thing about this solution is, it can be used with both AudioRecord and MediaRecorder class.

For audio capture you can use the AudioRecord class. This lets you record raw audio, i.e. you are not restricted to "narrow band" and you can also measure the volume.

Many smartphones have two microphones, one is the MIC you are using, the other one is near camera for video shooting, called CAMCORDER. You can get data from both of them to do noise reduction. There are many papers talking about audio noise reduction with multiple microphones.
Ref: http://developer.android.com/reference/android/media/MediaRecorder.AudioSource.html
https://www.google.com/search?q=noise+reduction+algorithm+with+two+mic

Android Audio Analysis in Real-time

I have searched for this online, but am still a bit confused (as I'm sure others will be if they think of something like this). I'd like to preface by saying that this is not for homework and/or profit.
I wanted to create an app that could listen to your microwave as you prepare popcorn. It would work by sounding an alarm when there's a certain time interval between pops (say 5-6 seconds). Again, this is simply a project to keep me occupied - not for a class.
Either way, I'm having trouble trying to figure out how to analyze the audio intake in real-time. That is, I need a way to log the time when a "pop" occurs. So that you guys don't think I didn't do any research into the matter, I've checked out this SO question and have extensively searched the AudioRecord function list.
I'm thinking that I will probably have to do something with one of the versions of read() and then compare the recorded audio every 2 seconds or so to the recorded audio of a "pop" (i.e. if 70% or more of the byte[] audioData array is the same as that of a popping sound, then log the time). Can anyone with Android audio input experience let me know if I'm at least on the right track? This is not a question of me wanting you to code anything for me, but a question as to whether I'm on the correct track, and, if not, which direction I should head instead.

I think I have an easier way.
You could use the MediaRecorder 's getMaxAmplitude method.
Anytime your recorder detects a big jump in amplitude, you have detected a corn pop!

Check out this code (ignore the playback part): Playing back sound coming from microphone in real-time
Basically the idea is that you will have to take the value of each 16-bit sample (which corresponds to the value of the wave at that time). Using the sampling rate, you can calculate the time between peaks in volume. I think that might accomplish what you want.

this may be a bit overkill, but there is a framework from MIT media labs called funf: http://code.google.com/p/funf-open-sensing-framework/
They already created classes for audio input and some analysis (FFT and the like), also saving to files or uploading is implemented as far as I've seen, and they handle most of the sensors available on the phone.
You can also get inspired from the code they wrote, which I think is pretty good.

Voice Activity Detection in Android

I am writing an application that will behave similar to the existing Voice recognition but will be sending the sound data to a proprietary web service to perform the speech recognition part. I am using the standard MediaRecord (which is AMR-NB encoded) which seems to be perfect to speech recognition. The only data provided by this is the Amplitude via the getMaxAmplitude() method.
I am trying to detect when the person starts to talk so that when the person stops talking for about 2 seconds I can proceed to send the sound data to the web service. Right now I am using a threshold for the amplitude that if its goes over a value (i.e. 1500) then I assume the person is speaking. My concern is that the amplitude levels may vary by device (i.e. Nexus One v Droid), so I am looking for a more standard approach to this that can be derived from the amplitude values.
P.S.
I looked at graphing-amplitude but it doesn't provide a way to do it with just the amplitude.

Well, this might not be of much help but how about starting by measuring the offset noise captured by the microphone of the device by the application, and apply the threshold dynamically based on that? That way you would make it adaptable to the different devices' microphones and also to the environment the user is using it at, at a given time.

1500 is too low of a number. Measuring the change in amplitude will work better.
However, it will still result in miss detections.
I fear the only way to solve this problem is to figure out how to recognize a simple word or tone rather than simply detect noise.

There are now multiple VAD library designed for Android. One of these are:
https://github.com/gkonovalov/android-vad

Most of the smartphones come with a proximity sensor. Android has API for using these sensors. This would be adequate for the job you described. When the user moves the phone near to his ear, you can code the app to start recording. It should be easy enough.
Sensor class for android

Microphone input

I'm trying to build a gadget that detects pistol shots using Android. It's a part of a training aid for pistol shooters that tells how the shots are distributed in time and I use a HTC Tattoo for testing.
I use the MediaRecorder and its getMaxAmplitude method to get the highest amplitude during the last 1/100 s but it does not work as expected; speech gives me values from getMaxAmplitude in the range from 0 to about 25000 while the pistol shots (or shouting!) only reaches about 15000. With a sampling frequency of 8kHz there should be some samples with considerably high level.
Anyone who knows how these things work? Are there filters that are applied before registering the max amplitude. If so, is it hardware or software?
Thanks,
/George

It seems there's an AGC (Automatic Gain Control) filter in place. You should also be able to identify the shot by its frequency characteristics. I would expect it to show up across most of the audible spectrum, but get a spectrum analyzer (there are a few on the app market, like SpectralView) and try identifying the event by its frequency "signature" and amplitude. If you clap your hands what do you get for max amplitude? You could also try covering the phone with something to muffle the sound like a few layers of cloth

It seems like AGC is in the media recorder. When I use AudioRecord I can detect shots using the amplitude even though it sometimes reacts on sounds other than shots. This is not a problem since the shooter usually doesn't make any other noise while shooting.
But I will do some FFT too to get it perfect :-)

Sounds like you figured out your agc problem. One further suggestion: I'm not sure the FFT is the right tool for the job. You might have better detection and lower CPU use with a sliding power estimator.
e.g.
signal => square => moving average => peak detection
All of the above can be implemented very efficiently using fixed point math, which fits well with mobile android platforms.
You can find more info by searching for "Parseval's Theorem" and "CIC filter" (cascaded integrator comb)

Sorry for the late response; I didn't see this question until I started searching for a different problem...
I have started an application to do what I think you're attempting. It's an audio-based lap timer (button to start/stop recording, and loud audio noises for lap setting). It' not finished, but might provide you with a decent base to get started.
Right now, it allows you to monitor the signal volume coming from the mic, and set the ambient noise amount. It's also using the new BSD license, so feel free to check out the code here: http://code.google.com/p/audio-timer/. It's set up to use the 1.5 API to include as many devices as possible.
It's not finished, in that it has two main issues:
The audio capture doesn't currently work for emulated devices because of the unsupported frequency requested
The timer functionality doesn't work yet - was focusing on getting the audio capture first.
I'm looking into the frequency support, but Android doesn't seem to have a way to find out which frequencies are supported without trial and error per-device.
I also have on my local dev machine some extra code to create a layout for the listview items to display "lap" information. Got sidetracked by the frequency problem though. But since the display and audio capture are pretty much done, using the system time to fill in the display values for timing information should be relatively straightforward, and then it shouldn't be too difficult to add the ability to export the data table to a CSV on the SD card.
Let me know if you want to join this project, or if you have any questions.

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.