In my task, our android mobile app need to recognize the knock sound (when knock to the surface of mobile device) to open to the app.
I tried some ways but it only recognize about 80% of knock (some time I knock phone but it do not return it is knock sound) and sometime it recognize other sound as knock, like vowel 'a'.
Here are the 3 methods we used -
1. Recognize by hight pass filter:
2. Using sum of magnitude from 13kHz to 18kHz (refer this article) :
3. Using library (refer link)
All of this effort only recognize about 80% of knock sound and some time it recognize other sound as knock.
I am not sure about knock characteristics and how to recognize knock exactly (it recognize knock when I clap phone exactly). Any help is greatly appreciated!
Recognize by hight pass filter
No relation to knock
Using sum of magnitude from 13kHz to 18kHz (refer this article) :
This is a reasonable direction but you need to add more features, in particular the energy in other frames nearby.
Using library
Not relevant
All your methods do not work because they have no relation to knock properties. To properly detect knock you need to figure out what distinguishes it from other sounds:
Knock is very short in time
Knock frequencies are in higher part of the spectrum.
So you need to implement the following algorithm:
Split audio on frames
Create FFT transform for every frame
Analyze FFT transform for every frame and neighbor frames and make
sure the following:
Spectral energy for frame is concentrated in the upper part
Energy of frame is significantly higher than the energy of neighbor frames
Once you see both features you can signal about knock detected.
For the reference see knock spectrogram:
A related algorithm with explanation is also covered here:
Given an audio stream, find when a door slams (sound pressure level calculation?)
If you want to further discriminate between sounds, for example recognizer clicks and doorslams from claps, then you might want to implement a classifier for the spectrum. You will need to collect more examples of claps and different sound and apply a machine learning toolkit on FFT values. An SVM should work reasonably well for this task.
Related
I try to develop a guitar game in android platform.
And I need to do the real-time pitch detection to get the frequency of guitar chord/String.
I will get the input from the microphone, and then analyze the input (the input playing which kind of guitar string/chord)
I find two kinds of method that I can use, one is YIN, another one is FFT.
Which method can get better performance and exact result?
You need to first understand what 'pitch' really is (read the Wikipedia link below). When a single note is made on a guitar or piano, what we hear is not just one frequency of sound vibration, but a composite of multiple sound vibrations occurring at different mathematically related frequencies. The elements of this composite of vibrations at differing frequencies are referred to as harmonics or partials. For instance, if we press the Middle C key on the piano, the individual frequencies of the composite's harmonics will start at 261.6 Hz as the fundamental frequency, 523 Hz would be the 2nd Harmonic, 785 Hz would be the 3rd Harmonic, 1046 Hz would be the 4th Harmonic, etc. The later harmonics are integer multiples of the fundamental frequency, 261.6 Hz ( ex: 2 x 261.6 = 523, 3 x 261.6 = 785, 4 x 261.6 = 1046 ).
Below, at GitHub.com, is the C++ source code for an unusual two-stage algorithm that I devised which can do Realtime Pitch Detection on polyphonic MP3 files while being played on Windows. This free application (PitchScope Player, available on web) is frequently used to detect the notes of a guitar or saxophone solo upon a MP3 recording. You could download the executable for Windows to see my algorithm at work on a mp3 file of your choosing. The algorithm is designed to detect the most dominant pitch (a musical note) at any given moment in time within a MP3 or WAV music file. Note onsets are accurately inferred by a change in the most dominant pitch (a musical note) at any given moment during the MP3 recording.
I use a modified DFT Logarithmic Transform (similar to a FFT) to first detect these possible Harmonics by looking for frequencies with peak levels (see diagram below). Because of the way that I gather data for my modified Log DFT, I do NOT have to apply a Windowing Function to the signal, nor do add and overlap. And I have created the DFT so its frequency channels are logarithmically located in order to directly align with the frequencies where harmonics are created by the notes on a guitar, saxophone, etc.
My Pitch Detection Algorithm is actually a two stage process: a) First the ScalePitch is detected ('ScalePitch' has 12 possible pitch values: {E, F, F#, G, G#, A, A#, B, C, C#, D, D#} ) b) and after ScalePitch is determined, then the Octave is calculated by examining all the harmonics for the 4 possible Octave-Candidate notes. The algorithm is designed to detect the most dominant pitch (a musical note) at any given moment in time within a polyphonic MP3 file. That usually corresponds to the notes of an instrumental solo. Those interested in the C++ source code for my Two Stage Pitch Detection algorithm might want to start at the Estimate_ScalePitch() function within the SPitchCalc.cpp file at GitHub.com.
https://github.com/CreativeDetectors/PitchScope_Player
https://en.wikipedia.org/wiki/Transcription_(music)#Pitch_detection
Below is the image of a Logarithmic DFT (created by my C++ software) for 3 seconds of a guitar solo on a polyphonic mp3 recording. It shows how the harmonics appear for individual notes on a guitar, while playing a solo. For each note on this Logarithmic DFT we can see its multiple harmonics extending vertically, because each harmonic will have the same time-width. After the Octave of the note is determined, then we know the frequency of the Fundamental.
The diagram below demonstrates the Octave Detection algorithm which I developed to pick the correct Octave-Candidate note (that is, the correct Fundamental), once the ScalePitch for that note has been determined. Those wishing to see that method in C++ should go to the Calc_Best_Octave_Candidate() function inside the file called FundCandidCalcer.cpp, which is contained in my source code at GitHub.
I am developing an android app for recording the sound. In my app i will display the SPL (Sound Pressure Level) in dB. As part of my search, i come across, mobile hardware can only record sounds up to <= 110 dB. The reason is, mobiles are designed for human voice recording and that falls under the range of 60 dB. So if i need to record the sounds which is more than 110 dB how the mobile hardware will respond to that? Do i need to depend upon external devices and not the mobiles? Please provide your comments.
Thanks & regards,
Siva.
Your question is in fact about the dynamic range of the audio input of a mobile phone - any value you record must be capable of being represented in the scale used to measure it.
There is an associated question of what the largest sound pressure level that a particular phone can record, but this is ultimately limited by the dynamic range and the design of transducer used. Any absolutely measure is relative a calibration point - which in digital audio systems is dB FSD (e.g. ratio sample to maximum), yielding negative values.
The dynamic range in dB of a ideal PCM system is limited by quantisation noise and is related directly to bit-depth (Q) of the sample:
SQNR = 20*log10(2 ^ Q) =~ 6.02Q
State-of-the-art ADCs used in pro-audio equipment typically have 24-bit sample depth giving a SQNR of 144dB. It's worth noting, that in silicon ADCs and DACs, the thermal noise floor of the analogue section of the converter is smaller than this, and the LSB might as well be random.
AFAIK, Android is using 16-bit PCM, which has a SQNR of 96dB. This is the same performance as the CD Audio standard. A SNR of 110dB wouldn't be bad for pro-audio equipment.
In practice, audio quality is rarely a headline feature of phones and most get nowhere near this. Most users use crappy headphones or the on-board speaker of their phone for voice calls and won't notice the difference. It's an obvious corner to cut from both a cost and power budget point of view for a phone manufacturer.
Additionally, good digital audio design is a black-art. Factors such as decoupling of digital signals from ground and physical proximity of analogue components come into play. You find that in tear-downs of Apple kit, they often place the codec right next to the headphone jack, and away from the main board of the system. Again, other cost-conscious manufactures don't do this, and it'll degrade the dynamic range of the system.
In order to get meaningful measurements from the audio input you will need to disable both automatic gain control (AGC) and probably the HFP (used to remove DC bias, and often set with Fc > 100Hz for voice calls).
If your intention is to record absolute SPL, you will need to calibrate the audio system of the device to a set-point. There is no standardisation of this between manufacturers (or even devices from any given manufacturer). Unless you fancy doing this for the devices on the market (of which there are a lot), you'll never provide universally accurate measurements.
I really fail at FFT and now I'm in need to communicate from the headphone jack of my Android to the Arduino there's currently a library for Arduino (talks about it in the blog post Real-time spectrum analyzer powered by Arduino) and one for Android too!
How should I start? How should I build audio signals which ultimately can be turned into FFTs and the Arduino can analyse the same using the library and I can actuate anything?
You are asking a very fuzzy question: "How should I build audio signals which ultimately can be turned into FFTs and the Arduino can analyse the same using the library and I can actuate anything?". I am going to help you think through the problem - asking yourself the right questions is essential to get any answers.
Presumably, your audio signals are "coming from somewhere" - i.e. they are sound. This means that you need to convert them into a stream of numbers first.
problem #1: converting audio signal into a stream of numbers
This breaks down into three separate sub problems:
Getting the signal to the right amplitude
Choosing the sampling rate needed
Digitizing and storing the data for later processing
Items (1) and (3) are related, since you need to know how you are going to digitize the signal before you can choose the right amplitude. For example, if you have a microphone as your sound input source, you will need to amplify the signal (and maybe add some automatic gain control) before feeding it into an ADC (analog to digital converter) that has a 5 V input range, since the microphone may have an output in the mV range. Without more information about the hardware you are using, there's not a lot to add here. It sounds from your tag that you are trying to do that inside an Android device - in which case I wonder how you intend to move the digital signal to the Arduino (over USB?).
The second point, "choosing the sampling rate", is actually very important. A sound signal contains many different frequencies - think of them as keys on the piano. In order to detect a high frequency, you need to sample the signal "faster than it is changing". There is a formal theorem called "Nyquist's Theorem" that states that you have to sample at 2x the highest frequency that is present in your signal. Note - it's not just "that you are interested in", but "that is present". If you sample a high frequency signal with a low frequency sample clock, it will appear "aliased" - it wil show up in your output as something completely different. So before you digitize a signal you have to decide what the frequencies of interest are, and remove all higher frequencies with a filter. Let's say you are interested in frequencies up to 500 Hz (about 1 octave above middle C on a piano). To give your filter a chance to work, you might choose to cut off all frequencies above 1 kHz (filters "roll off" - i.e. they increase in strength over a range of frequencies), and would sample at 2 kHz. This means you get 2000 samples per second, and you need to figure out where to put them on your Arduino (memory fills up quickly on the little board.)
Problem #2: analyzing the signal
Assuming that you have somehow captured a digital signal, your next task is analyzing it. The FFT is basicaly some clever math that tells you, for a given sound sample, "what keys on the piano were hit, and how hard". It breaks the sound signal into a series of frequency "bins", and determines how much energy is in each bin (it also computes the phase, but let's keep it simple). So if the input of a FFT algorithm is a sound sample, the output is an array of values telling you what frequencies were present in the signal. This is approximate, since it will find the "nearest bin". Sticking with the same analogy - if you were hitting a piano that's out of tune, the algorithm won't return "out of tune", but rather "a bit of C, and a bit of C sharp", since it cannot actually measure anything in between. The accuracy of an FFT is determined by the sampling frequency (which gives you the upper limit on the frequency you can detect) and the sample length: the longer you "listen" so the sample, the more subtle the differences you can "hear". So you have another trade-off to consider: if your audio signal changes rapidly, you have to sample for a short time (to capture the quick changes); but if you need an accurate frequency, you have to sample for a long time. For example if you are writing a Morse decoder, your sampling has to be short compared to a pause between "dits" and "dashes" - or they will slur together. Figuring out that a morse tone is present is pretty easy though, since there will be a single tone (one bin in the FFT) that is much larger than the others.
Exactly how you implement these things depends on your application. The third step, "doing something with it", requires you to decide what is a meaningful signal. Again, if you are making a Morse decoder, you would perhaps turn an LED ON when a single tone is present (one or two bins in the FFT have much bigger value than the mean of the others), and OFF when it is not (all noise - lots of bins with approximately the same size). But without a LOT more information from you, there's not much more one can say to help you.
You might learn a lot from reading the following articles:
http://www.arduinoos.com/2010/10/sound-capture/
http://www.arduinoos.com/2010/10/fast-fourier-transform-fft/
http://interface.khm.de/index.php/lab/experiments/frequency-measurement-library/
I am writing an application that will behave similar to the existing Voice recognition but will be sending the sound data to a proprietary web service to perform the speech recognition part. I am using the standard MediaRecord (which is AMR-NB encoded) which seems to be perfect to speech recognition. The only data provided by this is the Amplitude via the getMaxAmplitude() method.
I am trying to detect when the person starts to talk so that when the person stops talking for about 2 seconds I can proceed to send the sound data to the web service. Right now I am using a threshold for the amplitude that if its goes over a value (i.e. 1500) then I assume the person is speaking. My concern is that the amplitude levels may vary by device (i.e. Nexus One v Droid), so I am looking for a more standard approach to this that can be derived from the amplitude values.
P.S.
I looked at graphing-amplitude but it doesn't provide a way to do it with just the amplitude.
Well, this might not be of much help but how about starting by measuring the offset noise captured by the microphone of the device by the application, and apply the threshold dynamically based on that? That way you would make it adaptable to the different devices' microphones and also to the environment the user is using it at, at a given time.
1500 is too low of a number. Measuring the change in amplitude will work better.
However, it will still result in miss detections.
I fear the only way to solve this problem is to figure out how to recognize a simple word or tone rather than simply detect noise.
There are now multiple VAD library designed for Android. One of these are:
https://github.com/gkonovalov/android-vad
Most of the smartphones come with a proximity sensor. Android has API for using these sensors. This would be adequate for the job you described. When the user moves the phone near to his ear, you can code the app to start recording. It should be easy enough.
Sensor class for android
I'm trying to build a gadget that detects pistol shots using Android. It's a part of a training aid for pistol shooters that tells how the shots are distributed in time and I use a HTC Tattoo for testing.
I use the MediaRecorder and its getMaxAmplitude method to get the highest amplitude during the last 1/100 s but it does not work as expected; speech gives me values from getMaxAmplitude in the range from 0 to about 25000 while the pistol shots (or shouting!) only reaches about 15000. With a sampling frequency of 8kHz there should be some samples with considerably high level.
Anyone who knows how these things work? Are there filters that are applied before registering the max amplitude. If so, is it hardware or software?
Thanks,
/George
It seems there's an AGC (Automatic Gain Control) filter in place. You should also be able to identify the shot by its frequency characteristics. I would expect it to show up across most of the audible spectrum, but get a spectrum analyzer (there are a few on the app market, like SpectralView) and try identifying the event by its frequency "signature" and amplitude. If you clap your hands what do you get for max amplitude? You could also try covering the phone with something to muffle the sound like a few layers of cloth
It seems like AGC is in the media recorder. When I use AudioRecord I can detect shots using the amplitude even though it sometimes reacts on sounds other than shots. This is not a problem since the shooter usually doesn't make any other noise while shooting.
But I will do some FFT too to get it perfect :-)
Sounds like you figured out your agc problem. One further suggestion: I'm not sure the FFT is the right tool for the job. You might have better detection and lower CPU use with a sliding power estimator.
e.g.
signal => square => moving average => peak detection
All of the above can be implemented very efficiently using fixed point math, which fits well with mobile android platforms.
You can find more info by searching for "Parseval's Theorem" and "CIC filter" (cascaded integrator comb)
Sorry for the late response; I didn't see this question until I started searching for a different problem...
I have started an application to do what I think you're attempting. It's an audio-based lap timer (button to start/stop recording, and loud audio noises for lap setting). It' not finished, but might provide you with a decent base to get started.
Right now, it allows you to monitor the signal volume coming from the mic, and set the ambient noise amount. It's also using the new BSD license, so feel free to check out the code here: http://code.google.com/p/audio-timer/. It's set up to use the 1.5 API to include as many devices as possible.
It's not finished, in that it has two main issues:
The audio capture doesn't currently work for emulated devices because of the unsupported frequency requested
The timer functionality doesn't work yet - was focusing on getting the audio capture first.
I'm looking into the frequency support, but Android doesn't seem to have a way to find out which frequencies are supported without trial and error per-device.
I also have on my local dev machine some extra code to create a layout for the listview items to display "lap" information. Got sidetracked by the frequency problem though. But since the display and audio capture are pretty much done, using the system time to fill in the display values for timing information should be relatively straightforward, and then it shouldn't be too difficult to add the ability to export the data table to a CSV on the SD card.
Let me know if you want to join this project, or if you have any questions.