I am trying to do FFT and extract high frequency features on smart phones. It turns out too slow to do a full FFT on 44100HZ sampled data on smart phones, but downsampling it will kill high frequency information because of Nyquist Theorem. Is there a way to speed up the FFT while retaining the higher frequencies?
It is not clear if you want to use the FFT information or if it is just a way to implement some filter.
For the first case you can subsample the data, i.e., run a highpass filter and then compress (downsample) the sequence. Yes, there will be aliasing, but you can still map particular frequencies from the FFT back to the original higher frequencies.
If it is filtering, the filter should be reasonable long before you get any benefit from applying transform based filtering. Also, if you do this make sure you read up on overlap-add and overlap-save filtering and do not go with the all to common "let's take the FFT, multipliy with an 'ideal' response and then an IFFT". This will in general not give the expected result (unless you expect a transfer function which is time varying and different from the 'ideal').
Related
I'm making an application in which I record a direct audio from the microphone of the cell phone, I save that recording and I need to compare it with some audio already stored in the device
The audios are of "noise" of motors, the idea is that from the recorded recording it indicates us to which case of the saved ones it seems
that is, I have two cases, a good engine and a damaged engine, when I finish recording it must say "this audio belongs to a damaged engine"
Reading I find that it has to be done through artificial intelligence, which is really complex, I have read that you can "decompose" the audio into a vector of numbers or make comparisons by FFT, however I do not find much information about it, really I'd appreciate your help.
the file type saved is .wav
It's nontrivial task to compare audio signals.
The audio is just a sequence of values (numbers) where index is just a "time" and value is a loudness of sound (amplitude).
If you compare audio data like two arrays (sequences) element by element, iterating through the index - it will be luck to get something reasonable. Though you need some transformation of this array to get aggregated info about this sequence of numbers as a whole (for example - spectre of signal).
There are some mathematical tools for this task, for example, mentioned by you well-known Fourier Transform and statistical tool Autocorrelation (it finds "kindness" of sequence of numbers).
The autocorrelation method can be relatively simple - you just iterate comparing arrays of data and calculate the autocorrelation. But you will pay for simplicity in case of initial quality (or preparation/normalization) of signals - they should have similar duration. The value of resulted correlation function will show how differ two sequences, i.e. 0 - is absolutely different and 1 - is almost the same.
To implement Fourier Transform (FFT) is not a problem too, you could take well described algo and implement it itself on any language without using third party libs. It does the job very well.
FT will help you get a spectrum of the signal i.e. another set of values: set of amplitudes per frequency (roughly, frequency as array index instead of time in case of input raw signal) and now you can compare this given spectrums almost like two arrays iterating through an index (frequency) and then decide on their similarity - calculate deltas and see whether it hit into some acceptance interval (or you can use more correct statistical methods e.g. correlation function).
As for noised signal, the noise is usually subtracted from the given data set (but here you should know the sort of noise type).
It is all related to signal processing area and if you're working on such project you need to learn more about this.
Bonus: a book for example
I am trying to use FFT to decode morse code, but I'm finding that when I examine the resulting frequency bin/bucket I'm interested in, the absolute value is varying quite significantly even when a constant tone is presented. This makes it impossible for me to use the rise and fall around a threshold and therefore decode audio morse.
I've even tried the simple example that seems to be copied everywhere, but it also varies...
I can't work out what I'm doing wrong, and my maths is not clever enough to understand all the formulas associated with FFT.
I now it must be possible, but I can't find out how... can anyone help please?
Make sure you are using the magnitude of the FFT result, not just the real or imaginary component of a complex result.
In general, when a longer constant amplitude sinusoid is fed to a series of shorter FFTs (windowed STFT), the magnitude result will only be constant if the period of the sinusoid is exactly integer periodic in the FFT length. e.g.
f_tone modulo (f_sampling_rate / FFT_length) == 0
If you are only interested in the magnitude of one selected tone frequency, the Goertzel algorithm would serve as a more efficient filter than a full FFT. And, depending on the setup and length restrictions required by your chosen FFT library, it may be easier to vary the length of a Goertzel to match the requirements for your target tone frequency, as well as the time/frequency resolution trade-off needed.
I'm looking for ways to increase the position sample rate using an android phone. How to get a higher sample rate has been asked before about once a year here at SO.
But my question is more specific. By using the new Raw GNSS Measurements would it be possible to get a higher sample rate if I use the raw data and calculate the position myself?
Maybe I have misunderstood the point of the raw GNSS data, but in my ignorance I'm thinking that a phone like the Pixel 2 which supports data from GPS, GLONASS, GALILEO, BeiDou & QZSS should theoretically get the data much more frequent than 1Hz. But the chip it self only calculates and send positions to the system at a 1Hz sample rate.
But since there is the raw data from five positioning systems it should be possible to not only get a higher sample rate but also more accuracy!?!?!?
So my question is if its possible, using the raw data, to get higher sample rates and better accuracy? Reading through the page above doesn't suggest much about it and Raw positioning data is not a specialty of mine.
The interval for raw GPS updates really depends on the internal GPS receiver's capabilities. No matter what feature Android provides, it can't invent more samples than the receiver provides.
Secondly, by supporting multiple satellite constellations, there is a higher chance that you will obtain a 3D fix - because there is more to choose from by the receiver - but that is not guaranteed. For example, if you are driving in downtown Manhattan N.Y., being surrounded by tall buildings will reduce satellite visibility across the board. Combining low precision samples from multiple constellations to generate high precision data would be quite complicated (I won't say impossible though).
I don't know if modern receivers perform this sort of combination, so I typically do not assume they do. And relegating this complex computation to your application - via Raw GNSS measurements - would be an interesting experiment...
I am trying to make an android app that checks whether the recorded voice of a person is of high frequency or not.I have completed till the recording part but don't know how to proceed further.
After searching I found that FFT algorithm must be used but the problem is how to get the array values that must be passed as the input to the algorithm.
Can anyone help please?
Assuming you have defined what is meant by "contains high frequency", and you merely need a measure of this (no need to visualize the frequency content in a graph), there is really no need to calculate the FFT.
I would calculate the RMS values of the signal (a measure of the total energy), then apply a low-pass filter on the data (in the time domain) and calculate the RMS values again on the filtered signal. Comparing the loss of energy is your measure of how much high frequency content was responsible for your initial energy value.
REPLY TO COMMENT:
You need data in order to process it! Perhaps I dont understand your question? of what do you wish to "get exact values of" You have stated you "completed the recording part" so i assume you have the signal stored in memory, now you need to calculate the total energy of the signal in order to either A) calculate change of energy after filtering or B) compare energy to some predefined hardcoded value (bad idea btw).
Either way, this should be done in the time-domain if all you want is a measure/value. As stated by Parseval's theorem, there is no need to perform cpu intensive processing and go over to the frequency domain to calculate the energy of a signal. http://en.wikipedia.org/wiki/Parseval's_theorem
ELABORATION:
When you record the user's voice (collect data for your signal) you need to ensure the data is not lost and is properly stored in memory (in some array-type object) and that you have a reference to this array. Once the data is collected, you dont need to convert your signal into values, it is already stored as a sequence of values. Therefore, you are now ready to perform some calculation in order to get a measure of "how much high frequencies there are"...
The RMS (root mean square) value is a standardized way of measuring the total energy of a signal - you take the "square-root of the average of all values squared". See http://mathworld.wolfram.com/Root-Mean-Square.html
The RMS is quick and easy to calculate, but it gives you the energy of the total signal, low frequency components and high frequency components together and there is no way of knowing if a high RMS value is due to alot of high frequency components or low frequency components. Therefore, I suggest, you remove the high frequency components and calculate the RMS value again to see how much the total energy changed in doing so, ie. how much the high frequencies was responsible for the initial "raw" RMS value. Dividing the two values is your high frequency ratio measure... Im not sure this is what you want to do, but its what I would do.
In order perform low pass filtering you need to pick a frequency value Fcut and say anything over this is considered "high", then apply a low pass filter with the cut off point set to Fcut, applying a filter is done in the time domain by means of convolution.
Usually they use AudioRecord class. It writes raw PCM data then they can do some calculations on the data.
So I'm trying to build an android app which acts as a real time audio analyzer as a precursor to a project that will involve detecting and filtering out certain sounds.
So I think I've got the basics of discrete Fourier transforms down, however I'm not sure what the best parameters should be for doing real time frequency analysis.
I get the impression that under ideal situations (unlimited computing power), I would take all the samples from the 44100 sample/sec PCM stream I'm getting from the AudioRecord class and put them through a 44100 element fifo "window" (padded to 2**16 with 0's and maybe a tapering function?) , running an FFT on the window every time a new sample came in. This would (I think), give me the spectrum for 0 - ~22 KHz updated 44100 times per second.
It seems like this is not going to happen on a smartphone. The thing is, I'm not sure which parameters of the computation I should reduce in order to make in order to make it tractable on my Galaxy Nexus while still holding on to as much quality as possible. Eventually I would like to be using an external microphone with better sensitivity.
I figure it will involve moving the window more than one sample between taking FFT's, but I have no idea at what point this becomes more detrimental to accuracy/aliasing/whatever than just doing the FFT on a smaller window, or if there is a third option I'm overlooking.
With the natively implemented KissFFT I'm using from libgdx, I seem to be able to do somewhere between 30-42 44100 element FFT's per 44100 samples and still have it be responsive (meaning that the buffer getting filled from the thread doing AudioRecord.read() isn't filling up faster than the thread doing the fft's can drain it).
So my questions are:
Could the performance I'm currently getting just be the best I'm going to get? Or does it seem like I must be something stupid because much faster speeds are possible?
Is my approach to this at least fundamentally correct or am I barking entirely up the wrong tree?
I'd be happy to show any of my code if that would help answer my questions, but there's a lot of it so I figured I would do so selectively instead of posting it all.
if there is a third option I'm overlooking
Yes: doing both at the same time, a reduction of the FFT size as well as a larger step size. In a comment you pointed out that you want to detect "sniffling/chewing with mouth". So, what you want to do is similar to the typical task of speech recognition. There, you typically extract a feature vector in steps of 10ms (meaning with Fs=44.1kHz every 441 samples) and the signal window to transform is roughly about double the size of the step size, so 20ms which yields to a 2^X FFT size of 1024 samples (make sure that you choose an FFT size which is a power of 2, because it is faster).
Any increase in window size or reduction in step size increases the data but mainly adds redundancy.
Additional hints:
#SztupY correctly pointed out that you need to "window" your signal prior to the FFT, typically with a Hamming-wondow. (But this is not "filtering". It is just multiplying each sample value with the corresponding window value without accumulating the result).
The raw FFT output is hardly suited to recognize "sniffling/chewing with mouth", a classical recognizer consists of HMMs or ANNs which process sequences of MFCCs and their deltas.
Could the performance I'm currently getting just be the best I'm going to get? Or does it seem like I must be something stupid because much faster speeds are possible?
It's close to the best, but you are wasting all the CPU power to estimate highly redundant data, leaving no CPU power to the recognizer.
Is my approach to this at least fundamentally correct or am I barking entirely up the wrong tree?
After considering my answer you might re-think your approach.