FFT frequency bucket amplitude varies even with constant tone applied - android

I am trying to use FFT to decode morse code, but I'm finding that when I examine the resulting frequency bin/bucket I'm interested in, the absolute value is varying quite significantly even when a constant tone is presented. This makes it impossible for me to use the rise and fall around a threshold and therefore decode audio morse.
I've even tried the simple example that seems to be copied everywhere, but it also varies...
I can't work out what I'm doing wrong, and my maths is not clever enough to understand all the formulas associated with FFT.
I now it must be possible, but I can't find out how... can anyone help please?

Make sure you are using the magnitude of the FFT result, not just the real or imaginary component of a complex result.
In general, when a longer constant amplitude sinusoid is fed to a series of shorter FFTs (windowed STFT), the magnitude result will only be constant if the period of the sinusoid is exactly integer periodic in the FFT length. e.g.
f_tone modulo (f_sampling_rate / FFT_length) == 0
If you are only interested in the magnitude of one selected tone frequency, the Goertzel algorithm would serve as a more efficient filter than a full FFT. And, depending on the setup and length restrictions required by your chosen FFT library, it may be easier to vary the length of a Goertzel to match the requirements for your target tone frequency, as well as the time/frequency resolution trade-off needed.

Related

Correct way to apply window function in fft

I understand why I need to use window functions in fft,I record a sine wave (16 bit pcm format), I have the sine wave audio record which I would like to analyze,I have recorded the mic audio into a byte array, transformed it back to the sample array representing the sine wave with values from [-1,1] - values divided by 32768. Do I need to apply the window on the array with the values [-1,1](the divided one) or do I need tho apply it on the sample array without dividing it by 32768? I looked up for some answers on SO and google, couldn't find any explanation on what is the right way.
One of the properties of linear-time-invariant is that the result of a cascade of multiple linear-time-invariant systems is the same regardless of the order in which the operations where done (at least in theory, in practice filters and such can have small non-linearities which can make the result slightly different depending on order).
From a theoretical perspective, applying a constant scaling factor to all samples can be seen as such a linear-time-invariant system. For a specific computer implementation, the scaling can also be considered approximately linear-time-invariant, provided the scaling does not introduce significant losses of precision (e.g. by scaling the number to values near the floating point smallest representable value), nor distortions resulting from scaling values outside the supported range. In your case, simply dividing by 32768 is most likely not going to introduce significant distortions, and as such could be considered to be an (approximately) linear-time-invariant system.
Similarly, applying a window which multiplies each samples by a different window value, can also be seen as another linear-time-invariant system.
Having established that you have such a cascade of linear-time-invariant systems, you can perform the scaling by 32768 either before or after applying the window.
P.S.: as Paul mentioned in comments, you'd probably want to perform the conversion from 16-bit words to floating point (whether scaled or not) first if you are going to work with floating point values afterward. Trying to perform scaling in fixed-point arithmetic might prove more complex than necessary, and may be subject to loss of precision I alluded to above if not done carefully.

Max Amplitude from PCM Buffers - Audio Android

I am trying to find maximum amplitude value from PCM Buffer.
My questions are-
1) I found that to find this value in DB, formula is : amplDB=20log(abs(ampl)/32767). Now given that ampl is in range of -32768 to 32767, the value of log((abs)ampl/32767) would be always negative. So is this formula the correct one? Should I just negate the value of amplDB?
2) My values are coming very high. For normal song also the Maximum amplitude value is 32767, which doesn't seem correct. What are the usual amplitude values for a song?
3) I found another formula amplDb=ampl/2700. What is this 2700 for?
4) Is there any other way I can calculate the amplitude value?
Thanks
The formula you are using is correct. Keep in mind that dB is a perceptual measurement that compares an intensity with a reference level you set. Therefore, it makes sense that it is always negative since your reference level being used at the formula is the maximum PCM level. In other words, your dB will always be lower (negative), than your maximum level (0 dB).
Regarding the values you're obtaining, it is quite normal to obtain the maximum amplitude. If it is a commercial song, a common mastering practice is to boost the signal as much as possible. If it is a recording you made, it could have to do with the microphones sensitivity and the sounds you're recording.
Finally, just to be clear, this has nothing to do with the sound pressure levels at which the sound will happen upon playback, since you're only looking at the differences in amplitude of a recorded sound.

Extract only high frequency from FFT

I am trying to do FFT and extract high frequency features on smart phones. It turns out too slow to do a full FFT on 44100HZ sampled data on smart phones, but downsampling it will kill high frequency information because of Nyquist Theorem. Is there a way to speed up the FFT while retaining the higher frequencies?
It is not clear if you want to use the FFT information or if it is just a way to implement some filter.
For the first case you can subsample the data, i.e., run a highpass filter and then compress (downsample) the sequence. Yes, there will be aliasing, but you can still map particular frequencies from the FFT back to the original higher frequencies.
If it is filtering, the filter should be reasonable long before you get any benefit from applying transform based filtering. Also, if you do this make sure you read up on overlap-add and overlap-save filtering and do not go with the all to common "let's take the FFT, multipliy with an 'ideal' response and then an IFFT". This will in general not give the expected result (unless you expect a transfer function which is time varying and different from the 'ideal').

What does the content of the short[] array of AudioRecord.read() represent in Android [duplicate]

I am starting out with audio recording using my Android smartphone.
I successfully saved voice recordings to a PCM file. When I parse the data and print out the signed, 16-bit values, I can create a graph like the one below. However, I do not understand the amplitude values along the y-axis.
What exactly are the units for the amplitude values? The values are signed 16-bit, so they must range from -32K to +32K. But what do these values represent? Decibels?
If I use 8-bit values, then the values must range from -128 to +128. How would that get mapped to the volume/"loudness" of the 16-bit values? Would you just use a 16-to-1 quantisation mapping?
Why are there negative values? I would think that complete silence would result in values of 0.
If someone can point me to a website with information on what's being recorded, I would appreciate it. I found webpages on the PCM file format, but not what the data values are.
Think of the surface of the microphone. When it's silent, the surface is motionless at position zero. When you talk, that causes the air around your mouth to vibrate. Vibrations are spring like, and have movement in both directions, as in back and forth, or up and down, or in and out. The vibrations in the air cause the microphone surface to vibrate as well, as in move up and down. When it moves down, that might be measured or sampled a positive value. When it moves up that might be sampled as a negative value. (Or it could be the opposite.) When you stop talking the surface settles back down to the zero position.
What numbers you get from your PCM recording data depend on the gain of the system. With common 16 bit samples, the range is from -32768 to 32767 for the largest possible excursion of a vibration that can be recorded without distortion, clipping or overflow. Usually the gain is set a bit lower so that the maximum values aren't right on the edge of distortion.
ADDED:
8-bit PCM audio is often an unsigned data type, with the range from 0..255, with a value of 128 indicating "silence". So you have to add/subtract this bias, as well as scale by about 256 to convert between 8-bit and 16-bit audio PCM waveforms.
The raw numbers are an artefact of the quantization process used to convert an analog audio signal into digital. It makes more sense to think of an audio signal as a vibration around 0, extending as far as +1 and -1 for maximum excursion of the signal. Outside that, you get clipping, which distorts the harmonics and sounds terrible.
However, computers don't work all that well in terms of fractions, so discrete integers from 0 to 65536 are used to map that range. In most applications like this, a +32767 is considered maximum positive excursion of the microphone's or speaker's diaphragm. There is no correlation between a sample point and a sound pressure level, unless you start factoring in the characteristics of the recording (or playback) circuits.
(BTW, 16-bit audio is very standard and widely used. It is a good balance of signal-to-noise ratio and dynamic range. 8-bit is noisy unless you do some funky non-standard scaling.)
Lots of good answers here, but they don't directly address your questions in an easy to read way.
What exactly are the units for the amplitude values? The values are
signed 16-bit, so they must range from
-32K to +32K. But what do these values represent? Decibels?
The values have no unit. They simply represent a number that has come out of an analog-to-digital converter. The numbers from the A/D converter are a function of the microphone and pre-amplifier characteristics.
If I use 8-bit values, then the values
must range from -128 to +128. How
would that get mapped to the
volume/"loudness" of the 16-bit
values? Would you just use a 16-to-1
quantisation mapping?
I don't understand this question. If you are recording 8-bit audio, your values will be 8-bits. Are you converting 8-bit audio to 16-bit?
Why are there negative values? I would
think that complete silence would
result in values of 0
The diaphragm on a microphone vibrates in both directions and as a result creates positive and negative voltages. A value of 0 is silence as it indicates that the diaphragm is not moving. See how microphones work
For more details on how sound is represented digitally, see here.
Why are there negative values? I would think that complete silence
would result in values of 0
The diaphragm on a microphone vibrates in both directions and as a
result creates positive and negative voltages. A value of 0 is silence
as it indicates that the diaphragm is not moving. See how microphones
work
Small clarification: The position of the diaphragm is being recorded. Silence occurs when there is no vibration, when there is no change in position. So the vibration you are seeing is what is pushing the air and creating changes in air pressure over time. The air is no longer being pushed at the top and bottom peaks of any vibration, so the peaks are when silence occurs. The loudest part of the signal is when the position changes the fastest which is somewhere in the middle of the peaks. The speed with which the diaphragm moves from one peak to another determines the amount of pressure that's generated by the diaphragm. When the top and bottom peaks are reduced to zero (or some other number they share) then there is no vibration and no sound at all. Also as the diaphragm slows down so that there's a greater space of time between peaks, there is less sound pressure being generated or recorded.
I recommend the Yamaha Sound Reinforcement Handbook for more in depth reading. Understanding the idea of calculus would help the understanding of audio and vibration as well.
The 16bit numbers are the A/D convertor values from your microphone (you knew this). Know also that the amplifier between your microphone and the A/D convertor has an Automatic Gain Control (AGC). The AGC will actively change the amplification of the microphone signal to prevent too much voltage from hitting the A/D convertor (usually < 2Volts dc). Also, there is DC voltage de-coupling which sets the input signal in the middle of the A/D convertor's range (say 1Volt dc).
So, when there is no sound hitting the microphone, the AGC amplifier is sending a flat line 1.0 Volt dc signal to the A/D convertor. When sound waves hit the microphone, it creates a corresponding AC voltage wave. The AGC amp takes the AC voltage wave, centers it at 1.0 Vdc, and sends it to the A/D convertor. The A/D samples (measures the DC Voltage at say 44,000 / per second), and spits out the +/-16bit values of the voltage. So -65,536 = 0.0 Vdc and +65,536 = 2.0 Vdc. A value of +100 = 1.00001529 Vdc and -100 = 0.99998474 Vdc hitting the A/D convertor.
+Values are above 1.0 Vdc, -Values are below 1.0 Vdc.
Note, most audio systems use a log formula to curve the audio wave logarithmically, so a human ear can better hear it. In digital audio systems (with ADCs), Digital Signal Processing puts this curve on the signal. DSPs chips are big business, TI has made a fortune using them for all kinds of applications, not just audio processing. DSPs can work the very complicated math onto a real time stream of data that would choke an iPhone's ARM7 processor. Say you are sending 2MHz pulses to an array of 256 ultrasound sensor/receivers--you get the idea.

processing human voice

I am trying to make an android app that checks whether the recorded voice of a person is of high frequency or not.I have completed till the recording part but don't know how to proceed further.
After searching I found that FFT algorithm must be used but the problem is how to get the array values that must be passed as the input to the algorithm.
Can anyone help please?
Assuming you have defined what is meant by "contains high frequency", and you merely need a measure of this (no need to visualize the frequency content in a graph), there is really no need to calculate the FFT.
I would calculate the RMS values of the signal (a measure of the total energy), then apply a low-pass filter on the data (in the time domain) and calculate the RMS values again on the filtered signal. Comparing the loss of energy is your measure of how much high frequency content was responsible for your initial energy value.
REPLY TO COMMENT:
You need data in order to process it! Perhaps I dont understand your question? of what do you wish to "get exact values of" You have stated you "completed the recording part" so i assume you have the signal stored in memory, now you need to calculate the total energy of the signal in order to either A) calculate change of energy after filtering or B) compare energy to some predefined hardcoded value (bad idea btw).
Either way, this should be done in the time-domain if all you want is a measure/value. As stated by Parseval's theorem, there is no need to perform cpu intensive processing and go over to the frequency domain to calculate the energy of a signal. http://en.wikipedia.org/wiki/Parseval's_theorem
ELABORATION:
When you record the user's voice (collect data for your signal) you need to ensure the data is not lost and is properly stored in memory (in some array-type object) and that you have a reference to this array. Once the data is collected, you dont need to convert your signal into values, it is already stored as a sequence of values. Therefore, you are now ready to perform some calculation in order to get a measure of "how much high frequencies there are"...
The RMS (root mean square) value is a standardized way of measuring the total energy of a signal - you take the "square-root of the average of all values squared". See http://mathworld.wolfram.com/Root-Mean-Square.html
The RMS is quick and easy to calculate, but it gives you the energy of the total signal, low frequency components and high frequency components together and there is no way of knowing if a high RMS value is due to alot of high frequency components or low frequency components. Therefore, I suggest, you remove the high frequency components and calculate the RMS value again to see how much the total energy changed in doing so, ie. how much the high frequencies was responsible for the initial "raw" RMS value. Dividing the two values is your high frequency ratio measure... Im not sure this is what you want to do, but its what I would do.
In order perform low pass filtering you need to pick a frequency value Fcut and say anything over this is considered "high", then apply a low pass filter with the cut off point set to Fcut, applying a filter is done in the time domain by means of convolution.
Usually they use AudioRecord class. It writes raw PCM data then they can do some calculations on the data.

Categories

Resources