I am trying to develop GCC_PHAT algorithm on Android devices.
For FFT I used this library.
The idea is to correlate two audio files (16-bit PCM mono) to find the delay between them. With Matlab it works perfectly.
My first problem is FFT output, it gives numbers higher than 32768. For example:
fft re -20830.895138576154
fft re -30639.569794501647
fft re -49850.48597621472
fft re -49335.28275604235
fft re -96060.94916529073
fft re -91409.17426504416
fft re -226903.051428709
Is there a way to normalize these numbers to an interval of [-1,1]?
The library's forward transform definition does match Matlab's, so you should get matching values after the forward transform (not that it is critical since G_PHAT does get normalized to [-1,1]).
However, the same cannot be said of the inverse transform. Indeed from
the code comments on inverseTransform:
This transform does not perform scaling, so the inverse is not a true inverse.
And from the library webpage:
This FFT does not perform any scaling. So for a vector of length n, after performing a transform and an inverse transform on it, the result will be the original vector multiplied by n (plus approximation errors).
So, to get values matching Matlab's FFT/IFFT implementation you would need to divide the result of the IFFT by n.
Related
I need to output the audio from the left and right channels to the headphone jack, and the headphone jack to an oscilloscope. I can't get the correct audio waveform with Float.MAX_VALUE and Float.MIN_VALUE. Usually the 16-bit audio max/min is a short type with a value of +/-32767. So you can assign values with Short.MAX_VALUE and Short_MIN_VALUE. But currently my audio is of type float, ie AudioFormat.ENCODING_PCM_FLOAT, and using Float.MAX_VALUE and Float.MIN_VALUE does not get the correct audio waveform in the oscilloscope. The actual audio waveform will have 0.4 milliseconds of noise before and after, but when I take the float at 3.5f or -3.5f, the shape of the waveform looks correct, but it doesn't reach the maximum. So what is the maximum and minimum audio value of the float type?
The actual audio waveform will have 0.4 milliseconds of noise before and after.
The correct waveform should be such a shape. If set to 3.5f/-3.5f, the shape is correct but not the maximum.
From the docs:
...The implementation does not clip for sample values within the nominal range [-1.0f, 1.0f], provided that all gains in the audio pipeline are less than or equal to unity (1.0f), and in the absence of post-processing effects that could add energy, such as reverb. For the convenience of applications that compute samples using filters with non-unity gain, sample values +3 dB beyond the nominal range are permitted. However such values may eventually be limited or clipped, depending on various gains and later processing in the audio path. Therefore applications are encouraged to provide samples values within the nominal range.
A 3dB power increase corresponds to an increase in voltage of sqrt(2), or roughly 1.41. So, according to the documentation, your device may be able to handle -1.41 to 1.41, but note the caveat about clipping.
What does forcing key frames mean?
As per the doc
-force_key_frames[:stream_specifier] expr:expr (output,per-stream) Force key frames at the specified timestamps, more precisely at the
first frames after each specified time. If the argument is prefixed
with expr:, the string expr is interpreted like an expression and is
evaluated for each frame. A key frame is forced in case the evaluation
is non-zero.
Still I am not able to understand what does forcing key frames at specified timestamp means and what is its use?I can see this command is used while segmenting video.What is its purpose there?
A typical video codec uses temporal compression i.e. most frames only store the difference with respect to earlier (and in some cases, future) frames. So, in order to decode these frames, earlier frames have to be referenced, in order to generate a full image. In short, keyframes are frames which don't rely on other frames for decoding, and which other frames rely on in order to get decoded.
If a video has to be cut or segmented, without transcoding (recompression), then the segmenting can only occur at keyframes, so that the first frame of a segment is a keyframe. If this were not the case, then the frames of a segment till the next keyframe could not be played.
A encoder like x264 typically generates keyframes only if it detects that a scene change has occurred*. This isn't conducive for segmentation, as the keyframes may be generated at irregular intervals. In order to ensure that segments of identical and predictable lengths can be made, the force_key_frames option can be used to ensure desired keyframe placement.
-force_key_frames expr:gte(t,n_forced*5) forces a keyframe at t=5,10,15 seconds...
The GOP size option g is another method to ensure keyframe placement, e.g. -g 50 forces a keyframe every 50 frames.
*subject to minimum and maximum keyframe distance parameters.
I recorded an audio sample, and i want to apply FFT to it,,,
I did all the steps needed in order to use FFT in android such as getting the j-transform library and everything else needed...
and with in the code, i first defined the fft :
DoubleFFT_1D fft = new DoubleFFT_1D(1024);
and inside the code, after reading the audio file( stored as PCM) ... i applied FFT on it by using the following instruction:
fft.complexForward(audio_file_in_double_format);
Here is my question:
First of all the number (1024) used in the parameter of the fft definition, what is it based on? and what does it mean?
Does it mean that the fft will be applied on only 1024 samples?!
And what will be the output of the fft function? i know that it will give complex numbers, so is it gonna give a result double to the input??
I need help understanding how this FFT function works?!
The code is working fine with me, but i need to understand,, because i am inputting the while audio file into the FFT function which is alot bigger than 1024 samples. So is it applying FFT to its first 1024 and ignoring the rest? or what??
I'm using Libgdx library for do FFT from accelerometer signal in an Android app.
I need to have my signal normalized because I find the dot product of two signal and I want its max value 1.
With "normalization" i mean that the Euclidean Norm of signal is 1.
(Euclidean norm is square root of the sum of product of analogue components of vector. When I've found its value, for normalize signal I divide all components of vector by the norm value).
Dot product is in the frequency spectrum, so if I normalize the signal in time domain, the frequency spectrum representation is not euclidean normalized, then I'll do again the euclidean normalization.
(I consider already after the FFT the normalization by 1/N scale factor, I think it not influence my problem, maybe).
Which are differences if I do Euclidean Normalization before and after FFT, or I do it only after FFT?
EDIT 1: Consider also that FFT in Libgdx library is Complex DFT, and I've real signal in input than the output signal is symmetric for 0 to (N/2)-1 and N/2 to N.
I verify that Parseval's Theorem is verified if I apply no window (like Hamming's window).
So, if I use 0 to N/2-1 components of signal, will I obtain a dot product between 0 and 1?
Hm, seems nobody is answering this. Not sure why, but I will chime in briefly.
Let f[n] be the signal, F[k] be the Fourier transformed version (obviously discrete).
By Parseval's theorem, we have that:
norm(f[n]) = (1/N) norm(F[k])
where N is the number of samples. By homogeneity of Fourier transform, if g[n]=a f[n], then G[k] = a F[k].
Finally, combining these two, in order to get norm(F[k]) to be 1, what you need to do is divide by:
(1) norm(F[k]) = N norm(f[n])
Either in time or frequency domain.
Similar, if you want norm(f[n]) to be 1, what you need to do is divide by:
(2) norm(f[n]) = (1/N) norm(F[k])
And finally:
Which are differences if I do Euclidean Normalization before and after FFT, or I do it only after FFT?
It does not make a difference whether you divide before or after because Fourier transform is linear (and homogeneity property holds). However, if you want the time domain to have norm of 1, then you should use the constant in (2). On the other hand, to get the frequency domain to have norm of 1, you should use constant in (1).
I am using the AudioRecord class to analize raw pcm bytes as it comes in the mic.
So thats working nicely. Now i need convert the pcm bytes into decibel.
I have a formula that takes sound presure in Pa into db.
db = 20 * log10(Pa/ref Pa)
So the question is the bytes i am getting from audiorecorder from the buffer what is it is it amplitude pascal sound pressure or what.
I tried to putting the value into te formula but it comes back with very hight db so i do not think its right
thanks
Disclaimer: I know little about Android.
Your device is probably recording in mono at 44,100 samples per second (maybe less) using two bytes per sample. So your first step is to combine pairs of bytes in your original data into two-byte integers (I don't know how this is done in Android).
You can then compute the decibel value (relative to the peak) of each sample by first taking the normalized absolute value of the sample and passing it to your Db function:
float Db = 20 * log10(ABS(sampleVal) / 32768)
A value near the peak (e.g. +32767 or -32768) will have a Db value near 0. A value of 3277 (0.1) will have a Db value of -20; a value of 327 (.01) will have a Db value of -40 etc.
The problem is likely the definition of the "reference" sound pressure at the mic. I have no idea what it would be or if it's available.
The only audio application I've ever used, defined 0db as "full volume", when the samples were at + or - max value (in unsigned 16 bits, that'd be 0 and 65535). To get this into db I'd probably do something like this:
// assume input_sample is in the range 0 to 65535
sample = (input_sample * 10.0) - 327675.0
db = log10(sample / 327675.0)
I don't know if that's right, but it feels right to the mathematically challenged me. As the input_sample approaches the "middle", it'll look more and more like negative infinity.
Now that I think about it, though, if you want a SPL or something that might require different trickery like doing RMS evaluation between the zero crossings, again something that I could only guess at because I have no idea how it really works.
The reference pressure in Leq (sound pressure level) calculations is 20 micro-Pascal (rms).
To measure absolute Leq levels, you need to calibrate your microphone using a calibrator. Most calibrators fit 1/2" or 1/4" microphone capsules, so I have my doubts about calibrating the microphone on an Android phone. Alternatively you may be able to use the microphone sensitivity (Pa/mV) and then calibrate the voltage level going into the ADC. Even less reliable results could be had from comparing the Android values with the measured sound level of a diffuse stationary sound field using a sound level meter.
Note that in Leq calculations you normally use the RMS values. A single sample's value doesn't mean much.
I held my sound level meter right next to the mic on my google ion and went 'Woooooo!' and noted that clipping occurred about 105 db spl. Hope this helps.
The units are whatever units are used for the reference reading. In the formula, the reading is divided by the reference reading, so the units cancel out and no longer matter.
In other words, decibels is a way of comparing two things, it is not an absolute measurement. When you see it used as if it is absolute, then the comparison is with the quietest sound the average human can hear.
In our case, it is a comparison to the highest reading the device handles (thus, every other reading is negative, or less than the maximum).