I understand why I need to use window functions in fft,I record a sine wave (16 bit pcm format), I have the sine wave audio record which I would like to analyze,I have recorded the mic audio into a byte array, transformed it back to the sample array representing the sine wave with values from [-1,1] - values divided by 32768. Do I need to apply the window on the array with the values [-1,1](the divided one) or do I need tho apply it on the sample array without dividing it by 32768? I looked up for some answers on SO and google, couldn't find any explanation on what is the right way.
One of the properties of linear-time-invariant is that the result of a cascade of multiple linear-time-invariant systems is the same regardless of the order in which the operations where done (at least in theory, in practice filters and such can have small non-linearities which can make the result slightly different depending on order).
From a theoretical perspective, applying a constant scaling factor to all samples can be seen as such a linear-time-invariant system. For a specific computer implementation, the scaling can also be considered approximately linear-time-invariant, provided the scaling does not introduce significant losses of precision (e.g. by scaling the number to values near the floating point smallest representable value), nor distortions resulting from scaling values outside the supported range. In your case, simply dividing by 32768 is most likely not going to introduce significant distortions, and as such could be considered to be an (approximately) linear-time-invariant system.
Similarly, applying a window which multiplies each samples by a different window value, can also be seen as another linear-time-invariant system.
Having established that you have such a cascade of linear-time-invariant systems, you can perform the scaling by 32768 either before or after applying the window.
P.S.: as Paul mentioned in comments, you'd probably want to perform the conversion from 16-bit words to floating point (whether scaled or not) first if you are going to work with floating point values afterward. Trying to perform scaling in fixed-point arithmetic might prove more complex than necessary, and may be subject to loss of precision I alluded to above if not done carefully.
Related
I'm making an application in which I record a direct audio from the microphone of the cell phone, I save that recording and I need to compare it with some audio already stored in the device
The audios are of "noise" of motors, the idea is that from the recorded recording it indicates us to which case of the saved ones it seems
that is, I have two cases, a good engine and a damaged engine, when I finish recording it must say "this audio belongs to a damaged engine"
Reading I find that it has to be done through artificial intelligence, which is really complex, I have read that you can "decompose" the audio into a vector of numbers or make comparisons by FFT, however I do not find much information about it, really I'd appreciate your help.
the file type saved is .wav
It's nontrivial task to compare audio signals.
The audio is just a sequence of values (numbers) where index is just a "time" and value is a loudness of sound (amplitude).
If you compare audio data like two arrays (sequences) element by element, iterating through the index - it will be luck to get something reasonable. Though you need some transformation of this array to get aggregated info about this sequence of numbers as a whole (for example - spectre of signal).
There are some mathematical tools for this task, for example, mentioned by you well-known Fourier Transform and statistical tool Autocorrelation (it finds "kindness" of sequence of numbers).
The autocorrelation method can be relatively simple - you just iterate comparing arrays of data and calculate the autocorrelation. But you will pay for simplicity in case of initial quality (or preparation/normalization) of signals - they should have similar duration. The value of resulted correlation function will show how differ two sequences, i.e. 0 - is absolutely different and 1 - is almost the same.
To implement Fourier Transform (FFT) is not a problem too, you could take well described algo and implement it itself on any language without using third party libs. It does the job very well.
FT will help you get a spectrum of the signal i.e. another set of values: set of amplitudes per frequency (roughly, frequency as array index instead of time in case of input raw signal) and now you can compare this given spectrums almost like two arrays iterating through an index (frequency) and then decide on their similarity - calculate deltas and see whether it hit into some acceptance interval (or you can use more correct statistical methods e.g. correlation function).
As for noised signal, the noise is usually subtracted from the given data set (but here you should know the sort of noise type).
It is all related to signal processing area and if you're working on such project you need to learn more about this.
Bonus: a book for example
I am trying to use FFT to decode morse code, but I'm finding that when I examine the resulting frequency bin/bucket I'm interested in, the absolute value is varying quite significantly even when a constant tone is presented. This makes it impossible for me to use the rise and fall around a threshold and therefore decode audio morse.
I've even tried the simple example that seems to be copied everywhere, but it also varies...
I can't work out what I'm doing wrong, and my maths is not clever enough to understand all the formulas associated with FFT.
I now it must be possible, but I can't find out how... can anyone help please?
Make sure you are using the magnitude of the FFT result, not just the real or imaginary component of a complex result.
In general, when a longer constant amplitude sinusoid is fed to a series of shorter FFTs (windowed STFT), the magnitude result will only be constant if the period of the sinusoid is exactly integer periodic in the FFT length. e.g.
f_tone modulo (f_sampling_rate / FFT_length) == 0
If you are only interested in the magnitude of one selected tone frequency, the Goertzel algorithm would serve as a more efficient filter than a full FFT. And, depending on the setup and length restrictions required by your chosen FFT library, it may be easier to vary the length of a Goertzel to match the requirements for your target tone frequency, as well as the time/frequency resolution trade-off needed.
I am trying to do FFT and extract high frequency features on smart phones. It turns out too slow to do a full FFT on 44100HZ sampled data on smart phones, but downsampling it will kill high frequency information because of Nyquist Theorem. Is there a way to speed up the FFT while retaining the higher frequencies?
It is not clear if you want to use the FFT information or if it is just a way to implement some filter.
For the first case you can subsample the data, i.e., run a highpass filter and then compress (downsample) the sequence. Yes, there will be aliasing, but you can still map particular frequencies from the FFT back to the original higher frequencies.
If it is filtering, the filter should be reasonable long before you get any benefit from applying transform based filtering. Also, if you do this make sure you read up on overlap-add and overlap-save filtering and do not go with the all to common "let's take the FFT, multipliy with an 'ideal' response and then an IFFT". This will in general not give the expected result (unless you expect a transfer function which is time varying and different from the 'ideal').
I'm writing a small piece of Renderscript to dynamically take an image and sort the pixels into 'buckets' based on each pixel's RGB values. The number of buckets could vary, so my instinct would be to create an arraylist. This isn't possible within Renderscript, obviously, so I was wondering what the approach to creating a dynamic list of structs within the script. Any help greatly appreciated.
There's no clear answer to this. The problem is that dynamic memory management is anathema to platforms like RenderScript--it's slow, implies a lot of things about page tables and TLBs that may not be easy to guarantee from a given processor at an arbitrary time, and is almost never an efficient way to do what you want to do.
What the right alternative is depends entirely on what you're doing with the buckets after they're created. Do you need everything categorized without sorting everything into buckets? Just create a per-pixel mask (or use the alpha channel) and store the category alongside the pixel data. Do you have some upper bound on the size of each bucket? Allocate every bucket to be that size.
Sorry that this is open-ended, but memory management is one of those things that brings high-performance code to a screeching halt. Workarounds are necessary, but the right workaround varies in every case.
I'll try to answer your goal question of classifying pixel values, and not your title question of creating a dynamically-sized list of structs.
Without knowing much about your algorithm, I will frame my answer using one of the two algorithms:
RGB Joint Histogram
Does not use neighboring pixel values.
Connected Component
Requires neighboring pixel values.
Requires a supporting data structure called "Disjoint set".
Common advice.
Both algorithms require a lot of memory per worker thread. Also, both algorithms are poorly adapted to GPU because they require some kind of random memory access (Note). Therefore, it is likely that both algorithms will end up being executed on the CPU. It is therefore a good idea to reduce the number of "threads" to avoid multiplying the memory requirement.
Note: Non-coalesced (non-sequential) memory access - reads, writes, or both.
RGB Joint Histogram
The best way is to compute a joint color histogram using Renderscript, and then run your classification algorithm on the histogram instead (presumably on the CPU). After that, you can perform a final step of pixel-wise label assignment back in Renderscript.
The whole process is almost exactly the same as Tim Murray's Renderscript presentation in Google I/O 2013.
Link to recorded session (video)
Link to slides (PDF)
The joint color histogram will have to have hard-coded size. For example, a 32x32x32 RGB joint histogram uses 32768 histogram bins. This allows 32 levels of shades for each channel. The error per channel would be +/- 2 levels out of 256 levels.
Connected Component
I have successfully implemented multi-threaded connected component labeling on Renderscript. Note that my implementation is limited to execution on CPU; it is not possible to execute my implementation on the GPU.
Prerequisites.
Understand the Union-Find algorithm (and its various theoretical parts, such as path-compression and ranking) and how connected-component labeling benefits from it.
Some design choices.
I use a 32-bit integer array, same size as the image, to store the "links".
Linking occurs in the same way as Union-Find, except that I do not have the benefit of ranking. This means the tree may become highly unbalanced, and therefore the path length may become long.
On the other hand, I perform path-compression at various steps of the algorithm, which counteracts the risk of suboptimal tree merging by shortening the paths (depths).
One small but important implementation detail.
The values stored in the integer array is essentially an encoding of the "(x, y)" coordinates to (i) itself, if the pixel is its own root, or (ii) a different pixel which has the same label as the current pixel.
Steps.
The multi-threaded stage.
Divide the image into small tiles.
Inside each tile, compute the connected components, using label values local to that tile.
Perform path compression inside each tile.
Convert the label values into global coordinates and copy the tile's labels into the main result matrix.
The single-threaded stage.
Horizontal stitching.
Vertical stitching.
A global round of path-compression.
So, I've been struggling with this problem for some time, and haven't had any luck tapping the wisdom of the internets and related SO posts on the subject.
I am writing an Android app that uses the ubiquitous Accelerometer, but I seem to be getting an incredible amount of "noise" even while at rest, and can't seem to figure out how to deal with it as my readings need to be relatively accurate. I thought that maybe my phone (HTC Incredible) was dysfunctional, but the sensor seems to work well with other games and apps I've played.
I've tried to use various "filters" but I can't seem to wrap my mind around them. I understand that gravity must be dealt within some way, and maybe that's where I am going wrong. Currently I have tried this, adapted from a SO answer, which refers to an example from the iPhone SDK:
accel[0] = event.values[0] * kFilteringFactor + accel[0] * (1.0f - kFilteringFactor);
accel[1] = event.values[1] * kFilteringFactor + accel[1] * (1.0f - kFilteringFactor);
double x = event.values[0] - accel[0];
double y = event.values[1] - accel[1];
The poster says to "play with" the kFilteringFactor value (kFilteringFactor = 0.1f in the example) until satisfied. Unfortunately I still seem to get a lot of noise, and all this seems to do is make the readings come in as tiny decimals, which doesn't help me all that much, and it appears to just make the sensor less sensitive. The math centers of my brain are also atrophied from years of neglect, so I don't completely understand how this filter is working.
Can someone explain to me in some detail how to go about getting a useful reading from the accelerometer? A succinct tutorial would be an incredible help, as I haven't found a really good one (at least aimed at my level of knowledge). I get frustrated because I feel like all of this should be more apparent to me. Any help or direction would be greatly appreciated, and of course I can provide more samples from my code if needed.
I hope I'm not asking to be spoon-fed too much; I wouldn't be asking unless I've been trying to figure it our for a while. It also looks like there is some interest from other SO members.
To get a correct reading from the accelerometer you need to use the equation speed = SQRT(x*x + y*y + z*z). Using this, when the phone is at rest the speed will be that of gravity - 9.8m/s. So if you subtract that (SensorManager.GRAVITY_EARTH) then when the phone is at rest, you will have a reading of 0 m/s. As for noise, Blrfl might be right about cheap accelerometers, even when my phone is at rest, it continuously flickers a few fractions of a metre per second. You could just set a small threshold e.g 0.4m/s and if the speed doesn't go over that, then it is at rest.
Partial answer:
Accuracy. If you're looking for high accuracy, the inexpensive accelerometers you find in handsets won't cut the mustard. For comparison, a three-axis sensor suitable for industrial or scientific use runs north of $1,500 for just the sensor; adding the hardware to power it and turn its readings into something a computer can use doubles the price. The sensor in a handset runs well below $5 in quantity.
Noise. Cheap sensors are inaccurate, and inaccuracy translates to noise. An inaccurate sensor that isn't moving won't always show zeros, it will show values on either side within some range. About the best you can do is characterize the sensor while motionless to get some idea how noisy it is and use that to round your measurements to a less-precise scale based on expected error. (In other words, If it's within ±x m/s^2 of zero, it's safe to say the sensor's not moving, but you can't be precisely sure because it could be moving very slowly.) You'll have to do this on every device, because they don't all use the same accelerometer and they all behave differently. I guess that's one advantage the iPhone has: the hardware's pretty much homogeneous.
Gravity. There's some discussion in the SensorEvent documentation about factoring gravity out of what the accelerometer says. You'll notice it bears a lot of similarity to the code you posted, except that it's clearer about what it's doing. :-)
HTH.
How do you deal with jitteriness? You smooth the data. Instead of looking at the sequence of values from the sensor as your values, you average them on an ongoing basis, and the new sequence formed become the values you use. This moves each jittery value closer to the moving average. Averaging necessarily gets rid of quick variations in adjacent values.. and is why people use the terminology Low (frequency) Pass filtering since data that originally may have varied a lot per sample (or unit time) now varies more slowly.
eg, instead of using values 10 6 7 11 7 10, you can average these in many ways. For example, we can compute the next value from an equal weight of the running average (ie, of your last processed data point) with the next raw data point. Using a 50-50 mix for the above numbers, we'd get 10, 8, 7.5, 9.25, 8.125, 9.0675. This new sequence, our processed data, would be used in lieu of the noisy data. And we could use a different mix than 50-50 of course.
As an analogy, imagine you are reporting where a certain person is located using only your eyesight. You have a good view of the wider landscape, but the person is engulfed in a fog. You will see pieces of the body that catch your attention .. a moving left hand, a right foot, shine off eyeglasses, etc, that are jittery, BUT each value is fairly close to the true center of mass. If we run some sort of running averaging, we'd get values that approach the center of mass of that target as it moves through the fog and are in effect more accurate than the values we (the sensor) reported which was made noisy by the fog.
Now it seems like we are losing potentially interesting data to get a boring curve. It makes sense though. If we are trying to recreate an accurate picture of the person in the fog, the first task is to get a good smooth approximation of the center of mass. To this we can then add data from a complementary sensor/measuring process. For example, a different person might be up close to this target. That person might provide very accurate description of the body movements, but might be in the thick of the fog and not know overall where the target is ending up. This is the complementary position to what we first got -- the second data gives detail accurately without a sense of the approximate location. The two pieces of data would be stitched together. We'd low pass the first set (like your problem presented here) to get a general location void of noise. We'd high pass the second set of data to get the detail without unwanted misleading contributions to the general position. We use high quality global data and high quality local data, each set optimized in complementary ways and kept from corrupting the other set (through the 2 filterings).
Specifically, we'd mix in gyroscope data -- data that is accurate in the local detail of the "trees" but gets lost in the forest (drifts) -- into the data discussed here (from accelerometer) which sees the forest well but not the trees.
To summarize, we low pass data from sensors that is jittery but stays close to the "center of mass". We combine this base smooth value with data that is accurate at the detail but drifts, so this second set is high-pass filtered. We get the best of both worlds as we process each group of data to clean it of incorrect aspects. For the accelerometer, we smooth/low pass the data effectively by running some variation of a running average on its measured values. If we were treating the gyroscope data, we'd do math that effectively keeps the detail (accepts deltas) while rejecting the accumulated error that would eventually grow and corrupt the accelerometer smooth curve. How? Essentially, we use the actual gyro values (not averages), but use a small number of samples (of deltas) a piece when deriving our total final clean values. Using a small number of deltas keeps the overall average curve mostly along the same averages tracked by the low pass stage (by the averaged accelerometer data) which forms the bulk of each final data point.