How can I get the volume of the audio stream coming in through the android microphone?
There may be a simpler way, but a method that I know would work is to use an AudioRecord object that is configured to watch the MediaRecorder.AudioSource.MIC audio source and record in 8-bit (AudioFormat.ENCODING_PCM_8BIT). You would need a thread to run in the background and constantly poll the object for audio with the read() call, which fills a given byte array with audio data from the mic.
Because you are recording in 8-bit, the max range of each audio sample would be -128 to 127 (although I may have this wrong - it could be from 0 to 256). You would have to experiment with taking either the maximum magnitude byte from the returned byte array, or perhaps the RMS average depending on your application's needs. The size of the byte array would also play a part in determining how frequently your application is able to sample the input audio volume.
If you are forced to record in 16-bit PCM, then you would have to look at the value of every other byte because each sample will span two bytes. The trick will be to know which byte to look at - I believe it should be the odd-indexed bytes returned in the byte array. Or if you want higher fidelity you could look at both bytes and you would know the output volume within the range 2^15 rather than just 2^7.
Since the byte type is signed in Java, you could simply look at every buffer[index * 2 + 1] byte and save the largest one. This would give you the max volume detected over the given sample.
As stated in my other answer, you could also take the average of these values, depending on what you are using this number for.
Related
This question is very important for me as I want to make my audio frames received from the AudioRecord API independent of the absolute time. So, basically the question is let's say I have called AudioRecord.startRecording(). After I do this, I start a thread (call it Thread1) that starts reading the audio frames from the AudioRecorder instance using AudioRecord.read(...). While my application is running, let's say my Thread1 gets stalled for 500 milliseconds. When the Thread1 resumes, would I loose some audio data or the AudioRecord maintains a buffer to handle this (which I have a big nudge that it does)?
If yes, what is the size of the buffer that the AudioRecord maintains?
Is it defined in terms of the frame size for the AudioRecord for the device or some absolute time duration?
Also, how much latency can I expect from the time I call AudioRecord.startRecording() till when it actually starts recording the data?
I know that I have asked a lot of questions and I will be really grateful if someone could actually answer these.
The only buffer that AudioRecord uses to keep data before user consumes it is a buffer created during initialization. You can set its size via bufferSizeInBytes. If user doesn't read PCM for a while - data will be lost and replaced with new ones. So to ensure that you don't lose it - specify reasonably large buffer size. Calculation is pretty simple:
bufSz = samplingFreqHz * sampleSize * channelNum * bufCapacitySec;
E.g. to hold 5 sec of stereo 16-bit PCM sampled at 44100Hz you need 44100 * 2 * 2 * 5 = 882000 bytes. So just decide how long your reader thread sleeps and provide sufficient buffer size to AudioRecord to accumulate all data during that sleep.
To top it off
If yes, what is the size of the buffer that the AudioRecord maintains?
It is your responsibility to pass proper size to constructor call.
Is it defined in terms of the frame size for the AudioRecord for the
device or some absolute time duration?
It is just number of bytes, you should calculate proper size by yourself
Also, how much latency can I expect from the time I call
AudioRecord.startRecording() till when it actually starts recording
the data?
There is no perfect answer. It depends on actual device and OS version. Audio recording is implemented as dedicated process and your commands and recorded data go through IPC that in general has unpredictable delays. There is a nice articles about audio latency in Android (they are mostly about playback but I guess it may be interpolated to recording too).
I'd like to analyze a piece of a recorded sound sample and find it's properties like pitch and so.
I have tried to analyze the recorded bytes of the buffer with no success.
How it can be done?
You will have to look into FFM.
Then do something like this pseudocode indicates :
Complex in[1024];
Complex out[1024];
Copy your signal into in
FFT(in, out)
for every member of out compute sqrt(a^2+b^2)
To find frequency with highest power scan for the maximum value in the first 512 points in out
Check also out the original post of the buddy here because it is probably a duplicate.
Use fast Fourier transform.. Libraries available for most languages. Bytes are no good, can be mp3 encoded or wav/pcm.. You need to decide then analyze.
DG
I am trying to find maximum amplitude value from PCM Buffer.
My questions are-
1) I found that to find this value in DB, formula is : amplDB=20log(abs(ampl)/32767). Now given that ampl is in range of -32768 to 32767, the value of log((abs)ampl/32767) would be always negative. So is this formula the correct one? Should I just negate the value of amplDB?
2) My values are coming very high. For normal song also the Maximum amplitude value is 32767, which doesn't seem correct. What are the usual amplitude values for a song?
3) I found another formula amplDb=ampl/2700. What is this 2700 for?
4) Is there any other way I can calculate the amplitude value?
Thanks
The formula you are using is correct. Keep in mind that dB is a perceptual measurement that compares an intensity with a reference level you set. Therefore, it makes sense that it is always negative since your reference level being used at the formula is the maximum PCM level. In other words, your dB will always be lower (negative), than your maximum level (0 dB).
Regarding the values you're obtaining, it is quite normal to obtain the maximum amplitude. If it is a commercial song, a common mastering practice is to boost the signal as much as possible. If it is a recording you made, it could have to do with the microphones sensitivity and the sounds you're recording.
Finally, just to be clear, this has nothing to do with the sound pressure levels at which the sound will happen upon playback, since you're only looking at the differences in amplitude of a recorded sound.
I intend to encode YUV data to H264 format on android platform. I've got it all implemented however have one small query to ask here regarding the DSI data getting returned with dequeOutputBuffer() call.
Currently, for the 1st call to dequeOutputBuffer(), I get the DSI data back. So for the 1st YUV-frame-input to the ViceoEncoder, I'm calling dequeOutputBuffer() twice to get the encoded stream. For remaining frames, I call dequeOutputBuffer() only once to get the corresponding encoded data. This approach works fine on devices running on ARM arch however on device running on x86 arch, it hangs during dequeOutputBuffer(), while encoding first YUV-frame.
So, my questions are:
Am I missing something w.r.t. Encoder configuration?
Is there a way to get back combined stream of DSI + EncodedData with
single call to
dequeOutputBuffer()?
Hope the question is clear.
The video encoder is going to accept N frames before producing any output. In some cases N will be 1, and you will see an output frame shortly after providing a single input frame. Other codecs will want to gather up a fair bit of video data before starting to produce output. It appears you've managed to resolve your current situation by doubling-up frames and discarding half the output, but you should be aware that different devices and different codecs will behave differently (assuming portability is a concern).
The CSD data is provided in a buffer with the BUFFER_FLAG_CODEC_CONFIG flag set. There is no documented behavior in MediaCodec for if or when such buffers will appear. (In fact, if you're using VP8, it doesn't appear at all.) For AVC, it arrives in the first buffer. If you're not interested in the CSD data, just ignore any packet with that flag set.
Because the buffer info flags apply to the entire buffer of data, the API doesn't provide a way to return a single buffer that has both CSD and encoded-frame data in it.
Note also that the encoder is allowed to reorder output, so you might submit frames 0,1,2 and receive encoded data for 0,2,1. The easiest way to keep track is to supply a presentation time stamp with each frame that uniquely identifies it. Some codecs will use the PTS value to adjust the encoding quality in an attempt to meet the bit rate goal, so you need to use reasonably "real" values, not a trivial integer counter.
I am starting out with audio recording using my Android smartphone.
I successfully saved voice recordings to a PCM file. When I parse the data and print out the signed, 16-bit values, I can create a graph like the one below. However, I do not understand the amplitude values along the y-axis.
What exactly are the units for the amplitude values? The values are signed 16-bit, so they must range from -32K to +32K. But what do these values represent? Decibels?
If I use 8-bit values, then the values must range from -128 to +128. How would that get mapped to the volume/"loudness" of the 16-bit values? Would you just use a 16-to-1 quantisation mapping?
Why are there negative values? I would think that complete silence would result in values of 0.
If someone can point me to a website with information on what's being recorded, I would appreciate it. I found webpages on the PCM file format, but not what the data values are.
Think of the surface of the microphone. When it's silent, the surface is motionless at position zero. When you talk, that causes the air around your mouth to vibrate. Vibrations are spring like, and have movement in both directions, as in back and forth, or up and down, or in and out. The vibrations in the air cause the microphone surface to vibrate as well, as in move up and down. When it moves down, that might be measured or sampled a positive value. When it moves up that might be sampled as a negative value. (Or it could be the opposite.) When you stop talking the surface settles back down to the zero position.
What numbers you get from your PCM recording data depend on the gain of the system. With common 16 bit samples, the range is from -32768 to 32767 for the largest possible excursion of a vibration that can be recorded without distortion, clipping or overflow. Usually the gain is set a bit lower so that the maximum values aren't right on the edge of distortion.
ADDED:
8-bit PCM audio is often an unsigned data type, with the range from 0..255, with a value of 128 indicating "silence". So you have to add/subtract this bias, as well as scale by about 256 to convert between 8-bit and 16-bit audio PCM waveforms.
The raw numbers are an artefact of the quantization process used to convert an analog audio signal into digital. It makes more sense to think of an audio signal as a vibration around 0, extending as far as +1 and -1 for maximum excursion of the signal. Outside that, you get clipping, which distorts the harmonics and sounds terrible.
However, computers don't work all that well in terms of fractions, so discrete integers from 0 to 65536 are used to map that range. In most applications like this, a +32767 is considered maximum positive excursion of the microphone's or speaker's diaphragm. There is no correlation between a sample point and a sound pressure level, unless you start factoring in the characteristics of the recording (or playback) circuits.
(BTW, 16-bit audio is very standard and widely used. It is a good balance of signal-to-noise ratio and dynamic range. 8-bit is noisy unless you do some funky non-standard scaling.)
Lots of good answers here, but they don't directly address your questions in an easy to read way.
What exactly are the units for the amplitude values? The values are
signed 16-bit, so they must range from
-32K to +32K. But what do these values represent? Decibels?
The values have no unit. They simply represent a number that has come out of an analog-to-digital converter. The numbers from the A/D converter are a function of the microphone and pre-amplifier characteristics.
If I use 8-bit values, then the values
must range from -128 to +128. How
would that get mapped to the
volume/"loudness" of the 16-bit
values? Would you just use a 16-to-1
quantisation mapping?
I don't understand this question. If you are recording 8-bit audio, your values will be 8-bits. Are you converting 8-bit audio to 16-bit?
Why are there negative values? I would
think that complete silence would
result in values of 0
The diaphragm on a microphone vibrates in both directions and as a result creates positive and negative voltages. A value of 0 is silence as it indicates that the diaphragm is not moving. See how microphones work
For more details on how sound is represented digitally, see here.
Why are there negative values? I would think that complete silence
would result in values of 0
The diaphragm on a microphone vibrates in both directions and as a
result creates positive and negative voltages. A value of 0 is silence
as it indicates that the diaphragm is not moving. See how microphones
work
Small clarification: The position of the diaphragm is being recorded. Silence occurs when there is no vibration, when there is no change in position. So the vibration you are seeing is what is pushing the air and creating changes in air pressure over time. The air is no longer being pushed at the top and bottom peaks of any vibration, so the peaks are when silence occurs. The loudest part of the signal is when the position changes the fastest which is somewhere in the middle of the peaks. The speed with which the diaphragm moves from one peak to another determines the amount of pressure that's generated by the diaphragm. When the top and bottom peaks are reduced to zero (or some other number they share) then there is no vibration and no sound at all. Also as the diaphragm slows down so that there's a greater space of time between peaks, there is less sound pressure being generated or recorded.
I recommend the Yamaha Sound Reinforcement Handbook for more in depth reading. Understanding the idea of calculus would help the understanding of audio and vibration as well.
The 16bit numbers are the A/D convertor values from your microphone (you knew this). Know also that the amplifier between your microphone and the A/D convertor has an Automatic Gain Control (AGC). The AGC will actively change the amplification of the microphone signal to prevent too much voltage from hitting the A/D convertor (usually < 2Volts dc). Also, there is DC voltage de-coupling which sets the input signal in the middle of the A/D convertor's range (say 1Volt dc).
So, when there is no sound hitting the microphone, the AGC amplifier is sending a flat line 1.0 Volt dc signal to the A/D convertor. When sound waves hit the microphone, it creates a corresponding AC voltage wave. The AGC amp takes the AC voltage wave, centers it at 1.0 Vdc, and sends it to the A/D convertor. The A/D samples (measures the DC Voltage at say 44,000 / per second), and spits out the +/-16bit values of the voltage. So -65,536 = 0.0 Vdc and +65,536 = 2.0 Vdc. A value of +100 = 1.00001529 Vdc and -100 = 0.99998474 Vdc hitting the A/D convertor.
+Values are above 1.0 Vdc, -Values are below 1.0 Vdc.
Note, most audio systems use a log formula to curve the audio wave logarithmically, so a human ear can better hear it. In digital audio systems (with ADCs), Digital Signal Processing puts this curve on the signal. DSPs chips are big business, TI has made a fortune using them for all kinds of applications, not just audio processing. DSPs can work the very complicated math onto a real time stream of data that would choke an iPhone's ARM7 processor. Say you are sending 2MHz pulses to an array of 256 ultrasound sensor/receivers--you get the idea.