Writing time sequenced to Android AudioTrack

Writing time sequenced to Android AudioTrack - android

I am currently writing some code for a sample sequencer in Android. I am using the AudioTrack class. I have been told the only proper way to have accurate timing is to use the timing of the AudioTrack. EG I know that if I write a buffer of X samples to AudioTrack playing at a rate of 44100 samples per second, that the time to write will be (1/44100)X secs.
Then you use that info to know what samples should be written when.
I am trying to implement my first attempt using this approach. I am using only one sample and am writing it as continuous 16th notes at a tempo of 120bpm. But for some reason it is playing at a rate of 240bpm.
First I checked my code to derive the time of a 16th (nanoseconds) note at tempo X. It checks outs.
private void setPeriod()
{
period=(int)((1/(((double)TEMPO)/60))*1000);
period=(period*1000000)/4;
Log.i("test",String.valueOf(period));
}
Then I verified that my code to get the time for my buffer to be played at 44100khz in nanoseconds and it is correct.
long bufferTime=(1000000000/SAMPLE_RATE)*buffSize;
So now I am left thinking that the audio track is playing at a rate that is different from 44100. Maybe 96000khz, which would explain the doubling of speed. But when I instantiate the audioTrack, it was indeed set to 44100khz.
final int SAMPLE_RATE is set to 44100
buffSize = AudioTrack.getMinBufferSize(SAMPLE_RATE, AudioFormat.CHANNEL_OUT_MONO,
AudioFormat.ENCODING_PCM_16BIT);
track = new AudioTrack(AudioManager.STREAM_MUSIC, SAMPLE_RATE,
AudioFormat.CHANNEL_OUT_MONO,
AudioFormat.ENCODING_PCM_16BIT,
buffSize,
AudioTrack.MODE_STREAM);
So I am confused as to why my tempo is being doubled. I ran a debug to compare time elapsed audioTrack to time elapsed system time, and the it seems that the audiotrack is indeed playing twice as fast as it should be. I am confused.
Just to make sure, this is my play loop.
public void run() {
// TODO Auto-generated method stub
int buffSize=192;
byte[] output = new byte[buffSize];
int pos1=0;//index for output array
int pos2=0;//index for sample array
long bufferTime=(1000000000/SAMPLE_RATE)*buffSize;
long elapsed=0;
int writes=0;
currTrigger=trigger[triggerPointer];
Log.i("test","period="+String.valueOf(period));
Log.i("test","bufferTime="+String.valueOf(bufferTime));
long time=System.nanoTime();
while(play)
{
//fill up the buffer
while(pos1<buffSize)
{
output[pos1]=0;
if(currTrigger&&pos2<sample.length)
{
output[pos1]=sample[pos2];
pos2++;
}
pos1++;
}
track.write(output, 0, buffSize);
elapsed=elapsed+bufferTime;
writes++;
//time passed is more than one 16th note
if(elapsed>=period)
{
Log.i("test",String.valueOf(writes));
Log.i("test","elapsed A.T.="+String.valueOf(elapsed)+" elapsed S.T.="+String.valueOf(System.nanoTime()-time));
time=System.nanoTime();
writes=0;
elapsed=0;
triggerPointer++;
if(triggerPointer==16)
triggerPointer=0;
currTrigger=trigger[triggerPointer];
pos2=0;
}
pos1=0;
}
}
}

edited : rephrased and updated due to initial erroneous assumption that system time was used to synchronize sequenced audio :)
As for audio playing back at twice the speed, this is a bit strange as the "write"-method of the AudioTrack is blocking until the native layer has enqueued the next buffer, are you sure the render loop isn't invoked from two different sources (although I assume from your example you invoke the loop from within a thread).
However, what is certain is that there is a time synchronization issue to address: the problem herein lies with the calculation of the buffer time you use in your example:
(1000000000/SAMPLE_RATE)*buffSize;
Which will always return 4353741 at a buffer size of 192 samples at a sample rate of 44100 Hz, thus disregarding any cues in tempo (for instance this will be the same at 300 BPM or 40 BPM), Now, in your example this doesn't have any consequences for the actual syncing per se, but I'd like to point this out as we'll return to it shortly further on in this text.
Also, nanoseconds are a nicely precise unit, but too much as milliseconds will suffice for audio operations. As such, I will continue the illustration in milliseconds.
Your calculation for the period of a 16th note at 120 BPM indeed checks out at the correct value of 125 ms. The previously mentioned calculation for the period corresponding to each buffer size is 4.3537 ms. This indicates you will iterate the buffer loop 28.7112 times before the time of a single sixteenth note passes. In your example however, you check whether the "offset" for this sixteenth note has passed at the END of the buffer iteration loop (where the period for a single buffer has already been added to the elapsed time!), by using:
elapsed>=period
Which will already lead to drift at the first occasion as at this moment "elapsed" would be at (192 * 29 iterations) 5568 samples (or 126.26 ms), rather than at (192 * 28.7112 iterations) 5512 samples (or 126 ms). This is a difference of 56 samples (or when speaking in time : 1.02 ms). This wouldn't of course lead to samples playing back FASTER than expected (as you stated), but already leads to a irregularity in playback. For the second 16th note (which would occur at the 57.4224th iteration, the drift would be 11136 - 11025 = 111 samples or 2.517 ms (more than half your buffer time!) As such, you must perform this check WITHIN the
while(pos1<buffSize)
loop, where you are incrementing the read pointer up until the size of the buffer has been reached. As such you will need to increase another variable by a fraction of the buffer period PER buffer sample.
I hope the above example illustrates why I'd initially proposed counting time by sample iterations rather than elapsed time (of course the samples DO indicate time, as they are merely translations of a unit of time to an amount of samples in a buffer, but you can use these numbers as the markers, rather than adding a fixed interval to a counter as in your render loop).
First of all, some convenience math to help you with getting these values :
// calculate the amount of samples are necessary for storing the given length of time
// ( in milliSeconds ) at the given sample rate ( in Hz )
int millisecondsToSamples( int milliSeconds, int sampleRate )
{
return ( int ) ( milliSeconds * ( sampleRate / 1000 ));
}
OR : These calculations which are more convenient when thinking in a musical context like you mentioned in your post. Calculate the amount of samples that are present in a single bar of music at the given sample rate ( in Hz ), tempo ( in BPM ) and time signature ( timeSigBeatUnit being the "4" and timeSigBeatAmount being the "3" in a time signature of 3/4 - although most sequencers limit themselves to 4/4 I've added the calculation for explaining the logic).
int samplesPerBeat = ( int ) (( sampleRate * 60 ) / tempo );
int samplesPerBar = samplesPerBeat * timeSigBeatAmount;
int samplesPerSixteenth = ( int ) ( samplesPerBeat / 4 ); // 1/4 of a beat being a 16th
etc.
The way you then write the timed samples into the output buffer is by keeping track of the "playback position' in your buffer callback, i.e. each time you write a buffer, you'll be incrementing the playback position with the length of the buffer. Returning to a musical context: if you were to be "looping a single bar of 120 bpm in 4/4 time", when the playback position would exceed (( sampleRate * 60 ) / 120 * 4 = 88200 samples, you reset it to 0 to "loop" from the beginning.
So let's assume you have two "events" of audio that occur in a sequence of a single bar of 4/4 time at 120 BPM. One event is to play on the 1st beat of a bar and lasts for a quaver (1/8 of a bar) and the other is to play on the 3rd beat of a bar and lasts for another quaver. These two "events" (which you could represent in a value object) would have the following properties, for the first event:
int start = 0; // buffer position 0 is at the 1st beat/start of the bar
int length = 11025; // 1/8 of the full bar size
int end = 11025; // start + length
and the second event:
int start = 44100; // 3rd beat (or half-way through the bar)
int length = 11025;
int end = 55125; // start + length
These value objects could have two additional properties such as "sample" which could be the buffer containing the actual audio and "readPointer" which would hold the last sample-buffer index the sequencer has read from last.
Then in the buffer write loop:
int playbackPosition = 0; // at start of bar
int maximumPlaybackPosition = 88200; // i.e. a single bar of 4/4 at 120 bpm
public void run()
{
// loop through list of "audio events" / samples
for ( CustomValueObject audioEvent : audioEventList )
{
// loop through the buffer length this cycle will write
for ( int i = 0; i < bufferSize; ++i )
{
// calculate "sequence position" from playback position and current iteration
int seqPosition = playbackPosition + i;
// sequence position within start and end range of audio event ?
if ( seqPosition >= audioEvent.start && seqPosition <= audioEvent.end )
{
// YES! write its sample content into the output buffer
output[ i ] += audioEvent.sample[ audioEvent.readPointer ];
// update the sample read pointer to the next slot (but keep in bounds)
if ( ++audioEvent.readPointer == audioEvent.length )
audioEvent.readPointer = 0;
}
}
// update playback position and keep within sequencer range for looping
if ( playbackPosition += bufferSize > maximumPosition )
playbackPosition -= maximumPosition;
}
}
This should give you a perfectly timed approach in writing audio. There's still some magic you have to work out when you're hitting the iteration where the sequence will loop (i.e. read the remaining unprocessed buffer length from the start of the sample for seamless looping) but I hope this gives you a general idea on a working approach.

Related

How to use AudioTrack.getTimestamp() on Android to calculate latency?

I see many resources recommending that AudioTrack.getTimestamp() be used on modern Android versions to calculate audio latency for audio/video sync.
For instance:
https://stackoverflow.com/a/37625791/332798
https://developer.amazon.com/docs/fire-tv/audio-video-synchronization.html#section1-1
https://groups.google.com/forum/#!topic/android-platform/PoHfyNK54ps
However, none of these explain how to use the timestamp to calculate the latency? I'm struggling to figure what to do with the timestamp's framePosition/nanoTime to come up with a latency number.

So prior to this API, you would use AudioTrack.getPlaybackHeadPosition() which was just an approximation. Thus, to account for latency you had to offset that value with a latency value from one of two hidden methods: AudioManager.getOutputLatency() or AudioTrack.getLatency().
With the new AudioTrack.getTimestamp() API, you get a snapshot of the playhead position at a given time, taken directly at the output. As such, it is fully accurate and already accounts for device latency. Thus there's no need to call any other APIs now to add/remove latency.
The caveat is that this timestamp is only a snapshot, and the docs recommend you don't call this new method very often. So the trick to getting the "current" position is to use your last snapshot and linearly interpolate what the current value should be:
playheadPos = timestamp.framePosition +
(System.nanoTime() - timestamp.nanoTime) * samplerate / 1e9;
This position can then be compared against how many frames you've written into the AudioTrack, by maintaining another counter which increments every time AudioTrack.write() completes:
int bytesWritten = track.write(...);
writtenPos += bytesWritten / pcmFrameSize;
If you're working with ENCODING_AC3, the playhead position reported by AudioTrack is still in terms of samples. You will either need to convert it to bytes, or convert the number of bytes you've written in back into samples. Either way, you will need to know the bitrate of your AC3 stream (i.e. 384000bps)
int bytesWritten = track.write(...);
writtenPos += bytesWritten * samplerate / (bitrate / 8);

Android AudioTrack write blocks for the whole period of playback

I am using AudioTrack class to play a stream of raw sound data:
AudioTrack audioTrack;
int sampleRate = 11025;
int channelConfigIn = AudioFormat.CHANNEL_IN_MONO;
int channelConfigOut = AudioFormat.CHANNEL_OUT_STEREO;
int audioFormat = AudioFormat.ENCODING_PCM_16BIT;
.....
int bufferSize = AudioTrack.getMinBufferSize(sampleRate,channelConfigOut,audioFormat);
audioTrack = new AudioTrack(AudioManager.STREAM_VOICE_CALL,sampleRate,channelConfigOut,audioFormat,bufferSize,AudioTrack.MODE_STREAM);
audioTrack.play();
Then on a separate thread:
while(true)
{
short [] buffer = new short[14500];
//fill buffer with sound data
long time = System.currentTimeMillis();
audioTrack.write(buffer,0,14500);
Log.i("time",(System.currentTimeMillis() - time) + "");
}
My problem is that the log always shows that the write method blocks for about 0.6 second, that is the same as the played sound length (14500 samples), moreover the phone is not responding during the playback, the main tread almost can not do anything anyone can help...

You are using the blocking version of the write() method. You can use write(float[],int,int,int) instead, passing WRITE_NON_BLOCKING as the 4th parameter, but that would only write as much data as can fit in the play buffer. I generally prefer the approach you already have (dedicating a thread to writing). You should expect each call to block; after all, the play buffer can only fit so much data, and to make more space, sound must be played (which takes time). 14500 samples is ~1.3 seconds of sound at your chosen sample rate, so I'm guessing it takes you about .7 seconds to fill buffer each time.
I cannot tell, based on the code presented, why your UI thread is not responding.

getMaxAmplitude() alternative for Visualizer

In my app I allow the user to record audio using the phone's camera, while the recording is in progress I update a Path using time as the X value and a normalized form of getMaxAmplitude() for the y value.
float amp = Math.min(mRecorder.getMaxAmplitude(), mMaxAmplitude)
/ (float) mMaxAmplitude;
This works rather well.
My problem occurs when I go to play back the audio (after transporting it over the network). I want to recreate the waveform generated while recording, but the MediaPlayer class does not possess the same getMaxAmplitude() method.
I have been attempting to use the Visualizer class provided by the framework, but am having a difficult time getting a usable result for the y value. The byte array returned contains values between -128 and 127 but when i look at the actual values they do not appear to represent the waveform as I would expect it to be.
How do I use the values returned from the visualizer to get a value related to the loudness of the sound?

Your byte array is probably an array of 16, 24 or 32 bit signed values. Assuming they are 16 bit signed then the bytes will be alternating hi-byte with the MSB being the sign bit and the lo-byte. Or, depending on the endianness it could be lo-byte followed by the high byte. Moreover, if you have two channels of data, each sample is probably interleaved. Again, assuming 16-bits, you can decode the samples something in a manner similar to this:
for (int i = 0 ; i < numBytes/2 ; ++i)
{
sample[i] = (bytes[i*2] << 8) | bytes[i*2+1];
}
According to the documentation of getMaxAmplitude, it returns the maximum absolute amplitude that was sampled since the last call. I guess this means the peak amplitude but it's not totally clear from the documentation. To compute the peak amplitude, just compute the max of the abs of all the samples.
int maxPeak = 0.0;
for (int i = 0 ; i < numSamples ; ++i)
{
maxPeak = max(maxPeak, abs(samples[i]));
}

convolution of audio signal

I'm using audiorecorder to record sound and do some processing in pseudorealtime on android phone.
i'm facing a problem between FFT and convolution of audio signal:
I perform FFT on a known signal(a sine waveform), and i correctly always find the single tone contained in it, by using the FFT.
Now i want to do the same thing by using a convolution (it's an exercise): I perform 5000 convolutions of that signal by using 5000 filters. Each filter is a sine waveform on a different frequency between 0 and 5000 Hz.
Then, i search the peak for each convolution output. By this way i should find the maximum peak when i'm using the filter with the same tone contained on the signal.
Infact with a tone of 2kHz i can find the max with the 2kHz filter.
The problem is that when i receive a 4kHz tone, i find the max on the convolution with the 4200Hz filter (while the FFT instead always works fine)
Is it matematically possible?
what is the problem in my convolution?
This is the convolution function that i wrote:
//i do the convolution and return the max
//IN is the array with the signal
//DATASIZE is the size of the array IN
//KERNEL is the filter containing the sine at the selected frequency
int convolveAndGetPeak(short[] in,int dataSize, double[] kernel) {
//per non rischiare l'overflow, il kernel deve avere un ampiezza massima pari a 1/10 del max
int i, j, k;
int kernelSize=kernel.length;
int tmpSignalAfterFilter=0;
double out;
// convolution from out[0] to out[kernelSize-2]
//iniziamo
for(i=0; i < kernelSize - 1; ++i)
{
out = 0; // init to 0 before sum
for(j = i, k = 0; j >= 0; --j, ++k)
out += in[j] * kernel[k];
if (Math.abs((int) out)>tmpSignalAfterFilter ){
tmpSignalAfterFilter=Math.abs((int) out);
}
}
// start convolution from out[kernelSize-1] to out[dataSize-1] (last)
//iniziamo da dove eravamo arrivati
for( ; i < dataSize; ++i)
{
out = 0; // initialize to 0 before accumulate
for(j = i, k = 0; k < kernelSize; --j, ++k)
out += in[j] * kernel[k];
if (Math.abs((int) out)>tmpSignalAfterFilter ){
tmpSignalAfterFilter=Math.abs((int) out);
}
}
return tmpSignalAfterFilter;
}
the kernel, used as filter, is generated this way:
//curFreq is the frequency of the filter in Hz
//kernelSamplesSize is the desired length of the filter (number of samples), for time precision reasons i'm using 20 samples length.
//sampleRate is the sampling frequency
double[] generateKernel(int curFreq,int kernelSamplesSize,int sampleRate){
double[] curKernel= new double[kernelSamplesSize] ;
for (int kernelIndex=0;kernelIndex<curKernel.length;kernelIndex++){
curKernel[kernelIndex]=Math.sin( (double)kernelIndex * ((double)(2*Math.PI) * (double)curFreq / (double)sampleRate)); //the part that makes this a sine wave....
}
return curKernel;
}
if you want to try a convolution, the data contained in the IN array is the following:
http://www.tr3ma.com/Dati/signal.txt
Note1: the sampling frequency is 44100Hz
Note2: the tone contained in the signal is a single 4kHz tone (even if the convolution has the max peak with a 4200Hz filter.
EDIT: I also repeated the test on a excel sheet. the result is the same (of course, i'm using the same algorithm) and the algorithms seems to me to be correct...
this is the excel sheet i prepared, if you prefer to work on excel: http://www.tr3ma.com/Dati/convolutions.xlsm

You change the bandwidth by two factors:
a) The length of your kernel (e.g. a length t of 5ms produces a rough bandwidth of f >= 200Hz, estimated with 1/0.005 because Δt·Δf >= 1, see "Heisenberg"), and
b) the window function (which you definitely should implement to make your algorithm working in real-world applications because otherwise in some cases sidelobes of some filter outputs could yield more energy than the main lobe of the expected filter output).
But you have another problem: you need to convolve with a 2nd kernel consisting of cosine waves (which means that you need the same waves as in the 1st kernel but shifted by 90 degrees). Why is that? Because with only the sine kernel, you get a phase-dependent modulation of the filter outputs (e.g. if the phase difference between the input signal and the kernel wave with the identical frequency is 90 degrees you get the amplitude 0).
Finally, you combine the outputs of both kernels with Pythagoras.

it seems all correct, apart the number of samples of the kernel (the filter).
Increasing the size of the filter the result is more accurate.
I don't know how to calculate the bandwidth of this filter but it seems clear to me that it's a matter of filter bandwidth. So, the filter bandwidth depends also on the number of samples of the filter used in the convolution, with reference to the sampling frequency(and may be also with reference to the tone frequency). Unfortunately i can not increase too much the number of samples of my filter since otherwise the phone can not perform the filtering in realtime.
Note: i need the convolution cause i need to identify the precise moment when the tone was fired.
EDIT: i made a compare between filter with 20 samples and filter with 40 samples.
I don't know the formula to obtain the fitler bandwidth but it's clear, in the following image, the difference between the 2 filters.
EDIT2: FEW DAYS AFTER POSTING THE SOLUTION I FOUND HOW TO CALCULATE THE BANDWIDTH OF SUCH FILTER: IT'S JUST THE INVERSE OF THE FILTER DURATION. SO IN EXAMPLE A KERNEL OF 40 SAMPLES AT 44100KhZ HAS A DURATION OF ABOUT 907uS, THEN THE FILTER BANDWIDTH, WITH THIS KERNEL AND A WINDOW OF THE SAME LENGTH IS 1/907uS= 1,1KhZ
(source: tr3ma.com)

Explanation of how this MIDI lib for Android works

I'm using the library of #LeffelMania : https://github.com/LeffelMania/android-midi-lib
I'm musician but I've always recorded as studio recordings, not MIDI, so I don't understand some things.
The thing I want to understand is this piece of code:
// 2. Add events to the tracks
// Track 0 is the tempo map
TimeSignature ts = new TimeSignature();
ts.setTimeSignature(4, 4, TimeSignature.DEFAULT_METER, TimeSignature.DEFAULT_DIVISION);
Tempo tempo = new Tempo();
tempo.setBpm(228);
tempoTrack.insertEvent(ts);
tempoTrack.insertEvent(tempo);
// Track 1 will have some notes in it
final int NOTE_COUNT = 80;
for(int i = 0; i < NOTE_COUNT; i++)
{
int channel = 0;
int pitch = 1 + i;
int velocity = 100;
long tick = i * 480;
long duration = 120;
noteTrack.insertNote(channel, pitch, velocity, tick, duration);
}
Ok, I have 228 Beats per minute, and I know that I have to insert the note after the previous note. What I don't understand is the duration.. is it in milliseconds? it doesn't have sense if I keep the duration = 120 and I set my BPM to 60 for example. Neither I understand the velocity
MY SCOPE
I want to insert notes of X pitch with Y duration.
Could anyone give me some clue?

The way MIDI files are designed, notes are in terms of musical length, not time. So when you insert a note, its duration is a number of ticks, not a number of seconds. By default, there are 480 ticks per quarter note. So that code snippet is inserting 80 sixteenth notes since there are four sixteenths per quarter and 480 / 4 = 120. If you change the tempo, they will still be sixteenth notes, just played at a different speed.
If you think of playing a key on a piano, the velocity parameter is the speed at which the key is struck. The valid values are 1 to 127. A velocity of 0 means to stop playing the note. Typically a higher velocity means a louder note, but really it can control any parameter the MIDI instrument allows it to control.
A note in a MIDI file consists of two events: a Note On and a Note Off. If you look at the insertNote code you'll see that it is inserting two events into the track. The first is a Note On command at time tick with the specified velocity. The second is a Note On command at time tick + duration with a velocity of 0.
Pitch values also run from 0 to 127. If you do a Google search for "MIDI pitch numbers" you'll get dozens of hits showing you how pitch number relates to note and frequency.
There is a nice description of timing in MIDI files here. Here's an excerpt in case the link dies:
In a standard MIDI file, there’s information in the file header about “ticks per quarter note”, a.k.a. “parts per quarter” (or “PPQ”). For the purpose of this discussion, we’ll consider “beat” and “quarter note” to be synonymous, so you can think of a “tick” as a fraction of a beat. The PPQ is stated in the last word of information (the last two bytes) of the header chunk that appears at the beginning of the file. The PPQ could be a low number such as 24 or 96, which is often sufficient resolution for simple music, or it could be a larger number such as 480 for higher resolution, or even something like 500 or 1000 if one prefers to refer to time in milliseconds.
What the PPQ means in terms of absolute time depends on the designated tempo. By default, the time signature is 4/4 and the tempo is 120 beats per minute. That can be changed, however, by a “meta event” that specifies a different tempo. (You can read about the Set Tempo meta event message in the file format description document.) The tempo is expressed as a 24-bit number that designates microseconds per quarter-note. That’s kind of upside-down from the way we normally express tempo, but it has some advantages. So, for example, a tempo of 100 bpm would be 600000 microseconds per quarter note, so the MIDI meta event for expressing that would be FF 51 03 09 27 C0 (the last three bytes are the Hex for 600000). The meta event would be preceded by a delta time, just like any other MIDI message in the file, so a change of tempo can occur anywhere in the music.
Delta times are always expressed as a variable-length quantity, the format of which is explained in the document. For example, if the PPQ is 480 (standard in most MIDI sequencing software), a delta time of a dotted quarter note (720 ticks) would be expressed by the two bytes 82 D0 (hexadecimal).

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.