I am looking to get the offset in bytes employed by the seekto() method of the MediaPlayer class.
I was wondering if there was anyway to retrieve this information directly somehow, and if not if there is away to calculate it myself for example:
If the media file has a registered bit rate in it's metadata, and I wanted to seek 10 seconds in I could use the following calculation:
10(secs)*(bit rate per second)/8
Can I assume that the MediaPlayer retrieves the bit rate information using the MediaMetadataRetriever class?
I have read the following:
Accuracy of MediaPlayer.seekTo(int msecs)
And I am aware of the issues with variable bit rate, but I am not looking for accuracy in the seekto() method but rather how to get/calculate the value it uses for the offset to retrieve the new data.
Your objective of implementing seekTo() based on an offset is novel, but has multiple challenges attached to the same. Before going into the seekTo() implementation, some clarification about MediaPlayer and MediaMetadataRetriever. Both these classes employ an MediaExtractor object internally to retrieve the metadata information. Hence MediaPlayer doesn't include a MediaMetadataRetriever class.
First, let's consider extracting the bitrate. MediaPlayer is a generic implementation that should support multiple file-formats. Hence, for your design, you need to ensure that bitrate parameter is extracted by all file formats supported by your system like audio-visual formats such as MP4, MPEG-2 TS, AVI, Matroska etc or audio formats only like WAV, MP3 etc. In the latest android implementation, I found that only MP3Extractor is exposing the bitrate through kKeyBitrate key.
Next, coming to your algorithm, I find the following challenges attached to a size based seek.
audio and video tracks are stored in an interleaved fashion. Hence, time * bitrate (in bytes) will not be directly helpful due to the interleaved nature of your input data.
Starting offset needs to be considered. In the file, there is some metadata or boxes stored at the starting of the file which is specific to the file format. You will have to consider this offset also which will be different for different formats.
If your input has more mnumber of tracks like audio, video, text or rather multiple audio tracks as in a movie, then the problem will become more complex.
Video frames are typically irregular in size. Even though a constant bitrate model is employed, the video frame sizes could vary significantly based on the type of frame. Typically, an I-frame / IDR-Frame in H.264 can consume large number of bits as compared to P-frame or B-frame. This will pose practical difficulties for size based seekTo() implementation. One could easily observe a 1:5 ratio in terms of the frame sizes for I and P frames
There is a definite impact of the variable bitrate model which you have already acknowledged. Hence, I am skipping this point.
With the aforementioned points, without discouraging you, I feel a size based implementation looks to be difficult.
Related
I'm building an app where it's important to have accurate seeking in MP3 files.
Currently, I'm using ExoPlayer in the following manner:
public void playPeriod(long startPositionMs, long endPositionMs) {
MediaSource mediaSource = new ClippingMediaSource(
new ExtractorMediaSource.Factory(mDataSourceFactory).createMediaSource(mFileUri),
startPositionMs * 1000,
endPositionMs * 1000
);
mExoPlayer.prepare(mediaSource);
mExoPlayer.setPlayWhenReady(true);
}
In some cases, this approach results in offsets of 1-3 seconds relative to the expected playback times.
I found this issue on ExoPlayer's github. Looks like this is an intrinsic limitation of ExoPlayer with Mp3 format and it won't be fixed.
I also found this question which seems to suggest that the same issue exists in Android's native MadiaPlayer and MediaExtractor.
Is there a way to perform accurate seek in local (e.g. on-device) Mp3 files on Android? I'm more than willing to take any hack or workaround.
MP3 files are not inherently seekable. They don't contain any timestamps. It's just a series of MPEG frames, one after the other. That makes this tricky. There are two methods for seeking an MP3, each with some tradeoffs.
The most common (and fastest) method is to read the bitrate from the first frame header (or, maybe the average bitrate from the first few frame headers), perhaps 128k. Then, take the byte length of the entire file, divide it by this bitrate to estimate the time length of the file. Then, let the user seek into the file. If they seek 1:00 into a 2:00 file, divide the byte size of the file to the 50% mark and "needle drop" into the stream. Read the file until a sync word for the next frame header comes by, and then begin decoding.
As you can imagine, this method isn't accurate. At best, you're going to be within a half frame of the target on-average. With frame sizes being 576 samples, this is pretty accurate. However, there are problems with calculating the needle drop point in the first place. The most common issue is that ID3 tags and such add size to the file, throwing off the size calculations. A more severe issue is a variable bitrate (VBR) file. If you have music encoded with VBR, and the beginning of the track is silent-ish or otherwise easy to encode, the beginning might be 32 kbps whereas one second in might be 320 kbps. A 10x error in calculating the time length of the file!
The second method is to decode the whole file to raw PCM samples. This means you can guarantee sample-accurate seeking, but you must decode at least up to the seek point. If you want a proper time length for the full track, you must decode the whole file. Some 20 years ago, this was painfully slow. Seeking into a track would take almost as long as listening to the track to the point you were seeking to! These days, for short files, you can probably decode them so fast that it doesn't matter so much.
TL;DR; If you must have sample-accurate seeking, decode the files first before putting them in your player, but understand the performance penalty first before deciding this tradeoff.
For those who might come across this issue in the future, I ended up simply converting mp3 to m4a. This was the simplest solution in my specific case.
Constant bitrate mp3s are better. The system i used was to record the sample offset location of each frame header in the mp3 into a list. Then to seek, I would seek to the closest frame header before the desired sample by using the values in the list and then read from that location to my desired sample. This works fairly well but not perfect as the rendered wave form is decoded from the reference frame not the values if you decoded from start of file. If accuracy is reqired use libmpg123 it appears to be almost sample accurate. Note check licencing if for commercial app.
I've just written some iOS code that uses Audio Units to get a mono float stream from the microphone at the hardware sampling rate.
It's ended up being quite a lot of code! First I have to set up an audio session, specifying a desired sample rate of 48kHz. I then have to start the session and inspect the sample rate that was actually returned. This will be the actual hardware sampling rate. I then have to set up an audio unit, implementing a render callback.
But I am at least able to use the hardware sampling rate (so I can be certain that there is no information is lost through software re-sampling). And also I am able to set the smallest possible buffer size, so that I achieve minimal latency.
What is the analogous process on android?
How can I get down to the wire?
PS Nobody has mentioned it yet but it appears to be possible to work at the JNI level.
The AudioRecord class should be able to help you do what you need from the Java/Kotlin side of things. This will give you raw PCM data at the sampling rate you requested (assuming the hardware supports it.) It's up to your app to read the data out of the AudioRecord class in an efficient and timely manner so it does not overflow the buffer and drop data.
I intend to encode YUV data to H264 format on android platform. I've got it all implemented however have one small query to ask here regarding the DSI data getting returned with dequeOutputBuffer() call.
Currently, for the 1st call to dequeOutputBuffer(), I get the DSI data back. So for the 1st YUV-frame-input to the ViceoEncoder, I'm calling dequeOutputBuffer() twice to get the encoded stream. For remaining frames, I call dequeOutputBuffer() only once to get the corresponding encoded data. This approach works fine on devices running on ARM arch however on device running on x86 arch, it hangs during dequeOutputBuffer(), while encoding first YUV-frame.
So, my questions are:
Am I missing something w.r.t. Encoder configuration?
Is there a way to get back combined stream of DSI + EncodedData with
single call to
dequeOutputBuffer()?
Hope the question is clear.
The video encoder is going to accept N frames before producing any output. In some cases N will be 1, and you will see an output frame shortly after providing a single input frame. Other codecs will want to gather up a fair bit of video data before starting to produce output. It appears you've managed to resolve your current situation by doubling-up frames and discarding half the output, but you should be aware that different devices and different codecs will behave differently (assuming portability is a concern).
The CSD data is provided in a buffer with the BUFFER_FLAG_CODEC_CONFIG flag set. There is no documented behavior in MediaCodec for if or when such buffers will appear. (In fact, if you're using VP8, it doesn't appear at all.) For AVC, it arrives in the first buffer. If you're not interested in the CSD data, just ignore any packet with that flag set.
Because the buffer info flags apply to the entire buffer of data, the API doesn't provide a way to return a single buffer that has both CSD and encoded-frame data in it.
Note also that the encoder is allowed to reorder output, so you might submit frames 0,1,2 and receive encoded data for 0,2,1. The easiest way to keep track is to supply a presentation time stamp with each frame that uniquely identifies it. Some codecs will use the PTS value to adjust the encoding quality in an attempt to meet the bit rate goal, so you need to use reasonably "real" values, not a trivial integer counter.
I'd like to display decoded video frames from MediaCodec out of order, or omit frames, or show frames multiple times.
I considered configuring MediaCodec to use a Surface, call MediaCodec.dequeueOutputBuffer() repeatedly, save the resulting buffer indices, and then later call MediaCodec.releaseOutputBuffer(desired_index, true), but there doesn't seem to be a way to increase the number of output buffers, so I might run out of output buffers if I'm dealing with a lot of frames to be rearranged.
One idea I'm considering is to use glReadPixels() to read the pixel data into a frame buffer, convert the color format appropriately, then copy it to a SurfaceView when I need the frame displayed. But this seems like a lot of copying (and color format conversion) overhead, especially when I don't inherently need to modify the pixel data.
So I'm wondering if there is a better, more performant way. Perhaps there is a way to configure a different Surface/Texture/Buffer for each decoded frame, and then a way to tell the SurfaceView to display a specific Surface/Texture/Buffer (without having to do a memory copy). It seems like there must be a way to accomplish this with OpenGL, but I'm very new to OpenGL and could use recommendations on areas to investigate. I'll even go NDK if I have to.
So far I've been reviewing the Android docs, and fadden's bigflake and Grafika. Thanks.
Saving copies of lots of frames could pose a problem when working with higher-resolution videos and higher frame counts. A 1280x720 frame, saved in RGBA, will be 1280x720x4 = 3.5MB. If you're trying to save 100 frames, that's 1/3rd of the memory on a 1GB device.
If you do want to go this approach, I think what you want to do is attach a series of textures to an FBO and render to them to store the pixels. Then you can just render from the texture when it's time to draw. Sample code for FBO rendering exists in Grafika (it's one of the approaches used in the screen recording activity).
Another approach is to seek around in the decoded stream. You need to seek to the nearest sync frame before the frame of interest (either by asking MediaExtractor to do it, or by saving off encoded data with the BufferInfo flags) and decode until you reach the target frame. How fast this is depends on how many frames you need to traverse, the resolution of the frames, and the speed of the decoder on your device. (As you might expect, stepping forward is easier than stepping backward. You may have noticed a similar phenomena in other video players you've used.)
Don't bother with glReadPixels(). Generally speaking, if decoded data is directly accessible from your app, you're going to take a speed hit (more so on some devices than others). Also, the number of buffers used by the MediaCodec decoder is somewhat device-dependent, so I wouldn't count on having more than 4 or 5.
I'm trying to use a C library (Aubio) to perform beat detection on some music playing from a MediaPlayer in Android. To capture the raw audio data, I'm using a Visualizer, which sends a byte buffer at regular intervals to a callback function, which in turn sends it to the C library through JNI.
I'm getting inconsistent results (i.e. almost no beats are detected, and the only ones who are are not really consistent with the audio). I've checked multiple times, and, while I can't exactly rule out what I'm doing on my own, I'm wondering how exactly the Android Visualizer behaves, since it is not explicit in the documentation.
If I set the buffer size using setCaptureSize, does that mean that the captured buffer is averaged over the complete audio samples? For instance, if I divide the capture size by 2, will it still represent the same captured sound, but with 2 times less precision on the time axis?
Is it the same with the capture rate? For instance, does setting twice the capture size with half the rate yield the same data?
Are the captures consecutive? To put it another way, if I take too long to process a capture, are the sounds played during the processing ignored when I receive the next capture?
Thanks for your insight!
Make sure the callback function gets the entire audio signal, for instance by counting the frames that get out of it, and the ones that reach the callback.
It would help to be pointed at Visualizer documentation.