The short version:
I'm developing a synth app and using Opensl with low latency. I was doing all the audio calculation in the Opensl callback funktion (I know I should not but I did anyway). Now the calculations take about 75% cpu time on my nexus 4, so the next step is to do all the calculations in multiple threads instead.
The problem I ran into was that the audio started to stutter since the callback thread obviously run on a high priority while my new thread doesn't. If I use more/bigger buffers the problem goes away but so does the realtime too. Setting higher priority on the new thread don't seem to work.
So, is there even possible to do threaded low latency audio or do I have to do everything in the callback for it to work?
I have a buffer of 256 samples and that's about 5ms and that should be ages for the thread- scheduler-thingie to run my calc thread.
I think the fundamental problem lies in the performance of your synth-engine. A decent channel count with a Cortex-A8 or -A9 CPU is achievable with a single core. What language have you implemented it in? If it happens to be Java, I recommend porting it to C++.
Using multiple threads for synthesis is certainly possible, but brings with it new problems - namely that each thread must synchronise before the generated audio can be mixed.
Unless you take an additional latency hit that would come from running the synthesis threads asynchronously, the likely set-up is that in your render call-back you'd signal the additional synthesis threads and then wait for them to complete before mixing the audio from all of them together.
(an obvious optimisation is that the render call-back runs some of the processing itself as it's already running on the CPU and would otherwise be doing nothing).
Herein lies the problem. Unless you can be certain that your synth render threads run with real-time priority, you can potentially take a scheduling hit each time the render callback runs, and potentially another if you block the callback thread waiting for the synth render threads to catch up.
Last time I looked at audio on Android, Bionic was deficient of a means of setting real-time thread priority (e.g. SCHED_FIFO). In any case, whether this is even allowed is matter of operating system policy: on a desktop Linux system you either need to be root or have adjusted the appropriate ulimit (as root) - I'm not sure what Android does here, but I very much suspect that downloaded apps aren't by default given this permission. Nor the other useful permission which is to mlock() the code and its likely stack needs into physical memory.
Related
I recently upgraded my old Galaxy S2 phone to a brand new Galaxy S7, and was very surprised to find an old game I wrote seemed to be performing worse on the new phone. After cutting everything down to a bare bones project, I have discovered the problem - the GLES20.glFinish() call I was performing at the end of every onDrawFrame. With this in there, with a glClear but no draw calls, the FPS hovered around 40. Without the glFinish, solid 60 FPS. My old S2 had solid 60 FPS regardless.
I then went back to my game, and removed the glFinish method call, and sure enough performance went back to being perfect and there was no obvious downside to its removal.
Why was glFinish slowing down my frame rate on my new phone but not my old phone?
I think a speculative answer is as good as it's going to get, so — apologies for almost certainly repeating a lot of what you already know:
Commands sent to OpenGL go through three states, named relative to the GPU side of things:
unsubmitted
submitted but pending
completed
Communicating with the code running the GPU is usually expensive. So most OpenGL implementations accept your calls and just queue the work up inside your memory space for a while. At some point it'll decide that a communication is justified and will pay the cost to transfer all the calls at once, promoting them to the submitted state. Then the GPU will complete each one (potentially out-of-order, subject to not breaking the API).
glFinish:
... does not return until the effects of all previously called GL
commands are complete. Such effects include all changes to GL state,
all changes to connection state, and all changes to the frame buffer
contents.
So for some period when that CPU thread might have been doing something else, it now definitely won't. But if you don't glFinish then your output will probably still appear, it's just unclear when. glFlush is often the correct way forwards — it'll advance everything to submitted but not wait for completed, so everything will definitely appear shortly, you just don't bother waiting for it.
OpenGL bindings to the OS vary a lot; in general though you almost certainly want to flush rather than finish if your environment permits you to do so. If it's valid to neither flush nor finish and the OS isn't pushing things along for you based on any criteria then it's possible you're incurring some extra latency (e.g. the commands you issue one frame may not reach the GPU until you fill up the unsubmitted queue again during the next frame) but if you're doing GL work indefinitely then output will almost certainly still proceed.
Android sits upon EGL. Per the spec, 3.9.3:
... eglSwapBuffers and eglCopyBuffers perform an implicit flush operation
on the context ...
I therefore believe that you are not required to perform either a flush or a finish in Android if you're double buffering. A call to swap the buffers will cause a buffer swap as soon as drawing is complete without blocking the CPU.
As to the real question, the S7 has an Adreno 530 GPU. The S2 has a Mali T760MP6 GPU. The Malis are produced by ARM, the Adrenos by Qualcomm, so they're completely different architectures and driver implementations. So the difference that causes the blocking could be almost anything. But it's permitted to be. glFinish isn't required and is a very blunt instrument; it's probably not one of the major optimisation targets.
I am writing a video processing app and have come across the following performance issue:
Most of the methods in my app have great differences between cpu time and real time.
I have investigated using the DDMS TraceView and have discovered that the main culprit for these discrepancies is context switching in some base methods, such as MediaCodec.start() or MediaCodec.dequeueOutputBuffer()
MediaCodec.start() for example has 0.7ms Cpu time and 24.2ms Real time. 97% of this real time is used up by the context switch.
This would not be a real problem, but the method is called quite often, and it is not the only one that presents this kind of symptom.
I also need to mention that all of the processing happens in a single AsyncTask, therefore on a single non-UI thread.
Is context switching a result of poor implementation, or an inescapable reality of threading?
I would very much appreciate any advice in this matter.
First, I doubt the time is actually spent context-switching. MediaCodec.start() is going to spend some amount of time waiting for the mediaserver process to talk to the video driver, and that's probably what you're seeing. (Unless you're using a software codec, your process doesn't do any of the actual work -- it sends IPC requests to mediaserver, which talks to the hardware codec.) It's possible traceview is just reporting its best guess at where the time went.
Second, AsyncTask threads are executed at a lower priority. Since MediaCodec should be doing all of the heavy lifting in the hardware codec, this won't affect throughput, but it's possible that it's having some effect on latency because other threads will be prioritized by the scheduler. If you're worried about performance, stop using AsyncTask. Either do the thread management yourself, or use the handy helpers in java.util.concurrent.
Third, if you really want to know what's happening when multiple threads and processes are involved, you should be using systrace, not traceview. An example of using systrace with custom trace markers (to watch CPU cores spin up) can be found here.
My app record video and use TextToSpeech->android.speech.tts.TextToSpeech.speak() at the same time.
If I run in high device like 4 procesor at 1.5 ghz works ok. But if I use in 2 procesor 1.1 ghz device ui thread go very slow, with freezing of 2-6 seconds.
I know that problem is in TextToSpeech because if I don´t use it and record video the ui thread works very fluently in low device. If I use TextToSpeech + record video ui thread don´t work and also voice freeze 1-2 seg.
Is there any way to improve performance of TextToSpeech.speak()?
You're using text to speech and video recording at the same time? And you're surprised its slow? Both of these take a non-trivial amount of CPU resources. Some things just take processing power. Try not using them at the same time and you'll get better results.
If you need to use them at the same time- try using synthesizeToFile first to write the sound clip to a file, then playing the soundclip while recording. This way you aren't trying to generate the phonemes at the same time as recording.
If you are referring to 'Cores' when you say 'processors'? It seems like you are doing activities that should run on 3 different threads.
the Main Thread should be free always. Try not to bog it down... ever!
Extend the AsyncTask class. AsyncTask will allow you to do something that will take some lengthy amount of time without blocking the main thread.
Since this is all running on a virtual machine (Dalvik, to be precise), we must assume that threading is also virtual. This means that if you run 3 threads on two cores, the Virtual machine will decide which threads get processor cycles, and sometimes that means sharing cores.
I would say that if you ONLY plan on doing two heavy things at once, for a lower end device, you could implement this using the main thread for video, and a second thread for TextToSpeech. This isn't ideal because it potentially blocks the main thread. But since Video is the smoother of the two, it would be the first choice candidate for running on the Main UI Thread.
Ideally, you want minimum three threads, leaving the main UI Thread primarily unblocked. You can poll for results from both threads to detect completion.
If you happen to have 4 cores, then creating three threads should likely have more distributed performance over the available cores.
Some docs to get you going:
Android Multithreading - a Qualcomm article, and
Android: AsyncTask
I'm using openSL ES in one of my Android apps. When the app is in the foreground the callbacks are pretty regular. The mic callback is called approximately every 10ms and so is the speaker callback. However, if I put my app in the background and open up a browser (or another app for that matter) I see that a "storm" of callbacks are triggered upon opening up the browser (or browsing). Is there a way to get around that? And why does it happen? Is openSL compensating for a period of time where it wasn't able to execute the callbacks? (like it's trying to catch up).
My source code is in C and I'm on Jelly Bean 4.3.
I have tried to increase the thread priorities of AudioTrack and AudioRecorder, and it does seem to help, but I'm not sure that's the way to go.
ADDITIONAL QUESTIONS
So you're saying that even with increased thread priority you might get a burst of callbacks and that you should discard those ?
How is that a good solution? You'll be dropping mic packet (or draining the source of the speaker packets), right? If you don't drop mic packets, the receiver of the mic packets will interpret the burst of mic packets as excessive jitter, right?
More importantly: I manually increased the thread priority of AudioTrack and AudioRecorder and changed the sched policy to round robin. It required both root access and installation of BusyBox (which comes with a command line util for changing thread priorities/sched policy). How is this done programatically from C ? I want to make sure that it IS the individual thread priority that is increased and not just the priority of my app (process).
Yes, this is by design. Trying to push the thread priority high is the legitimate way to work around. Make sure to work with native buffer size and sampling (see Low-latency audio playback on Android) for best results. You should still be prepared to discard bursts of callbacks because there is no way to guarantee they will never happen. You should also try to reduce overall CPU consumption and RAM footstamp of your app while it is in the background.
This may be very specific, still trying to ask:
I'm founder of Heat Synthesizer, a software music synthesizer for Android. (https://play.google.com/store/apps/details?id=com.nilsschneider.heat.demo)
This app generates audio signals in realtime and needs to do heavy math calculations to do so.
Having seen the talk on Google I/O 2013 about "High Performance Audio on Android" (http://www.youtube.com/watch?v=d3kfEeMZ65c), I was excited to implement it as they suggested, but I keep having problems with crackling.
I have a CPU usage of a single core of about 50% on a Nexus 7 (2012), everything seems to be okay so far. Locking has been reduced to a minimum and most of the code is done lock-free.
Using an app that is called Usemon, I can see that the core I use for processing is used only 50% and is even being downclocked by the kernel because my CPU usage is not high enough.
However, this core speed changes result in crackling of the audio, because the next audio block is not calculated fast enough because my core is underclocked.
Is there any way to prevent a core from changing it's clock frequency?
FWIW, I recommend the use of systrace (docs, explanation, example) for this sort of analysis. On a rooted device you can enable the "freq" tags, which show the clock frequencies of various components. Works best on Android 4.3 and later.
The hackish battery-unfriendly way to deal with this is to start up a second thread that does nothing but spin while your computations are in progress. In theory this shouldn't work (since you're spinning on a different core), but in practice it usually gets the job done. Make sure you verify that the device has multiple cores (Runtime.getRuntime().availableProcessors() or NDK equivalent), as doing this on a single-core device would be bad.
Assuming your computations are performed asynchronously in a separate thread, you can do a bit better by changing the worker thread from a "compute, then wait for work" to a "compute, then poll for work" model. Again, far less efficient battery-wise, but if you never sleep then the kernel will assume you're working hard and needs to keep the core at full speed. Make sure you drop out of polling mode if there isn't any actual work to do (i.e. you hit the end of input).