RenderScript speedup 10x when forcing default CPU implementation

RenderScript speedup 10x when forcing default CPU implementation - android

I have implemented a CNN in RenderScript, described in a previous question which spawned this one. Basically, when running
adb shell setprop debug.rs.default-CPU-driver 1
there is a 10x speedup on both Nvidia Shield and Nexus 7. The average computation time goes from around 50ms to 5ms, the test app goes from around 50fps to 130 or more. There are two convolution algorithms:
(1) moving kernel
(2) im2col and GEMM from RenderScriptIntrinsicsBLAS.
Both experience similar speedup. The question is: why is this happening and can this effect be instantiated from the code in a predictable way? And is detailed information about this available somewhere?
Edit:
As per suggestions below, I verified the use of finish() and copyTo(), here is a breakdown of the procedure. The speedup reported is AFTER the call to copyTo() but without finish(). Uncommenting finish() adds about 1ms to the time.
double forwardTime = 0;
long t = System.currentTimeMillis();
//double t = SystemClock.elapsedRealtime(); // makes no difference
for (Layer a : layers) {
blob = a.forward(blob);
}
mRS.finish(); // adds about 1ms to measured time
blob.copyTo(outbuf);
forwardTime = System.currentTimeMillis() - t;
Maybe this is unrelated, but on the NVIDIA Shield I get an error message at startup which disappears when running with adb shell setprop debug.rs.default-CPU-driver 1
E/Renderscript: rsAssert failed: 0, in vendor/nvidia/tegra/compute/rs/driver/nv/rsdNvBcc.cpp
I'm setting compileSdkVersion, minSdkVersion and targetSdkVersion to 23 right now, with buildToolsVersion "23.0.2". The tablets are autoupdated to the very latest Android version. Not sure about the minimum target I need to set and still have ScriptIntrinsicsBLAS available.
I'm using #pragma rs_fp_relaxed in all scripts. The Allocations all use default flags.
This question has a similar situation, but it turned out OP was creating new Script objects every computational round. I do nothing of the sort, all Scripts and Allocations are created at init time.

The original post has the mRS.finish() commented out. I am wondering if that is the case here.
To benchmark RenderScript properly, we should wait for pending asynchronous opeations to complete. There are generally two ways to do that:
Use RenderScript.finish(). This works well when using debug.rs.default-CPU-driver 1. And it also works with most GPU drivers. However, certain GPU driver does treat this as a NOOP.
Use Allocation.copyTo() or other similar APIs to access data of an Allocation, preferably the final output Allocation. This is actually a trick, but it works on all devices. Just be aware, the copyTo operation itself may take some time and make sure you take that into consideration.
5ms here seems suspicious, it might be real depending on the actually algorithm. But it worth double check if it is still the case when you add finish() or copyTo().

That's very strange indeed. The fact that you're getting the same result across both devices and with two very different implementations of the conv layers suggests there is still something else going on with the benchmarking or timing itself, rather than differences with CPU/GPU execution, as things are rarely that conclusive.
I would suggest verifying the outputs from the copyTo()'s is always the same. Setup a logcat output of, say, the first (and last!) 10 values in the float array that comes back from each layer's output allocation to make sure all implementations and execution modes are truly processing the data properly and equally at each layer.
Depending on your setup, it's also possible that the data copying overhead I mentioned before might be overpowering the computation time itself and what you're seeing is just an unfortunate effect of that, as it's possible data copying from one place or another takes more or less time. Try increasing the conv kernel sizes or count (with dummy/random values, just for testing sake) to make the computations much more complex and thereby offset the computing vs data loading times balance, and see how that affects your results.
If all else fails, it could just be the GPU really is taking longer for some reason, though it can be hard to track down why. Some things to check... What data type and size are you using for the data? How are you loading/writing the data to the allocations? Are you using #pragma rs_fp_relaxed already to set your floats precision? What flags are you setting for the allocation usage (such as Allocation.USAGE_SCRIPT | Allocation.USAGE_GRAPHICS_TEXTURE)?
And as for your last question, detailed RS documentation on specific optimization matters is still very scarce unfortunately... I think just asking here on SO is still one of the best resources available for now :)

Related

Opengl es 2.0 - why does glFinish give me a lower framerate on my new android phone compared with an old one?

I recently upgraded my old Galaxy S2 phone to a brand new Galaxy S7, and was very surprised to find an old game I wrote seemed to be performing worse on the new phone. After cutting everything down to a bare bones project, I have discovered the problem - the GLES20.glFinish() call I was performing at the end of every onDrawFrame. With this in there, with a glClear but no draw calls, the FPS hovered around 40. Without the glFinish, solid 60 FPS. My old S2 had solid 60 FPS regardless.
I then went back to my game, and removed the glFinish method call, and sure enough performance went back to being perfect and there was no obvious downside to its removal.
Why was glFinish slowing down my frame rate on my new phone but not my old phone?

I think a speculative answer is as good as it's going to get, so — apologies for almost certainly repeating a lot of what you already know:
Commands sent to OpenGL go through three states, named relative to the GPU side of things:
unsubmitted
submitted but pending
completed
Communicating with the code running the GPU is usually expensive. So most OpenGL implementations accept your calls and just queue the work up inside your memory space for a while. At some point it'll decide that a communication is justified and will pay the cost to transfer all the calls at once, promoting them to the submitted state. Then the GPU will complete each one (potentially out-of-order, subject to not breaking the API).
glFinish:
... does not return until the effects of all previously called GL
commands are complete. Such effects include all changes to GL state,
all changes to connection state, and all changes to the frame buffer
contents.
So for some period when that CPU thread might have been doing something else, it now definitely won't. But if you don't glFinish then your output will probably still appear, it's just unclear when. glFlush is often the correct way forwards — it'll advance everything to submitted but not wait for completed, so everything will definitely appear shortly, you just don't bother waiting for it.
OpenGL bindings to the OS vary a lot; in general though you almost certainly want to flush rather than finish if your environment permits you to do so. If it's valid to neither flush nor finish and the OS isn't pushing things along for you based on any criteria then it's possible you're incurring some extra latency (e.g. the commands you issue one frame may not reach the GPU until you fill up the unsubmitted queue again during the next frame) but if you're doing GL work indefinitely then output will almost certainly still proceed.
Android sits upon EGL. Per the spec, 3.9.3:
... eglSwapBuffers and eglCopyBuffers perform an implicit flush operation
on the context ...
I therefore believe that you are not required to perform either a flush or a finish in Android if you're double buffering. A call to swap the buffers will cause a buffer swap as soon as drawing is complete without blocking the CPU.
As to the real question, the S7 has an Adreno 530 GPU. The S2 has a Mali T760MP6 GPU. The Malis are produced by ARM, the Adrenos by Qualcomm, so they're completely different architectures and driver implementations. So the difference that causes the blocking could be almost anything. But it's permitted to be. glFinish isn't required and is a very blunt instrument; it's probably not one of the major optimisation targets.

Interrupt forEach_root in RenderScript from Java-Side?

I am writing an android application and I use renderscript for some complex calculation (I am simulating a magnetic pendulum) that is performed on each pixel of a bitmap (using script.forEach_root(...)). This calculation might last from tenth of a second to up to about 10 seconds or even more, depending on the input parameters.
I want to keep the application responsive and allow users to change parameters without waiting. Therefore I would like to interrupt a running calculation based on user input on the Java-Side of the program. Hence, can I interrupt a forEach_root-call?
I already tried some solutions but they either do not work or do not fully satisfy me:
Add a variable containing a cancel-Flag to RenderScript and check its status in root: Does not work because I cannot change variables using set while forEach_root is running (they are synchronized - I guess for good reasons).
Split the image up into multiple tiles: This is a possible solution and currently the one I favor the most, yet it is only a work around because calculating a single tile might also take several seconds.
Since I am new to renderscript I am wondering whether there are some other solutions which I was not aware of.

Unfortunately, there is no simple way to cancel a running kernel for RenderScript. I think that your tiling approach is probably the best solution today, which you should be able to implement using http://developer.android.com/reference/android/renderscript/Script.LaunchOptions.html when you begin kernel execution.

Is achartengine ready for realtime graphing?

I'm trying to graph some real-time data, "realtime" here means < 10msec data, ideally as low as possible. I've been able to get Android to fetch and process data this fast but ACE just looks like it's not been designed for real-time use in mind. First symptoms are that garbage collector kicks in like there's no tomorrow and totally kills the app. I'm visualizing data on a "sliding window" fashion so it's not like I'm expecting ACE to plot in real time hundreds of thousandths of points.
I've taken a look at it and the onDraw for XYChart certainly allocates very heavily in cases where it looks like it's convenient and probably makes the code much more readable but not really required.
This might even be worse than it used to be so it might not have been noticed yet. I saw bugfix for issue #225 solved a concurrency problem changing:
return mXY.subMap(start, stop);
for:
return new TreeMap<Double, Double>(mXY.subMap(start, stop));
This creates huge allocations (still backed by the original subMap though) when it would probably be better to queue updates while onDraw is going on and process them later on atomic updates or something on that line to avoid concurrency issues.
The real pity here is that ACE is certainly fast enough for what I need. It can do what I need on my HW perfectly but since it allocates so heavily on repaint Android goes crazy with GC. It soon starts allocating while GC is running so it has to wait and my app starts looking like a stop-motion movie.
The real question though is: Is it reasonable to expect to be able to repaint 4 or 6 linecharts (tablet app) on realtime (sub 200ms refresh rate) with ACE or is it simply not prepared for that kind of abuse?
If the answer is no. Any other options there you'd recommend?
EDIT 20130109:
Revision 471 improves things quite a bit for small data sets. 2.000 points / 4 charts / 100 msec refresh rate is doable and smooth. Logs still see "GC_CONCURRENT freed" like crazy (around 10/sec) but no "WAIT_FOR_CONCURRENT_GC blocked" which are the showstoppers that make your app stop-motion like.
At 3.000 points / 1 chart / 100 msec it's clearly not smooth. We get again the avalanche of "WAIT_FOR_CONCURRENT_GC blocked" on logcat and the stuttering app. Again it looks like we do not have a speed problem, only a memory management problem.
It may look like I might be asking ACE to do magic but I hit this wall after refactoring all my code to retrieve and store telemetry data at 1KHz. Once I finally saw my app retrieve and store all of that realtime without triggering GC at all I pulled my hair with ACE when trying to graph :)

First of all thanks for the great question and for the point you raised. You were definitely right about the huge memory allocation that was done under the onDraw() method. I fixed that and checked in the code in SVN. I have also added a synchronized block inside the onDraw() method such as it will hopefully not throw ConcurrentModificationException when adding new data to the dataset during repaints.
Please checkout the code from SVN and do an ant dist in order to build new AChartEngine jar file and embed it in your application. Please see instructions here.
To answer your question: AChartEngine is definitely ready for dynamic charting. The issue you reported was a show-stopper, but it should be fixed now. I have written dynamic charting using it. However, you need to make sure you don't add 100000s of data values to the datasets. Old data can be removed from the datasets in order to gain performance.
It is definitely reasonable to paint the 5 or so line charts if they have up to a few 1000s points.

After a big effort on optimizing everything else on my app I still can't make it plot in what I understand for "realtime".
The library is great, and very fast but the way in which every onDraw allocates memory inevitably triggers an avalanche of garbage collection which collides with its own allocation and thus android freezes your app momentarily causing stutter which is totally incompatible with "realtime" graphing. The "stutter" here can range from 50-250ms (yes milliseconds) but that's enough to kill a realtime app.
AChartengine will allow you to create "dynamic" graphs just as long as you don't require them to be "realtime" (10 frames/sec, (<100ms refresh rate) or what you'd call "smooth").
If someone needs more information on the core problem here, or why I'm saying that the library is fast enough but memory allocation patterns end up causing performance problems take a look at Google I/O 2009 - Writing Real-Time Games for Android

Ok, so after some time looking for another graphing library I found nothing good enough :)
That made me look into ACE again and I ended up doing a small patch which makes it "usable" for me although far from ideal.
On XYSeries.java I added a new method:
/**
* Removes the first value from the series.
* Useful for sliding, realtime graphs where a standard remove takes up too much time.
* It assumes data is sorted on the key value (X component).
*/
public synchronized void removeFirst() {
mXY.removeByIndex(0);
mMinX = mXY.getXByIndex(0);
}
I found that on top of the memory issues there where some real speed issues at high speed framerates too. I was spending around 90% of the time on the remove function as I removed points which scrolled out of view. The reason is that when you remove a min|max point ACE calls InitRange which iterates over every point to recalculate those min/max points it uses internally. As I'm processing 1000 telemetry frames per sec and a have a small viewport (forced by ACE memory allocation strategies) I hit min/max points very often on the remove function.
I created a new method used to remove the first point of a series which will normally be called as soon as you add a point which makes that one scroll out of your viewport. If your points are sorted on key value (classic dateTime series) then we can adjust mMinX so the viewport still looks nice and do that very fast.
We can't update minY or maxY fast with the current ACE implementation (not that I looked into it though), but if you set up an appropriate initial range it may not be required. In fact I manually adjust the range from time to time using extra information I have because I know what I'm plotting and what the normal ranges are at different points in time.
So, this might be good enough for someone else but I insist that a memory allocation refactoring is still required on ACE for any serious realtime graphing.

eglSwapBuffers is erratic/slow

I have a problem with very low rendering time on an android tablet using the NDK and the egl commands. I have timed calls to eglSwapBuffers and is taking a variable amount of time, frequently exceeded the device frame rate. I know it synchronizes to the refresh, but that is around 60FPS, and the times here drop well below that.
The only command I issue between calls to swap is glClear, so I know it isn't anything that I'm drawing causing the problem. Even just by clearing the frame rate drops to 30FPS (erratic though).
On the same device a simple GL program in Java easily renders at 60FPS, thus I know it isn't fundamentally a hardware issue. I've looked through the Android Java code for setting up the GL context and can't see any significant difference. I've also played with every config attribute, and while some alter the speed slightly, none (that I can find) change this horrible frame rate drop.
To ensure the event polling wasn't an issue I moved the rendering into a thread. That thread now only does rendering, thus just calls clear and swap repeatedly. The slow performance still persists.
I'm out of ideas what to check and am looking for suggestions as to what the problem might be.

There's really not enough info (like what device you are testing on, what was you exact config etc) to answer this 100% reliable but this kind of behavior is usually caused by window and surface pixel format mismatch eg. 16bit (RGB565) vs 32bit.

FB_MULTI_BUFFER=3 environment variable will enable the multi buffering on Freescale i.MX 6 (Sabrelite) board with some recent LTIB build (without X). Your GFX driver may needs something like this.

Why does traceview give inconsistent measurements?

I am trying to speed up my app start-up time (currently ~5 seconds due to slow Guice binding), and when I run traceview I'm seeing pretty big variations (as high as 30%) in measurements from executions of the same code.
I would assume this is from garbage collection differences, but the time spent in startGC according to traceview is completely insignificant.
This is particularly aggravating because it's very difficult to determine what the effects were of my optimizations when the measurements are so variable.
Why does this happen? Is there any way to make the measurements more consistent?

I suppose you are starting profiling from the code rather than turning it on manually? But anyway even if you use Debug.startMethodTracing and Debug.stopMethodTracing from a specific point of your code you will receive different measurments.
You can see here that Traceview disables the JIT and I believe some other optimizations so during profiling your code is executed slower than without it. Also your code performance depends on overall system load. If some other app is doing any heavy operation in background your code will execute longer. So you should definitely get results that a slightly different and so start-up time couldn't be a constant.
Generally it is not so important how long your method executes but how much CPU time it consumes comparing to other methods.

Sounds like measurement is not your ultimate goal. Your ultimate goal is to make it faster.
The way to do that is by finding what activities are accounting for a large fraction of time, so you can find a better way to do them.
I said "finding", not "measuring", and I said "activities", not "routines".
To do this, it is only necessary to sample the program's state.
Many profilers collect a large number of samples of the program's state, but then they all fall into the same logic - they summarize, on the theory that all you want is measurements, and you don't really care of what.
In fact, if rather than getting summaries you could examine some of the samples in detail, it would tell you a great deal more about how the program is spending its time.
What's more, if on as few as two(2) samples you could see the program pursuing some goal, and it was something you could improve significantly, you would see a significant speedup.
This process can be repeated several times, and that's how you can really optimize it.
The process is explained in more detail here, and there's a use case here.

If you are doing any network related activity on startup then this tool can help you understand what is happening and how you might be able to optimize connections and caching. http://developer.att.com/developer/legalAgreementPage.jsp?passedItemId=9700312

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.