I'm wondering if anybody has developed a Renderscript Program that runs on GPU. I've tried some simple implementations, like doing IntrinsicBlur via RS but it turned out that it runs on CPU rather than GPU.
Intrinsics will always run on the processor that will do them the fastest. If it is running on the CPU, that means that the GPU is not suitable for running it quickly. Reasons for this might be that the GPU is usually used for drawing the screen (which takes a lot of effort too), and so there isn't additional compute bandwidth there.
Related
I've been searching and trying stuff for a few days with no luck. I have an embedded system using a Snapdragon SoC. It is running Android 5.0 and using openGL ES 3.0. It is not a phone and does not have a display, but I am able to use Vysor Chrome extension to see and work with the Android GUI.
Since it's not a phone and in a rather tight physical package, and I will eventually be doing some intensive encoding/decoding stuff, I am trying to test thermal output and properties under load. I am using Snapdragon Profiler to monitor CPU utilization and temperature.
I have been able to successfully load up the CPU and get a good idea of thermal output. I just made some test code that encodes a bunch of bitmaps to jpeg using standard Android SDK calls (using the CPU).
Now I want to see what happens if I do some GPU intensive stuff. The idea being that if I leverage the GPU for some encoding chores maybe things won't get so hot because the GPU can more efficiently handle some types of jobs.
I have been reading and from what I gather, there are a few ways I can eventually leverage the GPU. I could use some library such as FFMPEG or Android's MediaCodec stuff that uses hardware acceleration. I could also use openCV or RenderScript.
Before I go down any of those paths I want to just get some test code running and profile the hardware.
What's a quick, easy way to do this? I have done a little bit of openGL ES shader programming, but since this is not really a 3D graphics thing, I am not sure I can use shaders to test this. Since it is part of the graphics pipeline, will openGL allow me to do some GPU intensive stuff in the shaders? Or will it just drop frames or crash if I start doing some heavy stuff in there? What can I do to load up the GPU if I try shaders? Just a long while loop or something?
If shaders aren't the best way to load up GPU, what is? I think shaders are the only programmable part of openGL ES. Using RenderScript can I explicitly run operations on the GPU or does the framework just automatically determine how to run the code?
Finally, what is the metric I should be probing to determine GPU usage? In my profiler I have CPU Utilization but there is no GPU utilization. I have available the following GPU metrics:
but I am able to use Vysor Chrome extension to see and work with the Android GUI.
If you have Chrome working on the platform with a network connection, and don't care too much about what is being rendered then https://www.shadertoy.com/ is a quick and dirty way of getting some complex graphics running via WebGL.
I could use some library such as FFMPEG or Android's MediaCodec stuff that uses hardware acceleration. I could also use openCV or RenderScript.
FFMPEG and MediaCodec will be hardware accelerated, but likely not on the 3D GPU, but a separate dedicated video encoder / decoder.
My dev env is as follows:
Device: Nexus 5
Android: 4.4.2
SDK Tools: 22.6.1
Platform Tools: 19.0.1
Build tools: 19.0.3
Build Target: level 19
Min Target: level 19
I'm doing some image processing application. Basically I need to go through a preprocessing step to the image and then use convolution 5x5 to filter the image. In the preprocessing step, I successfully made the script to run on GPU and achieve good performance. Since Renderscript offers a 5x5 convolution intrinsics, I'd like to use it to make the whole pipeline as fast as possible. However, I found using the 5x5 convolution intrinsics after the preprocssing step is very slow. In contrast, if I use the adb tool to force all the scripts to run on CPU, the speed of the 5x5 convolution intrinsics is a lot faster. In both cases, the time consumed by the preprocessing step is basically the same. So it was the performance of the intrinsics which made the difference.
Also, in the code I use
Allocation.USAGE_SHARED
in creating all the Allocations, hoping the shared memory would facilitate memory access between CPU and GPU.
Since I understand that intrinsics runs on CPU, is this behavior expected? Or did I miss anything? Is there a way to make the GPU script/CPU intrinsics mixed code fast? Thanks a lot!
The 5X5 convolve Intrinsic (in default android rs driver for CPU) uses Neon. This is extremely fast and my measurements proved the same as well. In general, I did not find any rs apis then uses 5x5 convolve on two 5x5 matrices. This is a problem as it prevents one from writing more complex kernels.
Given the performance differences you are noticing, it is quite possible that that the GPU driver on your device supports a 5X5 convolve intrinsic that likely runs slower than the CPU 5X5 convolveIntrinsic that uses neon. So forcing CPU usage for renderscript is giving better performance.
I am using Nexus 10, Android 4.4. I see that if I have writes to global variables in the script, then the script is executed on CPU, instead of GPU. I can see this from logcat mali driver prints.
I read somewhere that this limitation will go away in future. I was hoping 4.4 will remove this. Does anyone know more about why this limitation exists and when it might go away?
This limitation appears to be restrictive. For instance, I am using an intermittent allocation as a global variable between kernels in a scriptgroup, and my script guarantees that the kernels write at different locations in the allocation. Due to this restriction, my script now falls back to CPU, which causes significant performance delays in atleast a few cases. For instance, this performance loss is significant if one uses cosine, pow functions in the kernel. CPU(s) do a far worse job than GPU on these functions
I have a few basic algorithms (DCT/IDCT and a few others) ported and working (as expected atleast functionally) on Nexus 10. Since these algorithms are first implementations, their execution time is currently running into secs, which is understandable.
However, given the architecture of Renderscript, I see that these algorithms run either on CPU or GPU depending on other parallel application activities. For instance, in my application, there is a scrollview for images and any activity on this view, essentially pushes the renderscript execution to CPU. If there is no activity the algorithm runs on GPU. I see this live via ARM-DS5 Mali/A15 traces.
This situation is presenting itself as debug/tuning nightmare, since the performance delta when the algorithm runs on CPU (dual core) versus GPU (Mali) is of the order of 2 secs, making it very difficult to gauge the performance improvements that I am doing on my algorithm code.
is there a way to get around this problem? One possible solution is to atleast have a debug configurability option to choose the target type (ARM, GPU) for renderscript code?
adb shell setprop debug.rs.default-CPU-driver 1
This will force execution to run on the reference CPU implementation. There is no equivalent to force things to the GPU as many conditions could make that impossible at runtime.
Also useful is:
adb shell setprop debug.rs.max-threads 1
Which limits the number of CPU cores to be used to 1 (or any other value you set up to the CPU count of the device)
Are there any Android devices where renderscript executes on the GPU instead of the CPU, or is this something not yet implemented anywhere?
As of JellyBean 4.2 there is a direct GPU integration for renderscript. See this and this.
I cannot confirm with any official documentation for Google, but I work with RenderScript all day every day and each time I run it, I see the logcat report loading drivers for graphics chips in my devices, most notably Tegra 2. Google has really lagged in documenting RenderScript, and I would not at all be surprised if they simply havn't corrected this omission in their discussion.
Currently the compute side of Renderscript will only run on the CPU:
For now, compute Renderscripts can only take advantage of CPU cores, but in the future, they can potentially run on other types of processors such as GPUs and DSPs.
Taken from Renderscript dev guide.
The graphics side of Renderscript sits on top of OpenGL ES so the shaders will run on the GPU.
ARM's Mali-T604 GPU will provide a target for the compute side of Renderscript (in a future Android release?) (see ARM Blog entry).
The design of RenderScript is so that it runs on the GPU. This was the main purpose of adding the new language. I assume there are devices where it runs on the CPU due to lack of support, but on most devices it runs on the GPU
I think this may depend on whether you're doing graphics or compute operations. The graphics operations will likely get executed on the GPU but the compute operations won't as far as I understand.
When you use the forEach construct the computation will run in multiple threads on the CPU, not the GPU (you can see this in the ICS source code). In future releases this may change (see https://events.linuxfoundation.org/slides/2011/lfcs/lfcs2011_llvm_liao.pdf) but I haven't seen any announcements.
Currently, only the Nexus 10 seems to support Renderscript GPU compute.