BACKGROUND:
I want to add live filter based on the code of Android camera app. But the architecture of Android camera app is based on OpenGL ES 1.x. I need to use shader to custom our filter implementation. However, it is too difficult to update the camera app to OpenGL ES 2.0. Then I have to find some other methods to implement live filter instead of OpenGL. I decided to use render script after some research.
PROBLEM:
I have wrote a demo of a simple filter by render script. It shows that the fps is much lower than implementing it by OpenGL. About 5 fps vs 15 fps.
QUESTIONS:
The Android official offsite says: The RenderScript runtime will parallelize work across all processors available on a device, such as multi-core CPUs, GPUs, or DSPs, allowing you to focus on expressing algorithms rather than scheduling work or load balancing. Then why is render script implementation slower?
If render script cannot satisfy my requirement, is there a better way?
CODE DETAILS:
Hi I am in the same team with the questioner. We want to write a render-script based live-filter camera. In our test-demo-project, we use a simple filter: a YuvToRGB IntrinsicScript added with a overlay-filter ScriptC script.
In the OpenGL version, we set the camera data as textures and do the image-filter-procss with shader. Like this:
GLES20.glActiveTexture(GLES20.GL_TEXTURE0);
GLES20.glBindTexture(GLES20.GL_TEXTURE_2D, textureYHandle);
GLES20.glUniform1i(shader.uniforms.get("uTextureY"), 0);
GLES20.glTexSubImage2D(GLES20.GL_TEXTURE_2D, 0, 0, 0, mTextureWidth,
mTextureHeight, GLES20.GL_LUMINANCE, GLES20.GL_UNSIGNED_BYTE,
mPixelsYBuffer.position(0));
In the RenderScript version, we set the camera data as Allocation and do the image-filter-procss with script-kernals. Like this:
// The belowing code is from onPreviewFrame(byte[] data, Camera camera) which gives the camera frame data
byte[] imageData = datas[0];
long timeBegin = System.currentTimeMillis();
mYUVInAllocation.copyFrom(imageData);
mYuv.setInput(mYUVInAllocation);
mYuv.forEach(mRGBAAllocationA);
// To make sure the process of YUVtoRGBA has finished!
mRGBAAllocationA.copyTo(mOutBitmap);
Log.e(TAG, "RS time: YUV to RGBA : " + String.valueOf((System.currentTimeMillis() - timeBegin)));
mLayerScript.forEach_overlay(mRGBAAllocationA, mRGBAAllocationB);
mRGBAAllocationB.copyTo(mOutBitmap);
Log.e(TAG, "RS time: overlay : " + String.valueOf((System.currentTimeMillis() - timeBegin)));
mCameraSurPreview.refresh(mOutBitmap, mCameraDisplayOrientation, timeBegin);
The two problems are :
(1) RenderScript process seems slower than OpenGL process.
(2) According to our time-log, the process of YUV to RGBA which uses intrinsic script is very quick, takes about 6ms; but the process of overlay which uses scriptC is very slow, takes about 180ms. How does this happen?
Here is the rs-kernal code of the ScriptC we use(mLayerScript):
#pragma version(1)
#pragma rs java_package_name(**.renderscript)
#pragma stateFragment(parent)
#include "rs_graphics.rsh"
static rs_allocation layer;
static uint32_t dimX;
static uint32_t dimY;
void setLayer(rs_allocation layer1) {
layer = layer1;
}
void setBitmapDim(uint32_t dimX1, uint32_t dimY1) {
dimX = dimX1;
dimY = dimY1;
}
static float BlendOverlayf(float base, float blend) {
return (base < 0.5 ? (2.0 * base * blend) : (1.0 - 2.0 * (1.0 - base) * (1.0 - blend)));
}
static float3 BlendOverlay(float3 base, float3 blend) {
float3 blendOverLayPixel = {BlendOverlayf(base.r, blend.r), BlendOverlayf(base.g, blend.g), BlendOverlayf(base.b, blend.b)};
return blendOverLayPixel;
}
uchar4 __attribute__((kernel)) overlay(uchar4 in, uint32_t x, uint32_t y) {
float4 inPixel = rsUnpackColor8888(in);
uint32_t layerDimX = rsAllocationGetDimX(layer);
uint32_t layerDimY = rsAllocationGetDimY(layer);
uint32_t layerX = x * layerDimX / dimX;
uint32_t layerY = y * layerDimY / dimY;
uchar4* p = (uchar4*)rsGetElementAt(layer, layerX, layerY);
float4 layerPixel = rsUnpackColor8888(*p);
float3 color = BlendOverlay(inPixel.rgb, layerPixel.rgb);
float4 outf = {color.r, color.g, color.b, inPixel.a};
uchar4 outc = rsPackColorTo8888(outf.r, outf.g, outf.b, outf.a);
return outc;
}
Renderscript does not use any GPU or DSPs cores. That is a common misconception encouraged by Google's deliberately vague documentation. Renderscript used to have an interface to OpenGL ES, but that has been deprecated and has never been used for much beyond animated wallpapers. Renderscript will use multiple CPU cores, if available, but I suspect Renderscript will be replaced by OpenCL.
Take a look at the Effects class and the Effects demo in the Android SDK. It shows how to use OpenGL ES 2.0 shaders to apply effects to images without writing OpenGL ES code.
http://software.intel.com/en-us/articles/porting-opengl-games-to-android-on-intel-atom-processors-part-1
UPDATE:
It's wonderful when I learn more answering a question than asking one and that is the case here. You can see from the lack of answers that Renderscript is hardly used outside of Google because of its strange architecture that ignores industry standards like OpenCL and almost non-existent documentation on how it actually works.
Nonetheless, my answer did evoke a rare response from the Renderscrpt development team which includes only one link that actually contains any useful information about renderscript - this article by Alexandru Voica at IMG, the PowerVR GPU vendor:
http://withimagination.imgtec.com/index.php/powervr/running-renderscript-efficiently-with-powervr-gpus-on-android
That article has some good information which was new to me. There are comments posted there from more people who are having trouble getting Renderscript code to actually run on the GPU.
But, I was incorrect to assume that Renderscript is no longer being developed at Google. Although my statement that "Renderscript does not use any GPU or DSPs cores." was true until just fairly recently, I have learned that this has changed as of one of the Jelly Bean releases.
It would have been great if one of the Renderscript developers could have explained that. Or even if they had a public webpage that explains that or that lists
which GPUs are actually supported and how can you tell if your code actually gets run on a GPU.
My opinion is that Google will replace Renderscript with OpenCL eventually and I would not invest time developing with it.
Related
Context
I have an Android app that takes a picture, blurs the picture, removes the blur based on a mask and applies a final layer (not relevant). The last 2 steps, removing the blur based on a mask and applying a final layer is done repeatedly, each time with a new mask (150 masks).
The output get's drawn on a canvas (SurfaceView). This way the app effectively creates a view of the image with an animated blur.
Technical details & code
All of these image processing steps are achieved with RenderScript.
I'm leaving out the code for step 1, blurring the picture, since this is irrelevant for the problem I'm facing.
Step 2: removing the blur based on a mask
I have a custom kernel which takes an in Allocation as argument and holds 2 global variables, which are Allocations as well.
These 3 Allocations all get their data from bitmaps using Allocation.copyFrom(bitmap).
Step 3: applying a final layer
Here I have a custom kernel as well which takes an in Allocation as argument and holds 3 global variables, of which 1 is and Allocation and 2 are floats.
How these kernels work is irrelevant to this question but just to be sure I included some simplified snippets below.
Another thing to note is that I am following all best practices to ensure performance is at its best regarding Allocations, RenderScript and my SurfaceView.
So common mistakes such as creating a new RenderScript instance each time, not re-using Allocations when possible,.. are safe to ignore.
blurMask.rs
#pragma version(1)
#pragma rs java_package_name(com.example.rs)
#pragma rs_fp_relaxed
// Extra allocations
rs_allocation allocBlur;
rs_allocation allocBlurMask;
/*
* RenderScript kernel that performs a masked blur manipulation.
* Blur Pseudo: out = original * blurMask + blur * (1.0 - blurMask)
* -> Execute this for all channels
*/
uchar4 __attribute__((kernel)) blur_mask(uchar4 inOriginal, uint32_t x, uint32_t y) {
// Manually getting current element from the blur and mask allocations
uchar4 inBlur = rsGetElementAt_uchar4(allocBlur, x, y);
uchar4 inBlurMask = rsGetElementAt_uchar4(allocBlurMask, x, y);
// normalize to 0.0 -> 1.0
float4 inOriginalNorm = rsUnpackColor8888(inOriginal);
float4 inBlurNorm = rsUnpackColor8888(inBlur);
float4 inBlurMaskNorm = rsUnpackColor8888(inBlurMask);
inBlurNorm.rgb = inBlurNorm.rgb * 0.7 + 0.3;
float4 outNorm = inOriginalNorm;
outNorm.rgb = inOriginalNorm.rgb * inBlurMaskNorm.rgb + inBlurNorm.rgb * (1.0 - inBlurMaskNorm.rgb);
return rsPackColorTo8888(outNorm);
}
myKernel.rs
#pragma version(1)
#pragma rs java_package_name(com.example.rs)
#pragma rs_fp_relaxed
// Extra allocations
rs_allocation allocExtra;
// Randoms; Values are set from kotlin, the values below just act as a min placeholder.
float randB = 0.1f;
float randW = 0.75f;
/*
* RenderScript kernel that performs a manipulation.
*/
uchar4 __attribute__((kernel)) my_kernel(uchar4 inOriginal, uint32_t x, uint32_t y) {
// Manually getting current element from the extra allocation
uchar4 inExtra = rsGetElementAt_uchar4(allocExtra, x, y);
// normalize to 0.0 -> 1.0
float4 inOriginalNorm = rsUnpackColor8888(inOriginal);
float4 inExtraNorm = rsUnpackColor8888(inExtra);
float4 outNorm = inOriginalNorm;
if (inExtraNorm.r > 0.0) {
outNorm.rgb = inOriginalNorm.rgb * 0.7 + 0.3;
// Separate channel operation since we are using inExtraNorm.r everywhere
outNorm.r = outNorm.r * inExtraNorm.r + inOriginalNorm.r * (1.0 - inExtraNorm.r);
outNorm.g = outNorm.g * inExtraNorm.r + inOriginalNorm.g * (1.0 - inExtraNorm.r);
outNorm.b = outNorm.b * inExtraNorm.r + inOriginalNorm.b * (1.0 - inExtraNorm.r);
}
else if (inExtraNorm.g > 0.0) {
...
}
return rsPackColorTo8888(outNorm);
}
Problem
So the app works great on a range of devices, even on low-end devices. I manually cap the FPS at 15, but when I remove this cap, I get results ranging from 15-20 on low-end devices to 35-40 on high-end devices.
The Samsung Galaxy S8 is where my problem occurs. For some reason I only manage to get around 10 FPS. If I use adb to force RenderScript to use CPU instead:
adb shell setprop debug.rs.default-CPU-driver 1
I get around 12-15 FPS, but obviously I want it to run on the GPU.
An important, weird thing I noticed
If I trigger a touch event, no matter where (even out of the app), the performance dramatically increases to around 35-40 FPS. If I lift my finger from the screen again, FPS drops back to 10 FPS.
NOTE: drawing the result on the SurfaceView can be excluded as an impacting factor since the results are the same with just the computation in RenderScript without drawing the actual result.
Questions
So I have more than one question really:
What could be the reason behind the low performance?
Why would a touch event improve this performance so dramatically?
How could I solve or work around this issue?
I have implemented a small CNN in RenderScript and want to profile the performance on different hardware. On my Nexus 7 the times make sense, but on the NVIDIA Shield they do not.
The CNN (LeNet) is implemented in 9 layers residing in a queue, computation is performed in sequence. Each layer is timed individually.
Here is an example:
conv1 pool1 conv2 pool2 resh1 ip1 relu1 ip2 softmax
nexus7 11.177 7.813 13.357 8.367 8.097 2.1 0.326 1.557 2.667
shield 13.219 1.024 1.567 1.081 0.988 14.588 13.323 14.318 40.347
The distribution of the times are about right for the nexus, with conv1 and conv2 (convolution layers) taking most of the time. But on the shield, the times drop way beyond what's reasonable for layers 2-4 and seem to gather up towards the end. The softmax layer is a relatively small job, so 40ms is way too large. My timing method must be faulty, or something else is going on.
The code running the layers looks something like this:
double[] times = new double[layers.size()];
int layerindex = 0;
for (Layer a : layers) {
double t = SystemClock.elapsedRealtime();
//long t = System.currentTimeMillis(); // makes no difference
blob = a.forward(blob); // here we call renderscript forEach_(), invoke_() etc
//mRS.finish(); // makes no difference
t = SystemClock.elapsedRealtime() - t;
//t = System.currentTimeMillis() - t; // makes no difference
times[layerindex] += t; // later we take average etc
layerindex++;
}
It is my understanding that once forEach_() returns, the job is supposed to be finished. In any case, mRS.finish() should provide a final barrier. But looking at the times, the only reasonable explanation is that jobs are still processed in the background.
The app is very simple, I just run the test from MainActivity and print to logcat. Android Studio builds the app as a release and runs it on the device which is connected by USB.
(1) What is the correct way to time RenderScript processes?
(2) Is it true that when forEach_() returns, the threads spawned by the script are guaranteed to be done?
(3) In my test app, I simply run directly from the MainActivity. Is this a problem (other than blocking the UI thread and making the app unresponsive)? If this influences the timing or causes the weirdness, what is a proper way to set up a test app like this?
I've implemented CNNs in RenderScript myself, and as you explain, it does require chaining multiple processes and calling forEach_*() various times for each layer if you implement them each as a different kernel. As such, I can assure you that the forEach call returning does not really guarantee that the process has completed. In theory, this will only schedule the kernel and all queued up requests will actually run whenever the system determines it's best to, especially if they get processed in the tablet's GPU.
Usually, the only way to make absolutely sure you have some kind of control over a kernel truly running is by explicitly reading the output of the RS kernel in between layers, such as by using .copyTo() on the output allocation object of that kernel. This "forces" any queued up RS jobs that have not run yet (on which that layer's output allocation is dependent), to execute at that time. Granted, that may introduce data transfer overheads and your timing will not be fully accurate -- in fact, the execution time of the full network will quite surely be lower than the sum of the individual layers if timed in this manner. But as far as I know, it's the only reliable way to time individual kernels in a chain and it will give you some feedback to find out where bottlenecks are, and to better guide your optimization, if that's what you're after.
Maybe a little bit off topic: but for CNN, if you can structure your algorithm using matrix-matrix multiplication as basic computing blocks you can actually use RenderScript IntrinsicBLAS, especially BNNM and SGEMM.
Pros:
High performance implementation of 8bit Matrix Multiplication (BNNM), available in N Preview.
Back support back to Android 2.3 through RenderScript Support lib, when using Build-Tools 24.0.0 rc3 and above.
High performance GPU acceleration of SGEMM on Nexus5X and 6P with N Preview build NPC91K.
If you only use RenderScript Intrinsics, you can code everything in java.
Cons:
Your algorithm may need to be refactored, and need to be based on 2d matrix multiplication.
Though available in Android 6.0, but BNNM performance in 6.0 is not satisfactory. So it is better to use support lib for BNNM and set targetSdkVersion to be 24.
SGEMM GPU acceleration currently only available in Nexus5X and Nexus6P. And it currently requires the width and height of the Matrices to be multiples of 8.
It's worth trying if BLAS fits into your algorithm. And it is easy to use:
import android.support.v8.renderscript.*;
// if you are not using support lib:
// import android.renderscript.*;
private void runBNNM(int m, int n, int k, byte[] a_byte, byte[] b_byte, int c_offset, RenderScript mRS) {
Allocation A, B, C;
Type.Builder builder = new Type.Builder(mRS, Element.U8(mRS));
Type a_type = builder.setX(k).setY(m).create();
Type b_type = builder.setX(k).setY(n).create();
Type c_type = builder.setX(n).setY(m).create();
// If you are reusing the input Allocations, just create and cache them somewhere else.
A = Allocation.createTyped(mRS, a_type);
B = Allocation.createTyped(mRS, b_type);
C = Allocation.createTyped(mRS, c_type);
A.copyFrom(a_byte);
B.copyFrom(b_byte);
ScriptIntrinsicBLAS blas = ScriptIntrinsicBLAS.create(mRS);
// Computes: C = A * B.Transpose
int a_offset = 0;
int b_offset = 0;
int c_offset = 0;
int c_multiplier = 1;
blas.BNNM(A, a_offset, B, b_offset, C, c_offset, c_multiplier);
}
SGEMM is similar:
ScriptIntrinsicBLAS blas = ScriptIntrinsicBLAS.create(mRS);
// Construct the Allocations: A, B, C somewhere and make sure the dimensions match.
// Computes: C = 1.0f * A * B + 0.0f * C
float alpha = 1.0f;
float beta = 0.0f;
blas.SGEMM(ScriptIntrinsicBLAS.NO_TRANSPOSE, ScriptIntrinsicBLAS.NO_TRANSPOSE,
alpha, A, B, beta, C);
I know this has been asked generally but answer is alweays "depends", so I'm creating a concrete question in hope to get a concrete answer.
I know the evil of IF's on GLSL, they can be really expensive, even execute all code in some hardware.
So, I have a fragment shader from an example (a dual paraboloid shadow map) which uses if's to determine which map to use and compute the depth, but I know it's very easy to replace those if's with a multiplier, the question is there are a texture sampling inside the fragment shader, what would be faster, to use an if or use a multiplier to filter the unused data?
These are the proposed codes:
IF version:
//Alpha is a variable computed on the fly, cannot be replaced
float depth = 0;
float mydepth = 0;
if(alpha >= 0.5f)
{
depth = texture2D(ShadowFrontS, P0.xy).x;
mydepth = P0.z;
}
else
{
depth = texture2D(ShadowBackS, P1.xy).x;
mydepth = P1.z;
}
Filter version:
float mlt = ceiling(alpha - 0.5f);
float depth = 0;
float mydepth = 0;
depth = texture2D(ShadowFrontS, P0.xy).x * mlt;
mydepth = P0.z * mlt;
mlt = 1.0f - mlt;
depth = depth + (texture2D(ShadowFrontS, P1.xy).x * mlt);
mydepth = P1.z * mlt;
P.D.: I'm targeting Desktop and Mobile devices, so performance on low-end hardware is a must.
Branching is not "evil" per-se on massively SIMD architectures. If all the threads in a "bunch" (NVidia calls them Warps) follow the same code path, i.e. take all the same branches, everything is fine.
Only if a branch is partly taken (within that bunch) and for the other part not, both branches must be executed and later on the calculations and data fetches discarded that are not relevant for the current thread.
Now in your case it requires some careful profiling to see, which variant benefits your GPU more. But my gut instinct tells me, it's actually the branching version. Why? Because: Usually the value by which you decide on a branch depends on the screen space position and often large contiguous areas of fragments share the same code path and branching; so performance penalities happen only for those "bunches", which cover a bordering region. These bunches are usually only a few pixel² in size (8×8, or 16×16).
The shader you have there is not GPU limited (i.e. limited by the computational capabilities of the GPU), but memory bandwidth limited, i.e. by the throughput that the GPU's memory link offers; that is because of the texture2D fetch operations. And in that case reducing the actual number of fetches and thereby the required memory bandwidth will probably benefit your program more than reducing the number of computations.
The branchless mix-multiplex variant of your shader will always fetch both textures, the branching one will do that only within the bordering regions. So from that heuristic I'd guess, that your branching variant is actually the better choice.
But to be sure you have to profile it.
I'm writing crossplatform application. It should run on Android devices.
I want to use dFdx/dFdy for antialiasing. But, unfortunately, glsl es 2.0 does not support derivatives.
Can I replace dFdx/dFdy with something? I.E. 1/sprite_width, 1/sprite_height in screen pixels.
As I said, I need this to work on android devices. And I saw that my device support GL_OES_standard_derivatives, which allow it to use this functions. Does all android opengl es 2.0 devices support it?
As I said many Opengl ES 2.0 devices support GL_OES_standard_derivatives extension.
But for thoose who don't, I made this workaround:
float myFunc(vec2 p){
return p.x*p.x - p.y; // that's our function. We want derivative from it.
}
// this calculated in vertex shader
// width/height = triangle/quad width/height in px;
vec2 pixel_step = vec2(1/width, 1/height);
float current = myFunc(texCoord);
float dfdx = myFunc(texCoord + pixel_step.x) - current;
float dfdy = myFunc(texCoord + pixel_step.y) - current;
float fwidth = abs(dx) + abs(dy); // from khronos doc #http://www.khronos.org/registry/gles/extensions/OES/OES_standard_derivatives.txt
P.S. I get very close results to the glsl built ins, A little bit more blurry (in my shader). To fix this I added multiply pixel_step on 1/1.75. If someone knows why, let me know.
I'm developing a mobile application that runs on Android and IOS. It's capable of real-time-processing of a video stream. On Android I get the Preview-Videostream of the camera via android.hardware.Camera.PreviewCallback.onPreviewFrame. I decided to use the NV21-Format, since it should be supported by all Android-devices, whereas RGB isn't (or just RGB565).
For my algorithms, which mostly are for pattern recognition, I need grayscale images as well as color information. Grayscale is not a problem, but the color conversion from NV21 to BGR takes way too long.
As described, I use the following method to capture the images;
In the App, I override the onPreviewFrame-Handler of the Camera. This is done in CameraPreviewFrameHandler.java:
#Override
public void onPreviewFrame(byte[] data, Camera camera) {
{
try {
AvCore.getInstance().onFrame(data, _prevWidth, _prevHeight, AvStreamEncoding.NV21);
} catch (NativeException e)
{
e.printStackTrace();
}
}
The onFrame-Function then calls a native function which fetches data from the Java-Objects as local references. This is then converted to an unsigned char* bytestream and calls the following c++ function, which uses OpenCV to convert from NV21 to BGR:
void CoreManager::api_onFrame(unsigned char* rImageData, avStreamEncoding_t vImageFormat, int vWidth, int vHeight)
{
// rImageData is a local JNI-reference to the java-byte-array "data" from onPreviewFrame
Mat bgrMat; // Holds the converted image
Mat origImg; // Holds the original image (OpenCV-Wrapping around rImageData)
double ts; // for profiling
switch(vImageFormat)
{
// other formats
case NV21:
origImg = Mat(vHeight + vHeight/2, vWidth, CV_8UC1, rImageData); // fast, only creates header around rImageData
bgrMat = Mat(vHeight, vWidth, CV_8UC3); // Prepare Mat for target image
ts = avUtils::gettime(); // PROFILING START
cvtColor(origImg, bgrMat, CV_YUV2BGRA_NV21);
_onFrameBGRConversion.push_back(avUtils::gettime()-ts); // PROFILING END
break;
}
[...APPLICATION LOGIC...]
}
As one might conclude from comments in the code, I profiled the conversion already and it turned out that it takes ~30ms on my Nexus 4, which is unacceptable long for such a "trivial" pre-processing step. (My profiling methods are double-checked and working properly for real-time measurement)
Now I'm trying desperately to find a faster implementation of this color conversion from NV21 to BGR. This is what I've already done;
Adopted the code "convertYUV420_NV21toRGB8888" to C++ provided in this topic (multiple of the conversion-time)
Modified the code from 1 to use only integer operations (doubled conversion-time of openCV-Solution)
Browsed through a couple other implementations, all with similar conversion-times
Checked OpenCV-Implementation, they use a lot of bit-shifting to get performance. Guess I'm not able to do better on my own
Do you have suggestions / know good implementations or even have a completely different way to work around this Problem? I somehow need to capture RGB/BGR-Frames from the Android-Camera and it should work on as many Android-devices as possible.
Thanks for your replies!
Did you try libyuv? I used it in the past and if you compile it with NEON support, it uses an asm code optimized for ARM processors, you can start from there to further optimize for your special situation.