RenderScript low performance on Samsung Galaxy S8

RenderScript low performance on Samsung Galaxy S8 - android

Context
I have an Android app that takes a picture, blurs the picture, removes the blur based on a mask and applies a final layer (not relevant). The last 2 steps, removing the blur based on a mask and applying a final layer is done repeatedly, each time with a new mask (150 masks).
The output get's drawn on a canvas (SurfaceView). This way the app effectively creates a view of the image with an animated blur.
Technical details & code
All of these image processing steps are achieved with RenderScript.
I'm leaving out the code for step 1, blurring the picture, since this is irrelevant for the problem I'm facing.
Step 2: removing the blur based on a mask
I have a custom kernel which takes an in Allocation as argument and holds 2 global variables, which are Allocations as well.
These 3 Allocations all get their data from bitmaps using Allocation.copyFrom(bitmap).
Step 3: applying a final layer
Here I have a custom kernel as well which takes an in Allocation as argument and holds 3 global variables, of which 1 is and Allocation and 2 are floats.
How these kernels work is irrelevant to this question but just to be sure I included some simplified snippets below.
Another thing to note is that I am following all best practices to ensure performance is at its best regarding Allocations, RenderScript and my SurfaceView.
So common mistakes such as creating a new RenderScript instance each time, not re-using Allocations when possible,.. are safe to ignore.
blurMask.rs
#pragma version(1)
#pragma rs java_package_name(com.example.rs)
#pragma rs_fp_relaxed
// Extra allocations
rs_allocation allocBlur;
rs_allocation allocBlurMask;
/*
* RenderScript kernel that performs a masked blur manipulation.
* Blur Pseudo: out = original * blurMask + blur * (1.0 - blurMask)
* -> Execute this for all channels
*/
uchar4 __attribute__((kernel)) blur_mask(uchar4 inOriginal, uint32_t x, uint32_t y) {
// Manually getting current element from the blur and mask allocations
uchar4 inBlur = rsGetElementAt_uchar4(allocBlur, x, y);
uchar4 inBlurMask = rsGetElementAt_uchar4(allocBlurMask, x, y);
// normalize to 0.0 -> 1.0
float4 inOriginalNorm = rsUnpackColor8888(inOriginal);
float4 inBlurNorm = rsUnpackColor8888(inBlur);
float4 inBlurMaskNorm = rsUnpackColor8888(inBlurMask);
inBlurNorm.rgb = inBlurNorm.rgb * 0.7 + 0.3;
float4 outNorm = inOriginalNorm;
outNorm.rgb = inOriginalNorm.rgb * inBlurMaskNorm.rgb + inBlurNorm.rgb * (1.0 - inBlurMaskNorm.rgb);
return rsPackColorTo8888(outNorm);
}
myKernel.rs
#pragma version(1)
#pragma rs java_package_name(com.example.rs)
#pragma rs_fp_relaxed
// Extra allocations
rs_allocation allocExtra;
// Randoms; Values are set from kotlin, the values below just act as a min placeholder.
float randB = 0.1f;
float randW = 0.75f;
/*
* RenderScript kernel that performs a manipulation.
*/
uchar4 __attribute__((kernel)) my_kernel(uchar4 inOriginal, uint32_t x, uint32_t y) {
// Manually getting current element from the extra allocation
uchar4 inExtra = rsGetElementAt_uchar4(allocExtra, x, y);
// normalize to 0.0 -> 1.0
float4 inOriginalNorm = rsUnpackColor8888(inOriginal);
float4 inExtraNorm = rsUnpackColor8888(inExtra);
float4 outNorm = inOriginalNorm;
if (inExtraNorm.r > 0.0) {
outNorm.rgb = inOriginalNorm.rgb * 0.7 + 0.3;
// Separate channel operation since we are using inExtraNorm.r everywhere
outNorm.r = outNorm.r * inExtraNorm.r + inOriginalNorm.r * (1.0 - inExtraNorm.r);
outNorm.g = outNorm.g * inExtraNorm.r + inOriginalNorm.g * (1.0 - inExtraNorm.r);
outNorm.b = outNorm.b * inExtraNorm.r + inOriginalNorm.b * (1.0 - inExtraNorm.r);
}
else if (inExtraNorm.g > 0.0) {
...
}
return rsPackColorTo8888(outNorm);
}
Problem
So the app works great on a range of devices, even on low-end devices. I manually cap the FPS at 15, but when I remove this cap, I get results ranging from 15-20 on low-end devices to 35-40 on high-end devices.
The Samsung Galaxy S8 is where my problem occurs. For some reason I only manage to get around 10 FPS. If I use adb to force RenderScript to use CPU instead:
adb shell setprop debug.rs.default-CPU-driver 1
I get around 12-15 FPS, but obviously I want it to run on the GPU.
An important, weird thing I noticed
If I trigger a touch event, no matter where (even out of the app), the performance dramatically increases to around 35-40 FPS. If I lift my finger from the screen again, FPS drops back to 10 FPS.
NOTE: drawing the result on the SurfaceView can be excluded as an impacting factor since the results are the same with just the computation in RenderScript without drawing the actual result.
Questions
So I have more than one question really:
What could be the reason behind the low performance?
Why would a touch event improve this performance so dramatically?
How could I solve or work around this issue?

Related

Call Renderscrip kernel inside function

I am trying to call a Renderscript kernel inside a function inside the same Renderscript file, but I have no idea how to do it (and the Google documentation doesn't really help).
So I want to call this kernel:
uchar __attribute__((kernel)) nextPixel(uint32_t x) {
tImgIndexB = (uint32_t) (lBlackX[rsGetElementAt_uchar(num, x)] + lX) * 426 + (lBlackY[rsGetElementAt_uchar(num, x)] + lY);
tImgIndexW = (uint32_t) (lWhiteX[rsGetElementAt_uchar(num, x)] + lX) * 426 + (lWhiteY[rsGetElementAt_uchar(num, x)] + lY);
if (tImg[tImgIndexB] == 0 && tImg[tImgIndexW] == 1) {
output = 1;
tImg[lX*426+lY] = 3;
//lX += lBlackX[rsGetElementAt_uchar(num, x)];
//lY += lBlackY[rsGetElementAt_uchar(num, x)];
} else {
output = 0;
}
return output;
}
into a function like this:
void function() {
// call kernel 'nextPixel'
}
Thank you in advance.

That's not really how RS is intended to be used. The RS engine calls your kernel with the appropriate data and your kernel can call other functions. However, it's not really a normal case to have a function within your RS code call into an RS kernel.

I get a frame from the camera with a line on it (that starts somewhere at the bottom edge). I need to get every pixel of the left and right edge of the line in two arrays (one for the left edge, one for the right edge), with the first element of the array being the pixel at the bottom edge, and the last element in the array either the pixel on the left, top or right edge.
The frame I get from the camera is in YUV. So I convert it to a binary image (line in black, background in white) with Renderscript (that works).
I could send the processed frame back to Java, set it in a bitmap, and do the line-detection on the bitmap. However, reading and writing data to a bitmap is slow (and I need it to be as fast as possible), so I was trying to do everything in Renderscript. The kernel posted in my first post looks for the next pixel in the line (there are 8 possibilities, so I would like to check the 8 possibilities in parallel).

how to check ray intersection with object in ARCore

Is there a way to check if I touched the object on the screen ? As I understand the HitResult class allows me to check if I touched the recognized and maped surface. But I want to check this I touched the object that is set on that surface.

ARCore doesn't really have a concept of an object, so we can't directly provide that. I suggest looking at ray-sphere tests for a starting point.
However, I can help with getting the ray itself (to be added to HelloArActivity):
/**
* Returns a world coordinate frame ray for a screen point. The ray is
* defined using a 6-element float array containing the head location
* followed by a normalized direction vector.
*/
float[] screenPointToWorldRay(float xPx, float yPx, Frame frame) {
float[] points = new float[12]; // {clip query, camera query, camera origin}
// Set up the clip-space coordinates of our query point
// +x is right:
points[0] = 2.0f * xPx / mSurfaceView.getMeasuredWidth() - 1.0f;
// +y is up (android UI Y is down):
points[1] = 1.0f - 2.0f * yPx / mSurfaceView.getMeasuredHeight();
points[2] = 1.0f; // +z is forwards (remember clip, not camera)
points[3] = 1.0f; // w (homogenous coordinates)
float[] matrices = new float[32]; // {proj, inverse proj}
// If you'll be calling this several times per frame factor out
// the next two lines to run when Frame.isDisplayRotationChanged().
mSession.getProjectionMatrix(matrices, 0, 1.0f, 100.0f);
Matrix.invertM(matrices, 16, matrices, 0);
// Transform clip-space point to camera-space.
Matrix.multiplyMV(points, 4, matrices, 16, points, 0);
// points[4,5,6] is now a camera-space vector. Transform to world space to get a point
// along the ray.
float[] out = new float[6];
frame.getPose().transformPoint(points, 4, out, 3);
// use points[8,9,10] as a zero vector to get the ray head position in world space.
frame.getPose().transformPoint(points, 8, out, 0);
// normalize the direction vector:
float dx = out[3] - out[0];
float dy = out[4] - out[1];
float dz = out[5] - out[2];
float scale = 1.0f / (float) Math.sqrt(dx*dx + dy*dy + dz*dz);
out[3] = dx * scale;
out[4] = dy * scale;
out[5] = dz * scale;
return out;
}
If you're calling this several times per frame see the comment about the getProjectionMatrix and invertM calls.

Apart from Mouse Picking with Ray Casting, cf. Ian's answer, the other commonly used technique is a picking buffer, explained in detail (with C++ code) here
The trick behind 3D picking is very simple. We will attach a running
index to each triangle and have the FS output the index of the
triangle that the pixel belongs to. The end result is that we get a
"color" buffer that doesn't really contain colors. Instead, for each
pixel which is covered by some primitive we get the index of this
primitive. When the mouse is clicked on the window we will read back
that index (according to the location of the mouse) and render the
select triangle red. By combining a depth buffer in the process we
guarantee that when several primitives are overlapping the same pixel
we get the index of the top-most primitive (closest to the camera).
So in a nutshell:
Every object's draw method needs an ongoing index and a boolean for whether this draw renders the pixel buffer or not.
The render method converts the index into a grayscale color and the scene is rendered
After the whole rendering is done, retrieve the pixel color at the touch position GL11.glReadPixels(x, y, /*the x and y of the pixel you want the colour of*/). Then translate the color back to an index and the index back to an object. Voilà, you have your clicked object.
To be fair, for a mobile usecase you should probably read a 10x10 rectangle, iterate trough it and pick the first found non-background color - because touches are never that precise.
This approach works independently of the complexity of your objects

Renderscript Documentation and Advice - Android

I have been following this guide on how to use Render-script on Android.
http://www.jayway.com/2014/02/11/renderscript-on-android-basics/
My code is this (I got a wrapper class for the script):
public class PixelCalcScriptWrapper {
private Allocation inAllocation;
private Allocation outAllocation;
RenderScript rs;
ScriptC_pixelsCalc script;
public PixelCalcScriptWrapper(Context context){
rs = RenderScript.create(context);
script = new ScriptC_pixelsCalc(rs, context.getResources(), R.raw.pixelscalc);
};
public void setInAllocation(Bitmap bmp){
inAllocation = Allocation.createFromBitmap(rs,bmp);
};
public void setOutAllocation(Bitmap bmp){
outAllocation = Allocation.createFromBitmap(rs,bmp);
};
public void forEach_root(){
script.forEach_root(inAllocation, outAllocation);
}
}
This methods calls the script:
public Bitmap processBmp(Bitmap bmp, Bitmap bmpCopy) {
pixelCalcScriptWrapper.setInAllocation(bmp);
pixelCalcScriptWrapper.setOutAllocation(bmpCopy);
pixelCalcScriptWrapper.forEach_root();
return bmpCopy;
};
and here is my script:
#pragma version(1)
#pragma rs java_package_name(test.foo)
void root(const uchar4 *in, uchar4 *out, uint32_t x, uint32_t y) {
float3 pixel = convert_float4(in[0]).rgb;
if(pixel.z < 128) {
pixel.z = 0;
}else{
pixel.z = 255;
}
if(pixel.y < 128) {
pixel.y = 0;
}else{
pixel.y = 255;
}
if(pixel.x < 128) {
pixel.x = 0;
}else{
pixel.x = 255;
}
out->xyz = convert_uchar3(pixel);
}
Now where can I find some documentation about this ?
For example, I have these questions:
1) What does this convert_float4(in[0]) do ?
2) What does the rgb return here convert_float4(in[0]).rgb;?
3) What is float3 ?
4) I don't know where to start with this line out->xyz = convert_uchar3(pixel);
5) I am assuming in the parameters, in and out are the Allocations passed?
what are x and y?

http://developer.android.com/guide/topics/renderscript/reference/rs_convert.html#android_rs:convert
What does this convert_float4(in[0]) do?
convert_float4 will convert from a uchar4 to a float4;
.rgb turns it into a float3 of the first 3 elements.
What does the rgb return?
RenderScript vector types have .r .g .b .a or
.x .y .z .w representing the first, second, third and forth element respectively. You can use any combination (e.g. .xy or .xwy)
What is float3?
float3 is a "vector type" sort of like a float but 3 of them.
There are float2, float3 and float4 vector types of float.
(there are uchar4, int4 etc.)
http://developer.android.com/guide/topics/renderscript/reference/overview.html might be helpful
I hope this helps.

1) In the kernel, the in pointer is a 4-element unsigned char, that is, it represents a pixel color with R, G, B and A values in the 0-255 range. So convert_float4 simply casts each of the four uchar as a float. In this particular code you are using, it probably doesn't make much sense to work with floats, since you're doing a simple threshold, and you could just as well had worked with the uchar data directly. Using floats is better aimed when doing other types of image processing algorithms where you do need to have the extra precision (example: blurring an image).
2) The .rgb suffix is a shorthand to return only the first three values of the float4, i.e. the R, G, and B values. If you had used only .r it would give you the first value as a regular float, if you had used .g it would give you the second value as a float, etc... These three values are then assigned to that float3 variable, which now represents the pixel with only three color channels (that is, no A alpha channel).
3) See #2.
4) Now convert_uchar3 is again another cast that converts the float3 pixel variable back to a uchar3 variable. You are assigning the three values to each of the x, y, and z elements in that order. This is probably a good time to mention that X, Y and Z are completely interchangeable with R, G and B. That statement could just as well have used out->rgb, and it would actually have been more readable that way. Note that out is a uchar4, and by doing this, you are assigning only the first three "rgb" or "xyz" elements in that pointer, the fourth element is left undefined here.
5) Yes, in is the input pixel, out is the output pixel. Then x and y are the x, and y coordinates of the pixel in the overall image. This kernel function is going to be called once for every pixel in the image/allocation you're working with, and so it's usually good to know what coordinate you're at when processing an image. In this particular example since it's only thresholding all pixels in the same way, the coordinates are irrelevant.
Good documentation on RenderScript is very hard to find. I would greatly recommend you take a look at these two videos though, as they will give you a much better sense of how RenderScript works:
AnDevCon: A Deep Dive into RenderScript
Google I/O 2013 - High Performance Applications with RenderScript
Keep in mind that both videos are a couple years old, so some minor details may have changed on the recent APIs, but overall, those are probably the best sources of information for RS.

Render script rendering is much slower than OpenGL rendering on Android

BACKGROUND:
I want to add live filter based on the code of Android camera app. But the architecture of Android camera app is based on OpenGL ES 1.x. I need to use shader to custom our filter implementation. However, it is too difficult to update the camera app to OpenGL ES 2.0. Then I have to find some other methods to implement live filter instead of OpenGL. I decided to use render script after some research.
PROBLEM:
I have wrote a demo of a simple filter by render script. It shows that the fps is much lower than implementing it by OpenGL. About 5 fps vs 15 fps.
QUESTIONS:
The Android official offsite says: The RenderScript runtime will parallelize work across all processors available on a device, such as multi-core CPUs, GPUs, or DSPs, allowing you to focus on expressing algorithms rather than scheduling work or load balancing. Then why is render script implementation slower?
If render script cannot satisfy my requirement, is there a better way?
CODE DETAILS:
Hi I am in the same team with the questioner. We want to write a render-script based live-filter camera. In our test-demo-project, we use a simple filter: a YuvToRGB IntrinsicScript added with a overlay-filter ScriptC script.
In the OpenGL version, we set the camera data as textures and do the image-filter-procss with shader. Like this:
GLES20.glActiveTexture(GLES20.GL_TEXTURE0);
GLES20.glBindTexture(GLES20.GL_TEXTURE_2D, textureYHandle);
GLES20.glUniform1i(shader.uniforms.get("uTextureY"), 0);
GLES20.glTexSubImage2D(GLES20.GL_TEXTURE_2D, 0, 0, 0, mTextureWidth,
mTextureHeight, GLES20.GL_LUMINANCE, GLES20.GL_UNSIGNED_BYTE,
mPixelsYBuffer.position(0));
In the RenderScript version, we set the camera data as Allocation and do the image-filter-procss with script-kernals. Like this:
// The belowing code is from onPreviewFrame(byte[] data, Camera camera) which gives the camera frame data
byte[] imageData = datas[0];
long timeBegin = System.currentTimeMillis();
mYUVInAllocation.copyFrom(imageData);
mYuv.setInput(mYUVInAllocation);
mYuv.forEach(mRGBAAllocationA);
// To make sure the process of YUVtoRGBA has finished!
mRGBAAllocationA.copyTo(mOutBitmap);
Log.e(TAG, "RS time: YUV to RGBA : " + String.valueOf((System.currentTimeMillis() - timeBegin)));
mLayerScript.forEach_overlay(mRGBAAllocationA, mRGBAAllocationB);
mRGBAAllocationB.copyTo(mOutBitmap);
Log.e(TAG, "RS time: overlay : " + String.valueOf((System.currentTimeMillis() - timeBegin)));
mCameraSurPreview.refresh(mOutBitmap, mCameraDisplayOrientation, timeBegin);
The two problems are :
(1) RenderScript process seems slower than OpenGL process.
(2) According to our time-log, the process of YUV to RGBA which uses intrinsic script is very quick, takes about 6ms; but the process of overlay which uses scriptC is very slow, takes about 180ms. How does this happen?
Here is the rs-kernal code of the ScriptC we use(mLayerScript):
#pragma version(1)
#pragma rs java_package_name(**.renderscript)
#pragma stateFragment(parent)
#include "rs_graphics.rsh"
static rs_allocation layer;
static uint32_t dimX;
static uint32_t dimY;
void setLayer(rs_allocation layer1) {
layer = layer1;
}
void setBitmapDim(uint32_t dimX1, uint32_t dimY1) {
dimX = dimX1;
dimY = dimY1;
}
static float BlendOverlayf(float base, float blend) {
return (base < 0.5 ? (2.0 * base * blend) : (1.0 - 2.0 * (1.0 - base) * (1.0 - blend)));
}
static float3 BlendOverlay(float3 base, float3 blend) {
float3 blendOverLayPixel = {BlendOverlayf(base.r, blend.r), BlendOverlayf(base.g, blend.g), BlendOverlayf(base.b, blend.b)};
return blendOverLayPixel;
}
uchar4 __attribute__((kernel)) overlay(uchar4 in, uint32_t x, uint32_t y) {
float4 inPixel = rsUnpackColor8888(in);
uint32_t layerDimX = rsAllocationGetDimX(layer);
uint32_t layerDimY = rsAllocationGetDimY(layer);
uint32_t layerX = x * layerDimX / dimX;
uint32_t layerY = y * layerDimY / dimY;
uchar4* p = (uchar4*)rsGetElementAt(layer, layerX, layerY);
float4 layerPixel = rsUnpackColor8888(*p);
float3 color = BlendOverlay(inPixel.rgb, layerPixel.rgb);
float4 outf = {color.r, color.g, color.b, inPixel.a};
uchar4 outc = rsPackColorTo8888(outf.r, outf.g, outf.b, outf.a);
return outc;
}

Renderscript does not use any GPU or DSPs cores. That is a common misconception encouraged by Google's deliberately vague documentation. Renderscript used to have an interface to OpenGL ES, but that has been deprecated and has never been used for much beyond animated wallpapers. Renderscript will use multiple CPU cores, if available, but I suspect Renderscript will be replaced by OpenCL.
Take a look at the Effects class and the Effects demo in the Android SDK. It shows how to use OpenGL ES 2.0 shaders to apply effects to images without writing OpenGL ES code.
http://software.intel.com/en-us/articles/porting-opengl-games-to-android-on-intel-atom-processors-part-1
UPDATE:
It's wonderful when I learn more answering a question than asking one and that is the case here. You can see from the lack of answers that Renderscript is hardly used outside of Google because of its strange architecture that ignores industry standards like OpenCL and almost non-existent documentation on how it actually works.
Nonetheless, my answer did evoke a rare response from the Renderscrpt development team which includes only one link that actually contains any useful information about renderscript - this article by Alexandru Voica at IMG, the PowerVR GPU vendor:
http://withimagination.imgtec.com/index.php/powervr/running-renderscript-efficiently-with-powervr-gpus-on-android
That article has some good information which was new to me. There are comments posted there from more people who are having trouble getting Renderscript code to actually run on the GPU.
But, I was incorrect to assume that Renderscript is no longer being developed at Google. Although my statement that "Renderscript does not use any GPU or DSPs cores." was true until just fairly recently, I have learned that this has changed as of one of the Jelly Bean releases.
It would have been great if one of the Renderscript developers could have explained that. Or even if they had a public webpage that explains that or that lists
which GPUs are actually supported and how can you tell if your code actually gets run on a GPU.
My opinion is that Google will replace Renderscript with OpenCL eventually and I would not invest time developing with it.

Allocation.copyTo(Bitmap) corrupting pixel values

I'm new to Renderscript, and am striking issues with my first script. As far as I can see (from debugging statements I've inserted) my code works fine, but the computed values are getting mangled when they are being copied back to the Bitmap by the Allocation.copyTo(Bitmap) method.
I was getting weird colours out, so eventually stripped down my script to this sample which shows the problem:
void root(const uchar4 *v_in, uchar4 *v_out, const void *usrData, uint32_t x, uint32_t y)
{
*v_out = rsPackColorTo8888(1.f, 0.f, 0.f, 1.f);
if (x==0 && y==0) {
rsDebug("v_out ", v_out->x, v_out->y, v_out->z, v_out->w);
}
}
Here we are just writing out an opaque red pixel. The debug line seems to print the right value (255 0 0 255) and indeed I get a red pixel in the bitmap.
However if I change the alpha on the red pixel slightly:
*v_out = rsPackColorTo8888(1.f, 0.f, 0.f, 0.998f);
The debug prints (255 0 0 254) which still seems correct, but the final pixel value ends up being (0 0 0 254) ie. black.
Obviously I suspected it was a premulltiplied alpha issue, but my understanding is that the Allocation routines to copy from and to Bitmaps is supposed to handle that for you. At least that's what Chet Haase suggests in this blog post: https://plus.google.com/u/0/+ChetHaase/posts/ef6Deey6xKA.
Also none of the example compute scripts out there seem to mention any issues with pre-multiplied alpha. My script was based on the HelloComputer example from the SDK.
If I am making a mistake, I would love an RS guru to point it out for me.
It's a shame that after 2+ years the documentation for Renderscript is still so poor.
PS. The Bitmaps I'm using are ARGB_888 and I am building and targetting on SDK18 (Android 4.3)

The example works fine because the example does not modify alpha.
If you are going to modify alpha and then use the Allocation as a normal bitmap you should return (r*a, g*a, b*a, a).
However, if you were sending the Allocation to a GL surface which is not pre-multiplied, your code would work as-is.

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.