I've just started to play around with NDK to explore the sweet performance boost that I've been promised. To get a feel for the difference, I tried a dumb number-crunching task (render a Mandelbrot set to a bitmap) and compared it to a Java version of the same code. To my big surprise, the C version is significantly slower (5.0 seconds vs 1.6 on my HTC One, on average). Even stranger, the cost isn't because of the overhead of making a native call, but it's the actual number-crunching that takes longer.
This can't be right, can it? What did I miss?
C version (debug timer code removed):
const int MAX_ITER = 63;
const float MAX_DEPTH = 16;
static uint16_t rgb565(int red, int green, int blue)
{
return (uint16_t)(((red << 8) & 0xf800) | ((green << 2) & 0x03e0) | ((blue >> 3) & 0x001f));
}
float zAbs(float re, float im) {
return re*re + im*im;
}
int depth(float cRe, float cIm) {
int i=0;
float re, im;
float zRe = 0.0f;
float zIm = 0.0f;
while ((zAbs(zRe, zIm) < MAX_DEPTH) && (i < MAX_ITER)) {
re = zRe * zRe - zIm * zIm + cRe;
im = 2.0f * zRe * zIm + cIm;
zRe = re;
zIm = im;
i++;
}
return i;
}
extern "C"
void Java_com_example_ndktest_MainActivity_renderFractal(JNIEnv* env, jobject thiz, jobject bitmap, float re0, float im0, float b)
{
AndroidBitmapInfo info;
void* pixels;
int ret;
long t0 = currentTimeInMilliseconds();
if ((ret = AndroidBitmap_getInfo(env, bitmap, &info)) < 0) {
LOGE("AndroidBitmap_getInfo() failed ! error=%d", ret);
return;
}
if (info.format != ANDROID_BITMAP_FORMAT_RGB_565) {
LOGE("Bitmap format is not RGB_565 !");
return;
}
if ((ret = AndroidBitmap_lockPixels(env, bitmap, &pixels)) < 0) {
LOGE("AndroidBitmap_lockPixels() failed ! error=%d", ret);
}
int w = info.width;
int h = info.height;
float re, im;
int z = 0;
uint16_t* px = (uint16_t*)pixels;
for(int y=0; y<h; y++) {
im = im0 + b*((float)y/(float)h);
for(int x=0; x<info.width; x++) {
re = re0 + b*((float)x/(float)w);
z = depth(re, im);
px[y*w + x] = rgb565(0, z*4, z * 16);
}
}
AndroidBitmap_unlockPixels(env, bitmap);
}
Java version:
private static final int MAX_ITER = 63;
private static final float MAX_DEPTH = 16;
static int rgb565(int red, int green, int blue)
{
return ((red << 8) & 0xf800) | ((green << 2) & 0x03e0) | ((blue >> 3) & 0x001f);
}
static float zAbs(float re, float im) {
return re*re + im*im;
}
static int depth(float cRe, float cIm) {
int i=0;
float re, im;
float zRe = 0.0f;
float zIm = 0.0f;
while ((zAbs(zRe, zIm) < MAX_DEPTH) && (i < MAX_ITER)) {
re = zRe * zRe - zIm * zIm + cRe;
im = 2.0f * zRe * zIm + cIm;
zRe = re;
zIm = im;
i++;
}
return i;
}
static void renderFractal(Bitmap bitmap, float re0, float im0, float b)
{
int w = bitmap.getWidth();
int h = bitmap.getHeight();
int[] pixels = new int[w * h];
bitmap.getPixels(pixels, 0, w, 0, 0, w, h);
float re, im;
int z = 0;
for(int y=0; y<h; y++) {
im = im0 + b*((float)y/(float)h);
for(int x=0; x<w; x++) {
re = re0 + b*((float)x/(float)w);
z = depth(re, im);
pixels[y*w + x] = rgb565(0, z*4, z * 16);
}
}
bitmap.setPixels(pixels, 0, w, 0, 0, w, h);
}
As noted in the comments, this was because the NDK code was built for the armeabi target rather than the armeabi-v7a target. The former is intended to work across a broad range of hardware, including devices without floating-point hardware, so it does all floating-point calculations in software.
Building for armeabi-v7a enables the VFP instructions, so anything that relies heavily on floating point calculations will speed up dramatically.
If you build exclusively for armeabi-v7a, you will exclude a fairly broad selection of devices, even relatively recent ones (e.g. the Samsung Galaxy Ace). These devices have VFP support, but the CPU is based on the ARMv6 instruction set rather than ARMv7. There is no "pre-ARMv7 CPU with VFP" build target, so you have to build for armeabi, or use custom build rules and careful selection of supported devices.
At the other end of the spectrum, you may get a small performance boost by specifying hard-float ABI within your armeabi-v7a library (-mhard-float -- requires NDK r9b).
FWIW, one of the selling points of just-in-time compilers like the one in Dalvik is that they can recognize the system capabilities and adapt code generation appropriately.
Related
I am trying to floodfill a bitmap using Renderscript. and my renderscript file progress.rs is
#pragma version(1)
#pragma rs java_package_name(com.intel.sample.androidbasicrs)
rs_allocation input;
int width;
int height;
int xTouchApply;
int yTouchApply;
static int same(uchar4 pixel, uchar4 in);
uchar4 __attribute__((kernel)) root(const uchar4 in, uint32_t x, uint32_t y) {
uchar4 out = in;
rsDebug("Process.rs : image width: ", width);
rsDebug("Process.rs : image height: ", height);
rsDebug("Process.rs : image pointX: ", xTouchApply);
rsDebug("Process.rs : image pointY: ", yTouchApply);
if(xTouchApply >= 0 && xTouchApply < width && yTouchApply >=0 && yTouchApply < height){
// getting touched pixel
uchar4 pixel = rsGetElementAt_uchar4(input, xTouchApply, yTouchApply);
rsDebug("Process.rs : getting touched pixel", 0);
// resets the pixel stack
int topOfStackIndex = 0;
// creating pixel stack
int pixelStack[width*height];
// Pushes the touched pixel onto the stack
pixelStack[topOfStackIndex] = xTouchApply;
pixelStack[topOfStackIndex+1] = yTouchApply;
topOfStackIndex += 2;
//four way stack floodfill algorithm
while(topOfStackIndex>0){
rsDebug("Process.rs : looping while", 0);
// Pops a pixel from the stack
int x = pixelStack[topOfStackIndex - 2];
int y1 = pixelStack[topOfStackIndex - 1];
topOfStackIndex -= 2;
while (y1 >= 0 && same(rsGetElementAt_uchar4(input, x, y1), pixel)) {
y1--;
}
y1++;
int spanLeft = 0;
int spanRight = 0;
while (y1 < height && same(rsGetElementAt_uchar4(input, x, y1), pixel)) {
rsDebug("Process.rs : pointX: ", x);
rsDebug("Process.rs : pointY: ", y1);
float3 outPixel = dot(f4.rgb, channelWeights);
out = rsPackColorTo8888(outPixel);
// conditions to traverse skipPixels to check threshold color(Similar color)
if (!spanLeft && x > 0 && same(rsGetElementAt_uchar4(input, x - 1, y1), pixel)) {
// Pixel to the left must also be changed, pushes it to the stack
pixelStack[topOfStackIndex] = x - 1;
pixelStack[topOfStackIndex + 1] = y1;
topOfStackIndex += 2;
spanLeft = 1;
} else if (spanLeft && !same(rsGetElementAt_uchar4(input, x - 1, y1), pixel)) {
// Pixel to the left has already been changed
spanLeft = 0;
}
// conditions to traverse skipPixels to check threshold color(Similar color)
if (!spanRight && x < width - 1 && same(rsGetElementAt_uchar4(input, x + 1, y1), pixel)) {
// Pixel to the right must also be changed, pushes it to the stack
pixelStack[topOfStackIndex] = x + 1;
pixelStack[topOfStackIndex + 1] = y1;
topOfStackIndex += 2;
spanRight = 1;
} else if (spanRight && x < width - 1 && !same(rsGetElementAt_uchar4(input, x + 1, y1), pixel)) {
// Pixel to the right has already been changed
spanRight = 0;
}
y1++;
}
}
}
return out;
}
static int same(uchar4 px, uchar4 inPx){
int isSame = 0;
if((px.r == inPx.r) && (px.g == inPx.g) && (px.b == inPx.b) && (px.a == inPx.a)) {
isSame = 1;
// rsDebug("Process.rs : matching pixel: ", isSame);
} else {
isSame = 0;
}
// rsDebug("Process.rs : matching pixel: ", isSame);
return isSame;
}
and my Activity's code is:
inputBitmap = Bitmap.createScaledBitmap(inputBitmap, displayWidth, displayHeight, false);
// Create an allocation (which is memory abstraction in the RenderScript)
// that corresponds to the inputBitmap.
allocationIn = Allocation.createFromBitmap(
rs,
inputBitmap,
Allocation.MipmapControl.MIPMAP_NONE,
Allocation.USAGE_SCRIPT
);
allocationOut = Allocation.createTyped(rs, allocationIn.getType());
int imageWidth = inputBitmap.getWidth();
int imageHeight = inputBitmap.getHeight();
script.set_width(imageWidth);
script.set_height(imageHeight);
script.set_input(allocationIn);
//....
//....
// and my onTouchEvent Code is
script.set_xTouchApply(xTouchApply);
script.set_yTouchApply(yTouchApply);
// Run the script.
script.forEach_root(allocationIn, allocationOut);
allocationOut.copyTo(outputBitmap);
when I touched bitmap it is showing Application not responding. It is because of root method is calling for every pixels. How can I optimize this code. And how can I compare two uchar4 variables in Renderscript? How can I improve my same method? Or How can I find similar neighbor pixels using threshold value? I got stuck. Please guys help me.
I don't have much knowledge of c99 programming language and Renderscript. Can you guys debug my renderscript code. and please tell me what's wrong in this code. Or can I improve this renderscript code to floodfill the bitmap. Any help will be appreciated And sorry for my poor English ;-) . Thanks
Renderscript is Android's front-end to GPU-instructions. And it is extremely good if you want to perform operations on each pixel because it uses the massive GPU-parallelism-capabilities. So, you can run an operation on each pixel. For this purpose, you start a program in Renderscript with sth like "for all pixels, do the following".
The flood fill algorithm though cannot run in such a parallel environment because you only know which pixel to paint after painting another pixel before it. This is not only true for renderscript but all GPU-related libraries, like CUDA or others.
i use the Renderscript to do the gaussian blur on a image.
but no matter what i did. the ScriptIntrinsicBlur is more more faster.
why this happened? ScriptIntrinsicBlur is using another method?
this id my RS code:
#pragma version(1)
#pragma rs java_package_name(top.deepcolor.rsimage.utils)
//aussian blur algorithm.
//the max radius of gaussian blur
static const int MAX_BLUR_RADIUS = 1024;
//the ratio of pixels when blur
float blurRatio[(MAX_BLUR_RADIUS << 2) + 1];
//the acquiescent blur radius
int blurRadius = 0;
//the width and height of bitmap
uint32_t width;
uint32_t height;
//bind to the input bitmap
rs_allocation input;
//the temp alloction
rs_allocation temp;
//set the radius
void setBlurRadius(int radius)
{
if(1 > radius)
radius = 1;
else if(MAX_BLUR_RADIUS < radius)
radius = MAX_BLUR_RADIUS;
blurRadius = radius;
/**
calculate the blurRadius by Gaussian function
when the pixel is far way from the center, the pixel will not contribute to the center
so take the sigma is blurRadius / 2.57
*/
float sigma = 1.0f * blurRadius / 2.57f;
float deno = 1.0f / (sigma * sqrt(2.0f * M_PI));
float nume = -1.0 / (2.0f * sigma * sigma);
//calculate the gaussian function
float sum = 0.0f;
for(int i = 0, r = -blurRadius; r <= blurRadius; ++i, ++r)
{
blurRatio[i] = deno * exp(nume * r * r);
sum += blurRatio[i];
}
//normalization to 1
int len = radius + radius + 1;
for(int i = 0; i < len; ++i)
{
blurRatio[i] /= sum;
}
}
/**
the gaussian blur is decomposed two steps:1
1.blur in the horizontal
2.blur in the vertical
*/
uchar4 RS_KERNEL horizontal(uint32_t x, uint32_t y)
{
float a, r, g, b;
for(int k = -blurRadius; k <= blurRadius; ++k)
{
int horizontalIndex = x + k;
if(0 > horizontalIndex) horizontalIndex = 0;
if(width <= horizontalIndex) horizontalIndex = width - 1;
uchar4 inputPixel = rsGetElementAt_uchar4(input, horizontalIndex, y);
int blurRatioIndex = k + blurRadius;
a += inputPixel.a * blurRatio[blurRatioIndex];
r += inputPixel.r * blurRatio[blurRatioIndex];
g += inputPixel.g * blurRatio[blurRatioIndex];
b += inputPixel.b * blurRatio[blurRatioIndex];
}
uchar4 out;
out.a = (uchar) a;
out.r = (uchar) r;
out.g = (uchar) g;
out.b = (uchar) b;
return out;
}
uchar4 RS_KERNEL vertical(uint32_t x, uint32_t y)
{
float a, r, g, b;
for(int k = -blurRadius; k <= blurRadius; ++k)
{
int verticalIndex = y + k;
if(0 > verticalIndex) verticalIndex = 0;
if(height <= verticalIndex) verticalIndex = height - 1;
uchar4 inputPixel = rsGetElementAt_uchar4(temp, x, verticalIndex);
int blurRatioIndex = k + blurRadius;
a += inputPixel.a * blurRatio[blurRatioIndex];
r += inputPixel.r * blurRatio[blurRatioIndex];
g += inputPixel.g * blurRatio[blurRatioIndex];
b += inputPixel.b * blurRatio[blurRatioIndex];
}
uchar4 out;
out.a = (uchar) a;
out.r = (uchar) r;
out.g = (uchar) g;
out.b = (uchar) b;
return out;
}
Renderscript intrinsics are implemented very differently from what you can achieve with a script of your own. This is for several reasons, but mainly because they are built by the RS driver developer of individual devices in a way that makes the best possible use of that particular hardware/SoC configuration, and most likely makes low level calls to the hardware that is simply not available at the RS programming layer.
Android does provide a generic implementation of these intrinsics though, to sort of "fall back" in case no lower hardware implementation is available. Seeing how these generic ones are done will give you some better idea of how these intrinsics work. For example, you can see the source code of the generic implementation of the 3x3 convolution intrinsic here rsCpuIntrinsicConvolve3x3.cpp.
Take a very close look at the code starting from line 98 of that source file, and notice how they use no for loops whatsoever to do the convolution. This is known as unrolled loops, where you add and multiply explicitly the 9 corresponding memory locations in the code, thereby avoiding the need of a for loop structure. This is the first rule you must take into account when optimizing parallel code. You need to get rid of all branching in your kernel. Looking at your code, you have a lot of if's and for's that cause branching -- this means the control flow of the program is not straight through from beginning to end.
If you unroll your for loops, you will immediately see a boost in performance. Note that by removing your for structures you will no longer be able to generalize your kernel for all possible radius amounts. In that case, you would have to create fixed kernels for different radii, and this is exactly why you see separate 3x3 and 5x5 convolution intrinsics, because this is just what they do. (See line 99 of the 5x5 intrinsic at rsCpuIntrinsicConvolve5x5.cpp).
Furthermore, the fact that you have two separate kernels doesn't help. If you're doing a gaussian blur, the convolutional kernel is indeed separable and you can do 1xN + Nx1 convolutions as you've done there, but I would recommend putting both passes together in the same kernel.
Keep in mind though, that even doing these tricks will probably still not give you as fast results as the actual intrinsics, because those have probably been highly optimized for your specific device(s).
I want to add a blur feature to my Android photo editor app. So far, I've made the following code in Cpp to improve speed and efficiency.
class JniBitmap
{
public:
uint32_t* _storedBitmapPixels;
AndroidBitmapInfo _bitmapInfo;
JniBitmap()
{
_storedBitmapPixels = NULL;
}
};
JNIEXPORT void JNICALL Java_com_myapp_utils_NativeBitmapOperations_jniBlurBitmap(JNIEnv * env, jobject obj, jobject handle, uint32_t radius)
{
JniBitmap* jniBitmap = (JniBitmap*) env->GetDirectBufferAddress(handle);
if (jniBitmap->_storedBitmapPixels == NULL) return;
uint32_t width = jniBitmap->_bitmapInfo.width;
uint32_t height = jniBitmap->_bitmapInfo.height;
uint32_t* previousData = jniBitmap->_storedBitmapPixels;
uint32_t* newBitmapPixels = new uint32_t[width * height];
// Array to hold totalRGB
uint8_t totalRGB[3];
uint8_t Pixel_col[3];
uint32_t Pixel_col;
int x, y, kx, ky;
uint8_t tmp;
for (y=0; y<height; y++)
{
for (x=0; x<width; x++)
{
// Colour value RGB
totalRGB[0] = 0.0;
totalRGB[1] = 0.0;
totalRGB[2] = 0.0;
for (ky=-radius; ky<=radius; ky++)
{
for (kx=-radius; kx<=radius; kx++)
{
// Each pixel position
pPixel_col = previousData[(y + ky) * width + x + kx];
totalRBG[0] += (Pixel_col & 0xFF0000) >> 16;
totalRBG[1] += (Pixel_col & 0x00FF00) >> 8;
totalRBG[2] += Pixel_col & 0x0000FF;
}
}
tmp = (radius * 2 + 1) * (radius * 2 + 1);
totalRGB[0] += tmp;
totalRGB[1] += tmp;
totalRGB[2] += tmp;
pPixel_col = totalRGB[0] << 16 + totalRGB[1] << 8 + totalRGB[2];
newBitmapPixels[y * width + x] = pPixel_col;
}
}
delete[] previousData;
jniBitmap->_storedBitmapPixels = newBitmapPixels;
}
I've compiled it with success unsing the latest Android NDK version.
In my Android application, I got this Java code to call the native method:
private ByteBuffer handler = null;
static
{
System.loadLibrary("JniBitmapOperationsLibrary");
}
private native void jniBlurBitmap(ByteBuffer handler, final int radius);
public void blurBitmap(final int radius)
{
if (handler == null) return;
jniBlurBitmap(handler, radius);
}
When I try to call it from my application it gives a blank picture. Did I make something wrong ?
PS: I also have a crop and scale method in my JNI files and they work perfectly. It might be an issue with my Blur algorithm.
Finally, I found the fastest way to do that. At first, I made a JNI script but the Android Support V8 way is clearly faster (About 10 times).
Add the library AND the .so files from the directory SDK/build-tools/android-4.4.2/renderscript/lib/. Copy renderscript-v8.jar to your libs folder and the packaged content to your jni folder.
Use this code
public static Bitmap getBlurredBitmap(Context context, Bitmap bm, int radius)
{
if (bm == null) return null;
if (radius < 1) radius = 1;
if (radius > 25) radius = 25;
Bitmap outputBitmap = Bitmap.createBitmap(bm);
RenderScript rs = RenderScript.create(context);
ScriptIntrinsicBlur theIntrinsic = ScriptIntrinsicBlur.create(rs, Element.U8_4(rs));
Allocation tmpIn = Allocation.createFromBitmap(rs, bm);
Allocation tmpOut = Allocation.createFromBitmap(rs, outputBitmap);
theIntrinsic.setRadius(radius);
theIntrinsic.setInput(tmpIn);
theIntrinsic.forEach(tmpOut);
tmpOut.copyTo(outputBitmap);
return outputBitmap;
}
My Problem is: I've set up a camera in Android and receive the preview data by using an onPreviewFrame-listener which passes me an byte[] array containing the image data in the default android YUV-format (device does not support R5G6B5-format). Each pixel consists of 12bits which makes the thing a little tricky. Now what I want to do is converting the YUV-data into ARGB-data in order to do image processing with it. This has to be done with renderscript, in order to maintain a high performance.
My idea was to pass two pixels in one element (which would be 24bits = 3 bytes) and then return two ARGB pixels. The problem is, that in Renderscript a u8_3 (a 3dimensional 8bit vector) is stored in 32bit, which means that the last 8 bits are unused. But when copying the image data into the allocation all of the 32bits are used, so the last 8bit get lost. Even if I used a 32bit input data, the last 8bit are useless, because they're only 2/3 of a pixel. When defining an element consisting a 3-byte-array it actually has a real size of 3 bytes. But then the Allocation.copyFrom()-method doesn't fill the in-Allocation with data, argueing it doesn't has the right data type to be filled with a byte[].
The renderscript documentation states, that there is a ScriptIntrinsicYuvToRGB which should do exactly that in API Level 17. But in fact the class doesn't exist. I've downloaded API Level 17 even though it seems not to be downloadable any more. Does anyone have any information about it? Does anyone have ever tried out a ScriptIntrinsic?
So in conclusion my question is: How to convert the camera data into ARGB data fast, hardwareaccelerated?
That's how to do it in Dalvik VM (found the code somewhere online, it works):
#SuppressWarnings("unused")
private void decodeYUV420SP(int[] rgb, byte[] yuv420sp, int width, int height) {
final int frameSize = width * height;
for (int j = 0, yp = 0; j < height; j++) {
int uvp = frameSize + (j >> 1) * width, u = 0, v = 0;
for (int i = 0; i < width; i++, yp++) {
int y = (0xff & ((int) yuv420sp[yp])) - 16;
if (y < 0)
y = 0;
if ((i & 1) == 0) {
v = (0xff & yuv420sp[uvp++]) - 128;
u = (0xff & yuv420sp[uvp++]) - 128;
}
int y1192 = 1192 * y;
int r = (y1192 + 1634 * v);
int g = (y1192 - 833 * v - 400 * u);
int b = (y1192 + 2066 * u);
if (r < 0)
r = 0;
else if (r > 262143)
r = 262143;
if (g < 0)
g = 0;
else if (g > 262143)
g = 262143;
if (b < 0)
b = 0;
else if (b > 262143)
b = 262143;
rgb[yp] = 0xff000000 | ((r << 6) & 0xff0000) | ((g >> 2) & 0xff00) | ((b >> 10) & 0xff);
}
}
}
I'm sure you will find the LivePreview test application interesting ... it's part of the Android source code in the latest Jelly Bean (MR1). It implements a camera preview and uses ScriptIntrinsicYuvToRgb to convert the preview data with Renderscript. You can browse the source online here:
LivePreview
I was not able to get running ScriptInstrinsicYuvToRgb, so I decided to write my own RS solution.
Here's ready script (named yuv.rs):
#pragma version(1)
#pragma rs java_package_name(com.package.name)
rs_allocation gIn;
int width;
int height;
int frameSize;
void yuvToRgb(const uchar *v_in, uchar4 *v_out, const void *usrData, uint32_t x, uint32_t y) {
uchar yp = rsGetElementAtYuv_uchar_Y(gIn, x, y) & 0xFF;
int index = frameSize + (x & (~1)) + (( y>>1) * width );
int v = (int)( rsGetElementAt_uchar(gIn, index) & 0xFF ) -128;
int u = (int)( rsGetElementAt_uchar(gIn, index+1) & 0xFF ) -128;
int r = (int) (1.164f * yp + 1.596f * v );
int g = (int) (1.164f * yp - 0.813f * v - 0.391f * u);
int b = (int) (1.164f * yp + 2.018f * u );
r = r>255? 255 : r<0 ? 0 : r;
g = g>255? 255 : g<0 ? 0 : g;
b = b>255? 255 : b<0 ? 0 : b;
uchar4 res4;
res4.r = (uchar)r;
res4.g = (uchar)g;
res4.b = (uchar)b;
res4.a = 0xFF;
*v_out = res4;
}
Don't forget to set camera preview format to NV21:
Parameters cameraParameters = camera.getParameters();
cameraParameters.setPreviewFormat(ImageFormat.NV21);
// Other camera init stuff: preview size, framerate, etc.
camera.setParameters(cameraParameters);
Allocations initialization and script usage:
// Somewhere in initialization section
// w and h are variables for selected camera preview size
rs = RenderScript.create(this);
Type.Builder tbIn = new Type.Builder(rs, Element.U8(rs));
tbIn.setX(w);
tbIn.setY(h);
tbIn.setYuvFormat(ImageFormat.NV21);
Type.Builder tbOut = new Type.Builder(rs, Element.RGBA_8888(rs));
tbOut.setX(w);
tbOut.setY(h);
inData = Allocation.createTyped(rs, tbIn.create(), Allocation.MipmapControl.MIPMAP_NONE, Allocation.USAGE_SCRIPT & Allocation.USAGE_SHARED);
outData = Allocation.createTyped(rs, tbOut.create(), Allocation.MipmapControl.MIPMAP_NONE, Allocation.USAGE_SCRIPT & Allocation.USAGE_SHARED);
outputBitmap = Bitmap.createBitmap(w, h, Bitmap.Config.ARGB_8888);
yuvScript = new ScriptC_yuv(rs);
yuvScript.set_gIn(inData);
yuvScript.set_width(w);
yuvScript.set_height(h);
yuvScript.set_frameSize(previewSize);
//.....
Camera callback method:
public void onPreviewFrame(byte[] data, Camera camera) {
// In your camera callback, data
inData.copyFrom(data);
yuvScript.forEach_yuvToRgb(inData, outData);
outData.copyTo(outputBitmap);
// draw your bitmap where you want to
// .....
}
For anyone who didn't know, RenderScript is now in the Android Support Library, including intrinsics.
http://android-developers.blogspot.com.au/2013/09/renderscript-in-android-support-library.html
http://android-developers.blogspot.com.au/2013/08/renderscript-intrinsics.html
We now have the new renderscript-intrinsics-replacement-toolkit to do it. First, build and import the renderscript module to your project and add it as a dependency to your app module. Then, go to Toolkit.kt and add the following:
fun toNv21(image: Image): ByteArray? {
val nv21 = ByteArray((image.width * image.height * 1.5f).toInt())
return if (!nativeYuv420toNv21(
nativeHandle,
image.width,
image.height,
image.planes[0].buffer, // Y buffer
image.planes[1].buffer, // U buffer
image.planes[2].buffer, // V buffer
image.planes[0].pixelStride, // Y pixel stride
image.planes[1].pixelStride, // U/V pixel stride
image.planes[0].rowStride, // Y row stride
image.planes[1].rowStride, // U/V row stride
nv21
)
) {
null
} else nv21
}
private external fun nativeYuv420toNv21(
nativeHandle: Long,
imageWidth: Int,
imageHeight: Int,
yByteBuffer: ByteBuffer,
uByteBuffer: ByteBuffer,
vByteBuffer: ByteBuffer,
yPixelStride: Int,
uvPixelStride: Int,
yRowStride: Int,
uvRowStride: Int,
nv21Output: ByteArray
): Boolean
Now, go to JniEntryPoints.cpp and add the following:
extern "C" JNIEXPORT jboolean JNICALL Java_com_google_android_renderscript_Toolkit_nativeYuv420toNv21(
JNIEnv *env, jobject/*thiz*/, jlong native_handle,
jint image_width, jint image_height, jobject y_byte_buffer,
jobject u_byte_buffer, jobject v_byte_buffer, jint y_pixel_stride,
jint uv_pixel_stride, jint y_row_stride, jint uv_row_stride,
jbyteArray nv21_array) {
auto y_buffer = static_cast<jbyte*>(env->GetDirectBufferAddress(y_byte_buffer));
auto u_buffer = static_cast<jbyte*>(env->GetDirectBufferAddress(u_byte_buffer));
auto v_buffer = static_cast<jbyte*>(env->GetDirectBufferAddress(v_byte_buffer));
jbyte* nv21 = env->GetByteArrayElements(nv21_array, nullptr);
if (nv21 == nullptr || y_buffer == nullptr || u_buffer == nullptr
|| v_buffer == nullptr) {
// Log this.
return false;
}
RenderScriptToolkit* toolkit = reinterpret_cast<RenderScriptToolkit*>(native_handle);
toolkit->yuv420toNv21(image_width, image_height, y_buffer, u_buffer, v_buffer,
y_pixel_stride, uv_pixel_stride, y_row_stride, uv_row_stride,
nv21);
env->ReleaseByteArrayElements(nv21_array, nv21, 0);
return true;
}
Go to YuvToRgb.cpp and add the following:
void RenderScriptToolkit::yuv420toNv21(int image_width, int image_height, const int8_t* y_buffer,
const int8_t* u_buffer, const int8_t* v_buffer, int y_pixel_stride,
int uv_pixel_stride, int y_row_stride, int uv_row_stride,
int8_t *nv21) {
// Copy Y channel.
for(int y = 0; y < image_height; ++y) {
int destOffset = image_width * y;
int yOffset = y * y_row_stride;
memcpy(nv21 + destOffset, y_buffer + yOffset, image_width);
}
if (v_buffer - u_buffer == sizeof(int8_t)) {
// format = nv21
// TODO: If the format is VUVUVU & pixel stride == 1 we can simply the copy
// with memcpy. In Android Camera2 I have mostly come across UVUVUV packaging
// though.
}
// Copy UV Channel.
int idUV = image_width * image_height;
int uv_width = image_width / 2;
int uv_height = image_height / 2;
for(int y = 0; y < uv_height; ++y) {
int uvOffset = y * uv_row_stride;
for (int x = 0; x < uv_width; ++x) {
int bufferIndex = uvOffset + (x * uv_pixel_stride);
// V channel.
nv21[idUV++] = v_buffer[bufferIndex];
// U channel.
nv21[idUV++] = u_buffer[bufferIndex];
}
}
}
Finally, go to RenderscriptToolkit.h and add the following:
/**
* https://blog.minhazav.dev/how-to-use-renderscript-to-convert-YUV_420_888-yuv-image-to-bitmap/#tobitmapimage-image-method
* #param image_width width of the image you want to convert to byte array
* #param image_height height of the image you want to convert to byte array
* #param y_buffer Y buffer
* #param u_buffer U buffer
* #param v_buffer V buffer
* #param y_pixel_stride Y pixel stride
* #param uv_pixel_stride UV pixel stride
* #param y_row_stride Y row stride
* #param uv_row_stride UV row stride
* #param nv21 the output byte array
*/
void yuv420toNv21(int image_width, int image_height, const int8_t* y_buffer,
const int8_t* u_buffer, const int8_t* v_buffer, int y_pixel_stride,
int uv_pixel_stride, int y_row_stride, int uv_row_stride,
int8_t *nv21);
You are now ready to harness the full power of renderscript. Below, I am providing an example with the ARCore Camera Image object (replace the first line with whatever code gives you your camera image):
val cameraImage = arFrame.frame.acquireCameraImage()
val width = cameraImage.width
val height = cameraImage.height
val byteArray = Toolkit.toNv21(cameraImage)
byteArray?.let {
Toolkit.yuvToRgbBitmap(
byteArray,
width,
height,
YuvFormat.NV21
).let { bitmap ->
saveBitmapToDevice(
name,
session,
bitmap,
context
)}}
``I want to use RGB in the camera's preview.I used JNI to do the YUV to RGB conversion.I changed the data in RGB,then I show RGB on preview by using drawBitmap.But it shows the very slow,how could I improve it
public void onPreviewFrame(final byte[] data, Camera camera) {
Thread showPic = new Thread(new Runnable() {
#Override
public void run() {
// TODO Auto-generated method stub
Canvas c = mHolder.lockCanvas(null);
try {
synchronized (mHolder) {
// TODO Auto-generated method stub
int imageWidth = mCamera.getParameters()
.getPreviewSize().width;
int imageHeight = mCamera.getParameters()
.getPreviewSize().height;
int RGBData[] = new int[imageWidth * imageHeight];
int RGBDataa[] = new int[imageWidth * imageHeight];
int RGBDatab[] = new int[imageWidth * imageHeight];
int center = imageWidth * imageHeight / 2;
Jni.decodeYUV420SP(RGBData, data, imageWidth,
imageHeight); // decode
for (int i = 0; i < center; i++)
RGBDataa[i] = RGBData[i];
for (int i = center; i < imageWidth * imageHeight; i++)
RGBDatab[i - center] = RGBData[i];
for (int i = 0; i < center; i++)
RGBData[i] = RGBDatab[i];
for (int i = center; i < imageWidth * imageHeight; i++)
RGBData[i] = RGBDataa[i - center];
c.drawBitmap(RGBData, 0, imageWidth, 0, 0, imageWidth,
imageHeight, false, new Paint());
// Bitmap bm = Bitmap.createBitmap(RGBData, imageWidth,
// imageHeight, Config.ARGB_8888);
}
} finally {
if (data != null)
mHolder.unlockCanvasAndPost(c);
}
}
});
showPic.run();
}
The following code is Jni
public class Jni {
public native static void decodeYUV420SP(int[] rgb, byte[] yuv420sp, int width,
int height);
}
The method decodeYUV420SP is completed by C.
I came across this thread searching for YUV420sp to RGB565 conversion. Using decodeYUV420SP as suggested above works fine, but I want to add some runtime considerations.
My first test using decodeYUV420SP in Java took 74 seconds decoding and converting a 10 second .mp4 FullHD video (using an Nexus 6 Android device). The profiler reported 96% of the overall process has been spent in decodeYUV420SP.
Moving decodeYUV420SP to JNI / C made made overall time spent go down to 32 seconds. Turning on GCC code optimization, it fell to 12 seconds.
Although the OP suggested this already, I wanted to emphasizes it will not be an improvement to move the routine to Java. While it is correct a context change between Java and C comes at a cost, this is a good example an expensive operation performed in the native function will easily compensate this overhead. The statement on overhead costs is valid for very simple JNI functions and in case method and field lookups back and forth are required. Nothing of this applies here.
To improve the performance of the OP's code: get rid of the memory allocation done for each frame and store it globally, get rid of the arithmetic on RGBData* and move that stuff to JNI too. Arrays can be accessed on JNI level at no cost using the Critical functions.
It's not recommended to use jni because of slow speed for compiler to change java code to C++. I think you should use direct method like this
public void decodeYUV420SP(int[] rgb, byte[] yuv420sp, int width, int height) {
final int frameSize = width * height;
for (int j = 0, yp = 0; j < height; j++) {
int uvp = frameSize + (j >> 1) * width, u = 0, v = 0;
for (int i = 0; i < width; i++, yp++) {
int y = (0xff & ((int) yuv420sp[yp])) - 16;
if (y < 0) y = 0;
if ((i & 1) == 0) {
v = (0xff & yuv420sp[uvp++]) - 128;
u = (0xff & yuv420sp[uvp++]) - 128;
}
int y1192 = 1192 * y;
int r = (y1192 + 1634 * v);
int g = (y1192 - 833 * v - 400 * u);
int b = (y1192 + 2066 * u);
if (r < 0) r = 0; else if (r > 262143) r = 262143;
if (g < 0) g = 0; else if (g > 262143) g = 262143;
if (b < 0) b = 0; else if (b > 262143) b = 262143;
rgb[yp] = 0xff000000 | ((r << 6) & 0xff0000) | ((g >> 2) & 0xff00) | ((b >> 10) & 0xff);
}
}
}
OR simply use this
Camera.Parameters.setPreviewFormat(ImageFormat.RGB_565);
to change the format output to rgb directly