Openmp slower than serial in Android - android

I try using OpenMP to parallel Deblocking filter of OpenHEVC.
But, It is more slower than serial to using openMP. even, I tried to blank code in for loop.
However it took four time as long than serial. I don't know why it happened.
Serial code
for (y = y0; y < y_end; y += 8) {
for (x = x0 ? x0 : 8; x < x_end; x += 8) {
const int bs0 = s->vertical_bs[(x >> 3) + (y >> 2) * s->bs_width];
const int bs1 = s->vertical_bs[(x >> 3) + ((y + 4) >> 2) * s->bs_width];
int c_tc[2], beta[2], tc[2];
uint8_t no_p[2] = { 0 };
uint8_t no_q[2] = { 0 };
if (bs0 || bs1) {
const int qp0 = (get_qPy(s, x - 1, y) + get_qPy(s, x, y) + 1) >> 1;
const int qp1 = (get_qPy(s, x - 1, y + 4) + get_qPy(s, x, y + 4) + 1) >> 1;
beta[0] = betatable[av_clip(qp0 + (beta_offset >> 1 << 1), 0, MAX_QP)];
beta[1] = betatable[av_clip(qp1 + (beta_offset >> 1 << 1), 0, MAX_QP)];
tc[0] = bs0 ? TC_CALC(qp0, bs0) : 0;
tc[1] = bs1 ? TC_CALC(qp1, bs1) : 0;
src = &s->frame->data[LUMA][y * s->frame->linesize[LUMA] + (x << s->sps->pixel_shift)];
if (pcmf) {
no_p[0] = get_pcm(s, x - 1, y);
no_p[1] = get_pcm(s, x - 1, y + 4);
no_q[0] = get_pcm(s, x, y);
no_q[1] = get_pcm(s, x, y + 4);
omp_set_lock(&writelock);
s->hevcdsp.hevc_v_loop_filter_luma_c(src,
s->frame->linesize[LUMA],
beta, tc, no_p, no_q);
omp_unset_lock(&writelock);
} else{
omp_set_lock(&writelock);
s->hevcdsp.hevc_v_loop_filter_luma(src,
s->frame->linesize[LUMA],
beta, tc, no_p, no_q);
}
}
}
}
Openmp code
omp_set_num_threads(4);
#pragma omp parallel shared(s) private(src)
{
#pragma omp for
for (y = y0; y < y_end; y += 8) {
for (x = x0 ? x0 : 8; x < x_end; x += 8) {
const int bs0 = s->vertical_bs[(x >> 3) + (y >> 2) * s->bs_width];
const int bs1 = s->vertical_bs[(x >> 3) + ((y + 4) >> 2) * s->bs_width];
int c_tc[2], beta[2], tc[2];
uint8_t no_p[2] = { 0 };
uint8_t no_q[2] = { 0 };
if (bs0 || bs1) {
const int qp0 = (get_qPy(s, x - 1, y) + get_qPy(s, x, y) + 1) >> 1;
const int qp1 = (get_qPy(s, x - 1, y + 4) + get_qPy(s, x, y + 4) + 1) >> 1;
beta[0] = betatable[av_clip(qp0 + (beta_offset >> 1 << 1), 0, MAX_QP)];
beta[1] = betatable[av_clip(qp1 + (beta_offset >> 1 << 1), 0, MAX_QP)];
tc[0] = bs0 ? TC_CALC(qp0, bs0) : 0;
tc[1] = bs1 ? TC_CALC(qp1, bs1) : 0;
src = &s->frame->data[LUMA][y * s->frame->linesize[LUMA] + (x << s->sps->pixel_shift)];
if (pcmf) {
no_p[0] = get_pcm(s, x - 1, y);
no_p[1] = get_pcm(s, x - 1, y + 4);
no_q[0] = get_pcm(s, x, y);
no_q[1] = get_pcm(s, x, y + 4);
s->hevcdsp.hevc_v_loop_filter_luma_c(src,
s->frame->linesize[LUMA],
beta, tc, no_p, no_q);
} else{
s->hevcdsp.hevc_v_loop_filter_luma(src,
s->frame->linesize[LUMA],
beta, tc, no_p, no_q);
}
}
}
}
}
Time(longest)
Serial : 1004ns
openMP : 4150ns

A blank loop will take longer in parallel than in series. You don't have nearly enough work inside the loop for it to be beneficial to you. There overhead required to spawn and close the threads is taking up most of that time.
Try putting a really heavy work load in there and see what happens! For example, I use OpenMP in Fortran code with loops that take 5 minutes per.
You could even put a 5 second sleep in just to test that they're actually running in parallel.

Related

How to convert YUV420SP to RGB and display it?

Im trying to render a video frame using android NDK.
Im using this sample of google Native-Codec NDK sample code and modified it so I can manually display each video frame (non-tunneled).
so I added this code to get the output buffer which is in YUV.
ANativeWindow_setBuffersGeometry(mWindow, bufferWidth, bufferHeight,
WINDOW_FORMAT_RGBA_8888
uint8_t *decodedBuff = AMediaCodec_getOutputBuffer(d->codec, status, &bufSize);
auto format = AMediaCodec_getOutputFormat(d->codec);
LOGV("VOUT: format %s", AMediaFormat_toString(format));
AMediaFormat *myFormat = format;
int32_t w,h;
AMediaFormat_getInt32(myFormat, AMEDIAFORMAT_KEY_HEIGHT, &h);
AMediaFormat_getInt32(myFormat, AMEDIAFORMAT_KEY_WIDTH, &w);
err = ANativeWindow_lock(mWindow, &buffer, nullptr);
and these codes to convert the YUV to RGB and display it using native window.
if (err == 0) {
LOGV("ANativeWindow_lock()");
int width =w;
int height=h;
int const frameSize = width * height;
int *line = reinterpret_cast<int *>(buffer.bits);
for (int y= 0; y < height; y++) {
for (int x = 0; x < width; x++) {
/*accessing YUV420SP elements*/
int indexY = y * width + x;
int indexU = (size + (y / 2) * (width ) + (x / 2) *2);
int indexV = (int) (size + (y / 2) * (width) + (x / 2) * 2 + 1);
/*todo; this conversion to int and then later back to int really isn't required.
There's room for better work here.*/
int Y = 0xFF & decodedBuff[indexY];
int U = 0xFF & decodedBuff[indexU];
int V = 0xFF & decodedBuff[indexV];
/*constants picked up from http://www.fourcc.org/fccyvrgb.php*/
int R = (int) (Y + 1.402f * (V - 128));
int G = (int) (Y - 0.344f * (U - 128) - 0.714f * (V - 128));
int B = (int) (Y + 1.772f * (U - 128));
/*clamping values*/
R = R < 0 ? 0 : R;
G = G < 0 ? 0 : G;
B = B < 0 ? 0 : B;
R = R > 255 ? 255 : R;
G = G > 255 ? 255 : G;
B = B > 255 ? 255 : B;
line[buffer.stride * y + x] = 0xff000000 + (B << 16) + (G << 8) + R;
}
}
ANativeWindow_unlockAndPost(mWindow);
Finally I was able to display a video on my device. Now my problem is the video does not scale to fit the surface view :(
Your thoughts are very much appreciated.

Crop YUV byte[] based on a rectangle without any convertion

I have tried to use the logic and pictorial representation from this SO. I am though confused with the images since one of them follow 4:1:1 whereas the later one does 4:2:2 nomenclature for YUV image (NV21).
Right now the issue is that i get an image (converted to Bitmap/PNG) with YUV component all over, essentially an unusable image.
Any recommendation to fix this?
private byte[] cropImage(byte[] data, Rect cropRect) {
int dataHeight = 480;
int dataWidth = 640;
int totalWH = dataWidth * dataHeight;
// make rect points even, currently the width & height is even number
// adjust x coordinates to make them
if (cropRect.left % 2 != 0 || cropRect.right % 2 != 0) {
cropRect.left -= 1;
cropRect.right -= 1;
}
// adjust y coordinates to make them even
if (cropRect.top % 2 != 0 || cropRect.bottom % 2 != 0) {
cropRect.top -= 1;
cropRect.bottom -= 1;
}
int area = cropRect.width() * cropRect.height() * 3/2;
Logger.getLogger().d("Size of byte array " + data.length + " Size of alloc area " + area);
byte[] pixels = new byte[area];//the size of the array is the dimensions of the sub-photo
// size.total = size.width * size.height;
// y = yuv[position.y * size.width + position.x];
// u = yuv[(position.y / 2) * (size.width / 2) + (position.x / 2) + size.total];
// v = yuv[(position.y / 2) * (size.width / 2) + (position.x / 2) + size.total + (size.total / 4)];
try {
// copy Y plane first
int srcOffset = cropRect.top * dataWidth;
int destOffset = 0;
int lengthToCopy = cropRect.width();
int y = 0;
for (; y < cropRect.height(); y++, srcOffset += dataWidth, destOffset += cropRect.width()) {
// Logger.getLogger().d("IO " + srcOffset + cropRect.left + " oO " + destOffset + " LTC " + lengthToCopy);
System.arraycopy(data, srcOffset + cropRect.left, pixels, destOffset, lengthToCopy);
}
Logger.getLogger().d("Completed Y copy");
// U and V components are not-interleaved, hence their size is just 1/4th the original size
// copy U plane
int nonYPlanerHeight = dataHeight / 4;
int nonYPlanerWidth = dataWidth / 4;
srcOffset = totalWH + (cropRect.top / 4 * nonYPlanerWidth);
for (y = 0; y < cropRect.height();
y++, srcOffset += nonYPlanerWidth, destOffset += cropRect.width() / 4) {
System.arraycopy(data, srcOffset + cropRect.left / 4, pixels, destOffset, cropRect.width() / 4);
}
Logger.getLogger().d("Completed U copy " + y + " destOffset=" + destOffset);
// copy V plane
srcOffset = totalWH + totalWH / 4 + (cropRect.top / 4 * nonYPlanerWidth);
for (y = 0; y < cropRect.height();
y++, srcOffset += nonYPlanerWidth, destOffset += cropRect.width() / 4) {
System.arraycopy(data, srcOffset + cropRect.left / 4, pixels, destOffset, cropRect.width() / 4);
}
Logger.getLogger().d("Completed V copy " + y + " destOffset=" + destOffset);
} catch (ArrayIndexOutOfBoundsException ae) {
// do nothing
Logger.getLogger().e("Exception " + ae.getLocalizedMessage());
}
return pixels;
}

FFT implementation produces a glitch

I'm getting a strange glitch in a FFT graph for white noise:
I've checked with reference program and while noise file seems to be fine.
Is it a bug in implementation?
void four1(float data[], int nn, int isign) {
int n, mmax, m, j, istep, i;
float wtemp, wr, wpr, wpi, wi, theta;
float tempr, tempi;
n = nn << 1;
j = 1;
for (int i = 1; i < n; i += 2) {
if (j > i) {
tempr = data[j];
data[j] = data[i];
data[i] = tempr;
tempr = data[j + 1];
data[j + 1] = data[i + 1];
data[i + 1] = tempr;
}
m = n >> 1;
while (m >= 2 && j > m) {
j -= m;
m >>= 1;
}
j += m;
}
mmax = 2;
while (n > mmax) {
istep = 2 * mmax;
theta = TWOPI / (isign * mmax);
wtemp = sin(0.5 * theta);
wpr = -2.0 * wtemp * wtemp;
wpi = sin(theta);
wr = 1.0;
wi = 0.0;
for (m = 1; m < mmax; m += 2) {
for (i = m; i <= n; i += istep) {
j = i + mmax;
tempr = wr * data[j] - wi * data[j + 1];
tempi = wr * data[j + 1] + wi * data[j];
data[j] = data[i] - tempr;
data[j + 1] = data[i + 1] - tempi;
data[i] += tempr;
data[i + 1] += tempi;
}
wr = (wtemp = wr) * wpr - wi * wpi + wr;
wi = wi * wpr + wtemp * wpi + wi;
}
mmax = istep;
}
}
Apart from a few minor changes, this code appears to be taken out of the 2nd edition of Numerical Recipes in C. The documentation for this function (taken from the book) states:
Replaces data[1..2*nn] by its discrete Fourier transform, if isign is input as 1; or replaces data[1..2*nn] by nn times its inverse discrete Fourier transform, if isign is input as −1.
data is a complex array of length nn or, equivalently, a real array of length 2*nn. nn MUST be an integer power of 2 (this is not checked for!).
This implementation yields correct results, given an input array with 1-based indexing. You can choose to use the same indexing convention by allocating a C array of size 2*nn+1 and filling your array starting at index 1. Alternatively you could pass an array of size 2*nn which has been fill starting at index 0, but calling four1(data-1, nn, isign) (notice the -1 offset on the data array).

Android image processing algorithm performance

I have created a method which performs a sobel edge detection.
I use the Camera yuv byte array to perform the detection on.
Now my problem is that I only get 5fps or something, which is really low.
I know it can be done faster because there are other apps on the market who are able to do it at good fps on good quality.
I pass images in a 800x400 resolution.
Can anyone check if my algorithm can be made shorter or more performant?
I already put the algorithm in native code but there seems to be no difference in fps.
public void process() {
progress=0;
index = 0;
// calculate size
// pixel index
size = width*(height-2) - 2;
// pixel loop
while (size>0)
{
// get Y matrix values from YUV
ay = input[index];
by = input[index+1];
cy = input[index+2];
gy = input[index+doubleWidth];
hy = input[index+doubleWidth+1];
iy = input[index+doubleWidth+2];
// get X matrix values from YUV
ax = input[index];
cx = input[index+2];
dx = input[index+width];
fx = input[index+width+2];
gx = input[index+doubleWidth];
ix = input[index+doubleWidth+2];
// 1 2 1
// 0 0 0
// -1 -2 -1
sumy = ay + (by*2) + cy - gy - (2*hy) - iy;
// -1 0 1
// -2 0 2
// -1 0 1
sumx = -ax + cx -(2*dx) + (2*fx) - gx + ix;
total[index] = (int) Math.sqrt(sumx*sumx+sumy*sumy);
// Math.atan2(sumx,sumy);
if(max < total[index])
max = total[index];
// sum = - a -(2*b) - c + g + (2*h) + i;
if (total[index] <0)
total[index] = 0;
// clamp to 255
if (total[index] >255)
total[index] = 0;
sum = (int) (total[index]);
output[index] = 0xff000000 | (sum << 16) | (sum << 8) | sum;
size--;
// next
index++;
}
//ratio = max/255;
}
Thx in Advance !
greetings
So I have two things:
I would consider loosing the Math.sqrt() expression: If you
are only interested in edge detection, I see no need for the this,
as the sqrt function is monotonic and it is really costly to
calculate.
I would consider another algorithm, especially I have had good results with a seperated convolution-filter: http://www.songho.ca/dsp/convolution/convolution.html#separable_convolution as this might bring down the number of arithmetic floating-point operations (which is probably your bottleneck).
I hope this helps, or at least sparks some inspiration. Good luck.
If you are using your algorithm in real-time, call it less often, maybe every ~20 frames instead of every frame.
Do more work per iteration, 800x400 in your algorithm is 318,398 iterations. Each iteration is pulling from the input array in a (to the processor) random way which causes issues with caching. Try pulling ay, ay2, by, by2, cy, cy2 and do twice the calculations per loop, you'll notice that the variables in the next iteration will relate to the previous. ay is now ay2 etc...
Here's a rewrite of your algo, doing twice the work per iteration. It saves a bit in redundant memory access, and ignores square root mentioned in another answer.
public void process() {
progress=0;
index = 0;
// calculate size
// pixel index
size = width*(height-2) - 2;
// do FIRST iteration outside of loop
// grab input avoid redundant memory accesses
ay = ax = input[index];
by = ay2 = ax2 = input[index+1];
cy = by2 = cx = input[index+2];
cy2 = cx2 = input[index+3];
gy = gx = input[index+doubleWidth];
hy = gy2 = gx2 = input[index+doubleWidth+1];
iy = hy2 = ix = input[index+doubleWidth+2];
iy2 = ix2 = input[index+doubleWidth+3];
dx = input[index+width];
dx2 = input[index+width+1];
fx = input[index+width+2];
fx2 = input[index+width+3];
//
sumy = ay + (by*2) + cy - gy - (2*hy) - iy;
sumy2 = ay2 + (by2*2) + cy2 - gy2 - (2*hy2) - iy2;
sumx = -ax + cx -(2*dx) + (2*fx) - gx + ix;
sumx2 = -ax2 + cx2 -(2*dx2) + (2*fx2) - gx2 + ix2;
// ignore the square root
total[index] = fastSqrt(sumx*sumx+sumy*sumy);
total[index+1] = fastSqrt(sumx2*sumx2+sumy2*sumy2);
max = Math.max(max, Math.max(total[index], total[index+1]));
// skip the test for negative value it can never happen
if(total[index] > 255) total[index] = 0;
if(total[index+1] > 255) total[index+1] = 0;
sum = (int) (total[index]);
sum2 = (int) (total[index+1]);
output[index] = 0xff000000 | (sum << 16) | (sum << 8) | sum;
output[index+1] = 0xff000000 | (sum2 << 16) | (sum2 << 8) | sum2;
size -= 2;
index += 2;
while (size>0)
{
// grab input avoid redundant memory accesses
ay = ax = cy;
by = ay2 = ax2 = cy2;
cy = by2 = cs = input[index+2];
cy2 = cx2 = input[index+3];
gy = gx = iy;
hy = gy2 = gx2 = iy2;
iy = hy2 = ix = input[index+doubleWidth+2];
iy2 = ix2 = input[index+doubleWidth+3];
dx = fx;
dx2 = fx2;
fx = input[index+width+2];
fx2 = input[index+width+3];
//
sumy = ay + (by*2) + cy - gy - (2*hy) - iy;
sumy2 = ay2 + (by2*2) + cy2 - gy2 - (2*hy2) - iy2;
sumx = -ax + cx -(2*dx) + (2*fx) - gx + ix;
sumx2 = -ax2 + cx2 -(2*dx2) + (2*fx2) - gx2 + ix2;
// ignore the square root
total[index] = fastSqrt(sumx*sumx+sumy*sumy);
total[index+1] = fastSqrt(sumx2*sumx2+sumy2*sumy2);
max = Math.max(max, Math.max(total[index], total[index+1]));
// skip the test for negative value it can never happen
if(total[index] >= 65536) total[index] = 0;
if(total[index+1] >= 65536) total[index+1] = 0;
sum = (int) (total[index]);
sum2 = (int) (total[index+1]);
output[index] = 0xff000000 | (sum << 16) | (sum << 8) | sum;
output[index+1] = 0xff000000 | (sum2 << 16) | (sum2 << 8) | sum2;
size -= 2;
index += 2;
}
}
// some faster integer only implementation of square root.
public static int fastSqrt(int x) {
}
Please note, the above code was not tested, it was written inside the browser window and may contain syntax errors.
EDIT You could try using a fast integer only square root function to avoid the Math.sqrt.
http://atoms.alife.co.uk/sqrt/index.html

How do I downsize a YUV420SP (NV21) frame retrieved from Android camera?

I am currently working on an Android application that processes camera frames retrieved from Camera.PreviewCallback.onPreviewFrame(). These frames are encoded in YUV420SP format and provided as a byte array.
I need to downsize the full frame and its contents, let's say by a factor of 2, from 640x480 px to 320x240. I guess, for downsizing the luminance part, I could just run a loop copying every second value from the byte[] frame to a new, smaller array, but what about the chrominance part? Does anyone know more about the structure of a YUV420SP frame?
Many thanks in advance!
Here is a code to get a half size RGBA image from yuv420sp bytes:
//byte[] data;
int frameSize = getFrameWidth() * getFrameHeight();
int[] rgba = new int[frameSize / 4];
for (int i = 0; i < getFrameHeight() / 2; i++)
for (int j = 0; j < getFrameWidth() / 2; j++) {
int y1 = (0xff & ((int) data[2 * i * getFrameWidth() + j * 2]));
int y2 = (0xff & ((int) data[2 * i * getFrameWidth() + j * 2 + 1]));
int y3 = (0xff & ((int) data[(2 * i + 1) * getFrameWidth() + j * 2]));
int y4 = (0xff & ((int) data[(2 * i + 1) * getFrameWidth() + j * 2 + 1]));
int y = (y1 + y2 + y3 + y4) / 4;
int u = (0xff & ((int) data[frameSize + i * getFrameWidth() + j * 2 + 0]));
int v = (0xff & ((int) data[frameSize + i * getFrameWidth() + j * 2 + 1]));
y = y < 16 ? 16 : y;
int r = Math.round(1.164f * (y - 16) + 1.596f * (v - 128));
int g = Math.round(1.164f * (y - 16) - 0.813f * (v - 128) - 0.391f * (u - 128));
int b = Math.round(1.164f * (y - 16) + 2.018f * (u - 128));
r = r < 0 ? 0 : (r > 255 ? 255 : r);
g = g < 0 ? 0 : (g > 255 ? 255 : g);
b = b < 0 ? 0 : (b > 255 ? 255 : b);
rgba[i * getFrameWidth() / 2 + j] = 0xff000000 + (b << 16) + (g << 8) + r;
}
Bitmap bmp = Bitmap.createBitmap(getFrameWidth()/2, getFrameHeight()/2, Bitmap.Config.ARGB_8888);
bmp.setPixels(rgba, 0/* offset */, getFrameWidth()/2 /* stride */, 0, 0, getFrameWidth()/2, getFrameHeight()/2);

Categories

Resources