We are using renderscript for audio dsp processing. It is simple and improves performance significantly for our use-case. But we run into an annoying issue with USAGE_SHARED on devices that have custom driver with GPU execution enabled.
As you may know, USAGE_SHARED flag makes the renderscript allocation to reuse the given memory without having to create a copy of it. As a consequence, it not only saves memory, in our case, improves performance to desired level.
The following code with USAGE_SHARED works fine on default renderscript driver (libRSDriver.so). With custom driver (libRSDriver_adreno.so) USAGE_SHARED does not reuse given memory and thus data.
This is the code that makes use of USAGE_SHARED and calls renderscript kernel
void process(float* in1, float* in2, float* out, size_t size) {
sp<RS> rs = new RS();
rs->init(app_cache_dir);
sp<const Element> e = Element::F32(rs);
sp<const Type> t = Type::create(rs, e, size, 0, 0);
sp<Allocation> in1Alloc = Allocation::createTyped(
rs, t,
RS_ALLOCATION_MIPMAP_NONE,
RS_ALLOCATION_USAGE_SCRIPT | RS_ALLOCATION_USAGE_SHARED,
in1);
sp<Allocation> in2Alloc = Allocation::createTyped(
rs, t,
RS_ALLOCATION_MIPMAP_NONE,
RS_ALLOCATION_USAGE_SCRIPT | RS_ALLOCATION_USAGE_SHARED,
in2);
sp<Allocation> outAlloc = Allocation::createTyped(
rs, t,
RS_ALLOCATION_MIPMAP_NONE,
RS_ALLOCATION_USAGE_SCRIPT | RS_ALLOCATION_USAGE_SHARED,
out);
ScriptC_x* rsX = new ScriptC_x(rs);
rsX->set_in1Alloc(in1Alloc);
rsX->set_in2Alloc(in2Alloc);
rsX->set_size(size);
rsX->forEach_compute(in1Alloc, outAlloc);
}
NOTE: This variation of Allocation::createTyped() is not mentioned in the documentation, but code rsCppStructs.h has it. This is the allocation factory method that allows providing backing pointer and respects USAGE_SHARED flag. This is how it is declared:
/**
* Creates an Allocation for use by scripts with a given Type and a backing pointer. For use
* with RS_ALLOCATION_USAGE_SHARED.
* #param[in] rs Context to which the Allocation will belong
* #param[in] type Type of the Allocation
* #param[in] mipmaps desired mipmap behavior for the Allocation
* #param[in] usage usage for the Allocation
* #param[in] pointer existing backing store to use for this Allocation if possible
* #return new Allocation
*/
static sp<Allocation> createTyped(
const sp<RS>& rs, const sp<const Type>& type,
RsAllocationMipmapControl mipmaps,
uint32_t usage,
void * pointer);
This is the renderscript kernel
rs_allocation in1Alloc, in2Alloc;
uint32_t size;
// JUST AN EXAMPLE KERNEL
// Not using reduction kernel since it is only available in later API levels.
// Not sure if support library helps here. Anyways, unrelated to the current problem
float compute(float ignored, uint32_t x) {
float result = 0.0f;
for (uint32_t i=0; i<size; i++) {
result += rsGetElementAt_float(in1Alloc, x) * rsGetElementAt_float(in2Alloc, size-i-1); // just an example computation
}
return result;
}
As mentioned, out doesn't have any of the result of the calculation.
syncAll(RS_ALLOCATION_USAGE_SHARED) also didn't help.
The following works though (but much slower)
void process(float* in1, float* in2, float* out, size_t size) {
sp<RS> rs = new RS();
rs->init(app_cache_dir);
sp<const Element> e = Element::F32(rs);
sp<const Type> t = Type::create(rs, e, size, 0, 0);
sp<Allocation> in1Alloc = Allocation::createTyped(rs, t);
in1Alloc->copy1DFrom(in1);
sp<Allocation> in2Alloc = Allocation::createTyped(rs, t);
in2Alloc->copy1DFrom(in2);
sp<Allocation> outAlloc = Allocation::createTyped(rs, t);
ScriptC_x* rsX = new ScriptC_x(rs);
rsX->set_in1Alloc(in1Alloc);
rsX->set_in2Alloc(in2Alloc);
rsX->set_size(size);
rsX->forEach_compute(in1Alloc, outAlloc);
outAlloc->copy1DTo(out);
}
Copying makes it to work, but in our testing, copying back and forth significantly degrades performance.
If we switch off GPU execution through debug.rs.default-CPU-driver system property, we could see that custom driver works well with desired performance.
Aligning memory given to renderscript to 16,32,.., or 1024, etc did not help to make the custom driver respect USAGE_SHARED.
Question
So, our question is this: How to make this kernel work for devices that use custom renderscript driver that enables GPU execution?
You need to have the copy even if you use USAGE_SHARED.
USAGE_SHARED is just a hint to the driver, it doesn’t have to use it.
If the driver does share the memory the copy will be ignored and performance will be the same.
Related
I have this overheat issue, that it turns off my phone after running for a couple of hours. I want to run this 24/7, please help me to improve this:
I use Camera2 interface, RAW format followed by a renderscript to convert YUV420888 to rgba. My renderscript is as below:
#pragma version(1)
#pragma rs java_package_name(com.sensennetworks.sengaze)
#pragma rs_fp_relaxed
rs_allocation gCurrentFrame;
rs_allocation gByteFrame;
int32_t gFrameWidth;
uchar4 __attribute__((kernel)) yuv2RGBAByteArray(uchar4 prevPixel,uint32_t x,uint32_t y)
{
// Read in pixel values from latest frame - YUV color space
// The functions rsGetElementAtYuv_uchar_? require API 18
uchar4 curPixel;
curPixel.r = rsGetElementAtYuv_uchar_Y(gCurrentFrame, x, y);
curPixel.g = rsGetElementAtYuv_uchar_U(gCurrentFrame, x, y);
curPixel.b = rsGetElementAtYuv_uchar_V(gCurrentFrame, x, y);
// uchar4 rsYuvToRGBA_uchar4(uchar y, uchar u, uchar v);
// This function uses the NTSC formulae to convert YUV to RBG
uchar4 out = rsYuvToRGBA_uchar4(curPixel.r, curPixel.g, curPixel.b);
rsSetElementAt_uchar(gByteFrame, out.r, 4 * (y*gFrameWidth + x) + 0 );
rsSetElementAt_uchar(gByteFrame, out.g, 4 * (y*gFrameWidth + x) + 1 );
rsSetElementAt_uchar(gByteFrame, out.b, 4 * (y*gFrameWidth + x) + 2 );
rsSetElementAt_uchar(gByteFrame, 255, 4 * (y*gFrameWidth + x) + 3 );
return out;
}
This is where I call the renderscript to convert to rgba:
#Override
public void onBufferAvailable(Allocation a) {
inputAllocation.ioReceive();
// Run processing pass if we should send a frame
final long current = System.currentTimeMillis();
if ((current - lastProcessed) >= frameEveryMs) {
yuv2rgbaScript.forEach_yuv2RGBAByteArray(scriptAllocation, outputAllocation);
if (rgbaByteArrayCallback != null) {
outputAllocationByte.copyTo(outBufferByte);
rgbaByteArrayCallback.onRGBAArrayByte(outBufferByte);
}
lastProcessed = current;
}
}
And this is the callback to run image processing using OpenCV:
#Override
public void onRGBAArrayByte(byte[] rgbaByteArray) {
try {
/* Fill images. */
rgbaMat.put(0, 0, rgbaByteArray);
analytic.processFrame(rgbaMat);
/* Send fps to UI for debug purpose. */
calcFPS(true);
} catch (Exception e) {
e.printStackTrace();
}
}
The whole thing runs at ~22fps. I've checked carefully and there is no memory leaks. But after running this for some time even with the screen off, the phone gets very hot, and turn off itself. Note if I remove the image processing part, the issue still persists. What could be wrong with this? I could turn on the phone camera app and leave it running for hours without a problem.
Does renderscript cause the heat?
Does 22fps cause the heat? Maybe I should reduce it?
Does Android background service cause heat?
Thanks.
ps: I tested this on LG G4 with full Camera2 interface support.
In theory, your device should throttle itself if it starts to overheat, and never shut down. This would just reduce your frame rate as the device warms up. But some devices aren't as good at this as they should be, unfortunately.
Basically, anything that reduces your CPU / GPU usage will reduce power consumption and heat generation. Basic tips:
Do not copy buffers. Each copy is very expensive when you're doing it at ~30fps. Here, you're copying from Allocation to byte[], and then from that byte[] to the rgbaMat. That's 2x as expensive as just copying from the Allocation to the rgbaMat. Unfortunately, I'm not sure there's a direct way to copy from the Allocation to the rgbaMat, or to create an Allocation that's backed by the same memory as the rgbaMat.
Are you sure you can't do your OpenCV processing on YUV data instead? That'll save you a lot of overhead here; the RGB->YUV conversion is not cheap when not done in hardware.
There's also an RS intrinsic, ScriptIntrinsicYuvToRgb, which may give you better performance than your hand-written loop.
Exactly as the title says.
I have a parallelized image creating/processing algorithm that I'd like to use. This is a kind of perlin noise implementation.
// Logging is never used here
#pragma version(1)
#pragma rs java_package_name(my.package.name)
#pragma rs_fp_full
float sizeX, sizeY;
float ratio;
static float fbm(float2 coord)
{ ... }
uchar4 RS_KERNEL root(uint32_t x, uint32_t y)
{
float u = x / sizeX * ratio;
float v = y / sizeY;
float2 p = {u, v};
float res = fbm(p) * 2.0f; // rs.: 8245 ms, fs: 8307 ms; fs 9842 ms on tablet
float4 color = {res, res, res, 1.0f};
//float4 color = {p.x, p.y, 0.0, 1.0}; // rs.: 96 ms
return rsPackColorTo8888(color);
}
As a comparison, this exact algorithm runs with at least 30 fps when I implement it on the gpu via fragment shader on a textured quad.
The overhead for running the RenderScript should be max 100 ms which I calculated from making a simple bitmap by returning the x and y normalized coordinates.
Which means that in case it would use the gpu it would surely not become 10 seconds.
The code I am using the RenderScript with:
// The non-support version gives at least an extra 25% performance boost
import android.renderscript.Allocation;
import android.renderscript.RenderScript;
public class RSNoise {
private RenderScript renderScript;
private ScriptC_noise noiseScript;
private Allocation allOut;
private Bitmap outBitmap;
final int sizeX = 1536;
final int sizeY = 2048;
public RSNoise(Context context) {
renderScript = RenderScript.create(context);
outBitmap = Bitmap.createBitmap(sizeX, sizeY, Bitmap.Config.ARGB_8888);
allOut = Allocation.createFromBitmap(renderScript, outBitmap, Allocation.MipmapControl.MIPMAP_NONE, Allocation.USAGE_GRAPHICS_TEXTURE);
noiseScript = new ScriptC_noise(renderScript);
}
// The render function is benchmarked only
public Bitmap render() {
noiseScript.set_sizeX((float) sizeX);
noiseScript.set_sizeY((float) sizeY);
noiseScript.set_ratio((float) sizeX / (float) sizeY);
noiseScript.forEach_root(allOut);
allOut.copyTo(outBitmap);
return outBitmap;
}
}
If I change it to FilterScript, from using this help (https://stackoverflow.com/a/14942723/4420543), I get several hundred milliseconds worse in case of support library and about double time worse in case of the non-support one. The precision did not influence the results.
I have also checked every question on stackoverflow, but most of them are outdated and I have also tried it with a nexus 5 (7.1.1 os version) among several other new devices, but the problem still remains.
So, when does RenderScript run on GPU? It would be enough if someone could give me an example on a GPU-running RenderScript.
Can you try to run it with rs_fp_relaxed instead of rs_fp_full?
#pragma rs_fp_relaxed
rs_fp_full will force your script running on CPU, since most GPUs don't support full precision floating point operations.
I can agree with your guess.
On Nexux 7 (2013, JellyBean 4.3) I wrote a renderscript and a filterscript, respectively, to calculate the famous Mandelbrot set.
Compared to an OpenGL fragment shader doing the same thing (all with 32 bit floats), the scripts were about 3 times slower. I assume OpenGL uses GPUs where renderscript (and filterscript !) do not.
Then I compared camera preview conversion (NV21 format -> RGB) with a renderscript, a filterscript and the ScriptIntrinsicYuvToRGB, respectively.
Here the Intrinsic is about 4 times faster than the self written scripts.
Again I see no differences in performance between renderscript and filterscript. In this case I assume the self written scripts again use CPUs only where the Intrinsic makes use of GPUs (too ?).
Background
I'm trying to learn Renderscript, so I wish to try to do some simple operations that I think about.
The problem
I thought of rotating a bitmap, which is something that's simple enough to manage.
on C/C++, it's a simple thing to do (search for "jniRotateBitmapCw90") :
https://github.com/AndroidDeveloperLB/AndroidJniBitmapOperations/blob/master/JniBitmapOperationsLibrary/jni/JniBitmapOperationsLibrary.cpp
Thing is, when I try this on Renderscript, I get this error:
android.support.v8.renderscript.RSRuntimeException: Dimension mismatch
between parameters ain and aout!
Here's what I do:
RS:
void rotate90CW(const uchar4 *in, uchar4 *out, uint32_t x, uint32_t y) {
// XY. ..X ... ...
// ...>..Y>...>Y..
// ... ... .YX X..
out[...]=in[...] ...
}
Java:
mRenderScript = RenderScript.create(this);
mInBitmap = BitmapFactory.decodeResource(getResources(), R.drawable.sample_photo);
mOutBitmap = Bitmap.createBitmap(mInBitmap.getHeight(), mInBitmap.getWidth(), mInBitmap.getConfig());
final Allocation input = Allocation.createFromBitmap(mRenderScript, mInBitmap, Allocation.MipmapControl.MIPMAP_NONE, Allocation.USAGE_SCRIPT);
final Allocation output = Allocation.createFromBitmap(mRenderScript, mOutBitmap, Allocation.MipmapControl.MIPMAP_NONE, Allocation.USAGE_SCRIPT);
ScriptC_test script = new ScriptC_test(mRenderScript, getResources(), R.raw.test);
...
script.forEach_rotate90CW(input, output);
output.copyTo(mOutBitmap);
Even when I do set both allocations to be of the same size (squared bitmap), and I just set the output to be the input:
out[width * y + x] = in[width * y+x];
then what I get is a bitmap with holes... How come?
This is what I get:
The questions
Does this mean I can't do this kind of operation?
Does it mean that I can't use allocations of various sizes/dimensions?
Is it possible to overcome this issue (and still use Renderscript, of course) ? If so, how?
Maybe I could add an array variable inside the RS side, and set the allocation to it, instead?
Why do I get holes in the bitmap, for the case of a square input&output?
EDIT:This is my current code:
RS
rs_allocation *in;
uchar4 attribute((kernel)) rotate90CW(uint32_t x, uint32_t y){
// XY. ..X ... ...
// ...>..Y>...>Y..
// ... ... .YX X..
uchar4 curIn =rsGetElementAt_uchar4(in, 0, 0);
return curIn; //just for testing...
}
Java:
mRenderScript = RenderScript.create(this);
mInBitmap = BitmapFactory.decodeResource(getResources(), R.drawable.sample_photo);
mOutBitmap = Bitmap.createBitmap(mInBitmap.getHeight(), mInBitmap.getWidth(), mInBitmap.getConfig());
final Allocation input = Allocation.createFromBitmap(mRenderScript, mInBitmap, Allocation.MipmapControl.MIPMAP_NONE, Allocation.USAGE_SCRIPT);
final Allocation output = Allocation.createFromBitmap(mRenderScript, mOutBitmap, Allocation.MipmapControl.MIPMAP_NONE, Allocation.USAGE_SCRIPT);
ScriptC_test script = new ScriptC_test(mRenderScript, getResources(), R.raw.test);
script.bind_in(input);
script.forEach_rotate90CW(output);
output.copyTo(mOutBitmap);
mImageView.setImageBitmap(mOutBitmap);
Here goes:
Does this mean I can't do this kind of operation?
No, not really. You just have to craft things correctly.
Does it mean that I can't use allocations of various sizes/dimensions?
No, but it does mean you can't use different size allocations in the way you currently are doing things. The default kernel in/out mechanism expects the input and output sizes to match so it can iterate over all of the elements correctly. If you need something different, it's up to you to manage it. More on that below.
Is it possible to overcome this issues...how?
The easiest solution would be to create an Allocation for input and bind it to the renderscript instance rather than pass it as a parameter. Then your RS would only need an output allocation (and your kernel only take output, x and y). From there you can determine which coordinate within the input allocation you want and place it directly into the output location:
int inX = ...;
int inY = ...;
uchar4 curIn = rsGetElementAt_uchar4(inAlloc, inX, inY);
*out = curIn;
Why do I get holes in the bitmap, for the case of a square input&output?
It's because you cannot use the x and y parameters to offset into the input and output allocation. Those in/out parameters are already pointing to the correct (same) location in both the input and output. The indexing you're doing is unnecessary and not really supported. Each time your kernel is called, it is being called for 1 element location within the allocation. This is why the input and output sizes must be the same when provided as parameters.
This should solve your problem
RS
rs_allocation *in;
uchar4 attribute((kernel)) rotate90CW(uint32_t x, uint32_t y){
...
uchar4 curIn =rsGetElementAt_uchar4(in, x, y);
return curIn;
}
I'm drawing some objects with OpenGL ES 2.0, and each vertex has several attributes which, as recommended for best performance, I'm storing interleaved in a single buffer. Some of these (like position) are represented with floats (GLES20.GL_FLOAT), but I'm passing in a color, and it's wasteful of memory to pass the color attribute as four float components so I'd like to use GLES20.GL_UNSIGNED_BYTE. Four of these will pack into one 32 bit field. However I can't figure out an easy way of storing these in ByteBuffers. E.g. using GL_FLOAT for the colours I'd do the following:
void addVertex(float xpos, float ypos, int color) {
floatBuffer.put(xpos);
floatBuffer.put(ypos);
floatBuffer.put(Color.red(color) / 255f);
floatBuffer.put(Color.green(color) / 255f);
floatBuffer.put(Color.blue(color) / 255f);
floatBuffer.put(Color.alpha(color) / 255f);
}
This works, but I'm using 128 bits to represent a colour that could be represented in 32, which will also cause a performance hit. So what I'd like to do is create a ByteBuffer and use asFloatBuffer() to create an alias to it and use whichever is appropriate for the type being stored. Like so:
void addVertices() {
byteBuffer = ByteBuffer.allocateDirect(vertexCount * VERTEX_STRIDE).order(ByteOrder.nativeOrder());
floatBuffer = byteBuffer.asFloatBuffer();
for(Vertex v : vertices)
addVertex(v.x, v.y, v.color);
}
void addVertex(float xpos, float ypos, int color) {
floatBuffer.put(xpos);
floatBuffer.put(ypos);
byteBuffer.put(Color.red(color));
byteBuffer.put(Color.green(color));
byteBuffer.put(Color.blue(color));
byteBuffer.put(Color.alpha(color)); }
However this won't work because asFloatBuffer() creates a FloatBuffer that shares the underlying storage of the ByteBuffer, but maintains its own position index. I'd have to manually track the position and update each one before storing anything, which is ugly.
I could also split the attributes into separate buffers, but the performance penalty for that would probably outweigh the gain.
Any ideas on how to do this in an elegant, efficient manner?
ByteBuffer has additional methods that can be used to put other data types besides byte into the buffer, in a machine-independent way. So instead of wrapping the ByteBuffer in a FloatBuffer, use the ByteBuffer directly, and use putFloat() to append the float values:
void addVertex(float xpos, float ypos, int color) {
byteBuffer.putFloat(xpos);
byteBuffer.putFloat(ypos);
byteBuffer.put((byte)Color.red(color));
byteBuffer.put((byte)Color.green(color));
byteBuffer.put((byte)Color.blue(color));
byteBuffer.put((byte)Color.alpha(color));
}
You could either pack the color into one float and put that into the float buffer, or split the position into bytes and put them into the byte buffer.
As an example, here's how to do the former:
Assuming 0-255 ints, first pack them into one int:
int colorBits = a << 24 | b << 16 | g << 8 | r;
then make that into a float:
float floatColor = Float.intBitsToFloat(colorBits);
floatBuffer.put(floatColor);
I am trying to access the raw data of a Bitmap in ARGB_8888 format on Android, using the copyPixelsToBuffer and copyPixelsFromBuffer methods. However, invocation of those calls seems to always apply the alpha channel to the rgb channels. I need the raw data in a byte[] or similar (to pass through JNI; yes, I know about bitmap.h in Android 2.2, cannot use that).
Here is a sample:
// Create 1x1 Bitmap with alpha channel, 8 bits per channel
Bitmap one = Bitmap.createBitmap(1,1,Bitmap.Config.ARGB_8888);
one.setPixel(0,0,0xef234567);
Log.v("?","hasAlpha() = "+Boolean.toString(one.hasAlpha()));
Log.v("?","pixel before = "+Integer.toHexString(one.getPixel(0,0)));
// Copy Bitmap to buffer
byte[] store = new byte[4];
ByteBuffer buffer = ByteBuffer.wrap(store);
one.copyPixelsToBuffer(buffer);
// Change value of the pixel
int value=buffer.getInt(0);
Log.v("?", "value before = "+Integer.toHexString(value));
value = (value >> 8) | 0xffffff00;
buffer.putInt(0, value);
value=buffer.getInt(0);
Log.v("?", "value after = "+Integer.toHexString(value));
// Copy buffer back to Bitmap
buffer.position(0);
one.copyPixelsFromBuffer(buffer);
Log.v("?","pixel after = "+Integer.toHexString(one.getPixel(0,0)));
The log then shows
hasAlpha() = true
pixel before = ef234567
value before = 214161ef
value after = ffffff61
pixel after = 619e9e9e
I understand that the order of the argb channels is different; that's fine. But I don't
want the alpha channel to be applied upon every copy (which is what it seems to be doing).
Is this how copyPixelsToBuffer and copyPixelsFromBuffer are supposed to work? Is there any way to get the raw data in a byte[]?
Added in response to answer below:
Putting in buffer.order(ByteOrder.nativeOrder()); before the copyPixelsToBuffer does change the result, but still not in the way I want it:
pixel before = ef234567
value before = ef614121
value after = ffffff41
pixel after = ff41ffff
Seems to suffer from essentially the same problem (alpha being applied upon each copyPixelsFrom/ToBuffer).
One way to access data in Bitmap is to use getPixels() method. Below you can find an example I used to get grayscale image from argb data and then back from byte array to Bitmap (of course if you need rgb you reserve 3x bytes and save them all...):
/*Free to use licence by Sami Varjo (but nice if you retain this line)*/
public final class BitmapConverter {
private BitmapConverter(){};
/**
* Get grayscale data from argb image to byte array
*/
public static byte[] ARGB2Gray(Bitmap img)
{
int width = img.getWidth();
int height = img.getHeight();
int[] pixels = new int[height*width];
byte grayIm[] = new byte[height*width];
img.getPixels(pixels,0,width,0,0,width,height);
int pixel=0;
int count=width*height;
while(count-->0){
int inVal = pixels[pixel];
//Get the pixel channel values from int
double r = (double)( (inVal & 0x00ff0000)>>16 );
double g = (double)( (inVal & 0x0000ff00)>>8 );
double b = (double)( inVal & 0x000000ff) ;
grayIm[pixel++] = (byte)( 0.2989*r + 0.5870*g + 0.1140*b );
}
return grayIm;
}
/**
* Create a gray scale bitmap from byte array
*/
public static Bitmap gray2ARGB(byte[] data, int width, int height)
{
int count = height*width;
int[] outPix = new int[count];
int pixel=0;
while(count-->0){
int val = data[pixel] & 0xff; //convert byte to unsigned
outPix[pixel++] = 0xff000000 | val << 16 | val << 8 | val ;
}
Bitmap out = Bitmap.createBitmap(outPix,0,width,width, height, Bitmap.Config.ARGB_8888);
return out;
}
}
My guess is that this might have to do with the byte order of the ByteBuffer you are using. ByteBuffer uses big endian by default.
Set endianess on the buffer with
buffer.order(ByteOrder.nativeOrder());
See if it helps.
Moreover, copyPixelsFromBuffer/copyPixelsToBuffer does not change the pixel data in any way. They are copied raw.
I realize this is very stale and probably won't help you now, but I came across this recently in trying to get copyPixelsFromBuffer to work in my app. (Thank you for asking this question, btw! You saved me tons of time in debugging.) I'm adding this answer in the hopes it helps others like me going forward...
Although I haven't used this yet to ensure that it works, it looks like that, as of API Level 19, we'll finally have a way to specify not to "apply the alpha" (a.k.a. premultiply) within Bitmap. They're adding a setPremultiplied(boolean) method that should help in situations like this going forward by allowing us to specify false.
I hope this helps!
This is an old question, but i got to the same issue, and just figured out that the bitmap byte are pre-multiplied, you can set the bitmap (as of API 19) to not pre-multiply the buffer, but in the API they make no guarantee.
From the docs:
public final void setPremultiplied(boolean premultiplied)
Sets whether the bitmap should treat its data as pre-multiplied.
Bitmaps are always treated as pre-multiplied by the view system and Canvas for performance reasons. Storing un-pre-multiplied data in a Bitmap (through setPixel, setPixels, or BitmapFactory.Options.inPremultiplied) can lead to incorrect blending if drawn by the framework.
This method will not affect the behaviour of a bitmap without an alpha channel, or if hasAlpha() returns false.
Calling createBitmap or createScaledBitmap with a source Bitmap whose colors are not pre-multiplied may result in a RuntimeException, since those functions require drawing the source, which is not supported for un-pre-multiplied Bitmaps.