Working with output allocations of indeterminate size in Renderscript

Working with output allocations of indeterminate size in Renderscript - android

I'm trying to wrap my head around the most efficient way to deal with arrays of indeterminate size as outputs of RS kernels. I would send the index of the last relevant array slot in the out allocation, but I learned in the answer to my previous question, there's not a good way to pass a global back to java after kernel execution. I've decided to "zoom out" the process again which lead me to the pattern below.
For example let's say we have an input allocation containing a struct (or structs) that that contains two arrays of polar coordinates; something like set_pair from bellow:
typedef struct polar_tag{
uint8_t angle;
uint32_t mag;
} polar;
typedef struct polar_set_tag{
uint8_t filled_slots;
polar coordinates[60];
} polar_set;
typedef struct set_pair_tag{
polar_set probe_set;
polar_set candidate_set;
} set_pair;
We want to find similar coordinate pairs between the sets so we set up a kernel to decide which (if any) of the polar coordinates are similar. If they're similar we load it into an output allocation that looks something like "matching_set":
typedef struct matching_pair_tag{
uint8_t probe_index;
uint8_t candidate_index;
} matching_pair;
typedef struct matching_set_tag{
matching_pair pairs[120];
uint8_t filled_slots;
} matching_set;
Is creating allocations with instructions like "filled_slots" the most efficient (or only) way to handle this sort of indeterminate I/O with RS or is there a better way?

I think the way I would try to approach this is to do a two pass.
For the 0-2 case:
Setup: for each coordinate, allocate an array to hold the max expected number of pairs (2).
Pass 1: run over coords, look for pairs by comparing the current item to a subset of other coords. Choose subset to avoid duplicate answers when the kernel runs on the other coord being compared.
Pass 2: Merge the results from #1 back into a list or whatever other data structure you want. Could run as an invokable if the number of coordinates is small.
For the 0-N case:
This gets a lot harder. I'd likely do something similar to what's above but with the per-coord array sized for a typical number of pairs. For the (hopefully small) number of overflows, use atomics to reserve a slot in an overflow buffer. The catch here is I think most GPU drivers would not be very happy with the atomics today. Would run very well on the CPU ref.
There are a lot of ways to go about this. One important decision point revolves around how expensive the comparison is to find the points vs the cost of writing the result.

Related

Recommended approach to compute over arbitrary sized 3D volume

To frame my question:
I'm writing a custom convolution (for a CNN) where an arbitrary sized HxWxD input volume is convolved with a FxFxD filter. D could be 3 or 4 but also much more. I'm new to RenderScript and currently investigating approaches with the goal of maybe creating a framework which can be used in the future, so I don't want to end up using the API in a way which may deprecate soon. I'm targeting 23 right now, but might need to move back to 18-19 at some point, this is up for discussion.
It appears that if I define a 3D Allocation and use float as type for in-parameter in kernel, the kernel visits every element, also along the Z-axis. Like this:
The kernel:
void __attribute__((kernel)) convolve(float in, uint32_t x, uint32_t y, uint32_t z){
rsDebug("x y z: ", x, y, z);
}
Java:
Allocation in;
Type.Builder tb = new Type.Builder(mRS, Element.F32(mRS));
Type in_type = tb.setX(W).setY(H).setZ(D).create();
in = Allocation.createTyped(mRS, in_type);
//...
mKonvoScript.forEach_convolve(in);
With W=H=5 and D=3 there are 75 floats in the 3D volume. Running the program prints 75 outputs:
x y: {0.000000, 0.000000, 0.000000}
x y: {1.000000, 0.000000, 0.000000}
...
x y: {0.000000, 0.000000, 1.000000}
x y: {1.000000, 0.000000, 1.000000}
...
etc.
The pattern repeats 3x25 times.
OTOH the reference is unclear about the z-coordinate, and the answer at renderscript: accessing 'z' coordinate states that z-coordinate parameters are not supported.
Also I will need to bind the filter to an rs_allocation variable inside the kernel. Right now I have:
Kernel:
rs_allocation gFilter;
//...
float f = rsGetElementAt_float(gFilter, 1,2,3);
Java:
Allocation filter;
Type filter_type = tb.setX(F).setY(F).setZ(D).create();
filter = Allocation.createTyped(mRS, filter_type);
This seems to work well (no compile or runtime errors). BUT there is an SE entry somehwere from 2014 which states that from version 20 and forward we can only bind 1D allocations, which contradicts my results.
There is a lot of contradictory and outdated information out there, so I'm hoping someone on the inside could comment on this, and recommend an approach both from a sustainability and optimality perspective.
(1) Should I go ahead and use the passed xyz coordinates to compute the convolution with the bound 3D allocation? Or will this approach become deprecated at some point?
(2) There are other ways to do this, for example I can reshape all allocations into 1D, pass them into the kernel and use index arithmetics. This would also allow for placing certain values close to each other. Another approach might be to subdivide the input 3D volumes into blocks with depth 4 and use float4 as in type. Assuming (1) is ok to use, from an optimization perspective, is there a disadvantage in using (1) as opposed to the other approaches?
(3) In general, is there a desirable memory layout formulation, for example to reformulate a problem into float3 or float4 depths, for optimality reasons, as opposed to a "straightforward" approach like (1)?

1) z is supported now as a coordinate that you can query, so my older answer is outdated. It is also why your example code above doesn't generate a compiler error (assuming you are targeting a relatively modern API level).
2) Stop using bind() even for 1D things (that is the only supported one we have now, but even that isn't a great technique). You can use rs_allocation as a global variable in your .rs file, and set_() from Java to get equivalent access to these global Allocations. Then you use rsGetElementAt_() and rsSetElementAt_*() of the appropriate types to read/write directly in the .rs file.
3) Doing memory layout optimizations like this can be beneficial for some devices and worse on others. If you can use the regular x/y/z APIs, those give the implementation the best opportunity to lay things out efficiently.

What are the available kernel-functions that you can create on Renderscript?

Background
I'm learning how to use Renderscript, and I found this part in the docs:
In most respects, this is identical to a standard C function. The
first notable feature is the attribute((kernel)) applied to the
function prototype.
and they show a sample code of a kernel function:
uchar4 __attribute__((kernel)) invert(uchar4 in, uint32_t x, uint32_t y) {
uchar4 out = in;
out.r = 255 - in.r;
out.g = 255 - in.g;
out.b = 255 - in.b;
return out;
}
The problem
It seems that some samples show that the parameters of kernel functions can be different, and not only those that appear above.
Example:
uchar4 __attribute__((kernel)) grayscale(uchar4 v_in) {
float4 f4 = rsUnpackColor8888(v_in);
float3 mono = dot(f4.rgb, gMonoMult);
return rsPackColorTo8888(mono);
}
Thing is, the generated function on Java is still the same for all of those functions :
void forEach_FUNCTIONNAME(Allocation ain, Allocation aout)
where FUNCTIONNAME is the name of the function on RS.
So I assume that not every possible function can be a kernel function, and all of them need to follow some rules (besides the "attribute(kernel)" part, which needs to be added).
Yet I can't find those rules.
Only things I found is on the docs:
A kernel may have an input Allocation, an output Allocation, or both.
A kernel may not have more than one input or one output Allocation. If
more than one input or output is required, those objects should be
bound to rs_allocation script globals and accessed from a kernel or
invokable function via rsGetElementAt_type() or rsSetElementAt_type().
A kernel may access the coordinates of the current execution using the
x, y, and z arguments. These arguments are optional, but the type of
the coordinate arguments must be uint32_t.
The questions
What are the rules for creating kernel functions, besides what's written?
Which other parameters are allowed? I mean, what other parameters can I pass? Is it only those 2 "templates" of functions that I can use, or can I use other kernel-functions that have other sets of parameters?
Is there a list of valid kernel functions? One that shows which parameters sets are allowed?
Is it possible for me to customize those kernel functions, to have more parameters? For example, if I had a blurring function (I know we have a built in one) that I made, I could set the radius and the blurring algorithm.
Basically all of those questions are about the same

There really aren't that many rules. You have to have either an input and/or an output, because kernels are executed over the range present there (i.e. you have a 2-D Allocation with x=200, y=400 - it will execute on each cell of input/output). We do support an Allocation-less launch, but it is only available in the latest Android release, and thus not usable on most devices. We also support multi-input as of Android M, but earlier target APIs won't build with that (unless you are using the compatibility library).
Parameters are usually primitive types (char, int, unsigned int, long, float, double, ...) or vector types (e.g. float4, int2, ...). You can also use structures, provided that they don't contain pointers in their definition. You cannot use pointer types unless you are using the legacy kernel API, but even then, you are limited to a single pointer to a non-pointer piece of data. https://android.googlesource.com/platform/cts/+/master/tests/tests/renderscript/src/android/renderscript/cts/kernel_all.rs has a lot of simple kernels that we use for trivial testing. It shows how to combine most of the types.
You can optionally include the rs_kernel_context parameter (which lets you look up information about the size of the launch). You can also optionally pass x, y, and/or z (with uint32_t type each) to get the actual indices on which the current execution is happening. Each x/y/z coordinate will be unique for a single launch, letting you know what cell is being operated on.
For your question 4, you can't use a radius the way that you want to. It would have to be a global variable as input, since our only kernel inputs traditionally vary as you go from cell to cell of the input/output Allocations. You can look at https://android.googlesource.com/platform/cts/+/master/tests/tests/renderscript/src/android/renderscript/cts/intrinsic_blur.rs for an example about blur specifically.

Just some keypoints with which I was struggling, when I started to learn RS. Basically the yellow texts above include all RS wisdom, but in a "too compact" way to understand. In order to answer your questions 1 and 2 you have to differentiate between two types of allocations. The first type of allocations I call the "formal" allocations. In the kernel expression
uchar4 __attribute__((kernel)) invert(uchar4 in, uint32_t x, uint32_t y) {
this are the Input allocation in (of type uchar4, i.e. 8 bit unsigned integer) and the Output allocation out which is also uchar4 - this is the type you can see on the left hand side of the kernel expression. The output is what will be given back via "return", same as in Java functions. You need at least one formal allocation (i.e. one Input OR one Output OR both of them).
The other type of allocations I call "side Allocation". This is what you handle via script globals, and these can be as well input or output allocations. If you use them as input, you will pass the input from Java side via copyTo(). If If you use them as output, you will get the output to Java side via copyFrom().
Now, the point is that, although you need at least one formal allocation, there is no qualitative difference between the formal and the side allocations, the only thing you need to care is that you use at least one formal allocation.
All allocations in the kernel (whether "formal" or "side") have the same dimensions in terms of width and height.
Question 3 is implicitely answered by 1 and 2.
only formal Input allocation,
only formal Output allocation,
both formal Input and formal Output allocations
1.-3. can each have any number of additional "side" allocations.
Question 4: Yes. In your Gauss example, if you want to pass the radius of blur (e.g. 1-100) or the blurring algorithm (e.g. types 1,2 and 3) you would simply use one global variable for each of these, so that they can be applied within the kernel. Here I would not speak about "allocation" in the above sense since those are always of the same dimension as the grid spanned by the kernel (typically x width times y height). Nevertheless you still need to pass these Parameters via script_setxxx(yyy).
Hope this helps a bit.

What the meaning of "0x48151642" in malloc_debug_leak.cpp

Recently, I'm reading the libc-init code of android. When I read the code in malloc_debug_leak.cpp, line 70 and line 263, it said as follows.
#define GUARD 0x48151642
static uint32_t MEMALIGN_GUARD = 0xA1A41520;
I know the GUARD and MEMALIGN_GUARD meaning, but I really don't get the meaning of the value, for example static uint32_t MEMALIGN_GUARD = 0x0001 is OK？or any other value. Does 0xA1A41520 have some useful info？
http://i.stack.imgur.com/9lgzv.png
http://i.stack.imgur.com/ZMM5u.png

I really don't get the meaning of the value
It's a magic value, intended to catch common programming mistakes. See this wikipedia article for detailed explanation.
Would 0x0001 be OK?
No. It lacks the "distinctive unique value that is unlikely to be mistaken for other meanings" property.
When you see value 0x1 in a certain location memory, such value could very likely be placed there by lots of different code sequences. On the other hand, when you see 0xA1A41520, it is quite unlikely (though still possible) that that value was placed there by code other than the one that is using MEMALIGN_GUARD.

Android: does short take really 2 bytes?

I am trying to make decision how to design my app.
I have about 300 instances of class just like this:
public class ParamValue {
protected String sValue = null;
protected short shValue = 0;
protected short mode = PARAM_VALUE_MODE_UNKNOWN;
/*
* ...
*/
}
I have an array of these instances. I can't find out does these shorts really take 2 bytes or they take anyway 4 bytes?
And i need to pass the list of these objects via AIDL as a List<Parcelable>. Parcel can't readShort() and writeShort(), it can only work with int. So, to use short here too i have to manually pack two my shorts into one int, parcel it, and then unpack back. Looks too obtrusive.
Could you please tell me how many bytes shorts take, and does it make sense to use short instead of int here?
UPDATE:
I updated my question for future readers.
So, I wrote a test app and I figured out that in my case there's absolutely no reason to use short, because it takes the same space as int. But if I define array of shorts like that:
protected short[] myValues[2];
then it takes less space than array of ints:
protected int[] myValues[2];

Technically, in the Java language, a short is 2 bytes. Inside the JVM, though, short is a storage type, not a full-fledged primitive data type like int, float, or double. JVM registers always hold 4 bytes at a time; there are no half-word or byte registers. Whether the JVM ever actually stores a short in two bytes inside an object, or whether it is always stored as 4 bytes, is really up to the implementation.
This all holds for a "real" JVM. Does Dalvik do things differently? Dunno.

According to the Java Virtual Machine Specification, Sec. 2.4.1, a short is always exactly two bytes.
The Java Native Interface allows direct access from native code to arrays of primitives stored in the VM. A similar thing can happen in Android's JNI. This pretty much guarantees that a Java short[] will be an array of 2-byte values in either a JVM-compliant environment or in a Dalvik virtual machine.

Is the binary representation of native types guaranteed the same on all targets?

I"m planning to store my data in a binary format as a resource, read it into an int buffer and basically pass it straight down to a native C++ function, which might cast it to a struct/class and work with it. No pointers, obviously, just ints and floats.
The question is - what kind of fixing up do I need to do? I suppose that I need to check ByteOrder.nativeOrder(), figure out if it's big endian or little endian, and perform byte-swapping if need be.
Other than that, floats are presumably guaranteed to be expected in IEEE 754 format? Are there any other caveats I'm completely overlooking here?
(Also - since I'm compiling using the NDK, I know what architecture it is already (ARMv7-A, in my case), so can I technically skip the endian shenanigans and just take the data the way it is?)

ARM support both big and little endian. This will most probably be set by the OS so it might be worth while checking this out beforehand.
There is also the issue of padding to word size in a struct:
struct st
{
char a;
int b;
};
will have a sizeof 8 and not the expected 5 bytes. This is so that the int will be word aligned. Generally align everything to 4 bytes and probably use the gcc packed attribute (struct my_packed_struct __attribute__ ((__packed__))
) as well. This will ensure that the internals of the struct are as you expect.
Alternatively use the Android Simulator to generate the data file for you.

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.