To frame my question:
I'm writing a custom convolution (for a CNN) where an arbitrary sized HxWxD input volume is convolved with a FxFxD filter. D could be 3 or 4 but also much more. I'm new to RenderScript and currently investigating approaches with the goal of maybe creating a framework which can be used in the future, so I don't want to end up using the API in a way which may deprecate soon. I'm targeting 23 right now, but might need to move back to 18-19 at some point, this is up for discussion.
It appears that if I define a 3D Allocation and use float as type for in-parameter in kernel, the kernel visits every element, also along the Z-axis. Like this:
The kernel:
void __attribute__((kernel)) convolve(float in, uint32_t x, uint32_t y, uint32_t z){
rsDebug("x y z: ", x, y, z);
}
Java:
Allocation in;
Type.Builder tb = new Type.Builder(mRS, Element.F32(mRS));
Type in_type = tb.setX(W).setY(H).setZ(D).create();
in = Allocation.createTyped(mRS, in_type);
//...
mKonvoScript.forEach_convolve(in);
With W=H=5 and D=3 there are 75 floats in the 3D volume. Running the program prints 75 outputs:
x y: {0.000000, 0.000000, 0.000000}
x y: {1.000000, 0.000000, 0.000000}
...
x y: {0.000000, 0.000000, 1.000000}
x y: {1.000000, 0.000000, 1.000000}
...
etc.
The pattern repeats 3x25 times.
OTOH the reference is unclear about the z-coordinate, and the answer at renderscript: accessing 'z' coordinate states that z-coordinate parameters are not supported.
Also I will need to bind the filter to an rs_allocation variable inside the kernel. Right now I have:
Kernel:
rs_allocation gFilter;
//...
float f = rsGetElementAt_float(gFilter, 1,2,3);
Java:
Allocation filter;
Type filter_type = tb.setX(F).setY(F).setZ(D).create();
filter = Allocation.createTyped(mRS, filter_type);
This seems to work well (no compile or runtime errors). BUT there is an SE entry somehwere from 2014 which states that from version 20 and forward we can only bind 1D allocations, which contradicts my results.
There is a lot of contradictory and outdated information out there, so I'm hoping someone on the inside could comment on this, and recommend an approach both from a sustainability and optimality perspective.
(1) Should I go ahead and use the passed xyz coordinates to compute the convolution with the bound 3D allocation? Or will this approach become deprecated at some point?
(2) There are other ways to do this, for example I can reshape all allocations into 1D, pass them into the kernel and use index arithmetics. This would also allow for placing certain values close to each other. Another approach might be to subdivide the input 3D volumes into blocks with depth 4 and use float4 as in type. Assuming (1) is ok to use, from an optimization perspective, is there a disadvantage in using (1) as opposed to the other approaches?
(3) In general, is there a desirable memory layout formulation, for example to reformulate a problem into float3 or float4 depths, for optimality reasons, as opposed to a "straightforward" approach like (1)?
1) z is supported now as a coordinate that you can query, so my older answer is outdated. It is also why your example code above doesn't generate a compiler error (assuming you are targeting a relatively modern API level).
2) Stop using bind() even for 1D things (that is the only supported one we have now, but even that isn't a great technique). You can use rs_allocation as a global variable in your .rs file, and set_() from Java to get equivalent access to these global Allocations. Then you use rsGetElementAt_() and rsSetElementAt_*() of the appropriate types to read/write directly in the .rs file.
3) Doing memory layout optimizations like this can be beneficial for some devices and worse on others. If you can use the regular x/y/z APIs, those give the implementation the best opportunity to lay things out efficiently.
Related
Background
I'm learning how to use Renderscript, and I found this part in the docs:
In most respects, this is identical to a standard C function. The
first notable feature is the attribute((kernel)) applied to the
function prototype.
and they show a sample code of a kernel function:
uchar4 __attribute__((kernel)) invert(uchar4 in, uint32_t x, uint32_t y) {
uchar4 out = in;
out.r = 255 - in.r;
out.g = 255 - in.g;
out.b = 255 - in.b;
return out;
}
The problem
It seems that some samples show that the parameters of kernel functions can be different, and not only those that appear above.
Example:
uchar4 __attribute__((kernel)) grayscale(uchar4 v_in) {
float4 f4 = rsUnpackColor8888(v_in);
float3 mono = dot(f4.rgb, gMonoMult);
return rsPackColorTo8888(mono);
}
Thing is, the generated function on Java is still the same for all of those functions :
void forEach_FUNCTIONNAME(Allocation ain, Allocation aout)
where FUNCTIONNAME is the name of the function on RS.
So I assume that not every possible function can be a kernel function, and all of them need to follow some rules (besides the "attribute(kernel)" part, which needs to be added).
Yet I can't find those rules.
Only things I found is on the docs:
A kernel may have an input Allocation, an output Allocation, or both.
A kernel may not have more than one input or one output Allocation. If
more than one input or output is required, those objects should be
bound to rs_allocation script globals and accessed from a kernel or
invokable function via rsGetElementAt_type() or rsSetElementAt_type().
A kernel may access the coordinates of the current execution using the
x, y, and z arguments. These arguments are optional, but the type of
the coordinate arguments must be uint32_t.
The questions
What are the rules for creating kernel functions, besides what's written?
Which other parameters are allowed? I mean, what other parameters can I pass? Is it only those 2 "templates" of functions that I can use, or can I use other kernel-functions that have other sets of parameters?
Is there a list of valid kernel functions? One that shows which parameters sets are allowed?
Is it possible for me to customize those kernel functions, to have more parameters? For example, if I had a blurring function (I know we have a built in one) that I made, I could set the radius and the blurring algorithm.
Basically all of those questions are about the same
There really aren't that many rules. You have to have either an input and/or an output, because kernels are executed over the range present there (i.e. you have a 2-D Allocation with x=200, y=400 - it will execute on each cell of input/output). We do support an Allocation-less launch, but it is only available in the latest Android release, and thus not usable on most devices. We also support multi-input as of Android M, but earlier target APIs won't build with that (unless you are using the compatibility library).
Parameters are usually primitive types (char, int, unsigned int, long, float, double, ...) or vector types (e.g. float4, int2, ...). You can also use structures, provided that they don't contain pointers in their definition. You cannot use pointer types unless you are using the legacy kernel API, but even then, you are limited to a single pointer to a non-pointer piece of data. https://android.googlesource.com/platform/cts/+/master/tests/tests/renderscript/src/android/renderscript/cts/kernel_all.rs has a lot of simple kernels that we use for trivial testing. It shows how to combine most of the types.
You can optionally include the rs_kernel_context parameter (which lets you look up information about the size of the launch). You can also optionally pass x, y, and/or z (with uint32_t type each) to get the actual indices on which the current execution is happening. Each x/y/z coordinate will be unique for a single launch, letting you know what cell is being operated on.
For your question 4, you can't use a radius the way that you want to. It would have to be a global variable as input, since our only kernel inputs traditionally vary as you go from cell to cell of the input/output Allocations. You can look at https://android.googlesource.com/platform/cts/+/master/tests/tests/renderscript/src/android/renderscript/cts/intrinsic_blur.rs for an example about blur specifically.
Just some keypoints with which I was struggling, when I started to learn RS. Basically the yellow texts above include all RS wisdom, but in a "too compact" way to understand. In order to answer your questions 1 and 2 you have to differentiate between two types of allocations. The first type of allocations I call the "formal" allocations. In the kernel expression
uchar4 __attribute__((kernel)) invert(uchar4 in, uint32_t x, uint32_t y) {
this are the Input allocation in (of type uchar4, i.e. 8 bit unsigned integer) and the Output allocation out which is also uchar4 - this is the type you can see on the left hand side of the kernel expression. The output is what will be given back via "return", same as in Java functions. You need at least one formal allocation (i.e. one Input OR one Output OR both of them).
The other type of allocations I call "side Allocation". This is what you handle via script globals, and these can be as well input or output allocations. If you use them as input, you will pass the input from Java side via copyTo(). If If you use them as output, you will get the output to Java side via copyFrom().
Now, the point is that, although you need at least one formal allocation, there is no qualitative difference between the formal and the side allocations, the only thing you need to care is that you use at least one formal allocation.
All allocations in the kernel (whether "formal" or "side") have the same dimensions in terms of width and height.
Question 3 is implicitely answered by 1 and 2.
only formal Input allocation,
only formal Output allocation,
both formal Input and formal Output allocations
1.-3. can each have any number of additional "side" allocations.
Question 4: Yes. In your Gauss example, if you want to pass the radius of blur (e.g. 1-100) or the blurring algorithm (e.g. types 1,2 and 3) you would simply use one global variable for each of these, so that they can be applied within the kernel. Here I would not speak about "allocation" in the above sense since those are always of the same dimension as the grid spanned by the kernel (typically x width times y height). Nevertheless you still need to pass these Parameters via script_setxxx(yyy).
Hope this helps a bit.
I'm trying to wrap my head around the most efficient way to deal with arrays of indeterminate size as outputs of RS kernels. I would send the index of the last relevant array slot in the out allocation, but I learned in the answer to my previous question, there's not a good way to pass a global back to java after kernel execution. I've decided to "zoom out" the process again which lead me to the pattern below.
For example let's say we have an input allocation containing a struct (or structs) that that contains two arrays of polar coordinates; something like set_pair from bellow:
typedef struct polar_tag{
uint8_t angle;
uint32_t mag;
} polar;
typedef struct polar_set_tag{
uint8_t filled_slots;
polar coordinates[60];
} polar_set;
typedef struct set_pair_tag{
polar_set probe_set;
polar_set candidate_set;
} set_pair;
We want to find similar coordinate pairs between the sets so we set up a kernel to decide which (if any) of the polar coordinates are similar. If they're similar we load it into an output allocation that looks something like "matching_set":
typedef struct matching_pair_tag{
uint8_t probe_index;
uint8_t candidate_index;
} matching_pair;
typedef struct matching_set_tag{
matching_pair pairs[120];
uint8_t filled_slots;
} matching_set;
Is creating allocations with instructions like "filled_slots" the most efficient (or only) way to handle this sort of indeterminate I/O with RS or is there a better way?
I think the way I would try to approach this is to do a two pass.
For the 0-2 case:
Setup: for each coordinate, allocate an array to hold the max expected number of pairs (2).
Pass 1: run over coords, look for pairs by comparing the current item to a subset of other coords. Choose subset to avoid duplicate answers when the kernel runs on the other coord being compared.
Pass 2: Merge the results from #1 back into a list or whatever other data structure you want. Could run as an invokable if the number of coordinates is small.
For the 0-N case:
This gets a lot harder. I'd likely do something similar to what's above but with the per-coord array sized for a typical number of pairs. For the (hopefully small) number of overflows, use atomics to reserve a slot in an overflow buffer. The catch here is I think most GPU drivers would not be very happy with the atomics today. Would run very well on the CPU ref.
There are a lot of ways to go about this. One important decision point revolves around how expensive the comparison is to find the points vs the cost of writing the result.
Background
I'm new to renderscript, and I would like to try some experiments with it (but small ones and not the complex ones we find in the SDK), so I thought of an exercise to try out, which is based on a previous question of mine (using NDK).
What I want to do
In short, I would like to pass a bitmap data to renderscript, and then I would like it to copy the data to another bitmap that has the dimensions opposite to the previous one, so that the second bitmap would be a rotation of the first one.
For illustration:
From this bitmap (width:2 , height:4):
01
23
45
67
I would like it to rotate (counter clock-wise of 90 degrees) to:
1357
0246
The problem
I've noticed that when I try to change the signature of the root function, Eclipse gives me errors about it.
Even making new functions creates new errors. I've even tried the same code written on Google's blog (here ), but I couldn't find out how he got to create the functions he used, and how come I can't change the filter function to have the input and output bitmap arrays.
What can I do in order to customize the parameters I send to renderscript, and use the data inside it?
Is it ok not to use "filter" or "root" functions (API 11 and above)? What can I do in order to have more flexibility about what I can do there?
You are asking a bunch of separate questions here, so I will answer them in order.
1) You want to rotate a non-square bitmap. Unfortunately, the bitmap model for Renderscript won't allow you to do this easily. The reason for this is that that input and output allocations must have the same shape (i.e. same number of dimensions and values of those dimensions, even if the Types are different). In order to get the effect you want, you should use a root function that only has an output allocation of the new shape (i.e. input columns X input rows). You can create an rs_allocation global variable for holding your input bitmap (which you can then create/bind on the Java side). The kernel then merely needs to set the output cell to the result of rsGetElementAt(globalInAlloc, y, x).
2) If you are using API 11, you can't adjust the signature of the root() function (you can pass NULL allocations as input, output on the Java side if you are not using them). You also can't create more than 1 kernel per source file on these older API levels, so you are forced to only have a single "root()" function. If you want to use more kernels per source file, consider targeting a higher API level.
After watching Jeff Sharkey great Google I/O presentation and kicking to write some renderscript to speed up my existing audio processing project. The first problem comes in is that in the example code given, the conversion function in the very first line of code isn't documented anywhere. As least not in http://developer.android.com/guide/topics/renderscript/reference.html
float4 inColor = convert_float4(*inPixel);
Well the function convert_float4() in the example is obvious enough to understand what it does. But in my case I would like to know if it exists other built-in conversions like from a char to a float which I guess could be convert_float(char*) ?
The generic answer is RS support conversion from all basic vector numeric types to other types of the same vector size. The casts are done as if they were a normal C cast for rounding.
The form is:
convert_[dest type](source type)
(2,3,4) vectors of char,uchar,int,uint,short,ushort, and float are supported.
Avoid:
float4 f = (float4)myInt4;
It doesn't do what you expect it to do.
Looks like there are no such built-ins. convert_float4() is the only conversion function declared in rc_core.c.
I simply need to add floatArray1 to floatArray2 storing the result in floatArray2.. no third array.. all arrays are one dimensional but are very large... probibly as large as the os will let me get away with. Max i would need is two float arrays with 40,000 floats each... but i could get away with 1/10th that i suppose minimum.
Would love to do this in 1/30th or 1/60th of a second but that does not seem possible? Also if the code is JNI,NDK or OpenGL ES thats fine.. does android have an assembly language or like machine code i could use somehow?
Since a float is worth 32 bit and you have 40000 floats in each array you would need:
40000 * 32 * 2 = 2.560.000 bit
Which is 320.000 Byte. Not to much memory wise i would say since the default limit for an android app is 16MB.
Regarding performance you would definitely gain some speed using JNI. OpenGL would not give you enough benefit i would think since the OpenGL context creation takes some time as well.