Maximum triangle count in RenderScript - android

I wrote a quick application to get a feel for the limits of RenderScript and discoved that when reaching approximately 65,000 triangles, the system simply does not draw any additional ones. For example, if I create a cylinder with 70,000 triangles, there is a missing wedge from the cylinder corresponding to the triangles that exceed the ~65,000 count. The triangles are textured and for ease of writing the app, I simply used a TriangleMeshBuilder class so there is no real optimization going on such as using trifans or tristrips. The hardware is a Samsung Galaxy Nexus. LogCat reports a heap size of about 15MB with 3% free. I receive no errors or warnings regarding the graphics system or RenderScript.
Can anyone explain the reason for the triangles being dropped? Am I at a hardware limit that RenderScript is handling gracefully?
UPDATE Happens on a Samsung Galaxy Nexus (4.0.3), Samsung Galaxy Tab 7.0+ (3.2) and Motorola Xoom (3.2). All at the same point of approximately 65,000 triangles. Each of these devices have different GPU's.
UPDATE 2 In response to Steve Blackwell's insights, I have some additional thoughts.
Lines 710-712 do indeed downcast the int indices to short, thus 65536 goes to 0 as Steve points out. Additionally, the "cast" on line 757 is not so much of a cast as telling the RenderScript the format of the binary data that will eventually be sent to it. RenderScript requires all data to be packed into a RenderScript specific data type called an Allocation to move from Java to the RenderScript runtime, and this needs to be informed as to the data structure. In line with Steve's opinion that this is a bug, line 757 informs RenderScript to treat the index data as short (unsigned 16bit) but it sends it a 32 bit signed value (which will be accepted due to the lack of a check and treated unsigned, and then only the lower 16 bits used, hence why we get something drawn when below this threshold and triangles connecting back to the first indices when we go over).
Subclassing TriangleMeshBuilder to see if I can make it accept these values all as integers to increase this limit did not work, which leads me to believe that somewhere in the deep code we do not have access to, there is an additional reference to unsigned shorts. Looks like the only work around is to add additional vertex buffers as Steve suggests, which is easily done with the existing Mesh.AllocationBuilder class. I will also bring it up with Google in the Developer Hangouts to determine if this is in fact a bug or intentional.

I know almost nothing about RenderScript, so I don't know whether this is some inherent limitation, a hardware issue, or something to do with TriangleMeshBuilder, but I would bet you're running out of triangles at after number 65535.
This is a magic number because it's the maximum value of an unsigned 16-bit integer. (Wikipedia)
I would suspect that somewhere in the code there's an unsigned short that holds the number of triangles. It won't be in the Java code since Java doesn't have unsigned values. And the limitation is probably not hardware since CPU registers/pathways are >= 32-bit. So I would check TriangleMeshBuilder.
EDIT:
That's a great find on line 553. The value of every index has to fit into a short. It looks like the downcast is happening at line 710-712.
I assume that you're calling addTriangle(). That function takes three ints and then does an explicit cast to short. I think that's a bug right there because the downcast happens silently, and it's not what you'd expect from the function signature.
On line 768, that bogus data gets passed to Allocation.copy1DRangeFromUnchecked(). I didn't follow it all the way down, but I imagine that at some point, those signed values get cast to unsigned: the -32768 to -1 gets turned back into 32768 to 65535. So turning the indices into negatives looks bad, but it's just reinterpreting the same data and it's not really a problem.
The real problem starts when you send in values like 65536. When 65536 is cast to a short, it turns into 0. That's a real loss of data. Now you're referring to different indices, and a cast to unsigned doesn't fix it.
The real kicker is that copy1DRangeFromUnchecked() is an overloaded function, and one of the overloads takes an int[], so none of this ever needed to be an issue.
For workarounds, I guess you could subclass TriangleMeshBuilder and override the member variable mIndexData[] and method addTriangle(). Or maybe you could use multiple vertex buffers. Or file a bug report someplace? Anyway, interesting problem.

It's probably because OpenGL ES allows only short element indices, not int. Source: http://duriansoftware.com/joe/An-intro-to-modern-OpenGL.-Chapter-2.1:-Buffers-and-Textures.html (search for "OpenGL ES")

Related

Android: allocate a graphic once and move it around on the screen without rewriting pixels

I have a Google Pixel 4, rooted, and I have the AOSP code building successfully. Inside an Android app, I'd like to gralloc an extra-large area of memory (probably 4x as large as the 1080x2280 screen) and draw a simple graphic (for example, a square) into the middle of the buffer. Then, on each frame, I just want to slide the pointer around on the buffer to make it look like the square is moving around on the screen (since the rest of the buffer will be blank).
I'm not sure if this is feasible. So far, I have a completely native Android app. At the beginning of android_main(), I malloc a region 4x as large as the screen, and I draw a red square in the middle. On each frame, I call
ANativeWindow_lock()
to get the pointer to the gralloced memory, which is accessed in ANativeWindow_Buffer->bits. Then I use memcpy to copy pixels from my big malloced buffer into the bits address, adjusting the src pointer in my memcpy call to slide around within the buffer and make it seem like the square is moving around on the screen. Finally, I call
ANativeWindow_unlockAndPost()
to release CPU access and post the buffer to the display.
However, I'm trying to render as fast as possible. There are 2462400 pixels (2 bytes each) for the screen, so memcpy is copying 5MB of data for each frame, which takes ~10ms. So, I want to avoid the memcpy and access the ORIGINAL pointer to the dma_buf, or whatever it is that Gralloc3 originally allocates for the GraphicBuffer (the ANativeWindow is basically just a Surface which uses a GraphicBuffer, which uses GraphicBufferMapper, which uses GrallocMapper in gralloc). This is complicated by the fact that the system appears to be triple-buffering, putting three gralloced buffers in BufferQueue and rotating between them.
By adding log statements to my AOSP build, I can see when Gralloc3 allocates buffers and how big they are. So I can allocate extra-large buffers. But it's manually adjusting the pointers and having that reflect on the display where I'm getting stuck. ANativeWindow_lock() gets a copy of the original pointer to the pixel buffer, so I figured if I can trace that call all the way down, then I can find the original pointer. I traced it down into hardware/interfaces/graphics/mapper/3.0/IMapper.hal and IAllocator.hal, which are used by Gralloc3 to interact with memory. But I don't know where to go after this. The HAL file is basically a header that's implemented by some other vendor-specific file, I guess....
Checking out ANativewindow_lock() using Android Studio's CPU profiler
Based on this picture, it seems like some QtiMapper.cpp file might be implementing the HAL. There are a few files called QtiMapper, in hardware/qcom. (I'm guessing I should be looking in the sm8150 folder because the Pixel 4 uses Snapdragon 855.) Then lower down, it looks like the IonAlloc::CleanBuffer and BufferManager::LockBuffer might be in the gr_ files in hardware/qcom/display/msmxxxx/gralloc folders. But I can't be sure where the calls are being routed exactly because if I try to modify or add log statements to these files, I get problems with the AOSP build. Any directions on how to mod these would be very helpful, too. If these are the actual files being used by the system, it looks like I could possibly use them for my app because I can see the ioctl and mmap calls in them.
Using the Linux Direct Rendering Manager, I was able to write directly to the display in a C file by shutting down SurfaceFlinger and mmapping some memory. See my demo here. So if I shut down the Android framework, I can accomplish what I want to do. But I want to keep Android up, because I'm looking to use the accelerometers and maybe other APIs in my app. (The goal is to use the accelerometer readings to stabilize text on the display as fast as possible.) It's also annoying because starting up the display server again does some kind of reboot.
First of all, is what I want to do even worth it? It seems like it should be, because the display on the Pixel can refresh every 10 milliseconds, and taking the time to copy the pixel memory is pointless in this case.
Second of all, does anyone know of a way I can, within my app, adjust the low-level pointer to the pixel buffer in memory and still make it push to the display?

RenderScript speedup 10x when forcing default CPU implementation

I have implemented a CNN in RenderScript, described in a previous question which spawned this one. Basically, when running
adb shell setprop debug.rs.default-CPU-driver 1
there is a 10x speedup on both Nvidia Shield and Nexus 7. The average computation time goes from around 50ms to 5ms, the test app goes from around 50fps to 130 or more. There are two convolution algorithms:
(1) moving kernel
(2) im2col and GEMM from RenderScriptIntrinsicsBLAS.
Both experience similar speedup. The question is: why is this happening and can this effect be instantiated from the code in a predictable way? And is detailed information about this available somewhere?
Edit:
As per suggestions below, I verified the use of finish() and copyTo(), here is a breakdown of the procedure. The speedup reported is AFTER the call to copyTo() but without finish(). Uncommenting finish() adds about 1ms to the time.
double forwardTime = 0;
long t = System.currentTimeMillis();
//double t = SystemClock.elapsedRealtime(); // makes no difference
for (Layer a : layers) {
blob = a.forward(blob);
}
mRS.finish(); // adds about 1ms to measured time
blob.copyTo(outbuf);
forwardTime = System.currentTimeMillis() - t;​
Maybe this is unrelated, but on the NVIDIA Shield I get an error message at startup which disappears when running with adb shell setprop debug.rs.default-CPU-driver 1
E/Renderscript: rsAssert failed: 0, in vendor/nvidia/tegra/compute/rs/driver/nv/rsdNvBcc.cpp
I'm setting compileSdkVersion, minSdkVersion and targetSdkVersion to 23 right now, with buildToolsVersion "23.0.2". The tablets are autoupdated to the very latest Android version. Not sure about the minimum target I need to set and still have ScriptIntrinsicsBLAS available.
I'm using #pragma rs_fp_relaxed in all scripts. The Allocations all use default flags.
This question has a similar situation, but it turned out OP was creating new Script objects every computational round. I do nothing of the sort, all Scripts and Allocations are created at init time.
The original post has the mRS.finish() commented out. I am wondering if that is the case here.
To benchmark RenderScript properly, we should wait for pending asynchronous opeations to complete. There are generally two ways to do that:
Use RenderScript.finish(). This works well when using debug.rs.default-CPU-driver 1. And it also works with most GPU drivers. However, certain GPU driver does treat this as a NOOP.
Use Allocation.copyTo() or other similar APIs to access data of an Allocation, preferably the final output Allocation. This is actually a trick, but it works on all devices. Just be aware, the copyTo operation itself may take some time and make sure you take that into consideration.
5ms here seems suspicious, it might be real depending on the actually algorithm. But it worth double check if it is still the case when you add finish() or copyTo().
That's very strange indeed. The fact that you're getting the same result across both devices and with two very different implementations of the conv layers suggests there is still something else going on with the benchmarking or timing itself, rather than differences with CPU/GPU execution, as things are rarely that conclusive.
I would suggest verifying the outputs from the copyTo()'s is always the same. Setup a logcat output of, say, the first (and last!) 10 values in the float array that comes back from each layer's output allocation to make sure all implementations and execution modes are truly processing the data properly and equally at each layer.
Depending on your setup, it's also possible that the data copying overhead I mentioned before might be overpowering the computation time itself and what you're seeing is just an unfortunate effect of that, as it's possible data copying from one place or another takes more or less time. Try increasing the conv kernel sizes or count (with dummy/random values, just for testing sake) to make the computations much more complex and thereby offset the computing vs data loading times balance, and see how that affects your results.
If all else fails, it could just be the GPU really is taking longer for some reason, though it can be hard to track down why. Some things to check... What data type and size are you using for the data? How are you loading/writing the data to the allocations? Are you using #pragma rs_fp_relaxed already to set your floats precision? What flags are you setting for the allocation usage (such as Allocation.USAGE_SCRIPT | Allocation.USAGE_GRAPHICS_TEXTURE)?
And as for your last question, detailed RS documentation on specific optimization matters is still very scarce unfortunately... I think just asking here on SO is still one of the best resources available for now :)

Renderscript: Create a vector of structs

I'm writing a small piece of Renderscript to dynamically take an image and sort the pixels into 'buckets' based on each pixel's RGB values. The number of buckets could vary, so my instinct would be to create an arraylist. This isn't possible within Renderscript, obviously, so I was wondering what the approach to creating a dynamic list of structs within the script. Any help greatly appreciated.
There's no clear answer to this. The problem is that dynamic memory management is anathema to platforms like RenderScript--it's slow, implies a lot of things about page tables and TLBs that may not be easy to guarantee from a given processor at an arbitrary time, and is almost never an efficient way to do what you want to do.
What the right alternative is depends entirely on what you're doing with the buckets after they're created. Do you need everything categorized without sorting everything into buckets? Just create a per-pixel mask (or use the alpha channel) and store the category alongside the pixel data. Do you have some upper bound on the size of each bucket? Allocate every bucket to be that size.
Sorry that this is open-ended, but memory management is one of those things that brings high-performance code to a screeching halt. Workarounds are necessary, but the right workaround varies in every case.
I'll try to answer your goal question of classifying pixel values, and not your title question of creating a dynamically-sized list of structs.
Without knowing much about your algorithm, I will frame my answer using one of the two algorithms:
RGB Joint Histogram
Does not use neighboring pixel values.
Connected Component
Requires neighboring pixel values.
Requires a supporting data structure called "Disjoint set".
Common advice.
Both algorithms require a lot of memory per worker thread. Also, both algorithms are poorly adapted to GPU because they require some kind of random memory access (Note). Therefore, it is likely that both algorithms will end up being executed on the CPU. It is therefore a good idea to reduce the number of "threads" to avoid multiplying the memory requirement.
Note: Non-coalesced (non-sequential) memory access - reads, writes, or both.
RGB Joint Histogram
The best way is to compute a joint color histogram using Renderscript, and then run your classification algorithm on the histogram instead (presumably on the CPU). After that, you can perform a final step of pixel-wise label assignment back in Renderscript.
The whole process is almost exactly the same as Tim Murray's Renderscript presentation in Google I/O 2013.
Link to recorded session (video)
Link to slides (PDF)
The joint color histogram will have to have hard-coded size. For example, a 32x32x32 RGB joint histogram uses 32768 histogram bins. This allows 32 levels of shades for each channel. The error per channel would be +/- 2 levels out of 256 levels.
Connected Component
I have successfully implemented multi-threaded connected component labeling on Renderscript. Note that my implementation is limited to execution on CPU; it is not possible to execute my implementation on the GPU.
Prerequisites.
Understand the Union-Find algorithm (and its various theoretical parts, such as path-compression and ranking) and how connected-component labeling benefits from it.
Some design choices.
I use a 32-bit integer array, same size as the image, to store the "links".
Linking occurs in the same way as Union-Find, except that I do not have the benefit of ranking. This means the tree may become highly unbalanced, and therefore the path length may become long.
On the other hand, I perform path-compression at various steps of the algorithm, which counteracts the risk of suboptimal tree merging by shortening the paths (depths).
One small but important implementation detail.
The values stored in the integer array is essentially an encoding of the "(x, y)" coordinates to (i) itself, if the pixel is its own root, or (ii) a different pixel which has the same label as the current pixel.
Steps.
The multi-threaded stage.
Divide the image into small tiles.
Inside each tile, compute the connected components, using label values local to that tile.
Perform path compression inside each tile.
Convert the label values into global coordinates and copy the tile's labels into the main result matrix.
The single-threaded stage.
Horizontal stitching.
Vertical stitching.
A global round of path-compression.

Converting a short array to floating point using ARM neon

I've just started trying to optimised some android code using NEON. I'm having a few issues, however. The main issue is that I really can't work out how to do a quick 16-bit to float conversion.
I see its possible to convert multiple 32-bit ints to float in 1 SIMD instruction using vcvt.s32.f32. However how do I convert a set of 4 S16s to 4 S32s? I assume it has something to do with the VUZP instruction but I cannot figure out how...
Equally I see that its possible to use VCVT.s16.f32 to convert 1 16-bit to a float at a time but while this is helpful it seems very wasteful not to be able to do it using SIMD.
I've written assembler on many different platforms over the years but I find the ARM documentation completely unfathomable for some reason.
As such any help would be HUGELY appreciated.
Also is there any way to get the throughput and latency figures for the NEON unit?
Thanks in advance!
If no other computation is to be done along with the conversion from 16bit integer to 32bit integer you can go for uint32x4_t = vmovl_u16 (uint16x4_t)
If any simple addition or multiplication etc is being performed before the conversion, you can combine them in a single instruction like int32x4_t = vmull_u16 (int16x4_t, int16x4_t) or int32x4_t = vaddl_u16 (int16x4_t, int16x4_t) etc and thus saving some amount of cycles.
Elaborating a small bit on my comment: you want to "widen" the 4 16-bit registers to 4 32-bit integers before converting to 4 32-bit floats. Looking at the instruction set I don't think there are any faster conversion paths, but I could easily be wrong.
The direct method is to use vaddl.s16 with a second operand of four zeros, but unless you're only doing conversion you can often combine the conversion with a previous operation. E.g. if you're multiplying two int16x4 registers you can use vmull.s16 to get 32-bit output directly rather than first multiplying and widening later (provided you're not depending on any truncating behavior).
why use vaddl wasting cycles initializing a valuable register with 0?
vmovl.s16 q0, d1
then convert q0
that will do.
My question is :
Is it absolutely necessary to convert them to float? NEON is much faster doing integer operations than float. (both execution and pipeline) Therefore, fixed-point operations will be more appropriate in most cases thanks to the powerful long, wide, narrow models combined with arithmetic instructions and automatic round/saturation options.
PS : strange, I think ARM's PDF to be the best around.

android kernel libm pow(float,float) implementation

I am testing corner cases on the pow call(#include <math.h>), specifically pow(-1, Inf).
On my desktop (Ubuntu) I get the result 1.0, this is in accordance with the 2008 IEEE floating point specification.
I run the same test when running the Android Gingerbread kernel and I get NaN returned.
I have looked around and can see that there is indeed many implementations of pow in the standard libraries for different platforms and in the case pow(-1, Inf) they are coded to produce different results.
The question is which one should be deemed correct? Any Ideas or thoughts?
I apologize if I am posting on the wrong forum, I followed the link from the android developer resources and ended up here.
The C standard is perfectly clear on this point (§F.9.4.4); there's no room for "ideas or thoughts":
pow(−1, ±∞) returns 1.
Annex F applies only if an implementation defines __STDC_IEC_559__, but there is no question that 1.0 is the right answer.
I suspect that this is a Java-ism that has leaked over into the NDK. (Java defines pow(-1,infinity) to be NaN):
If the absolute value of the first argument equals 1 and the second argument is infinite, then the result is NaN.
Edit:
Since Matteo objects that this "makes no sense", I'll offer a few sentences of explanation for why the committee made this choice. Although lim_{n->inf} (-1)^n does not exist in the real numbers, we must remember that floating-point numbers are not real numbers, and in fact, for all sufficiently large floating-point numbers y, pow(-1,y) is +1. This is because all sufficiently large floating-point numbers are even integers. From this perspective, it is quite reasonable to define pow(-1,infinity) to be +1, and this turns out to actually lead to more useful behavior in some floating-point computations.
There are a surprising number of extremely competent mathematicians (as well as very skilled programmers and compiler writers) involved with both the C and the IEEE-754 committees, and they do not make these decisions flippantly. Every standard has bugs, but this is not one of them.

Categories

Resources