I'm trying to decide on whether to primarily use floats or ints for all 3D-related elements in my app (which is C++ for the most part). I understand that most ARM-based devices have no hardware floating point support, so I figure that any heavy lifting with floats would be noticeably slower.
However, I'm planning to prep all data for the most part (i.e. have vertex buffers where applicable and transform using matrices that don't change a lot), so I'm just stuffing data down OpenGL's throat. Can I assume that this goes more or less straight to the GPU and will as such be reasonably fast? (Btw, the minimum requirement is OpenGL ES 2.0, so that presumably excludes older 1.x-based phones.)
Also - how is the penalty when I mix and match ints and floats? Assuming that all my geometry is just pre-built float buffers, but I use ints for matrices since those do require expensive operations like matrix multiplications, how much wrath will I incur here?
By the way, I know that I should keep my expectations low (sounds like even asking for floats on the CPU is asking for too much), but is there anything remotely like 128-bit VMX registers?
(And I'm secretly hoping that fadden is reading this question and has an awesome answer.)
Older Android devices like the G1 and MyTouch have ARMv6 CPUs without floating point support. Most newer devices, like the Droid, Nexus One, and Incredible, use ARMv7-A CPUs that do have FP hardware. If your game is really 3D-intensive, it might demand more from the 3D implementation than the older devices can provide anyway, so you need to decide what level of hardware you want to support.
If you code exclusively in Java, your app will take advantage of the FP hardware when available. If you write native code with the NDK, and select the armv5te architecture, you won't get hardware FP at all. If you select the armv7-a architecture, you will, but your app won't be available on pre-ARMv7-A devices.
OpenGL from Java should be sitting on top of "direct" byte buffers now, which are currently slow to access from Java but very fast from the native side. (I don't know much about the GL implementation though, so I can't offer much more than that.)
Some devices additionally support the NEON "Advanced SIMD" extension, which provides some fancy features beyond what the basic VFP support has. However, you must test for this at runtime if you want to use it (looks like there's sample code for this now -- see the NDK page for NDK r4b).
An earlier answer has some info about the gcc flags used by the NDK for "hard" fp.
Ultimately, the answer to "fixed or float" comes down to what class of devices you want your app to run on. It's certainly easier to code for armv7-a, but you cut yourself off from a piece of the market.
In my opinion you should stick with fixed-point as much as possible.
Not only old phones miss floating point support, but also new ones such as the HTC Wildfire.
Also, if you choose to require ARMv7, please note that for example the Motorola Milestone (Droid for Europe) does feature an ARMv7 CPU, but because of the way Android 2.1 has been built for this device, the device will not use your armeabi-v7a libs (and might hide your app from the Market).
I personally worked around this by detecting ARMv7 support using the new cpufeatures library provided with NDK r4b, to load some armeabi-v7a lib on demand with dlopen().
Related
I believe that openGL ES 3.2 (and 3.1 + Android Extensions Pack AEP) support it, but I've heard that some GPU's with previous versions (specifically 3.1 without AEP) also have this particular extension.
My question is: how can I tell which GPUs have that particular extension, enabling one to render to a float texture?
I've searched manufacturer sites, but haven't been able to find this info (maybe I'm looking in the wrong place?)
I'm also a little wary, I heard that one manufacturer added this ability in their driver... but I wonder if that is a software solution (and therefore much slower, defeating the purpose).
Further to this, of course it's possible to do the encoding decoding in your own shader - but wouldn't this incur significant overhead? Or, maybe it's fine?
[BTW: the reason I'm asking is I want to purchase a phone to play around with general-purpose computing on mobile GPUs, and the latest phones are much more expensive]
Many thanks for any help! I've been trying to find this on-and-off for months...
Float rendering support is mandatory in OpenGL ES 3.2. It is not required for OpenGL ES 3.0 / 3.1 / 3.1 + AEP.
For earlier implementations you want to use a platform exposing the EXT_color_buffer_half_float and/or EXT_color_buffer_float extension.
Note that floating point rendering is relatively expensive due to the additional bandwidth, even when supported natively in the hardware. For higher dynamic range consider using something like RGB10_A2 if you can, it's smaller and faster (and supported in 3.0 core).
I am developing some numerical software, whose performance, depends a lot on the numerical accuracy (i.e., floats, double etc.).
I have noticed that the ARM NEON does not fully comply with the IEEE754 floating point standard. Is there a way to emulate NEON's floating point precision, on an x86 CPU ? For example a library that emulates the NEON SIMD floating point operations.
Probably.
I'm less familiar with SSE, but you can force many of the SSE modes to behave like NEON. This will depend on your compiler and available libraries, but see some Visual Studio FP unit control functions. This might be good enough for your requirements.
Furthermore, you can use the arm_neon.h header to ensure that you are using similar intrinsics to accomplish similar things.
Finally, if you really require achieving this precision at these boundary conditions, you are going to want a good test suite to verify that you are achieving your results as intended.
Finally finally, even with pure "C" code, which typically complies with IEEE-754, and uses the VFP on ARM as other commenters have mentioned, you will get different results because floating point is a highly... irregular process, subject to the whim of optimization and order of operations. It is challenging to get results to match across different compilers, let alone hardware architectures. For example, to get highly agreeable results on Intel with gcc it's often required to use the -ffloat-store flag, if you want to compare with /fp:precise on CL/MSVS.
In the end, you may need to accept some kind of non-zero error tolerance. Trying to get to zero may be difficult, but it would be awesome to hear your results if you get there. It seems possible... but difficult.
Thanks for your answers.
At last, I used an android phone connected to a desktop, and certain functions were running on the phone.
I have cycle in my application in which executed mathematical multiply and addition calculations.
I know some facts:
android devices supports armv6 and up processors
armv6 not supported NEON commands
Does i increase performance of application on armv6 including, and up, if instead of c math commands i will start using assembler math commands?
UPDATE
i need to execute cycle with math operation faster, is right way to use assembler instead of c.
UPDATE
i have this calculation:
Ry0 = (b0a0 * buffer[index] + b1a0 * Rx1 + b2a0 * Rx2 - a1a0 * Ry1
- a2a0 * Ry2);
it is biquad transfer function.
Can i force execute this calculation faster with asm?
UPDATE
buffer size is 192000
variables is float type
Compilers are pretty good at their job, so unless you KNOW what your compiler is producing, and know that you can do better, probably not.
Without knowing exactly what your code does, it would be impossible to give a better answer.
Edit: to summarize this discussion:
The FIRST step in improving performance is not to start writing assembler. The first step is to find the most efficient algorithm. Once that has been done you can look at assembler coding.
Infinite Impulse Response (IIR) functions are difficult to implement with high performance because each output element depends closely on the immediately preceding output element. This compels a latency from output to output. This dependency chain defeats common high-performance techniques (such as SIMD, strip mining, and superscalar execution).
Working in assembly initially is not a good approach to this. At some point, working in assembly may help. However, you have a fundamental issue to resolve: You cannot produce a new output until you have completed the previous output, multiplied it by a coefficient, and added the results of additional arithmetic. Therefore, the best you can do with this formulation is to produce one output as frequently as the processor can do a multiply and an add from start to finish, even supposing the other work can be done in parallel.
It is mathematically possible to rewrite the IIR so that the output depends on other outputs and inputs further in the past, instead of the immediately previous output. This uses more arithmetic but provides a possibility of doing more of the arithmetic in parallel, thus obtaining higher throughput.
On an iPhone or other iOS device, you could simply call vDSP_deq22 in the Accelerate framework. Accelerate is an Apple library, so it is not available on Android. However, perhaps somebody has implemented something similar.
One approach is to measure how many processor cycles each output is taking (calculate many, divide time by number of outputs, multiply by processor speed) to the latency, in cycles, of a multiplication from an addition (from the documentation for the processor model you are using). If the time taken is the same as the latency, then it is impossible to perform this arithmetic any more quickly on that processor, and you must either accept it or find an alternate solution with different math.
You might be able to gain some extra speed by taking a look at what your compiler does, but this should be the last thing you do. First take a good look at your algorithm and variable types.
Since your target is ARMv6, the first thing I would do is to switch from floating-point to fixed-point arithmetic. ARMv6 usually has no or very slow hardware floating point support. ARMv7 is usually better, but for ARM, fixed-point arithmetic is usually a lot faster than floating-point implementations.
Android supports ARMv5TE and ARMv7-A. Read NDK docs about supported CPU ARCHs & ABIs available at $NDK/docs/CPU-ARCH-ABIS.html.
ARMv5TE is default and doesn't give you any hardware floating point support, you can see Android NDK page more about this. You should add ARMv7-A support to your application to get best support from hardware.
ARMv6 is somewhere in between and if you want to target these devices you must do some Android.mk trickery.
Nowadays if you are coding a modern app you'll be probably targeting newer devices with ARMv7-A processor type having VFPv3 and NEON. If you just want to support ARMv6, you should use ARMv5TE to cover those. If you want to take advantage of a little bit extra provided by ARMv6 then you'll loose ARMv5TE support completely.
I compiled your simple line of code with NDK r8c, and it can produce me a binary like below. Best ARM VFP allows for your statement is multiply and accumulate instruction which is fmac and compiler can emit these easly.
00000000 <f>:
0: ee607aa2 fmuls s15, s1, s5
4: ed9f7a05 flds s14, [pc, #20]
8: ee407a07 fmacs s15, s0, s14
c: ee417a03 fmacs s15, s2, s6
10: ee417ae3 fnmacs s15, s3, s7
14: eeb00a67 fcpys s0, s15
18: ee020a44 fnmacs s0, s4, s8
1c: e12fff1e bx lr
It might be better to divide your statement into a few chunks to get dual issuing possible but you can do this in C.
You can't create miracles by just using assembly however compiler can also create a huge crap. GCC and ARM is not as good as GCC and Intel. Especially in vectorization, NEON usage. It is always good to check what compiler produces if you need to have high performing routines.
I've been learning up a little on the cpu features and stumbled upon NEON.
From what I've read, it looks like NEON requires specific programming to use this, but is this completely true, or do the cpus that have this feature still find ways to untilize it and speed media processes for some applications even though there is not specific code for it?
There are a number of ways to make use of the NEON instructions. Some of them are:
Libraries. It is a good chance that your memcpy is handcrafted using NEON. Music/video playback libs in the API are using NEON and/or GPU for acceleration. Aso, there are third-pary libs that use it. FastCV from Qualcomm is a good example
Compiler-issued instructions. Some compilers, when provided with the correct options will issue NEON instructions. Most compilers will use neon for float operations, but not vectorize them. They will use the unit as a single-data unit, just because it is fast and convenient. There are some vectorization capabilities in GCC and ARM compiler, but they are really limited in scope and results.
Hand-coded C with intrinsics http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html It is probably the best way to get started in the NEON world.
Hand-coded assembler. This seems to be the best, if you want to achieve max performance. It also requires a good deal of effort and CS knowledge.
Last but not least, you can use NEON by just downloading apps that use it. Your favourite music player and your camera app put the NEON unit in your smartphone to good use.
Conclusion: NEON has a lot of usages, but it is only used if the code specifically contains NEON instructions. More technically, as #pst said, it must be targeted by a piece of code.
i am developing an android app using OpenGL ES for drawing and i use the draw_texture extension as it's the fastest.
I read you have to query the string to check and see if the drawing method is supported on the phone and degrade gracefully if not. My main concern is, how common is it really to have a device which doesn't support this?
I mean, drawing textured quads (the only method standard in OpenGL) is so slow the game would hardly be enjoyable on these devices.
I'm just curious if it's worth the time to support these devices.
I don't know an example of Android device lacking the draw_texture extension, but it is highly likely that such devices actually exists in minimal amounts. It's definitely not worthed to dedicate effort in supporting them, but on the other hand it is nearly trivial to switch between drawTex and quads, especially if your code supports rotated sprites.