Is there any other free vector library optimized for neon that math-neon?
I would like to get advantage of neon in my code, i have lot of objects and i am doing lot of simple vector physics-math, like adding vectors, multiplying, dotting them, those are 3d vectors but if i could make it a lot faster 2d should be ok too, the question is, is it worth using neon? for example lets take 100000 points, i need to calculate their movement, collisions etc. I am currently using my own math, and its based on inline functions, lets say that i would like to use my hypothetical neon library with matrices too, currently i am using glm for that, and its doing fine, but could it be faster? Speed advantage between arm-abi and arm7-abi in ndk is about 30 percent in my case, can neon be faster or maybe my code is translated to neon in compile time?
You can check eigen. It has special code that it is activated when neon instruction support is activated.
Like someone else mentioned, you should look into Eigen, it is probably good enough for you. But if you want full performance (much better than 30% gain, more like 300% gain), you should use NEON code yourself and make sure your entire inner loop is written completely with NEON (not any CPU or VFP code).
If you just NEON optimize part of your loop instead of the entire loop, you get major penalties and so the NEON code is perhaps just 30% faster or perhaps even slower than regular C code. But a full NEON loop can often give you 300% - 2000% speedup!
If you are developing for an ARM Cortex-A9 then NEON C Intrinsics should be good enough, but for ARM Cortex-A8 devices you usually need NEON Assembly code to get full performance. I give some more info on how to NEON optimize your whole loop at "http://www.shervinemami.info/armAssembly.html"
Code is compiled for NEON if the target architecture supports it, namely, if it is compiled for armeabi-v7a. To do this, simply add armeabi-v7a to the list of targets in your app's Application.mk file.
Related
I am developing some numerical software, whose performance, depends a lot on the numerical accuracy (i.e., floats, double etc.).
I have noticed that the ARM NEON does not fully comply with the IEEE754 floating point standard. Is there a way to emulate NEON's floating point precision, on an x86 CPU ? For example a library that emulates the NEON SIMD floating point operations.
Probably.
I'm less familiar with SSE, but you can force many of the SSE modes to behave like NEON. This will depend on your compiler and available libraries, but see some Visual Studio FP unit control functions. This might be good enough for your requirements.
Furthermore, you can use the arm_neon.h header to ensure that you are using similar intrinsics to accomplish similar things.
Finally, if you really require achieving this precision at these boundary conditions, you are going to want a good test suite to verify that you are achieving your results as intended.
Finally finally, even with pure "C" code, which typically complies with IEEE-754, and uses the VFP on ARM as other commenters have mentioned, you will get different results because floating point is a highly... irregular process, subject to the whim of optimization and order of operations. It is challenging to get results to match across different compilers, let alone hardware architectures. For example, to get highly agreeable results on Intel with gcc it's often required to use the -ffloat-store flag, if you want to compare with /fp:precise on CL/MSVS.
In the end, you may need to accept some kind of non-zero error tolerance. Trying to get to zero may be difficult, but it would be awesome to hear your results if you get there. It seems possible... but difficult.
Thanks for your answers.
At last, I used an android phone connected to a desktop, and certain functions were running on the phone.
Are there any tools like fragment openGLES shaders, to draw on android canvas/bitmap? I need to calculate color of every pixel depends on it position, but it is very slow to work with bitmap as array. I can't use openGLES, because the result I have to get - bitmap.
Thanks.
It looks like you want to offload some heavy pixel manipulations to the GPU.
On Android, you have two major options:
OpenCL, but OpenCL support on Android is cumbersome
RenderScript
But don't underestimate the power that a CPU has, when you use vectorization instructions well. This might require hand-coding the vectorized loop using NEON intrinsics, but it will be worth it. Note that the performance issues mentioned in this last link are all resolved.
I have cycle in my application in which executed mathematical multiply and addition calculations.
I know some facts:
android devices supports armv6 and up processors
armv6 not supported NEON commands
Does i increase performance of application on armv6 including, and up, if instead of c math commands i will start using assembler math commands?
UPDATE
i need to execute cycle with math operation faster, is right way to use assembler instead of c.
UPDATE
i have this calculation:
Ry0 = (b0a0 * buffer[index] + b1a0 * Rx1 + b2a0 * Rx2 - a1a0 * Ry1
- a2a0 * Ry2);
it is biquad transfer function.
Can i force execute this calculation faster with asm?
UPDATE
buffer size is 192000
variables is float type
Compilers are pretty good at their job, so unless you KNOW what your compiler is producing, and know that you can do better, probably not.
Without knowing exactly what your code does, it would be impossible to give a better answer.
Edit: to summarize this discussion:
The FIRST step in improving performance is not to start writing assembler. The first step is to find the most efficient algorithm. Once that has been done you can look at assembler coding.
Infinite Impulse Response (IIR) functions are difficult to implement with high performance because each output element depends closely on the immediately preceding output element. This compels a latency from output to output. This dependency chain defeats common high-performance techniques (such as SIMD, strip mining, and superscalar execution).
Working in assembly initially is not a good approach to this. At some point, working in assembly may help. However, you have a fundamental issue to resolve: You cannot produce a new output until you have completed the previous output, multiplied it by a coefficient, and added the results of additional arithmetic. Therefore, the best you can do with this formulation is to produce one output as frequently as the processor can do a multiply and an add from start to finish, even supposing the other work can be done in parallel.
It is mathematically possible to rewrite the IIR so that the output depends on other outputs and inputs further in the past, instead of the immediately previous output. This uses more arithmetic but provides a possibility of doing more of the arithmetic in parallel, thus obtaining higher throughput.
On an iPhone or other iOS device, you could simply call vDSP_deq22 in the Accelerate framework. Accelerate is an Apple library, so it is not available on Android. However, perhaps somebody has implemented something similar.
One approach is to measure how many processor cycles each output is taking (calculate many, divide time by number of outputs, multiply by processor speed) to the latency, in cycles, of a multiplication from an addition (from the documentation for the processor model you are using). If the time taken is the same as the latency, then it is impossible to perform this arithmetic any more quickly on that processor, and you must either accept it or find an alternate solution with different math.
You might be able to gain some extra speed by taking a look at what your compiler does, but this should be the last thing you do. First take a good look at your algorithm and variable types.
Since your target is ARMv6, the first thing I would do is to switch from floating-point to fixed-point arithmetic. ARMv6 usually has no or very slow hardware floating point support. ARMv7 is usually better, but for ARM, fixed-point arithmetic is usually a lot faster than floating-point implementations.
Android supports ARMv5TE and ARMv7-A. Read NDK docs about supported CPU ARCHs & ABIs available at $NDK/docs/CPU-ARCH-ABIS.html.
ARMv5TE is default and doesn't give you any hardware floating point support, you can see Android NDK page more about this. You should add ARMv7-A support to your application to get best support from hardware.
ARMv6 is somewhere in between and if you want to target these devices you must do some Android.mk trickery.
Nowadays if you are coding a modern app you'll be probably targeting newer devices with ARMv7-A processor type having VFPv3 and NEON. If you just want to support ARMv6, you should use ARMv5TE to cover those. If you want to take advantage of a little bit extra provided by ARMv6 then you'll loose ARMv5TE support completely.
I compiled your simple line of code with NDK r8c, and it can produce me a binary like below. Best ARM VFP allows for your statement is multiply and accumulate instruction which is fmac and compiler can emit these easly.
00000000 <f>:
0: ee607aa2 fmuls s15, s1, s5
4: ed9f7a05 flds s14, [pc, #20]
8: ee407a07 fmacs s15, s0, s14
c: ee417a03 fmacs s15, s2, s6
10: ee417ae3 fnmacs s15, s3, s7
14: eeb00a67 fcpys s0, s15
18: ee020a44 fnmacs s0, s4, s8
1c: e12fff1e bx lr
It might be better to divide your statement into a few chunks to get dual issuing possible but you can do this in C.
You can't create miracles by just using assembly however compiler can also create a huge crap. GCC and ARM is not as good as GCC and Intel. Especially in vectorization, NEON usage. It is always good to check what compiler produces if you need to have high performing routines.
I've been learning up a little on the cpu features and stumbled upon NEON.
From what I've read, it looks like NEON requires specific programming to use this, but is this completely true, or do the cpus that have this feature still find ways to untilize it and speed media processes for some applications even though there is not specific code for it?
There are a number of ways to make use of the NEON instructions. Some of them are:
Libraries. It is a good chance that your memcpy is handcrafted using NEON. Music/video playback libs in the API are using NEON and/or GPU for acceleration. Aso, there are third-pary libs that use it. FastCV from Qualcomm is a good example
Compiler-issued instructions. Some compilers, when provided with the correct options will issue NEON instructions. Most compilers will use neon for float operations, but not vectorize them. They will use the unit as a single-data unit, just because it is fast and convenient. There are some vectorization capabilities in GCC and ARM compiler, but they are really limited in scope and results.
Hand-coded C with intrinsics http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html It is probably the best way to get started in the NEON world.
Hand-coded assembler. This seems to be the best, if you want to achieve max performance. It also requires a good deal of effort and CS knowledge.
Last but not least, you can use NEON by just downloading apps that use it. Your favourite music player and your camera app put the NEON unit in your smartphone to good use.
Conclusion: NEON has a lot of usages, but it is only used if the code specifically contains NEON instructions. More technically, as #pst said, it must be targeted by a piece of code.
I'm trying to decide on whether to primarily use floats or ints for all 3D-related elements in my app (which is C++ for the most part). I understand that most ARM-based devices have no hardware floating point support, so I figure that any heavy lifting with floats would be noticeably slower.
However, I'm planning to prep all data for the most part (i.e. have vertex buffers where applicable and transform using matrices that don't change a lot), so I'm just stuffing data down OpenGL's throat. Can I assume that this goes more or less straight to the GPU and will as such be reasonably fast? (Btw, the minimum requirement is OpenGL ES 2.0, so that presumably excludes older 1.x-based phones.)
Also - how is the penalty when I mix and match ints and floats? Assuming that all my geometry is just pre-built float buffers, but I use ints for matrices since those do require expensive operations like matrix multiplications, how much wrath will I incur here?
By the way, I know that I should keep my expectations low (sounds like even asking for floats on the CPU is asking for too much), but is there anything remotely like 128-bit VMX registers?
(And I'm secretly hoping that fadden is reading this question and has an awesome answer.)
Older Android devices like the G1 and MyTouch have ARMv6 CPUs without floating point support. Most newer devices, like the Droid, Nexus One, and Incredible, use ARMv7-A CPUs that do have FP hardware. If your game is really 3D-intensive, it might demand more from the 3D implementation than the older devices can provide anyway, so you need to decide what level of hardware you want to support.
If you code exclusively in Java, your app will take advantage of the FP hardware when available. If you write native code with the NDK, and select the armv5te architecture, you won't get hardware FP at all. If you select the armv7-a architecture, you will, but your app won't be available on pre-ARMv7-A devices.
OpenGL from Java should be sitting on top of "direct" byte buffers now, which are currently slow to access from Java but very fast from the native side. (I don't know much about the GL implementation though, so I can't offer much more than that.)
Some devices additionally support the NEON "Advanced SIMD" extension, which provides some fancy features beyond what the basic VFP support has. However, you must test for this at runtime if you want to use it (looks like there's sample code for this now -- see the NDK page for NDK r4b).
An earlier answer has some info about the gcc flags used by the NDK for "hard" fp.
Ultimately, the answer to "fixed or float" comes down to what class of devices you want your app to run on. It's certainly easier to code for armv7-a, but you cut yourself off from a piece of the market.
In my opinion you should stick with fixed-point as much as possible.
Not only old phones miss floating point support, but also new ones such as the HTC Wildfire.
Also, if you choose to require ARMv7, please note that for example the Motorola Milestone (Droid for Europe) does feature an ARMv7 CPU, but because of the way Android 2.1 has been built for this device, the device will not use your armeabi-v7a libs (and might hide your app from the Market).
I personally worked around this by detecting ARMv7 support using the new cpufeatures library provided with NDK r4b, to load some armeabi-v7a lib on demand with dlopen().