I have heard that ARM processors can switch between little-endian and big-endian. What do processors need this for? Is it used on Android phones?
Depending on the processor, it can be possible to switch endianness on the fly. Older processors will boot up in one endian state, and be expected to stay there. In the latter case, the whole design will generally be set up for either big or little endian.
The primary reason for supporting mixed-endian operation is to support networking stacks where the underlying datasets being manipulated are native big-endian. This is significant for switches/routers and mobile base-stations where the processor is running a well-defined software stack, rather than operating as a general purpose applications device.
Be aware that there are several different implementations of big-endian behaviour across the different ARM Architectures, and you need to check exactly how this works on any specific core.
You can switch endianness, but you wouldn't do that after the OS is up and running. It would only screw things up. If you were going to do it, you'd do it very early on in the boot sequence. By the time your app is running, the endianness is chosen and won't be changed.
Why would you do it? The only real reason would be if you were writing embedded software that had to deal with a lot of little-endian data, or to run a program that was written assuming little endian and not fixed to be endian agnostic. This kind of data tends to come from an x86 app that wrote things out in its native byte order (x86 is little endian). There's not a lot of other reasons to do it. You'll see ARM pretty much exclusively run in big endian mode.
Related
I am developing some numerical software, whose performance, depends a lot on the numerical accuracy (i.e., floats, double etc.).
I have noticed that the ARM NEON does not fully comply with the IEEE754 floating point standard. Is there a way to emulate NEON's floating point precision, on an x86 CPU ? For example a library that emulates the NEON SIMD floating point operations.
Probably.
I'm less familiar with SSE, but you can force many of the SSE modes to behave like NEON. This will depend on your compiler and available libraries, but see some Visual Studio FP unit control functions. This might be good enough for your requirements.
Furthermore, you can use the arm_neon.h header to ensure that you are using similar intrinsics to accomplish similar things.
Finally, if you really require achieving this precision at these boundary conditions, you are going to want a good test suite to verify that you are achieving your results as intended.
Finally finally, even with pure "C" code, which typically complies with IEEE-754, and uses the VFP on ARM as other commenters have mentioned, you will get different results because floating point is a highly... irregular process, subject to the whim of optimization and order of operations. It is challenging to get results to match across different compilers, let alone hardware architectures. For example, to get highly agreeable results on Intel with gcc it's often required to use the -ffloat-store flag, if you want to compare with /fp:precise on CL/MSVS.
In the end, you may need to accept some kind of non-zero error tolerance. Trying to get to zero may be difficult, but it would be awesome to hear your results if you get there. It seems possible... but difficult.
Thanks for your answers.
At last, I used an android phone connected to a desktop, and certain functions were running on the phone.
I have cycle in my application in which executed mathematical multiply and addition calculations.
I know some facts:
android devices supports armv6 and up processors
armv6 not supported NEON commands
Does i increase performance of application on armv6 including, and up, if instead of c math commands i will start using assembler math commands?
UPDATE
i need to execute cycle with math operation faster, is right way to use assembler instead of c.
UPDATE
i have this calculation:
Ry0 = (b0a0 * buffer[index] + b1a0 * Rx1 + b2a0 * Rx2 - a1a0 * Ry1
- a2a0 * Ry2);
it is biquad transfer function.
Can i force execute this calculation faster with asm?
UPDATE
buffer size is 192000
variables is float type
Compilers are pretty good at their job, so unless you KNOW what your compiler is producing, and know that you can do better, probably not.
Without knowing exactly what your code does, it would be impossible to give a better answer.
Edit: to summarize this discussion:
The FIRST step in improving performance is not to start writing assembler. The first step is to find the most efficient algorithm. Once that has been done you can look at assembler coding.
Infinite Impulse Response (IIR) functions are difficult to implement with high performance because each output element depends closely on the immediately preceding output element. This compels a latency from output to output. This dependency chain defeats common high-performance techniques (such as SIMD, strip mining, and superscalar execution).
Working in assembly initially is not a good approach to this. At some point, working in assembly may help. However, you have a fundamental issue to resolve: You cannot produce a new output until you have completed the previous output, multiplied it by a coefficient, and added the results of additional arithmetic. Therefore, the best you can do with this formulation is to produce one output as frequently as the processor can do a multiply and an add from start to finish, even supposing the other work can be done in parallel.
It is mathematically possible to rewrite the IIR so that the output depends on other outputs and inputs further in the past, instead of the immediately previous output. This uses more arithmetic but provides a possibility of doing more of the arithmetic in parallel, thus obtaining higher throughput.
On an iPhone or other iOS device, you could simply call vDSP_deq22 in the Accelerate framework. Accelerate is an Apple library, so it is not available on Android. However, perhaps somebody has implemented something similar.
One approach is to measure how many processor cycles each output is taking (calculate many, divide time by number of outputs, multiply by processor speed) to the latency, in cycles, of a multiplication from an addition (from the documentation for the processor model you are using). If the time taken is the same as the latency, then it is impossible to perform this arithmetic any more quickly on that processor, and you must either accept it or find an alternate solution with different math.
You might be able to gain some extra speed by taking a look at what your compiler does, but this should be the last thing you do. First take a good look at your algorithm and variable types.
Since your target is ARMv6, the first thing I would do is to switch from floating-point to fixed-point arithmetic. ARMv6 usually has no or very slow hardware floating point support. ARMv7 is usually better, but for ARM, fixed-point arithmetic is usually a lot faster than floating-point implementations.
Android supports ARMv5TE and ARMv7-A. Read NDK docs about supported CPU ARCHs & ABIs available at $NDK/docs/CPU-ARCH-ABIS.html.
ARMv5TE is default and doesn't give you any hardware floating point support, you can see Android NDK page more about this. You should add ARMv7-A support to your application to get best support from hardware.
ARMv6 is somewhere in between and if you want to target these devices you must do some Android.mk trickery.
Nowadays if you are coding a modern app you'll be probably targeting newer devices with ARMv7-A processor type having VFPv3 and NEON. If you just want to support ARMv6, you should use ARMv5TE to cover those. If you want to take advantage of a little bit extra provided by ARMv6 then you'll loose ARMv5TE support completely.
I compiled your simple line of code with NDK r8c, and it can produce me a binary like below. Best ARM VFP allows for your statement is multiply and accumulate instruction which is fmac and compiler can emit these easly.
00000000 <f>:
0: ee607aa2 fmuls s15, s1, s5
4: ed9f7a05 flds s14, [pc, #20]
8: ee407a07 fmacs s15, s0, s14
c: ee417a03 fmacs s15, s2, s6
10: ee417ae3 fnmacs s15, s3, s7
14: eeb00a67 fcpys s0, s15
18: ee020a44 fnmacs s0, s4, s8
1c: e12fff1e bx lr
It might be better to divide your statement into a few chunks to get dual issuing possible but you can do this in C.
You can't create miracles by just using assembly however compiler can also create a huge crap. GCC and ARM is not as good as GCC and Intel. Especially in vectorization, NEON usage. It is always good to check what compiler produces if you need to have high performing routines.
I wonder if there is a penalty for running Dalvik+JIT on a multi-core ARM chip vs a single core chip?
E.g., if I disable multi-core support in my Android system build and execute the entire phone with a single CPU core, will I get higher performance when running a single-threaded Java benchmark?
How much is the cost of memory barrier and synchronization on multi-core?
I am asking because I vaugely remember seeing single-threaded benchmark scores from single core phones vs dual core phones. As long as the Mhz is about the same, there is no big difference between the two phones. I had expected a slow down in the dual-core phone ....
The simple answer is "why don't you try it and find out?"
The complex answer is this:
There are costs to doing multicore synchronization but there are also benefits to have multiple cores. You can undoubtedly devise a pathological case where a program suffers from the additional overhead of synchronization primitives such that it is deeply affected by their performance. This is usually due to locking at too deep of a level (inside your fast loop). But in the general case, the fact that the dozen other system programs are able to get CPU time on other cores, as well as the kernel servicing interrupts and IO on them instead of interrupting your process, are likely to greatly overwhelm the penalty incurred by MP synchronization.
In answer to your question, a DSB can take dozens or hundreds of cycles and a DMB is likely more costly. Depending on the implementation exclusive load-store instructions can be very fast or very slow. WFE can consume several microseconds, though it shouldn't be needed if you are not experiencing contention.
Background: http://developer.android.com/training/articles/smp.html
Dalvik built for SMP does have additional overhead. The Java Memory Model requires that certain guarantees be enforced, which means issuing additional memory barriers, particularly when dealing with volatile fields and immutable objects.
Whether or not the added overhead will be noticeable depends on what exactly you're doing and what device you're on, but generally speaking it's unlikely you'll notice it unless you're running a targeted benchmark.
If you build for UP and run Dalvik on a device with multiple cores, you may see flaky behavior -- see the "SMP failure example" appendix in the doc referenced above.
Since all smart phones (at least the ones, that I can find specs on) have 32-bit processors, I would imagine that using single precision floating-point values in extensive calculations would perform significantly better than doubles. However, that doesn't seem to be the case.
Even if I avoid type casts, and use the FloatMath package whenever possible, I can hardly see any improvements in performance except for the memory use, when comparing float-based methods to double-based ones.
I am currently working on a rather large, calculation intensive sound analysis tool, which is doing several million multiplications and additions per second. Since a double precision multiplication on a 32-bit processor takes several clock cycles vs. 1 for single precision, I was assuming the type change would be noticeable... But it isn't :-(
Is there a good explanation for this? Is it due to the way the Dalvik VM works, or what?
Floating-point units on typical CPUs perform all of their calculations in double-precision (or better) and simply round or convert to whatever the final precision is. In other words, even 32-bit CPUs have 64-bit FPUs.
Many phones have CPUs that include FPUs, but have the FPUs disabled to save power, causing the floating-point operations to be slowly emulated (in which case 32-bit floats would be an advantage).
There are also vector units that have 32-bit FPUs, causing 64-bit floating-point operations to take longer. Some SIMD units (like those that execute SSE instructions) perform 32-bit and 64-bit operations in the same amount of time, so you could do twice as many 32-bit ops at a time, but a single 32-bit op won't go any faster than a single 64-bit op.
Many, perhaps most, Android devices have no floating-point co-processor.
I am currently working on a rather large, calculation intensive sound analysis tool, which is doing several million multiplications and additions per second.
That's not going to work very well on Android devices lacking a floating-point co-processor.
Move it into C/C++ with the NDK, then limit your targets to ARM7, which has a floating-point co-processor.
Or, change your math to work in fixed-point mode. For example, Google Maps does not deal with decimal degrees for latitude and longitude, but rather microdegrees (10^6 times degrees), specifically so that it can do its calculations using fixed-point math.
It seems that you're using a Nexus One, which has a Scorpion core.
I believe that both single- and double-precision scalar floating point are fully pipelined in Scorpion, so although the latency of the operations may differ, the throughput is the same.
That said, I believe that Scorpion also has a SIMD unit which is capable of operating on floats, but not doubles. In theory a program written against he NDK taking advantage of the SIMD instructions can run substantially faster on single-precision than on double-precision, but only with significant effort from the programmer.
I'm trying to decide on whether to primarily use floats or ints for all 3D-related elements in my app (which is C++ for the most part). I understand that most ARM-based devices have no hardware floating point support, so I figure that any heavy lifting with floats would be noticeably slower.
However, I'm planning to prep all data for the most part (i.e. have vertex buffers where applicable and transform using matrices that don't change a lot), so I'm just stuffing data down OpenGL's throat. Can I assume that this goes more or less straight to the GPU and will as such be reasonably fast? (Btw, the minimum requirement is OpenGL ES 2.0, so that presumably excludes older 1.x-based phones.)
Also - how is the penalty when I mix and match ints and floats? Assuming that all my geometry is just pre-built float buffers, but I use ints for matrices since those do require expensive operations like matrix multiplications, how much wrath will I incur here?
By the way, I know that I should keep my expectations low (sounds like even asking for floats on the CPU is asking for too much), but is there anything remotely like 128-bit VMX registers?
(And I'm secretly hoping that fadden is reading this question and has an awesome answer.)
Older Android devices like the G1 and MyTouch have ARMv6 CPUs without floating point support. Most newer devices, like the Droid, Nexus One, and Incredible, use ARMv7-A CPUs that do have FP hardware. If your game is really 3D-intensive, it might demand more from the 3D implementation than the older devices can provide anyway, so you need to decide what level of hardware you want to support.
If you code exclusively in Java, your app will take advantage of the FP hardware when available. If you write native code with the NDK, and select the armv5te architecture, you won't get hardware FP at all. If you select the armv7-a architecture, you will, but your app won't be available on pre-ARMv7-A devices.
OpenGL from Java should be sitting on top of "direct" byte buffers now, which are currently slow to access from Java but very fast from the native side. (I don't know much about the GL implementation though, so I can't offer much more than that.)
Some devices additionally support the NEON "Advanced SIMD" extension, which provides some fancy features beyond what the basic VFP support has. However, you must test for this at runtime if you want to use it (looks like there's sample code for this now -- see the NDK page for NDK r4b).
An earlier answer has some info about the gcc flags used by the NDK for "hard" fp.
Ultimately, the answer to "fixed or float" comes down to what class of devices you want your app to run on. It's certainly easier to code for armv7-a, but you cut yourself off from a piece of the market.
In my opinion you should stick with fixed-point as much as possible.
Not only old phones miss floating point support, but also new ones such as the HTC Wildfire.
Also, if you choose to require ARMv7, please note that for example the Motorola Milestone (Droid for Europe) does feature an ARMv7 CPU, but because of the way Android 2.1 has been built for this device, the device will not use your armeabi-v7a libs (and might hide your app from the Market).
I personally worked around this by detecting ARMv7 support using the new cpufeatures library provided with NDK r4b, to load some armeabi-v7a lib on demand with dlopen().