Why does using float instead of double not improve Android performance? - android

Since all smart phones (at least the ones, that I can find specs on) have 32-bit processors, I would imagine that using single precision floating-point values in extensive calculations would perform significantly better than doubles. However, that doesn't seem to be the case.
Even if I avoid type casts, and use the FloatMath package whenever possible, I can hardly see any improvements in performance except for the memory use, when comparing float-based methods to double-based ones.
I am currently working on a rather large, calculation intensive sound analysis tool, which is doing several million multiplications and additions per second. Since a double precision multiplication on a 32-bit processor takes several clock cycles vs. 1 for single precision, I was assuming the type change would be noticeable... But it isn't :-(
Is there a good explanation for this? Is it due to the way the Dalvik VM works, or what?

Floating-point units on typical CPUs perform all of their calculations in double-precision (or better) and simply round or convert to whatever the final precision is. In other words, even 32-bit CPUs have 64-bit FPUs.
Many phones have CPUs that include FPUs, but have the FPUs disabled to save power, causing the floating-point operations to be slowly emulated (in which case 32-bit floats would be an advantage).
There are also vector units that have 32-bit FPUs, causing 64-bit floating-point operations to take longer. Some SIMD units (like those that execute SSE instructions) perform 32-bit and 64-bit operations in the same amount of time, so you could do twice as many 32-bit ops at a time, but a single 32-bit op won't go any faster than a single 64-bit op.

Many, perhaps most, Android devices have no floating-point co-processor.
I am currently working on a rather large, calculation intensive sound analysis tool, which is doing several million multiplications and additions per second.
That's not going to work very well on Android devices lacking a floating-point co-processor.
Move it into C/C++ with the NDK, then limit your targets to ARM7, which has a floating-point co-processor.
Or, change your math to work in fixed-point mode. For example, Google Maps does not deal with decimal degrees for latitude and longitude, but rather microdegrees (10^6 times degrees), specifically so that it can do its calculations using fixed-point math.

It seems that you're using a Nexus One, which has a Scorpion core.
I believe that both single- and double-precision scalar floating point are fully pipelined in Scorpion, so although the latency of the operations may differ, the throughput is the same.
That said, I believe that Scorpion also has a SIMD unit which is capable of operating on floats, but not doubles. In theory a program written against he NDK taking advantage of the SIMD instructions can run substantially faster on single-precision than on double-precision, but only with significant effort from the programmer.

Related

Does phone CPU have separate integer and floating point compute units that can operate in parallel?

On desktop CPU, interleaved integer and float computation (such as with float arrays: updating integer indexes while computing the array value) is faster than all integer compute then all float compute. This is because integer ops and float ops are processed by different parts of the CPU, so they can be processed at the basically same time.
Is it the same for newer phones' CPU and ARM architecture in general?
After the x86 architecture has already been discussed in the comments, now about ARM:
Basically, this also depends on the processor model used. Most ARM processors have only two pipelines for SIMD calculations. Some instructions can only be executed on one of the two pipelines, but most do not care. This also applies to simple ALU operations such as
FADD, FSUB, FMUL for floating-point SIMD
ADD, SUB, MUL for integer SIMD
If this addition, for example, already has a throughput of (maximum) 2 instructions per cycle, this means that both pipelines are fully utilized. So here simple integer instructions are just as fast as floating point instructions. Due to the high throughput, no speed advantage can be achieved by using the pipelines for SIMD or even SISD integer operations instead. Here I assume, of course, that there are no dependencies between the instructions.
In addition to the throughput, the latency of the instructions must also be taken into account: The integer SIMD ADD has a maximum latency of 3 cycles, for floating-point FADD it is 4 cycles. On the other hand, the non-SIMD add only has one cycle latency. The latency indicates the number of cycles after which the result is available at the earliest. If the following instruction is based on the result of the previous one, the throughput is limited and it can be useful to put other instructions in between that use other pipelines, for example the non-SIMD ALU one.
At least that's the case with the Cortex-A72 and Cortex-A76. With the older Cortex-A55 it's a bit more complicated. You can find information in the respective "Software Optimization Guide", for example:
Arm® Cortex®-A55 Software Optimization Guide
Arm® Cortex®-A72 Software Optimization Guide
Arm® Cortex®-A76 Software Optimization Guide
Clarification after some comments: Scalar operations on SIMD registers (using s0 to s31, d0 to d31 etc.) and vector operations on them (v0 to v31) always take place on the two SIMD pipelines. Only operations on general-purpose registers (w0 to w30, wzr, wsp, x0 to x31, xzr, xsp) run on the two non-SIMD ALU pipelines I0/I1 and the M-Pipeline. That's why, in some cases, one ALU pipeline I0/I1 is also used for address calculation with SIMD instructions.

Can I emulate ARM NEON in an x86 C program?

I am developing some numerical software, whose performance, depends a lot on the numerical accuracy (i.e., floats, double etc.).
I have noticed that the ARM NEON does not fully comply with the IEEE754 floating point standard. Is there a way to emulate NEON's floating point precision, on an x86 CPU ? For example a library that emulates the NEON SIMD floating point operations.
Probably.
I'm less familiar with SSE, but you can force many of the SSE modes to behave like NEON. This will depend on your compiler and available libraries, but see some Visual Studio FP unit control functions. This might be good enough for your requirements.
Furthermore, you can use the arm_neon.h header to ensure that you are using similar intrinsics to accomplish similar things.
Finally, if you really require achieving this precision at these boundary conditions, you are going to want a good test suite to verify that you are achieving your results as intended.
Finally finally, even with pure "C" code, which typically complies with IEEE-754, and uses the VFP on ARM as other commenters have mentioned, you will get different results because floating point is a highly... irregular process, subject to the whim of optimization and order of operations. It is challenging to get results to match across different compilers, let alone hardware architectures. For example, to get highly agreeable results on Intel with gcc it's often required to use the -ffloat-store flag, if you want to compare with /fp:precise on CL/MSVS.
In the end, you may need to accept some kind of non-zero error tolerance. Trying to get to zero may be difficult, but it would be awesome to hear your results if you get there. It seems possible... but difficult.
Thanks for your answers.
At last, I used an android phone connected to a desktop, and certain functions were running on the phone.

Do I get a performance bonus if I try to use arm math assembler commands instead of c

I have cycle in my application in which executed mathematical multiply and addition calculations.
I know some facts:
android devices supports armv6 and up processors
armv6 not supported NEON commands
Does i increase performance of application on armv6 including, and up, if instead of c math commands i will start using assembler math commands?
UPDATE
i need to execute cycle with math operation faster, is right way to use assembler instead of c.
UPDATE
i have this calculation:
Ry0 = (b0a0 * buffer[index] + b1a0 * Rx1 + b2a0 * Rx2 - a1a0 * Ry1
- a2a0 * Ry2);
it is biquad transfer function.
Can i force execute this calculation faster with asm?
UPDATE
buffer size is 192000
variables is float type
Compilers are pretty good at their job, so unless you KNOW what your compiler is producing, and know that you can do better, probably not.
Without knowing exactly what your code does, it would be impossible to give a better answer.
Edit: to summarize this discussion:
The FIRST step in improving performance is not to start writing assembler. The first step is to find the most efficient algorithm. Once that has been done you can look at assembler coding.
Infinite Impulse Response (IIR) functions are difficult to implement with high performance because each output element depends closely on the immediately preceding output element. This compels a latency from output to output. This dependency chain defeats common high-performance techniques (such as SIMD, strip mining, and superscalar execution).
Working in assembly initially is not a good approach to this. At some point, working in assembly may help. However, you have a fundamental issue to resolve: You cannot produce a new output until you have completed the previous output, multiplied it by a coefficient, and added the results of additional arithmetic. Therefore, the best you can do with this formulation is to produce one output as frequently as the processor can do a multiply and an add from start to finish, even supposing the other work can be done in parallel.
It is mathematically possible to rewrite the IIR so that the output depends on other outputs and inputs further in the past, instead of the immediately previous output. This uses more arithmetic but provides a possibility of doing more of the arithmetic in parallel, thus obtaining higher throughput.
On an iPhone or other iOS device, you could simply call vDSP_deq22 in the Accelerate framework. Accelerate is an Apple library, so it is not available on Android. However, perhaps somebody has implemented something similar.
One approach is to measure how many processor cycles each output is taking (calculate many, divide time by number of outputs, multiply by processor speed) to the latency, in cycles, of a multiplication from an addition (from the documentation for the processor model you are using). If the time taken is the same as the latency, then it is impossible to perform this arithmetic any more quickly on that processor, and you must either accept it or find an alternate solution with different math.
You might be able to gain some extra speed by taking a look at what your compiler does, but this should be the last thing you do. First take a good look at your algorithm and variable types.
Since your target is ARMv6, the first thing I would do is to switch from floating-point to fixed-point arithmetic. ARMv6 usually has no or very slow hardware floating point support. ARMv7 is usually better, but for ARM, fixed-point arithmetic is usually a lot faster than floating-point implementations.
Android supports ARMv5TE and ARMv7-A. Read NDK docs about supported CPU ARCHs & ABIs available at $NDK/docs/CPU-ARCH-ABIS.html.
ARMv5TE is default and doesn't give you any hardware floating point support, you can see Android NDK page more about this. You should add ARMv7-A support to your application to get best support from hardware.
ARMv6 is somewhere in between and if you want to target these devices you must do some Android.mk trickery.
Nowadays if you are coding a modern app you'll be probably targeting newer devices with ARMv7-A processor type having VFPv3 and NEON. If you just want to support ARMv6, you should use ARMv5TE to cover those. If you want to take advantage of a little bit extra provided by ARMv6 then you'll loose ARMv5TE support completely.
I compiled your simple line of code with NDK r8c, and it can produce me a binary like below. Best ARM VFP allows for your statement is multiply and accumulate instruction which is fmac and compiler can emit these easly.
00000000 <f>:
0: ee607aa2 fmuls s15, s1, s5
4: ed9f7a05 flds s14, [pc, #20]
8: ee407a07 fmacs s15, s0, s14
c: ee417a03 fmacs s15, s2, s6
10: ee417ae3 fnmacs s15, s3, s7
14: eeb00a67 fcpys s0, s15
18: ee020a44 fnmacs s0, s4, s8
1c: e12fff1e bx lr
It might be better to divide your statement into a few chunks to get dual issuing possible but you can do this in C.
You can't create miracles by just using assembly however compiler can also create a huge crap. GCC and ARM is not as good as GCC and Intel. Especially in vectorization, NEON usage. It is always good to check what compiler produces if you need to have high performing routines.

Speeding up floating point operations (Android ARMv6)

I'm doing some image compression in Android using native code. For various reasons, I can't use a pre-built library.
I profiled my code using the android-ndk-profiler and found that the bottleneck is -- surprisingly -- floating point operations! Here's the profile output:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
40.37 0.44 0.44 __addsf3
11.93 0.57 0.13 7200 0.02 0.03 EncodeBlock
6.42 0.64 0.07 535001 0.00 0.00 BitsOut
6.42 0.71 0.07 __aeabi_fdiv
6.42 0.78 0.07 __gnu_mcount_nc
5.50 0.84 0.06 __aeabi_fmul
5.50 0.90 0.06 __floatdisf
...
I googled __addsf3 and apparently it is a software floating point operation. Yuck. I did more research on the ARMv6 architecture core, and unless I missed something, it doesn't have hardware floating point support. So what can I do here to speed this up? Fixed-point? I know that's normally done with integers, but I'm not really sure how to convert my code to do that. Is there a compiler flag I could set so it will do that? Other suggestions welcome.
Of course you can do anything with integer arithmetic only (after all is exactly what you program is doing right now) but if it can be done faster or not really depends on what exactly you are trying to do.
Floating point is sort of a generic solution can you can apply in most cases and just forget about it, but it's somewhat rare that your problem really needs numbers ranging wildly from the incredibly small to the incredibly big and with 52 bits of mantissa accuracy. Supposing your computations are about graphics with a double precision floating point number you can go from much less than sub-atomic scale to much more than the size of the universe... is it really that range needed? Accuracy provided of course depends on the scale with FP, but what is the accuracy you really need?
What are your numbers used for in your "inner loop"? Without knowing that is hard to say if the computation can be made faster by much or not. Almost surely it can be made faster (FP is a generic blind solution) but the degree of gain you may hope in varies a lot. I don't know the specific implementation but I'd expect it to be reasonably efficient (for the generic case).
You should aim at an higher logical level of optimization.
For image (de)compression based on say DCT or wavelet transform I think that indeed there is no need of floating point arithmetic: you can just consider the exact scales your number will be and use integer arithmetic. Moreover may be you also have an extra degree of freedom because of the ability of produce approximate results.
See 6502's excellent answer first...
Most processors dont have fpus because they are not needed. And when they do for some reason they try to conform to IEEE754 which is equally unnecessary, the cases that need any of that are quite rare. The fpu is just an integer alu with some stuff around it to keep track of the floating point, all of which you can do yourself.
How? Lets think decimals and dollars we can think about $110.50 and adding $0.07 and getting $110.57 or you could have just done everything in pennies, 11050 + 7 = 11057, then when you print it for a user place a dot in the right place. That is all the fpu is doing, and that is all you need to do. this link may or may not give some insight into this http://www.divms.uiowa.edu/~jones/bcd/divide.html
Dont blanket all ARMv6 processors that way, that is not how ARMs are categorized. Some cores have the option for an FPU or you can add one on yourself after you buy from ARM, etc. the ARM11's are ARMv6 with the option for an fpu for example.
Also, just because you can keep track of the decimal point yourself, if there is a hard fpu it is possible to have it be faster than doing it yourself in fixed point. Likewise it is possible and easy to not know how to use an fpu and get bad results, just get them faster. Very easy to write bad floating point code. Whether you use fixed or float you need to keep track of the range of your numbers and from that control where you move the point around to keep the integer math at the core within the mantissa. Which means to use floating point effectively you should be thinking in terms of what the integer math is doing. One very common mistake is to think that multiplies mess up your precision, when it is actually addition and subtraction that can hurt you the most.

Float or Double?

Which is faster, double or float, when preforming arithimic (+-*/%), and is it worth just using float for memory reasons? Precision is not an issue much of an issue.
Feel free to call me crazy for even thinking this. Just curious as I see the amount of floats I'm using is getting larger.
EDIT 1:
The only reason this is under android is because that is where I believe memory matters; I wouldn't even ask this for desktop development.
The processing speed on both types should approximately be the same in CPUs nowadays.
"use whichever precision is required for acceptable results."
Related questions have been asked a couple of times here on SO, here is one.
Edit:
In speed terms, there's no difference between float and double on the more modern hardware.
Please check out this article from developer.android.com.
Double rather than Float was advised by ADT v21 lint message due to the JIT (Just In Time) optimizations in Dalvik from Froyo onwards (API 8 and later).
I was using FloatMath.sin and it suggested Math.sin instead with the following under "explain issue" context menu. It reads to me like a general message relating to double vs float and not just trig related.
"In older versions of Android, using
android.util.FloatMath was recommended for
performance reasons when operating on floats.
However, on modern hardware doubles are just as fast
as float (though they take more memory), and in
recent versions of Android, FloatMath is actually
slower than using java.lang.Math due to the way the JIT
optimizes java.lang.Math. Therefore, you should use
Math instead of FloatMath if you are only targeting
Froyo and above."
Hope this helps.
I wouldn't advise either for fast operations but I would believe that a operations on floats would be faster as they are 32 bit vs 64 bit in doubles.
http://developer.android.com/training/articles/perf-tips.html#AvoidFloat
Avoid Using Floating-Point
As a rule of thumb, floating-point is about 2x slower than integer on
Android-powered devices.
In speed terms, there's no difference between float and double on the
more modern hardware. Space-wise, double is 2x larger. As with desktop
machines, assuming space isn't an issue, you should prefer double to
float.
Also, even for integers, some processors have hardware multiply but
lack hardware divide. In such cases, integer division and modulus
operations are performed in software—something to think about if
you're designing a hash table or doing lots of math.
a float is 32 bits or 4 bytes
a double is 64 bits or 8 bytes
so yeah, floats are half the size according to the sun java certification book.
In speed terms, there's no difference between float and double on the more modern hardware.
Very cheap devices seem to have a limited FPU where float is faster than double. I tested on a CMX device that is currently marketed as one of the the cheapest tablets in the world:
"float" test code takes 4.7 seconds
same code with "double" takes 6.6 seconds
This question has been asked a couple of times ...
Yes. Because the answer differs for different types of hardware. On desktop computers double has the same speed as float. On devices without FPU (interesting for WLAN router hacks) float is 2-5 times faster than double; and on devices with 32-bit FPU (often found in industrial and automotive applications) even up to 100 times.
Please check out this article ...
The last section of the article says that you have to do time measurements on the hardware device you are going to use to be 100% sure.
The android documentation quoted indicates that integers are preferable for fast operations. This seems a little strange on the face of it but the speed of an algorithm using ints vs floats vs doubles depends on several layers:
The JIT or VM: these will convert the mathematical operations to the host machine's native instruction set and that translation can have a large impact on performance. Since the underlying hardware can vary dramatically from platform to platform, it can be very difficult to write a VM or JIT that will emit optimal code in all cases. It is probably still best to use the JIT/VM's recommended fast type (in this case, integers) because, as the JITs and VMs get better at emitting more efficient native instructions, your high-level code should get the associated performance boosts without any modification.
The native hardware (why the first level isn't perfect): most processors nowadays have hardware floating point units (those support floats and doubles). If such a hardware unit is present, floats/doubles can be faster than integers, unless there is also hardware integer support. Compounding the issue is that most CPUs have some form of SIMD (Single Instruction Multiple Data) support that allow operations to be vectorized if the data types are small enough (eg. adding 4 floats in one instruction by putting two in each register instead of having to use one whole register for each of 4 doubles). This can allow data types that use fewer bits to be processed much faster than a double, at the expense of precision.
Optimizing for speed requires detailed knowledge of both of these levels and how they interact. Even optimizing for memory use can be tricky because the VM can choose to represent your data in a larger footprint for other reasons: a float may occupy 8 bytes in the VM's code, though that is less likely. All of this makes optimization almost the antithesis of portability. So here again, it is better to use the VM's recommended "fast" data type because that should result in the best performance averaged across supported devices.
This is not a bad question at all, even on desktops. Yes they are very fast today, but if you are implementing a complicated algorithm (for example, the fast Fourier transform), even small optimizations can have an enormous impact on the algorithm's run time. In any case, the answer to your question "which is faster: floats or doubles" is "it depends" :)
I wondered about this too and wrote a small test:
#include <iostream>
#include <chrono>
template<typename numType>
void test(void) {
std::cout<< "Size of variable: " << sizeof(numType) << std::endl;
numType array[20000];
auto t1 = std::chrono::high_resolution_clock::now();
// fill array
for( numType& number : array ) {
number = 1.0014535;
}
auto t2 = std::chrono::high_resolution_clock::now();
// multiply each number with itself 10.000 times
for( numType& number : array ) {
for( int i=0; i < 10000 ; i++ ) {
number *= number;
}
}
auto t3 = std::chrono::high_resolution_clock::now();
auto filltime = t2 - t1;
auto calctime = t3 - t2;
std::cout<< "Fill time: " << filltime.count() << std::endl;
std::cout<< "Calc time: " << calctime.count() << std::endl;
}
int main(int argc, char* argv[]) {
test<float>();
test<double>();
}
I ran and compiled it under Ubuntu 12.04 x64 using GCC on an Intel i7 3930k processor
These were the results:
Size of variable: 4
Fill time: 69
Calc time: 694303
Size of variable: 8
Fill time: 76
Calc time: 693363
Results were reproducable. So memory allocation for double takes slightly longer but the actual calculation time is exactly the same.
Out of curiousity I also ran and compiled it under Windows 7 x64 using Visual Studio 2012 in release mode on an intel i7 920 processor
(The unit of the time is different so don't compare the results above to these: it is only valid for internal comparison)
Size of variable: 4
Fill time: 0
Calc time: 3200183
Size of variable: 8
Fill time: 0
Calc time: 3890223
Results were reproducable.
It seems on windows allocation is instant, perhaps because linux does not actually give you memory until you use it while windows just hands it all over to you at once, requiring less system calls. Or perhaps the assignment is optimized away.
The multiplication of doubles is 21,5% slower here than for floats. This difference with the previous test is likely due to the different processor (that's my best guess at least).

Categories

Resources