I'm doing some image compression in Android using native code. For various reasons, I can't use a pre-built library.
I profiled my code using the android-ndk-profiler and found that the bottleneck is -- surprisingly -- floating point operations! Here's the profile output:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
40.37 0.44 0.44 __addsf3
11.93 0.57 0.13 7200 0.02 0.03 EncodeBlock
6.42 0.64 0.07 535001 0.00 0.00 BitsOut
6.42 0.71 0.07 __aeabi_fdiv
6.42 0.78 0.07 __gnu_mcount_nc
5.50 0.84 0.06 __aeabi_fmul
5.50 0.90 0.06 __floatdisf
...
I googled __addsf3 and apparently it is a software floating point operation. Yuck. I did more research on the ARMv6 architecture core, and unless I missed something, it doesn't have hardware floating point support. So what can I do here to speed this up? Fixed-point? I know that's normally done with integers, but I'm not really sure how to convert my code to do that. Is there a compiler flag I could set so it will do that? Other suggestions welcome.
Of course you can do anything with integer arithmetic only (after all is exactly what you program is doing right now) but if it can be done faster or not really depends on what exactly you are trying to do.
Floating point is sort of a generic solution can you can apply in most cases and just forget about it, but it's somewhat rare that your problem really needs numbers ranging wildly from the incredibly small to the incredibly big and with 52 bits of mantissa accuracy. Supposing your computations are about graphics with a double precision floating point number you can go from much less than sub-atomic scale to much more than the size of the universe... is it really that range needed? Accuracy provided of course depends on the scale with FP, but what is the accuracy you really need?
What are your numbers used for in your "inner loop"? Without knowing that is hard to say if the computation can be made faster by much or not. Almost surely it can be made faster (FP is a generic blind solution) but the degree of gain you may hope in varies a lot. I don't know the specific implementation but I'd expect it to be reasonably efficient (for the generic case).
You should aim at an higher logical level of optimization.
For image (de)compression based on say DCT or wavelet transform I think that indeed there is no need of floating point arithmetic: you can just consider the exact scales your number will be and use integer arithmetic. Moreover may be you also have an extra degree of freedom because of the ability of produce approximate results.
See 6502's excellent answer first...
Most processors dont have fpus because they are not needed. And when they do for some reason they try to conform to IEEE754 which is equally unnecessary, the cases that need any of that are quite rare. The fpu is just an integer alu with some stuff around it to keep track of the floating point, all of which you can do yourself.
How? Lets think decimals and dollars we can think about $110.50 and adding $0.07 and getting $110.57 or you could have just done everything in pennies, 11050 + 7 = 11057, then when you print it for a user place a dot in the right place. That is all the fpu is doing, and that is all you need to do. this link may or may not give some insight into this http://www.divms.uiowa.edu/~jones/bcd/divide.html
Dont blanket all ARMv6 processors that way, that is not how ARMs are categorized. Some cores have the option for an FPU or you can add one on yourself after you buy from ARM, etc. the ARM11's are ARMv6 with the option for an fpu for example.
Also, just because you can keep track of the decimal point yourself, if there is a hard fpu it is possible to have it be faster than doing it yourself in fixed point. Likewise it is possible and easy to not know how to use an fpu and get bad results, just get them faster. Very easy to write bad floating point code. Whether you use fixed or float you need to keep track of the range of your numbers and from that control where you move the point around to keep the integer math at the core within the mantissa. Which means to use floating point effectively you should be thinking in terms of what the integer math is doing. One very common mistake is to think that multiplies mess up your precision, when it is actually addition and subtraction that can hurt you the most.
Related
I was reading this great article talking about how to build more efficient Android apps:http://blog.azoft.com/android-application-development-tips/.
Those tips are really helpful. But I don't quite understand this one:
"Since the calculation of a floating point requires lots of battery power, you might consider using microdegrees for bulk geo math and caching values when performing DPI tasks with DisplayMetrics."
Why calculating a floating point requires lots of battery power please?
"Lots" is a bit of a hyperbole. If you are doing multiple seconds of floating point calculation, it will be more battery-intensive than the equivalent integer math, but the occasional multiply won't hurt. Unless you know you have math heavy operations, I wouldn't worry about it. To put a number to it, you are looking at ~1 mAh per billion operations (typical).
As for why, most integer operations execute in fewer than 4 cycles, while single precision floating point division can hit 96 cycles. Further, in some cases, a floating-point coprocessor may be used, which will draw additional power since it may be shut down when not in use to save battery.
See the ARM9 Instruction Cycle Count Summary for details.
I've just started trying to optimised some android code using NEON. I'm having a few issues, however. The main issue is that I really can't work out how to do a quick 16-bit to float conversion.
I see its possible to convert multiple 32-bit ints to float in 1 SIMD instruction using vcvt.s32.f32. However how do I convert a set of 4 S16s to 4 S32s? I assume it has something to do with the VUZP instruction but I cannot figure out how...
Equally I see that its possible to use VCVT.s16.f32 to convert 1 16-bit to a float at a time but while this is helpful it seems very wasteful not to be able to do it using SIMD.
I've written assembler on many different platforms over the years but I find the ARM documentation completely unfathomable for some reason.
As such any help would be HUGELY appreciated.
Also is there any way to get the throughput and latency figures for the NEON unit?
Thanks in advance!
If no other computation is to be done along with the conversion from 16bit integer to 32bit integer you can go for uint32x4_t = vmovl_u16 (uint16x4_t)
If any simple addition or multiplication etc is being performed before the conversion, you can combine them in a single instruction like int32x4_t = vmull_u16 (int16x4_t, int16x4_t) or int32x4_t = vaddl_u16 (int16x4_t, int16x4_t) etc and thus saving some amount of cycles.
Elaborating a small bit on my comment: you want to "widen" the 4 16-bit registers to 4 32-bit integers before converting to 4 32-bit floats. Looking at the instruction set I don't think there are any faster conversion paths, but I could easily be wrong.
The direct method is to use vaddl.s16 with a second operand of four zeros, but unless you're only doing conversion you can often combine the conversion with a previous operation. E.g. if you're multiplying two int16x4 registers you can use vmull.s16 to get 32-bit output directly rather than first multiplying and widening later (provided you're not depending on any truncating behavior).
why use vaddl wasting cycles initializing a valuable register with 0?
vmovl.s16 q0, d1
then convert q0
that will do.
My question is :
Is it absolutely necessary to convert them to float? NEON is much faster doing integer operations than float. (both execution and pipeline) Therefore, fixed-point operations will be more appropriate in most cases thanks to the powerful long, wide, narrow models combined with arithmetic instructions and automatic round/saturation options.
PS : strange, I think ARM's PDF to be the best around.
I am testing corner cases on the pow call(#include <math.h>), specifically pow(-1, Inf).
On my desktop (Ubuntu) I get the result 1.0, this is in accordance with the 2008 IEEE floating point specification.
I run the same test when running the Android Gingerbread kernel and I get NaN returned.
I have looked around and can see that there is indeed many implementations of pow in the standard libraries for different platforms and in the case pow(-1, Inf) they are coded to produce different results.
The question is which one should be deemed correct? Any Ideas or thoughts?
I apologize if I am posting on the wrong forum, I followed the link from the android developer resources and ended up here.
The C standard is perfectly clear on this point (§F.9.4.4); there's no room for "ideas or thoughts":
pow(−1, ±∞) returns 1.
Annex F applies only if an implementation defines __STDC_IEC_559__, but there is no question that 1.0 is the right answer.
I suspect that this is a Java-ism that has leaked over into the NDK. (Java defines pow(-1,infinity) to be NaN):
If the absolute value of the first argument equals 1 and the second argument is infinite, then the result is NaN.
Edit:
Since Matteo objects that this "makes no sense", I'll offer a few sentences of explanation for why the committee made this choice. Although lim_{n->inf} (-1)^n does not exist in the real numbers, we must remember that floating-point numbers are not real numbers, and in fact, for all sufficiently large floating-point numbers y, pow(-1,y) is +1. This is because all sufficiently large floating-point numbers are even integers. From this perspective, it is quite reasonable to define pow(-1,infinity) to be +1, and this turns out to actually lead to more useful behavior in some floating-point computations.
There are a surprising number of extremely competent mathematicians (as well as very skilled programmers and compiler writers) involved with both the C and the IEEE-754 committees, and they do not make these decisions flippantly. Every standard has bugs, but this is not one of them.
I am working on an application where I would like to track the position of a mobile user inside a building where GPS is unavailable. The user starts at a well known fixed location (accurate to within 5 centimeters), at which point the accelerometer in the phone is to be activated to track any further movements with respect to that fixed location. My question is, in current generation smart phones (iphones, android phones, etc), how accurately can one expect to be able to track somebodies position based on the accelerometer these phones generally come equip with?
Specific examples would be good, such as "If I move 50 meters X from the starting point, 35 meters Y from the starting point and 5 meters Z from the starting point, I can expect my location to be approximated to within +/- 80 centimeters on most current smart phones", or whatever.
I have only a superficial understanding of techniques like Kalman filters to correct for drift, though if such techniques are relevant to my application and someone wants to describe the quality of the corrections I might get from such techniques, that would be a plus.
If you integrate the accelerometer values twice you get position but the error is horrible. It is useless in practice.
Here is an explanation why (Google Tech Talk) at 23:20.
I answered a similar question.
I don't know if this thread is still open or even if you are still attempting this approach, but I could at least give an input into this, considering I tried the same thing.
As Ali said.... it's horrible! the smallest measurement error in accelerometers turn out to be rediculess after double integration. And due to constant increase and decrease in acceleration while walking (with each foot step in fact), this error quickly accumulates over time.
Sorry for the bad news. I also didn't want to believe it, till trying it self... filtering out unwanted measurements also doesn't work.
I have another approach possibly plausible, if you're interested in proceeding with your project. (approach which I followed for my thesis for my computer engineering degree)... through image processing!
You basically follow the theory for optical mice. Optical flow, or as called by a view, Ego-Motion. The image processing algorithms implemented in Androids NDK. Even implemented OpenCV through the NDK to simplify algorithms. You convert images to grayscale (compensating for different light entensities), then implement thresholding, image enhancement, on the images (to compensate for images getting blurred while walking), then corner detection (increase accuracy for total result estimations), then template matching which does the actual comparing between image frames and estimates actual displacement in amount of pixels.
You then go through trial and error to estimate which amount of pixels represents which distance, and multiply with that value to convert pixel displacement into actual displacement. This works up till a certain movement speed though, the real problem being camera images still getting too blurred for accurate comparisons due to walking. This can be improved by setting camera shutterspeeds, or ISO (I'm still playing around with this).
So hope this helps... otherwise google for Egomotion for real-time applications. Eventually you'll get the right stuff and figure out the jibberish I just explained to you.
enjoy :)
The optical approach is good, but OpenCV provides a few feature transforms. You then feature match (OpenCV provides this).
Without having a second point of reference (2 cameras) you can't reconstruct where you are directly because of depth. At best you can estimate a depth per point, assume a motion, score the assumption based on a few frames and re-guess at each depth and motion till it makes sense. Which isn't that hard to code but it isn't stable, small motions of things in the scene screw it up. I tried :)
With a second camera though, it's not that hard at all. But cell phones don't have them.
Typical phone accelerometer chips resolve +/- 2g # 12 bits providing 1024 bits over full range or 0.0643 ft/sec^2 lsb. The rate of sampling depends on clock speeds and overall configuration. Typical rates enable between one and 400 samples per second, with faster rates offering lower accuracy. Unless you mount the phone on a snail, displacement measurement likely will not work for you. You might consider using optical distance measurement instead of a phone accelerometer. Check out Panasonic device EKMB1191111.
Since all smart phones (at least the ones, that I can find specs on) have 32-bit processors, I would imagine that using single precision floating-point values in extensive calculations would perform significantly better than doubles. However, that doesn't seem to be the case.
Even if I avoid type casts, and use the FloatMath package whenever possible, I can hardly see any improvements in performance except for the memory use, when comparing float-based methods to double-based ones.
I am currently working on a rather large, calculation intensive sound analysis tool, which is doing several million multiplications and additions per second. Since a double precision multiplication on a 32-bit processor takes several clock cycles vs. 1 for single precision, I was assuming the type change would be noticeable... But it isn't :-(
Is there a good explanation for this? Is it due to the way the Dalvik VM works, or what?
Floating-point units on typical CPUs perform all of their calculations in double-precision (or better) and simply round or convert to whatever the final precision is. In other words, even 32-bit CPUs have 64-bit FPUs.
Many phones have CPUs that include FPUs, but have the FPUs disabled to save power, causing the floating-point operations to be slowly emulated (in which case 32-bit floats would be an advantage).
There are also vector units that have 32-bit FPUs, causing 64-bit floating-point operations to take longer. Some SIMD units (like those that execute SSE instructions) perform 32-bit and 64-bit operations in the same amount of time, so you could do twice as many 32-bit ops at a time, but a single 32-bit op won't go any faster than a single 64-bit op.
Many, perhaps most, Android devices have no floating-point co-processor.
I am currently working on a rather large, calculation intensive sound analysis tool, which is doing several million multiplications and additions per second.
That's not going to work very well on Android devices lacking a floating-point co-processor.
Move it into C/C++ with the NDK, then limit your targets to ARM7, which has a floating-point co-processor.
Or, change your math to work in fixed-point mode. For example, Google Maps does not deal with decimal degrees for latitude and longitude, but rather microdegrees (10^6 times degrees), specifically so that it can do its calculations using fixed-point math.
It seems that you're using a Nexus One, which has a Scorpion core.
I believe that both single- and double-precision scalar floating point are fully pipelined in Scorpion, so although the latency of the operations may differ, the throughput is the same.
That said, I believe that Scorpion also has a SIMD unit which is capable of operating on floats, but not doubles. In theory a program written against he NDK taking advantage of the SIMD instructions can run substantially faster on single-precision than on double-precision, but only with significant effort from the programmer.