I've just started trying to optimised some android code using NEON. I'm having a few issues, however. The main issue is that I really can't work out how to do a quick 16-bit to float conversion.
I see its possible to convert multiple 32-bit ints to float in 1 SIMD instruction using vcvt.s32.f32. However how do I convert a set of 4 S16s to 4 S32s? I assume it has something to do with the VUZP instruction but I cannot figure out how...
Equally I see that its possible to use VCVT.s16.f32 to convert 1 16-bit to a float at a time but while this is helpful it seems very wasteful not to be able to do it using SIMD.
I've written assembler on many different platforms over the years but I find the ARM documentation completely unfathomable for some reason.
As such any help would be HUGELY appreciated.
Also is there any way to get the throughput and latency figures for the NEON unit?
Thanks in advance!
If no other computation is to be done along with the conversion from 16bit integer to 32bit integer you can go for uint32x4_t = vmovl_u16 (uint16x4_t)
If any simple addition or multiplication etc is being performed before the conversion, you can combine them in a single instruction like int32x4_t = vmull_u16 (int16x4_t, int16x4_t) or int32x4_t = vaddl_u16 (int16x4_t, int16x4_t) etc and thus saving some amount of cycles.
Elaborating a small bit on my comment: you want to "widen" the 4 16-bit registers to 4 32-bit integers before converting to 4 32-bit floats. Looking at the instruction set I don't think there are any faster conversion paths, but I could easily be wrong.
The direct method is to use vaddl.s16 with a second operand of four zeros, but unless you're only doing conversion you can often combine the conversion with a previous operation. E.g. if you're multiplying two int16x4 registers you can use vmull.s16 to get 32-bit output directly rather than first multiplying and widening later (provided you're not depending on any truncating behavior).
why use vaddl wasting cycles initializing a valuable register with 0?
vmovl.s16 q0, d1
then convert q0
that will do.
My question is :
Is it absolutely necessary to convert them to float? NEON is much faster doing integer operations than float. (both execution and pipeline) Therefore, fixed-point operations will be more appropriate in most cases thanks to the powerful long, wide, narrow models combined with arithmetic instructions and automatic round/saturation options.
PS : strange, I think ARM's PDF to be the best around.
Related
I'm doing some image compression in Android using native code. For various reasons, I can't use a pre-built library.
I profiled my code using the android-ndk-profiler and found that the bottleneck is -- surprisingly -- floating point operations! Here's the profile output:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
40.37 0.44 0.44 __addsf3
11.93 0.57 0.13 7200 0.02 0.03 EncodeBlock
6.42 0.64 0.07 535001 0.00 0.00 BitsOut
6.42 0.71 0.07 __aeabi_fdiv
6.42 0.78 0.07 __gnu_mcount_nc
5.50 0.84 0.06 __aeabi_fmul
5.50 0.90 0.06 __floatdisf
...
I googled __addsf3 and apparently it is a software floating point operation. Yuck. I did more research on the ARMv6 architecture core, and unless I missed something, it doesn't have hardware floating point support. So what can I do here to speed this up? Fixed-point? I know that's normally done with integers, but I'm not really sure how to convert my code to do that. Is there a compiler flag I could set so it will do that? Other suggestions welcome.
Of course you can do anything with integer arithmetic only (after all is exactly what you program is doing right now) but if it can be done faster or not really depends on what exactly you are trying to do.
Floating point is sort of a generic solution can you can apply in most cases and just forget about it, but it's somewhat rare that your problem really needs numbers ranging wildly from the incredibly small to the incredibly big and with 52 bits of mantissa accuracy. Supposing your computations are about graphics with a double precision floating point number you can go from much less than sub-atomic scale to much more than the size of the universe... is it really that range needed? Accuracy provided of course depends on the scale with FP, but what is the accuracy you really need?
What are your numbers used for in your "inner loop"? Without knowing that is hard to say if the computation can be made faster by much or not. Almost surely it can be made faster (FP is a generic blind solution) but the degree of gain you may hope in varies a lot. I don't know the specific implementation but I'd expect it to be reasonably efficient (for the generic case).
You should aim at an higher logical level of optimization.
For image (de)compression based on say DCT or wavelet transform I think that indeed there is no need of floating point arithmetic: you can just consider the exact scales your number will be and use integer arithmetic. Moreover may be you also have an extra degree of freedom because of the ability of produce approximate results.
See 6502's excellent answer first...
Most processors dont have fpus because they are not needed. And when they do for some reason they try to conform to IEEE754 which is equally unnecessary, the cases that need any of that are quite rare. The fpu is just an integer alu with some stuff around it to keep track of the floating point, all of which you can do yourself.
How? Lets think decimals and dollars we can think about $110.50 and adding $0.07 and getting $110.57 or you could have just done everything in pennies, 11050 + 7 = 11057, then when you print it for a user place a dot in the right place. That is all the fpu is doing, and that is all you need to do. this link may or may not give some insight into this http://www.divms.uiowa.edu/~jones/bcd/divide.html
Dont blanket all ARMv6 processors that way, that is not how ARMs are categorized. Some cores have the option for an FPU or you can add one on yourself after you buy from ARM, etc. the ARM11's are ARMv6 with the option for an fpu for example.
Also, just because you can keep track of the decimal point yourself, if there is a hard fpu it is possible to have it be faster than doing it yourself in fixed point. Likewise it is possible and easy to not know how to use an fpu and get bad results, just get them faster. Very easy to write bad floating point code. Whether you use fixed or float you need to keep track of the range of your numbers and from that control where you move the point around to keep the integer math at the core within the mantissa. Which means to use floating point effectively you should be thinking in terms of what the integer math is doing. One very common mistake is to think that multiplies mess up your precision, when it is actually addition and subtraction that can hurt you the most.
I wrote a quick application to get a feel for the limits of RenderScript and discoved that when reaching approximately 65,000 triangles, the system simply does not draw any additional ones. For example, if I create a cylinder with 70,000 triangles, there is a missing wedge from the cylinder corresponding to the triangles that exceed the ~65,000 count. The triangles are textured and for ease of writing the app, I simply used a TriangleMeshBuilder class so there is no real optimization going on such as using trifans or tristrips. The hardware is a Samsung Galaxy Nexus. LogCat reports a heap size of about 15MB with 3% free. I receive no errors or warnings regarding the graphics system or RenderScript.
Can anyone explain the reason for the triangles being dropped? Am I at a hardware limit that RenderScript is handling gracefully?
UPDATE Happens on a Samsung Galaxy Nexus (4.0.3), Samsung Galaxy Tab 7.0+ (3.2) and Motorola Xoom (3.2). All at the same point of approximately 65,000 triangles. Each of these devices have different GPU's.
UPDATE 2 In response to Steve Blackwell's insights, I have some additional thoughts.
Lines 710-712 do indeed downcast the int indices to short, thus 65536 goes to 0 as Steve points out. Additionally, the "cast" on line 757 is not so much of a cast as telling the RenderScript the format of the binary data that will eventually be sent to it. RenderScript requires all data to be packed into a RenderScript specific data type called an Allocation to move from Java to the RenderScript runtime, and this needs to be informed as to the data structure. In line with Steve's opinion that this is a bug, line 757 informs RenderScript to treat the index data as short (unsigned 16bit) but it sends it a 32 bit signed value (which will be accepted due to the lack of a check and treated unsigned, and then only the lower 16 bits used, hence why we get something drawn when below this threshold and triangles connecting back to the first indices when we go over).
Subclassing TriangleMeshBuilder to see if I can make it accept these values all as integers to increase this limit did not work, which leads me to believe that somewhere in the deep code we do not have access to, there is an additional reference to unsigned shorts. Looks like the only work around is to add additional vertex buffers as Steve suggests, which is easily done with the existing Mesh.AllocationBuilder class. I will also bring it up with Google in the Developer Hangouts to determine if this is in fact a bug or intentional.
I know almost nothing about RenderScript, so I don't know whether this is some inherent limitation, a hardware issue, or something to do with TriangleMeshBuilder, but I would bet you're running out of triangles at after number 65535.
This is a magic number because it's the maximum value of an unsigned 16-bit integer. (Wikipedia)
I would suspect that somewhere in the code there's an unsigned short that holds the number of triangles. It won't be in the Java code since Java doesn't have unsigned values. And the limitation is probably not hardware since CPU registers/pathways are >= 32-bit. So I would check TriangleMeshBuilder.
EDIT:
That's a great find on line 553. The value of every index has to fit into a short. It looks like the downcast is happening at line 710-712.
I assume that you're calling addTriangle(). That function takes three ints and then does an explicit cast to short. I think that's a bug right there because the downcast happens silently, and it's not what you'd expect from the function signature.
On line 768, that bogus data gets passed to Allocation.copy1DRangeFromUnchecked(). I didn't follow it all the way down, but I imagine that at some point, those signed values get cast to unsigned: the -32768 to -1 gets turned back into 32768 to 65535. So turning the indices into negatives looks bad, but it's just reinterpreting the same data and it's not really a problem.
The real problem starts when you send in values like 65536. When 65536 is cast to a short, it turns into 0. That's a real loss of data. Now you're referring to different indices, and a cast to unsigned doesn't fix it.
The real kicker is that copy1DRangeFromUnchecked() is an overloaded function, and one of the overloads takes an int[], so none of this ever needed to be an issue.
For workarounds, I guess you could subclass TriangleMeshBuilder and override the member variable mIndexData[] and method addTriangle(). Or maybe you could use multiple vertex buffers. Or file a bug report someplace? Anyway, interesting problem.
It's probably because OpenGL ES allows only short element indices, not int. Source: http://duriansoftware.com/joe/An-intro-to-modern-OpenGL.-Chapter-2.1:-Buffers-and-Textures.html (search for "OpenGL ES")
I am testing corner cases on the pow call(#include <math.h>), specifically pow(-1, Inf).
On my desktop (Ubuntu) I get the result 1.0, this is in accordance with the 2008 IEEE floating point specification.
I run the same test when running the Android Gingerbread kernel and I get NaN returned.
I have looked around and can see that there is indeed many implementations of pow in the standard libraries for different platforms and in the case pow(-1, Inf) they are coded to produce different results.
The question is which one should be deemed correct? Any Ideas or thoughts?
I apologize if I am posting on the wrong forum, I followed the link from the android developer resources and ended up here.
The C standard is perfectly clear on this point (§F.9.4.4); there's no room for "ideas or thoughts":
pow(−1, ±∞) returns 1.
Annex F applies only if an implementation defines __STDC_IEC_559__, but there is no question that 1.0 is the right answer.
I suspect that this is a Java-ism that has leaked over into the NDK. (Java defines pow(-1,infinity) to be NaN):
If the absolute value of the first argument equals 1 and the second argument is infinite, then the result is NaN.
Edit:
Since Matteo objects that this "makes no sense", I'll offer a few sentences of explanation for why the committee made this choice. Although lim_{n->inf} (-1)^n does not exist in the real numbers, we must remember that floating-point numbers are not real numbers, and in fact, for all sufficiently large floating-point numbers y, pow(-1,y) is +1. This is because all sufficiently large floating-point numbers are even integers. From this perspective, it is quite reasonable to define pow(-1,infinity) to be +1, and this turns out to actually lead to more useful behavior in some floating-point computations.
There are a surprising number of extremely competent mathematicians (as well as very skilled programmers and compiler writers) involved with both the C and the IEEE-754 committees, and they do not make these decisions flippantly. Every standard has bugs, but this is not one of them.
Which is faster, double or float, when preforming arithimic (+-*/%), and is it worth just using float for memory reasons? Precision is not an issue much of an issue.
Feel free to call me crazy for even thinking this. Just curious as I see the amount of floats I'm using is getting larger.
EDIT 1:
The only reason this is under android is because that is where I believe memory matters; I wouldn't even ask this for desktop development.
The processing speed on both types should approximately be the same in CPUs nowadays.
"use whichever precision is required for acceptable results."
Related questions have been asked a couple of times here on SO, here is one.
Edit:
In speed terms, there's no difference between float and double on the more modern hardware.
Please check out this article from developer.android.com.
Double rather than Float was advised by ADT v21 lint message due to the JIT (Just In Time) optimizations in Dalvik from Froyo onwards (API 8 and later).
I was using FloatMath.sin and it suggested Math.sin instead with the following under "explain issue" context menu. It reads to me like a general message relating to double vs float and not just trig related.
"In older versions of Android, using
android.util.FloatMath was recommended for
performance reasons when operating on floats.
However, on modern hardware doubles are just as fast
as float (though they take more memory), and in
recent versions of Android, FloatMath is actually
slower than using java.lang.Math due to the way the JIT
optimizes java.lang.Math. Therefore, you should use
Math instead of FloatMath if you are only targeting
Froyo and above."
Hope this helps.
I wouldn't advise either for fast operations but I would believe that a operations on floats would be faster as they are 32 bit vs 64 bit in doubles.
http://developer.android.com/training/articles/perf-tips.html#AvoidFloat
Avoid Using Floating-Point
As a rule of thumb, floating-point is about 2x slower than integer on
Android-powered devices.
In speed terms, there's no difference between float and double on the
more modern hardware. Space-wise, double is 2x larger. As with desktop
machines, assuming space isn't an issue, you should prefer double to
float.
Also, even for integers, some processors have hardware multiply but
lack hardware divide. In such cases, integer division and modulus
operations are performed in software—something to think about if
you're designing a hash table or doing lots of math.
a float is 32 bits or 4 bytes
a double is 64 bits or 8 bytes
so yeah, floats are half the size according to the sun java certification book.
In speed terms, there's no difference between float and double on the more modern hardware.
Very cheap devices seem to have a limited FPU where float is faster than double. I tested on a CMX device that is currently marketed as one of the the cheapest tablets in the world:
"float" test code takes 4.7 seconds
same code with "double" takes 6.6 seconds
This question has been asked a couple of times ...
Yes. Because the answer differs for different types of hardware. On desktop computers double has the same speed as float. On devices without FPU (interesting for WLAN router hacks) float is 2-5 times faster than double; and on devices with 32-bit FPU (often found in industrial and automotive applications) even up to 100 times.
Please check out this article ...
The last section of the article says that you have to do time measurements on the hardware device you are going to use to be 100% sure.
The android documentation quoted indicates that integers are preferable for fast operations. This seems a little strange on the face of it but the speed of an algorithm using ints vs floats vs doubles depends on several layers:
The JIT or VM: these will convert the mathematical operations to the host machine's native instruction set and that translation can have a large impact on performance. Since the underlying hardware can vary dramatically from platform to platform, it can be very difficult to write a VM or JIT that will emit optimal code in all cases. It is probably still best to use the JIT/VM's recommended fast type (in this case, integers) because, as the JITs and VMs get better at emitting more efficient native instructions, your high-level code should get the associated performance boosts without any modification.
The native hardware (why the first level isn't perfect): most processors nowadays have hardware floating point units (those support floats and doubles). If such a hardware unit is present, floats/doubles can be faster than integers, unless there is also hardware integer support. Compounding the issue is that most CPUs have some form of SIMD (Single Instruction Multiple Data) support that allow operations to be vectorized if the data types are small enough (eg. adding 4 floats in one instruction by putting two in each register instead of having to use one whole register for each of 4 doubles). This can allow data types that use fewer bits to be processed much faster than a double, at the expense of precision.
Optimizing for speed requires detailed knowledge of both of these levels and how they interact. Even optimizing for memory use can be tricky because the VM can choose to represent your data in a larger footprint for other reasons: a float may occupy 8 bytes in the VM's code, though that is less likely. All of this makes optimization almost the antithesis of portability. So here again, it is better to use the VM's recommended "fast" data type because that should result in the best performance averaged across supported devices.
This is not a bad question at all, even on desktops. Yes they are very fast today, but if you are implementing a complicated algorithm (for example, the fast Fourier transform), even small optimizations can have an enormous impact on the algorithm's run time. In any case, the answer to your question "which is faster: floats or doubles" is "it depends" :)
I wondered about this too and wrote a small test:
#include <iostream>
#include <chrono>
template<typename numType>
void test(void) {
std::cout<< "Size of variable: " << sizeof(numType) << std::endl;
numType array[20000];
auto t1 = std::chrono::high_resolution_clock::now();
// fill array
for( numType& number : array ) {
number = 1.0014535;
}
auto t2 = std::chrono::high_resolution_clock::now();
// multiply each number with itself 10.000 times
for( numType& number : array ) {
for( int i=0; i < 10000 ; i++ ) {
number *= number;
}
}
auto t3 = std::chrono::high_resolution_clock::now();
auto filltime = t2 - t1;
auto calctime = t3 - t2;
std::cout<< "Fill time: " << filltime.count() << std::endl;
std::cout<< "Calc time: " << calctime.count() << std::endl;
}
int main(int argc, char* argv[]) {
test<float>();
test<double>();
}
I ran and compiled it under Ubuntu 12.04 x64 using GCC on an Intel i7 3930k processor
These were the results:
Size of variable: 4
Fill time: 69
Calc time: 694303
Size of variable: 8
Fill time: 76
Calc time: 693363
Results were reproducable. So memory allocation for double takes slightly longer but the actual calculation time is exactly the same.
Out of curiousity I also ran and compiled it under Windows 7 x64 using Visual Studio 2012 in release mode on an intel i7 920 processor
(The unit of the time is different so don't compare the results above to these: it is only valid for internal comparison)
Size of variable: 4
Fill time: 0
Calc time: 3200183
Size of variable: 8
Fill time: 0
Calc time: 3890223
Results were reproducable.
It seems on windows allocation is instant, perhaps because linux does not actually give you memory until you use it while windows just hands it all over to you at once, requiring less system calls. Or perhaps the assignment is optimized away.
The multiplication of doubles is 21,5% slower here than for floats. This difference with the previous test is likely due to the different processor (that's my best guess at least).
Since all smart phones (at least the ones, that I can find specs on) have 32-bit processors, I would imagine that using single precision floating-point values in extensive calculations would perform significantly better than doubles. However, that doesn't seem to be the case.
Even if I avoid type casts, and use the FloatMath package whenever possible, I can hardly see any improvements in performance except for the memory use, when comparing float-based methods to double-based ones.
I am currently working on a rather large, calculation intensive sound analysis tool, which is doing several million multiplications and additions per second. Since a double precision multiplication on a 32-bit processor takes several clock cycles vs. 1 for single precision, I was assuming the type change would be noticeable... But it isn't :-(
Is there a good explanation for this? Is it due to the way the Dalvik VM works, or what?
Floating-point units on typical CPUs perform all of their calculations in double-precision (or better) and simply round or convert to whatever the final precision is. In other words, even 32-bit CPUs have 64-bit FPUs.
Many phones have CPUs that include FPUs, but have the FPUs disabled to save power, causing the floating-point operations to be slowly emulated (in which case 32-bit floats would be an advantage).
There are also vector units that have 32-bit FPUs, causing 64-bit floating-point operations to take longer. Some SIMD units (like those that execute SSE instructions) perform 32-bit and 64-bit operations in the same amount of time, so you could do twice as many 32-bit ops at a time, but a single 32-bit op won't go any faster than a single 64-bit op.
Many, perhaps most, Android devices have no floating-point co-processor.
I am currently working on a rather large, calculation intensive sound analysis tool, which is doing several million multiplications and additions per second.
That's not going to work very well on Android devices lacking a floating-point co-processor.
Move it into C/C++ with the NDK, then limit your targets to ARM7, which has a floating-point co-processor.
Or, change your math to work in fixed-point mode. For example, Google Maps does not deal with decimal degrees for latitude and longitude, but rather microdegrees (10^6 times degrees), specifically so that it can do its calculations using fixed-point math.
It seems that you're using a Nexus One, which has a Scorpion core.
I believe that both single- and double-precision scalar floating point are fully pipelined in Scorpion, so although the latency of the operations may differ, the throughput is the same.
That said, I believe that Scorpion also has a SIMD unit which is capable of operating on floats, but not doubles. In theory a program written against he NDK taking advantage of the SIMD instructions can run substantially faster on single-precision than on double-precision, but only with significant effort from the programmer.