I am testing corner cases on the pow call(#include <math.h>), specifically pow(-1, Inf).
On my desktop (Ubuntu) I get the result 1.0, this is in accordance with the 2008 IEEE floating point specification.
I run the same test when running the Android Gingerbread kernel and I get NaN returned.
I have looked around and can see that there is indeed many implementations of pow in the standard libraries for different platforms and in the case pow(-1, Inf) they are coded to produce different results.
The question is which one should be deemed correct? Any Ideas or thoughts?
I apologize if I am posting on the wrong forum, I followed the link from the android developer resources and ended up here.
The C standard is perfectly clear on this point (§F.9.4.4); there's no room for "ideas or thoughts":
pow(−1, ±∞) returns 1.
Annex F applies only if an implementation defines __STDC_IEC_559__, but there is no question that 1.0 is the right answer.
I suspect that this is a Java-ism that has leaked over into the NDK. (Java defines pow(-1,infinity) to be NaN):
If the absolute value of the first argument equals 1 and the second argument is infinite, then the result is NaN.
Edit:
Since Matteo objects that this "makes no sense", I'll offer a few sentences of explanation for why the committee made this choice. Although lim_{n->inf} (-1)^n does not exist in the real numbers, we must remember that floating-point numbers are not real numbers, and in fact, for all sufficiently large floating-point numbers y, pow(-1,y) is +1. This is because all sufficiently large floating-point numbers are even integers. From this perspective, it is quite reasonable to define pow(-1,infinity) to be +1, and this turns out to actually lead to more useful behavior in some floating-point computations.
There are a surprising number of extremely competent mathematicians (as well as very skilled programmers and compiler writers) involved with both the C and the IEEE-754 committees, and they do not make these decisions flippantly. Every standard has bugs, but this is not one of them.
Related
I am running a comparison between lightweight ciphers versus non-lightweight.
My chosen lightweight cipher is Clefia which is a 128 bit block cipher by Sony and I am comparing it to the 128 bit infamous AES with both keys being 128 bit.
My comparison is being ran on a real mobile device running Android OS (Samsung Galaxy S3).
The paper about Clefia states that it is faster than AES.
This seems to be logical given it is a lightweight algorithm to be used on less resourceful devices.
In order to compile both code on android, I converted the official code of Clefia written in C to Java as is. (Although C could be compiled on android? not sure)
and for the AES, I used the native Javax.Crypto libraries. (lots of examples on the internet for that)
What struck me is that the complete opposite happened. Instead of Clefia being way faster, it was AES which was around 350 times faster than Clefia.
The only reason now I can think of is that the code Clefia has posted on their official website is not optimized, which they admit; as the below is a copy-paste from their code.
* NOTICE
* This reference code is written for a clear understanding of the CLEFIA
* block cipher algorithm based on the specification of CLEFIA.
* Therefore, this code does not include any optimizations for
* high-speed or low-cost implementations or any countermeasures against
* implementation attacks.
I can assume (I can be wrong) that the Javax.Crypto classes use much optimized version of AES.
This is the only reason I can think of why there would be such a huge difference regarding in speed.
Therefor my questions are as follows.
When we say optimized; what is meant technically? Less rounds in favor of security? different code? etc?
Can the reason for such a difference in speed be explained differently? that is, optimization not being the reason for such a difference in speed.
I still could not locate an optimized version of Clefia, and I am not sure if Java has included it with their latest JDK, given Clefia is now a standard. Is making an algorithm optimized left for the user that wants to use it to develop or the company (side that proposed the algorithm) offers?
Any ideas, insights and thoughts are highly appreciated. (In case you find a logical flaw in what I posted, please feel free to share. Also note that I was going to post this on http://crypto.stackexchange.com but the user base is way low there and this involves java, so at the time being I am posting it here, but if you think I need to move it there, please advise. Also, I do not mind sharing the code of both Clefia and AES if needed.)
Hardware Speed
In the paper you refer to, they show that Clefia when implemented in hardware, can be faster than AES when considering Kbps/gate. The best Clefia has 268.63 Kbps/gate and the best AES has 135.81 Kbps/gate - which is around a factor of 2.
Software Speed
They also have a comparison of software implementations where Clefia performs a bit slower at 12.9 cycles/byte than AES with only 10.6 cycles/byte.
So this shows that the speed of the two algorithms in itself are within a factor of 2.
Now, the problem is that you compare a highly optimized, and maybe even hardware backed (The ARMv8 instruction set now includes instructions that does a full AES round in one instruction) implementation, to your own java port of an implementation that is not optimized in the first place (The original code even states: this code does not include any optimizations for high-speed).
Also, how big is the data set are you testing on? And how is the effect of the JIT compilation been accounted for in the test?
If you want a comparative result, you ought to implement the AES algorithm in Java as well - and then do a comparison. My guess is that this approach would give an comparatively slow implementation of AES as well.
I am developing some numerical software, whose performance, depends a lot on the numerical accuracy (i.e., floats, double etc.).
I have noticed that the ARM NEON does not fully comply with the IEEE754 floating point standard. Is there a way to emulate NEON's floating point precision, on an x86 CPU ? For example a library that emulates the NEON SIMD floating point operations.
Probably.
I'm less familiar with SSE, but you can force many of the SSE modes to behave like NEON. This will depend on your compiler and available libraries, but see some Visual Studio FP unit control functions. This might be good enough for your requirements.
Furthermore, you can use the arm_neon.h header to ensure that you are using similar intrinsics to accomplish similar things.
Finally, if you really require achieving this precision at these boundary conditions, you are going to want a good test suite to verify that you are achieving your results as intended.
Finally finally, even with pure "C" code, which typically complies with IEEE-754, and uses the VFP on ARM as other commenters have mentioned, you will get different results because floating point is a highly... irregular process, subject to the whim of optimization and order of operations. It is challenging to get results to match across different compilers, let alone hardware architectures. For example, to get highly agreeable results on Intel with gcc it's often required to use the -ffloat-store flag, if you want to compare with /fp:precise on CL/MSVS.
In the end, you may need to accept some kind of non-zero error tolerance. Trying to get to zero may be difficult, but it would be awesome to hear your results if you get there. It seems possible... but difficult.
Thanks for your answers.
At last, I used an android phone connected to a desktop, and certain functions were running on the phone.
I'm doing some image compression in Android using native code. For various reasons, I can't use a pre-built library.
I profiled my code using the android-ndk-profiler and found that the bottleneck is -- surprisingly -- floating point operations! Here's the profile output:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
40.37 0.44 0.44 __addsf3
11.93 0.57 0.13 7200 0.02 0.03 EncodeBlock
6.42 0.64 0.07 535001 0.00 0.00 BitsOut
6.42 0.71 0.07 __aeabi_fdiv
6.42 0.78 0.07 __gnu_mcount_nc
5.50 0.84 0.06 __aeabi_fmul
5.50 0.90 0.06 __floatdisf
...
I googled __addsf3 and apparently it is a software floating point operation. Yuck. I did more research on the ARMv6 architecture core, and unless I missed something, it doesn't have hardware floating point support. So what can I do here to speed this up? Fixed-point? I know that's normally done with integers, but I'm not really sure how to convert my code to do that. Is there a compiler flag I could set so it will do that? Other suggestions welcome.
Of course you can do anything with integer arithmetic only (after all is exactly what you program is doing right now) but if it can be done faster or not really depends on what exactly you are trying to do.
Floating point is sort of a generic solution can you can apply in most cases and just forget about it, but it's somewhat rare that your problem really needs numbers ranging wildly from the incredibly small to the incredibly big and with 52 bits of mantissa accuracy. Supposing your computations are about graphics with a double precision floating point number you can go from much less than sub-atomic scale to much more than the size of the universe... is it really that range needed? Accuracy provided of course depends on the scale with FP, but what is the accuracy you really need?
What are your numbers used for in your "inner loop"? Without knowing that is hard to say if the computation can be made faster by much or not. Almost surely it can be made faster (FP is a generic blind solution) but the degree of gain you may hope in varies a lot. I don't know the specific implementation but I'd expect it to be reasonably efficient (for the generic case).
You should aim at an higher logical level of optimization.
For image (de)compression based on say DCT or wavelet transform I think that indeed there is no need of floating point arithmetic: you can just consider the exact scales your number will be and use integer arithmetic. Moreover may be you also have an extra degree of freedom because of the ability of produce approximate results.
See 6502's excellent answer first...
Most processors dont have fpus because they are not needed. And when they do for some reason they try to conform to IEEE754 which is equally unnecessary, the cases that need any of that are quite rare. The fpu is just an integer alu with some stuff around it to keep track of the floating point, all of which you can do yourself.
How? Lets think decimals and dollars we can think about $110.50 and adding $0.07 and getting $110.57 or you could have just done everything in pennies, 11050 + 7 = 11057, then when you print it for a user place a dot in the right place. That is all the fpu is doing, and that is all you need to do. this link may or may not give some insight into this http://www.divms.uiowa.edu/~jones/bcd/divide.html
Dont blanket all ARMv6 processors that way, that is not how ARMs are categorized. Some cores have the option for an FPU or you can add one on yourself after you buy from ARM, etc. the ARM11's are ARMv6 with the option for an fpu for example.
Also, just because you can keep track of the decimal point yourself, if there is a hard fpu it is possible to have it be faster than doing it yourself in fixed point. Likewise it is possible and easy to not know how to use an fpu and get bad results, just get them faster. Very easy to write bad floating point code. Whether you use fixed or float you need to keep track of the range of your numbers and from that control where you move the point around to keep the integer math at the core within the mantissa. Which means to use floating point effectively you should be thinking in terms of what the integer math is doing. One very common mistake is to think that multiplies mess up your precision, when it is actually addition and subtraction that can hurt you the most.
I wrote a quick application to get a feel for the limits of RenderScript and discoved that when reaching approximately 65,000 triangles, the system simply does not draw any additional ones. For example, if I create a cylinder with 70,000 triangles, there is a missing wedge from the cylinder corresponding to the triangles that exceed the ~65,000 count. The triangles are textured and for ease of writing the app, I simply used a TriangleMeshBuilder class so there is no real optimization going on such as using trifans or tristrips. The hardware is a Samsung Galaxy Nexus. LogCat reports a heap size of about 15MB with 3% free. I receive no errors or warnings regarding the graphics system or RenderScript.
Can anyone explain the reason for the triangles being dropped? Am I at a hardware limit that RenderScript is handling gracefully?
UPDATE Happens on a Samsung Galaxy Nexus (4.0.3), Samsung Galaxy Tab 7.0+ (3.2) and Motorola Xoom (3.2). All at the same point of approximately 65,000 triangles. Each of these devices have different GPU's.
UPDATE 2 In response to Steve Blackwell's insights, I have some additional thoughts.
Lines 710-712 do indeed downcast the int indices to short, thus 65536 goes to 0 as Steve points out. Additionally, the "cast" on line 757 is not so much of a cast as telling the RenderScript the format of the binary data that will eventually be sent to it. RenderScript requires all data to be packed into a RenderScript specific data type called an Allocation to move from Java to the RenderScript runtime, and this needs to be informed as to the data structure. In line with Steve's opinion that this is a bug, line 757 informs RenderScript to treat the index data as short (unsigned 16bit) but it sends it a 32 bit signed value (which will be accepted due to the lack of a check and treated unsigned, and then only the lower 16 bits used, hence why we get something drawn when below this threshold and triangles connecting back to the first indices when we go over).
Subclassing TriangleMeshBuilder to see if I can make it accept these values all as integers to increase this limit did not work, which leads me to believe that somewhere in the deep code we do not have access to, there is an additional reference to unsigned shorts. Looks like the only work around is to add additional vertex buffers as Steve suggests, which is easily done with the existing Mesh.AllocationBuilder class. I will also bring it up with Google in the Developer Hangouts to determine if this is in fact a bug or intentional.
I know almost nothing about RenderScript, so I don't know whether this is some inherent limitation, a hardware issue, or something to do with TriangleMeshBuilder, but I would bet you're running out of triangles at after number 65535.
This is a magic number because it's the maximum value of an unsigned 16-bit integer. (Wikipedia)
I would suspect that somewhere in the code there's an unsigned short that holds the number of triangles. It won't be in the Java code since Java doesn't have unsigned values. And the limitation is probably not hardware since CPU registers/pathways are >= 32-bit. So I would check TriangleMeshBuilder.
EDIT:
That's a great find on line 553. The value of every index has to fit into a short. It looks like the downcast is happening at line 710-712.
I assume that you're calling addTriangle(). That function takes three ints and then does an explicit cast to short. I think that's a bug right there because the downcast happens silently, and it's not what you'd expect from the function signature.
On line 768, that bogus data gets passed to Allocation.copy1DRangeFromUnchecked(). I didn't follow it all the way down, but I imagine that at some point, those signed values get cast to unsigned: the -32768 to -1 gets turned back into 32768 to 65535. So turning the indices into negatives looks bad, but it's just reinterpreting the same data and it's not really a problem.
The real problem starts when you send in values like 65536. When 65536 is cast to a short, it turns into 0. That's a real loss of data. Now you're referring to different indices, and a cast to unsigned doesn't fix it.
The real kicker is that copy1DRangeFromUnchecked() is an overloaded function, and one of the overloads takes an int[], so none of this ever needed to be an issue.
For workarounds, I guess you could subclass TriangleMeshBuilder and override the member variable mIndexData[] and method addTriangle(). Or maybe you could use multiple vertex buffers. Or file a bug report someplace? Anyway, interesting problem.
It's probably because OpenGL ES allows only short element indices, not int. Source: http://duriansoftware.com/joe/An-intro-to-modern-OpenGL.-Chapter-2.1:-Buffers-and-Textures.html (search for "OpenGL ES")
Since all smart phones (at least the ones, that I can find specs on) have 32-bit processors, I would imagine that using single precision floating-point values in extensive calculations would perform significantly better than doubles. However, that doesn't seem to be the case.
Even if I avoid type casts, and use the FloatMath package whenever possible, I can hardly see any improvements in performance except for the memory use, when comparing float-based methods to double-based ones.
I am currently working on a rather large, calculation intensive sound analysis tool, which is doing several million multiplications and additions per second. Since a double precision multiplication on a 32-bit processor takes several clock cycles vs. 1 for single precision, I was assuming the type change would be noticeable... But it isn't :-(
Is there a good explanation for this? Is it due to the way the Dalvik VM works, or what?
Floating-point units on typical CPUs perform all of their calculations in double-precision (or better) and simply round or convert to whatever the final precision is. In other words, even 32-bit CPUs have 64-bit FPUs.
Many phones have CPUs that include FPUs, but have the FPUs disabled to save power, causing the floating-point operations to be slowly emulated (in which case 32-bit floats would be an advantage).
There are also vector units that have 32-bit FPUs, causing 64-bit floating-point operations to take longer. Some SIMD units (like those that execute SSE instructions) perform 32-bit and 64-bit operations in the same amount of time, so you could do twice as many 32-bit ops at a time, but a single 32-bit op won't go any faster than a single 64-bit op.
Many, perhaps most, Android devices have no floating-point co-processor.
I am currently working on a rather large, calculation intensive sound analysis tool, which is doing several million multiplications and additions per second.
That's not going to work very well on Android devices lacking a floating-point co-processor.
Move it into C/C++ with the NDK, then limit your targets to ARM7, which has a floating-point co-processor.
Or, change your math to work in fixed-point mode. For example, Google Maps does not deal with decimal degrees for latitude and longitude, but rather microdegrees (10^6 times degrees), specifically so that it can do its calculations using fixed-point math.
It seems that you're using a Nexus One, which has a Scorpion core.
I believe that both single- and double-precision scalar floating point are fully pipelined in Scorpion, so although the latency of the operations may differ, the throughput is the same.
That said, I believe that Scorpion also has a SIMD unit which is capable of operating on floats, but not doubles. In theory a program written against he NDK taking advantage of the SIMD instructions can run substantially faster on single-precision than on double-precision, but only with significant effort from the programmer.