What is meant (technically) by optimized cipher? (Case of AES in Java)

What is meant (technically) by optimized cipher? (Case of AES in Java) - android

I am running a comparison between lightweight ciphers versus non-lightweight.
My chosen lightweight cipher is Clefia which is a 128 bit block cipher by Sony and I am comparing it to the 128 bit infamous AES with both keys being 128 bit.
My comparison is being ran on a real mobile device running Android OS (Samsung Galaxy S3).
The paper about Clefia states that it is faster than AES.
This seems to be logical given it is a lightweight algorithm to be used on less resourceful devices.
In order to compile both code on android, I converted the official code of Clefia written in C to Java as is. (Although C could be compiled on android? not sure)
and for the AES, I used the native Javax.Crypto libraries. (lots of examples on the internet for that)
What struck me is that the complete opposite happened. Instead of Clefia being way faster, it was AES which was around 350 times faster than Clefia.
The only reason now I can think of is that the code Clefia has posted on their official website is not optimized, which they admit; as the below is a copy-paste from their code.
* NOTICE
* This reference code is written for a clear understanding of the CLEFIA
* block cipher algorithm based on the specification of CLEFIA.
* Therefore, this code does not include any optimizations for
* high-speed or low-cost implementations or any countermeasures against
* implementation attacks.
I can assume (I can be wrong) that the Javax.Crypto classes use much optimized version of AES.
This is the only reason I can think of why there would be such a huge difference regarding in speed.
Therefor my questions are as follows.
When we say optimized; what is meant technically? Less rounds in favor of security? different code? etc?
Can the reason for such a difference in speed be explained differently? that is, optimization not being the reason for such a difference in speed.
I still could not locate an optimized version of Clefia, and I am not sure if Java has included it with their latest JDK, given Clefia is now a standard. Is making an algorithm optimized left for the user that wants to use it to develop or the company (side that proposed the algorithm) offers?
Any ideas, insights and thoughts are highly appreciated. (In case you find a logical flaw in what I posted, please feel free to share. Also note that I was going to post this on http://crypto.stackexchange.com but the user base is way low there and this involves java, so at the time being I am posting it here, but if you think I need to move it there, please advise. Also, I do not mind sharing the code of both Clefia and AES if needed.)

Hardware Speed
In the paper you refer to, they show that Clefia when implemented in hardware, can be faster than AES when considering Kbps/gate. The best Clefia has 268.63 Kbps/gate and the best AES has 135.81 Kbps/gate - which is around a factor of 2.
Software Speed
They also have a comparison of software implementations where Clefia performs a bit slower at 12.9 cycles/byte than AES with only 10.6 cycles/byte.
So this shows that the speed of the two algorithms in itself are within a factor of 2.
Now, the problem is that you compare a highly optimized, and maybe even hardware backed (The ARMv8 instruction set now includes instructions that does a full AES round in one instruction) implementation, to your own java port of an implementation that is not optimized in the first place (The original code even states: this code does not include any optimizations for high-speed).
Also, how big is the data set are you testing on? And how is the effect of the JIT compilation been accounted for in the test?
If you want a comparative result, you ought to implement the AES algorithm in Java as well - and then do a comparison. My guess is that this approach would give an comparatively slow implementation of AES as well.

Related

Cache timing on ARM processor

i need to implement AES algorithm on a smartphone with ARM Cortex A-15 processor(Samsung Galaxy Note 3, etc) and need to observe and save cache timings for each process, round. How do I go about it?
To be precise, I need to observe time taken by the processor to run each round of the AES per plaintext - key pair. I am trying to find the practicability of timing attacks in smartphones(focus on Bernstein's modified attacks but will see feasibility of both trace driven and access driven cache attacks). It is for academic purposes.
I understand the architecture of the processor used. Problem lies in the assembly programming - not getting the right code- as well as how to load this program onto the smartphone.

Can I emulate ARM NEON in an x86 C program?

I am developing some numerical software, whose performance, depends a lot on the numerical accuracy (i.e., floats, double etc.).
I have noticed that the ARM NEON does not fully comply with the IEEE754 floating point standard. Is there a way to emulate NEON's floating point precision, on an x86 CPU ? For example a library that emulates the NEON SIMD floating point operations.

Probably.
I'm less familiar with SSE, but you can force many of the SSE modes to behave like NEON. This will depend on your compiler and available libraries, but see some Visual Studio FP unit control functions. This might be good enough for your requirements.
Furthermore, you can use the arm_neon.h header to ensure that you are using similar intrinsics to accomplish similar things.
Finally, if you really require achieving this precision at these boundary conditions, you are going to want a good test suite to verify that you are achieving your results as intended.
Finally finally, even with pure "C" code, which typically complies with IEEE-754, and uses the VFP on ARM as other commenters have mentioned, you will get different results because floating point is a highly... irregular process, subject to the whim of optimization and order of operations. It is challenging to get results to match across different compilers, let alone hardware architectures. For example, to get highly agreeable results on Intel with gcc it's often required to use the -ffloat-store flag, if you want to compare with /fp:precise on CL/MSVS.
In the end, you may need to accept some kind of non-zero error tolerance. Trying to get to zero may be difficult, but it would be awesome to hear your results if you get there. It seems possible... but difficult.

Thanks for your answers.
At last, I used an android phone connected to a desktop, and certain functions were running on the phone.

Android sound synthesis

I am trying to play a synthesized sound (basically 2 sine waves and some noise) using the AudioTrack class. It doesn't seem to be any different than the SourceDataLine in javax.sound.sampled, BUT the synthesis is REALLY SLOW. Even for ARM standards, it's unrealistic to think that 32768 samples (16 bit, stereo, for a total of 65536) take over 1 second to render on a Nexus 4 (measured with System.nanotime(), write to AudioTrack excluded).
The synthesis part is almost identical to this http://audioprograming.wordpress.com/2012/10/18/a-simple-synth-in-android-step-by-step-guide-using-the-java-sdk/, the only difference is that I play stereo sound (I can't reduce it to mono because it's a binaural tone).
Any ideas? what can I do?
Thanks in advance

Marko's answer seems very good. But if you're still in the experimental/investigational phase of your project, you might want to consider using Pure Data, which already is implemented as a combination Android library/NDK library and which would allow you to synthesize many sounds and interact with them in a relatively simple manner.
The libpd distribution is the Android implementation of Pure Data. Some good starting references can be found at the SoundOnSound site and also at this site.
Addendum: I found a basic but functional implementation of an Android Midi Driver through this discussion link. The relevant code can be found here (github, project by billthefarmer, named mididriver).
You can view how I use it in my Android app (imSynt link leads you to Google Play), or on YouTube.

The performance of audio synthesis on ARM is actually very respectable with native code that makes good use of the NEON unit. The Dalvik's JIT compiler is never going to get close to this level of performance for floating-point intensive code.
A look at the enormous number of soft-synth apps for iOS provides ample evidence of what should be possible on ARM devices with similar levels of performance.
However, the performance you are reporting is several orders of magnitude short of what I would expect. You might consider the following:
Double precision float-point arithmetic is particularly expensive on ARM Cortex A-x NEON units, where as single precision is very fast and highly parallelizable. Math.sin() returns a double, so is unnecessarily precise, and liable to be slow. The 24-mantissa provided by single precision floating point value is substantially larger than the 16-bit int used by the audio subsystem.
You could precompute sin(x) and then perform a table-lookup in your render loop.
There is a previous post on SO concerning Math.sin(x) on android suggesting degrading performance as x becomes large, as it's likely to in this case over time.
For a more advanced table-based synthesiser, you might consider using a DDS Oscillator.
Ultimately, you might consider using native code for synthesis, with the NDK.

You should be able render multiple oscillators with filters and envelopes and still have CPU time left over. Check your inner loops to make sure that there are no system calls.
Are you on a very old phone? You did not mention the hardware or OS version.
You might want to try using JSyn. It is a free modular Java synthesizer that runs on any Java platform including desktops, Raspberry Pi and Android.
https://github.com/philburk/jsyn

Have you tried profiling your code? It sounds like something else is possibly causing your slow down, profiling would help to highlight the cause.
Mike

android kernel libm pow(float,float) implementation

I am testing corner cases on the pow call(#include <math.h>), specifically pow(-1, Inf).
On my desktop (Ubuntu) I get the result 1.0, this is in accordance with the 2008 IEEE floating point specification.
I run the same test when running the Android Gingerbread kernel and I get NaN returned.
I have looked around and can see that there is indeed many implementations of pow in the standard libraries for different platforms and in the case pow(-1, Inf) they are coded to produce different results.
The question is which one should be deemed correct? Any Ideas or thoughts?
I apologize if I am posting on the wrong forum, I followed the link from the android developer resources and ended up here.

The C standard is perfectly clear on this point (§F.9.4.4); there's no room for "ideas or thoughts":
pow(−1, ±∞) returns 1.
Annex F applies only if an implementation defines __STDC_IEC_559__, but there is no question that 1.0 is the right answer.
I suspect that this is a Java-ism that has leaked over into the NDK. (Java defines pow(-1,infinity) to be NaN):
If the absolute value of the first argument equals 1 and the second argument is infinite, then the result is NaN.
Edit:
Since Matteo objects that this "makes no sense", I'll offer a few sentences of explanation for why the committee made this choice. Although lim_{n->inf} (-1)^n does not exist in the real numbers, we must remember that floating-point numbers are not real numbers, and in fact, for all sufficiently large floating-point numbers y, pow(-1,y) is +1. This is because all sufficiently large floating-point numbers are even integers. From this perspective, it is quite reasonable to define pow(-1,infinity) to be +1, and this turns out to actually lead to more useful behavior in some floating-point computations.
There are a surprising number of extremely competent mathematicians (as well as very skilled programmers and compiler writers) involved with both the C and the IEEE-754 committees, and they do not make these decisions flippantly. Every standard has bugs, but this is not one of them.

How would you improve Dalvik? Android's Virtual Machine

I am currently writing a paper on the Android platform. After some research, it's clear that Dalvik has room for improvement. I was wondering, what do you think would be the best use of a developer's time with this goal?
JIT compilation seems like the big one, but then i've also heard this would be of limited use on such a low resource machine. Does anyone have a resource or data that backs this up?
Are there any other options that should be considered? Aside from developing a robust native development kit to bypass the VM.
For those who are interested, there is a lecture that has been recorded and put online regarding the Dalvik VM.
Any thoughts welcome, as this question appears subjective i'll clarify that the answer I'll accept must have some justification for proposed changes. Any data to back it up, such as the improvement in the Sun JVM when it was introduced, would be a massive plus.

Better garbage collection: compacting at minimum (to eliminate memory fragmentation problems experienced today), ideally less CPU intensive at doing the collection itself (to reduce the "my game frame rates suck" complaints)
JIT, as you cite
Enough documentation that, when coupled with an NDK, somebody sufficiently crazy could compile Dalvik bytecode to native code for an AOT compilation option
Make it separable from Android itself, such that other projects might experiment with it and community contributions might arrive in greater quantity and at a faster clip
I'm sure I could come up other ideas if you need them.

JIT. The stuff about it not helping is a load of crap. You might be more selective about what code you JIT but having 1/10th the performance of native code is always going to be limiting
Decent GC. Modern generational garbage collectors do not have big stutters.
Better code analysis. There are lot of cases where allocations/frees don't need to be made, locks held, and so on. It allows you to write clean code rather than doing optimizations that the machine is better at
In theory most of the higher level languages (Java, Javascript, python,...) should be within 20% of native code performance for most cases. But it requires the platform vendor to spend 100s+ developer man years. Sun Java is getting good. They have also been working on it for 10 years.

One of the main problems with Dalvik is performance, which is terrible I heard, but one of the things I would like most is the addition of more languages.
The JVM has had community projects getting Python and Ruby running on the platform, and even special languages such as Scala, Groovy and Closure developed for it. It would be nice to see these (and/or others) on the Dalvik platform as well. Sun has been working on the Da Vinci machine as well, a dynamic typing extension of the JVM, which indicates a major shift away from the "one language fits all" philosophy Sun has followed over the last 15 years.

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.