Cache timing on ARM processor

Cache timing on ARM processor - android

i need to implement AES algorithm on a smartphone with ARM Cortex A-15 processor(Samsung Galaxy Note 3, etc) and need to observe and save cache timings for each process, round. How do I go about it?
To be precise, I need to observe time taken by the processor to run each round of the AES per plaintext - key pair. I am trying to find the practicability of timing attacks in smartphones(focus on Bernstein's modified attacks but will see feasibility of both trace driven and access driven cache attacks). It is for academic purposes.
I understand the architecture of the processor used. Problem lies in the assembly programming - not getting the right code- as well as how to load this program onto the smartphone.

Related

Switching endianness on ARM

I have heard that ARM processors can switch between little-endian and big-endian. What do processors need this for? Is it used on Android phones?

Depending on the processor, it can be possible to switch endianness on the fly. Older processors will boot up in one endian state, and be expected to stay there. In the latter case, the whole design will generally be set up for either big or little endian.
The primary reason for supporting mixed-endian operation is to support networking stacks where the underlying datasets being manipulated are native big-endian. This is significant for switches/routers and mobile base-stations where the processor is running a well-defined software stack, rather than operating as a general purpose applications device.
Be aware that there are several different implementations of big-endian behaviour across the different ARM Architectures, and you need to check exactly how this works on any specific core.

You can switch endianness, but you wouldn't do that after the OS is up and running. It would only screw things up. If you were going to do it, you'd do it very early on in the boot sequence. By the time your app is running, the endianness is chosen and won't be changed.
Why would you do it? The only real reason would be if you were writing embedded software that had to deal with a lot of little-endian data, or to run a program that was written assuming little endian and not fixed to be endian agnostic. This kind of data tends to come from an x86 app that wrote things out in its native byte order (x86 is little endian). There's not a lot of other reasons to do it. You'll see ARM pretty much exclusively run in big endian mode.

What is meant (technically) by optimized cipher? (Case of AES in Java)

I am running a comparison between lightweight ciphers versus non-lightweight.
My chosen lightweight cipher is Clefia which is a 128 bit block cipher by Sony and I am comparing it to the 128 bit infamous AES with both keys being 128 bit.
My comparison is being ran on a real mobile device running Android OS (Samsung Galaxy S3).
The paper about Clefia states that it is faster than AES.
This seems to be logical given it is a lightweight algorithm to be used on less resourceful devices.
In order to compile both code on android, I converted the official code of Clefia written in C to Java as is. (Although C could be compiled on android? not sure)
and for the AES, I used the native Javax.Crypto libraries. (lots of examples on the internet for that)
What struck me is that the complete opposite happened. Instead of Clefia being way faster, it was AES which was around 350 times faster than Clefia.
The only reason now I can think of is that the code Clefia has posted on their official website is not optimized, which they admit; as the below is a copy-paste from their code.
* NOTICE
* This reference code is written for a clear understanding of the CLEFIA
* block cipher algorithm based on the specification of CLEFIA.
* Therefore, this code does not include any optimizations for
* high-speed or low-cost implementations or any countermeasures against
* implementation attacks.
I can assume (I can be wrong) that the Javax.Crypto classes use much optimized version of AES.
This is the only reason I can think of why there would be such a huge difference regarding in speed.
Therefor my questions are as follows.
When we say optimized; what is meant technically? Less rounds in favor of security? different code? etc?
Can the reason for such a difference in speed be explained differently? that is, optimization not being the reason for such a difference in speed.
I still could not locate an optimized version of Clefia, and I am not sure if Java has included it with their latest JDK, given Clefia is now a standard. Is making an algorithm optimized left for the user that wants to use it to develop or the company (side that proposed the algorithm) offers?
Any ideas, insights and thoughts are highly appreciated. (In case you find a logical flaw in what I posted, please feel free to share. Also note that I was going to post this on http://crypto.stackexchange.com but the user base is way low there and this involves java, so at the time being I am posting it here, but if you think I need to move it there, please advise. Also, I do not mind sharing the code of both Clefia and AES if needed.)

Hardware Speed
In the paper you refer to, they show that Clefia when implemented in hardware, can be faster than AES when considering Kbps/gate. The best Clefia has 268.63 Kbps/gate and the best AES has 135.81 Kbps/gate - which is around a factor of 2.
Software Speed
They also have a comparison of software implementations where Clefia performs a bit slower at 12.9 cycles/byte than AES with only 10.6 cycles/byte.
So this shows that the speed of the two algorithms in itself are within a factor of 2.
Now, the problem is that you compare a highly optimized, and maybe even hardware backed (The ARMv8 instruction set now includes instructions that does a full AES round in one instruction) implementation, to your own java port of an implementation that is not optimized in the first place (The original code even states: this code does not include any optimizations for high-speed).
Also, how big is the data set are you testing on? And how is the effect of the JIT compilation been accounted for in the test?
If you want a comparative result, you ought to implement the AES algorithm in Java as well - and then do a comparison. My guess is that this approach would give an comparatively slow implementation of AES as well.

How can you put Android OpenGL ES client and server on different machines

All the intro texts to OpenGL ES repeat that, since it's based on OpenGL, it's designed around a client/server model, though these two things tend to be on the same machine.
Well, I would like to put them on separate machines (on the same local network). Is this possible in Android? How can it be done? Extra kudos if you can figure out how to work this into a libgdx scenario (which is the gaming library I use).
(Long-winded and Perhaps Unnecessary Further Information: my use case is in faster prototyping of android games for phones. It's pretty straight forward to get finger taps and accelerometer data and what not and send it over the network to a PC. If I can have the PC send gl calls to the phone, then I can effectively run the entire game from the PC, but appear to be running on the phone. This lets me test and see if a game/game-idea will work on the phone/phone-gpu, from the advantage of far superior ram/cpu/compile-times/hot-swap-code, and just see what works on a phone, before worrying about getting everything into the ram and cpu footprint and logistics of a handset device. I know I can do this by deconstructing rendered frames, sending byte[] arrays to the device, and using libgdx Pixmap or android BitmapFactory to get the image and render it; but if it's simple to stream gl calls instead, I'd rather do that, especially since it's a more realistic test of the phone gpu's rendering ability)

There is a difference between a protocol supporting remote operation and an implementation of a server or client that does the remote operation. I don't think there are existing Android implementations that support anything like this. I suspect any of the "remote desktop" apps just forward 2D images, and don't do anything with OpenGL.
That said, there isn't anything particularly preventing you from implementing a new libGDX backend that would "remote" OpenGL calls to a server that runs on the phone and forwards those operations to the local OpenGL backend. (I can only say this with confidence as I have not looked at it any detail....)
However, given that one of the bigger bottlenecks in OpenGL performance is (generally) the bandwidth between the client and the GPU (e.g., uploading textures, vertex data, shaders, etc), adding a network is only going to exacerbate that problem and will make it hard to reason about actual performance on a phone.
You're probably better off running on your desktop and using profiling to make sure you only use a "reasonable" amount of CPU and GPU resources.

How much resource usage would pattern recognition/image processing take in Android

I'm writing an Android app that involves some form of pattern recognition to count the number of similar objects in an image. The app would be designed to work with a specific type of objects and would not involve machine learning.
Is the computation and processing within the device for such a scenario feasible or would it be better to send the image over to a remote server?
If the computation can be handled by the device, would a first generation device running on version 2.2 with 528MHz of CPU and 288MB of RAM be able to return an output within a convenient amount of time?

It completely depends on your algorithm. There's no universal pattern recognition/image processing algorithm, even for your somewhat specific case of counting similar objects.

Embedded System: which OS should I use?

I am planning to build my embedded system for processing the sound of my guitar, like a pod, with input and output and so on and a system running with a program with presets, options etc in a small lcd screen should be multitouch for navigation.
Now I am at the very beginning and dont know where to start and what system I should use.
It should support the features I wrote above (like multitouch) and should be free.
Embedded Linux,
or
Android
or what?

Are you using off the shelf effects modules with some sort of interface to an embedded system or are you planning on doing the effects in your program as well? I assume the latter in this response, please clarify if I have misunderstood the nature of the project:
Do your system engineering...
You are going to need to deal with the analog of the inputs and outputs. Even digital inputs and outputs are analog in some respects to keep the signals clean. Even optical is going to be analog between the optical interface and the processors interface.
(I know this is long, keep reading it will converge on the answer to your question)
You will have some sort of hardware to software data in interface, ideally if you choose to support different interfaces you will ideally want to normalize the data into a common form and datarate so that the effects processing only has to deal with it one way. (avoiding a bunch of if-then-elses in the code, if bitrate is this then, else if bitrate is this then, else...if bitrate is this and data is unipolar then, else if bitrate is this and data is bipolar then, else...).
The guts of the effects processing is as complicated as you want to make it, one effect at a time or multiple? For each effect define the parameters you are going to allow to be adjusted (I would start with the minimum number which might be none, then add parameters later once it is all working). These parameters are going to need to be global in some for or fashion so that the user interface can get at them and modify them for the effects processing.
the output, same as the input, a lot of analog work, convert from the normalized data stream into whatever the interface wants or needs or you defined it to be.
then there is the user interface...the easy part.
...
The guts of the software for the effects processing can be system independent code, and is probably more comfortable being developed and tested on a desktop/laptop than on the target system, bearing in mind the code should be written system and operating system independent as well as being written embeddable (avoid floating point, divides, lots of local variables, etc).
Sometimes if not often in an enclosed system with some sort of user interface on the same black box, knobs or buttons a screen of some sort, touch screens, etc. One system may manage the user interface the other performs the task and there is a connection between. not always but it is a nice clean design, and allows, for example a product designed yesterday with buttons and knobs and say a two line lcd panel, to be modernized to a touch screen, at a fraction of the effort, and tomorrow sometime there may be some fiber that plugs directly into a socket in the back of your head, who knows.
Another reason to separate the processing tasks is so that it is easier to insure that the effects processor will never get bogged down by user interface stuff. you dont want to be turning a virtual knob on your touchscreen and the graphics load to draw the picture causes your audio to get garbled or turn to a nasty whine. Basically the effects processor is real-time critical. you dont want to pick the string on the guitar, and have the sound come out of the amp three seconds later because the processor is also drawing an animated background on your touch screen panel. That processing needs to be tight and fast and deterministic, every if-then-else in the code has to be accounted for and balanced. If you allow for multiple effects in parallel your processor needs to be able to have the bandwidth to process all of the effects without a noticeable delay, otherwise if only one effect at a time then the processor needs to be chosen to handle the one effect with the worst computation effort. The worst that could happen is that the input to output latency varies because of something the gui processing is doing, causing the music to sound horrible.
So you can work the effects processor with its user interface being, for example, a serial interface and a protocol across that interface (which you define) for selecting effects and changing parameters. You can get the effects processor up and working and tested using your desktop and/or laptop connected through the serial interface with some adhoc code being used to change parameters, perhaps a command line program.
Now is where it becomes interesting. You can get an off the shelf embedded linux system for example or embedded android or whatever, write your app that uses the serial protocol, if need be glue, bolt, tape, mold, etc this user interface system on top of around, next to the effects processor module. Note that you could have all of the platforms suggested, an android version, a linux (without android) version, a mac version, a windows version, a dos version, a qnx version, an amiga version, you name it. You can try 100 different user interface variations on the same OS, maybe I want the knobs to be sliders, or up/down push buttons, or a dial looking thing that I use a two finger touch to rotate, or some other multi-touch gesture.
And it gets better, instead of or in addition to serial you could use a bluetooth module. Your user interface could be an iPhone app, or android phone app, or laptop linux or windows app. or your desktop computer, etc. All of which are (relatively) easy platforms for writing graphical user interfaces for selecting things.
Another approach of course could be ethernet, in particular wireless ethernet then your user interface could be a web page and the bulk of your user interface work has already been done by the firefox or chrome or other team. (wireless ethernet or bluetoot or zigbee or other allows the effects processor to be somewhere convenient and doesnt have to be within arms/foot reach of you).
...
Do your system engineering. Break the problem into a few big modules, define the interfaces between the modules and then worry about the system engineering if necessary inside those modules until you get to easily digestable bites. The better the system engineering and the better defined the interfaces between modules the easier the project will be to implement.
...
I would also investigate the xcore processors at xmos, they have a very nice simulator with vcd waveform output that you can also use to accurately profile your effects processing. Personally I would have a very tough time not choosing this platform for this project.
You should also investigate the omap from ti, this is what is on a beagleboard. You get a nice arm that already has linux and other things ported and running on it, but you also get a dsp block, that dsp block could do your effects processing and likely in a way that the two dont interfere. You lose the ability to separate your user interface processor and effects processor physically, but gain elsewhere, and can probably use a beagleboard off the shelf to develop a prototype (using analog audio in and out). I actually liked the hawkboard better (with the hawkboard you get a usable system out of the box, with the beagleboard you spend another beagleboards worth of money for stuff that should have been on the board), but last I saw they had an instability flaw with the pcb design.
I am not up on the specs but the tegra (a number of upcoming phones are or will be tegra based), like the omap, should give some parallel processing with a lean toward audio/video as well as gui. You only need the audio and gui (the easier two of the three). I think there is a development platform for sale that has a touchscreen on it and popular embedded OSes.
If you are trying to save money buy making one of these things yourself. Stop now and go to the store and buy one. The homebrew one will cost a lot more, even if all the design stuff is free. The hardware and melted down guitars and guitar amps are not. I speak from experience, many times I have spent many thousands of dollars on a homebrew projects to avoid buying some off the shelf $300 item. I learned an awful lot, and personally the building of the thing is more fun than the using it, I normally shelve it once it is finally working. YMMV
If I have misunderstood your question, please let me know and I will edit/remove/replace all of it with a different (short) answer.

In facts it depends on what kind of hardware you want to run and interface (as a consequence how much you will work at driver level... or not).
The problem with android remains the same than with a bare linux. Could even be worse if there is no framework-level library (Java) since you will have to manage C part (with JNI) and the Java part.
Work the specs... then you will choose wisely...
Reminder: android is linux-based.

Go for Android:
With any other embedded OS you will have too much of an integration work to deal with.
You can start by buying off-the-shelf hardware (Galaxy Tab, HTC phone, etc) to start your development and reach a prototype fast

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.