So I'm trying to write some low-level code for Android, and my main concern is that I want to avoid ALL optimization by the JIT compiler (or anything else). After doing some research, the best approach seems to be to:
write Java bytecode by hand
convert it to a dex file using the "dx" command
run it on the program using the "dalvikvm" command (via adb shell) with the "-Xverify:none -Xdexopt:none" paramaters specified
My question is: will this in fact avoid ALL optimization? The previous discussion here https://groups.google.com/forum/#!topic/android-platform/Y-pzP9z6xLw makes me unsure, and I can't 100% convince myself by reading the docs.
Any confirmation one way or the other is greatly appreciated.
Some of the instruction rewriting performed by dexopt cannot be disabled. For example, accesses to volatile long fields must be handled differently from access to long fields, and the specialization is handled by replacing the field-get instruction with a different instruction.
The optimizations performed by dexopt take the form of instruction replacement, usually some sort of "quickening" that allows the VM to do a little less work. All such optimizations are performed statically, ahead of time, not dynamically at run time, so you will get consistent behavior. Enabling the dexopt optimizations doesn't introduce unknowns, it just changes from one set of knowns to a different set of knowns.
The biggest source of variation is going to be Dalvik's JIT compiler, which you can disable with -Xint:fast. See this slightly outdated doc for notes on how to configure this system-wide.
Related
I have a working app which I need to speed up. I set up profiling (see here for details) which appears to report on how much time each function takes. I can not find a way to discover anything about time consumed in different sub-parts of functions.
I then inserted the keyword "inline" in the declarations of some frequently accessed small functions hoping for some speedup. But when I profiled again, I saw the same list of functions, including the ones I'd made inline. This made me suspicious as to whether the inline keyword had just been ignored.
I have a vague recollection that with some compilers the inline keyword was something that the compiler could optionally ignore, depending on things like the amount of memory available.
So is there some check I could do to confirm whether or not the "inline" keyword has actually done its job?
You could try:
examining the compiler's assembly or machine code output (whether disassembling or just checking for the function symbol with nm or whatever android has), or stepping through with a debugger
using a compiler pragma/attribute to force inlining (if available, for example GCC has a function attribute always_inline), if your profiling results aren't affected then presumably the compiler was already inlining
checking your profiling docs to make sure that however you're doing profiling doesn't inhibit inlining
As you recalled, inline (and member functions defined inside their class which are implicitly inline) are just hints for the compiler. Some people argue they're just convenient ways to manage One Definition Rule issues, but you'd have to check individual C++ compilers' code to see if the keyword was really that meaningless these days. The compiler might use all sorts of metrics to work out when to inline, including the optimisation flags in affect, the size of the out-of-line function, the number of calls to the function (e.g. if there's only one, why not inline even a large function?) etc..
On a recent SO question, I explained how calling a RenderScript kernel multiple times will effectively force all threads to be globally synchronized between calls.
I am currently working with multiple convolutions applied in sequence to image data. Since the convolution algorithm requires reading surrounding pixel data of the input image, I have implemented a workflow where my own custom kernel is called multiple times -- to make sure that at every step, all data from the previous convolution is ready and available at the correct coordinates. This technique has worked great for me so far.
However, in my constant quest for optimization, I have noticed that there is much performance to be obtained by keeping intermediate values in local registers for a thread, instead of writing them back to the global memory allocation in between kernel calls. If I were able to chain these convolutions in such a way, things would run much quicker. The problem is obviously that accessing the registers of surrounding threads is not really possible. Furthermore, this would require threads to run in synch to make sure these intermediate values in between stages get calculated in the expected order.
In CUDA and OpenCL, these issues are very common, and are addressed by well-known barrier synchronization + shared memory tiling techniques, which in turn depend on the concept of CUDA thread blocks or OpenCL work groups. I believe these concepts are non-existent in RenderScript, as this issue is very much tied to the wildly different architectures between desktop-class GPU's and mobile SoC's.
So my obvious question here is, are such things possible in RenderScript? That is, better management of threads and possibly thread groups for quicker data sharing among them.
On the Google I/O 2013 RenderScript talk by Jason Sams and Tim Murray, it is discussed how Script Groups might be able to do some behind the scenes optimizations, such as cross-device parallelization, memory tiling, and kernel fusion; all this by analyzing at runtime the dependency DAG in the group, and either automatically creating allocations where needed or possibly optimizing them away. I'm assuming this last bit referes to fusing kernels so that they work off their own local data, kind of how I mentioned above keeping data in local registers and combining separate steps inside a single kernel.
All this seems very much in line with what I'm looking for, especially since my application is indeed a well-defined DAG of inter-dependent operations (for a Convolutional Neural Network). So if Script Groups are indeed a plausible mobile-centric alternative to these mechanisms, I'm wondering if there is any way of influencing how and where these optimizations happen. Or if not, how much can the runtime be trusted to make the correct inference from my data dependencies given the hardware its running on -- in the specific case of "surrounding" pixel data access of the convolutional algorithm.
I realize this might all still be work in pogress, and methods would be highly hardware dependent at this point. So if there is no straight solution for such matters at the present time -- I'd be very much willing to accept a speculative answer on how this kind of workflow might potentially be approached by RenderScript in future releases.
I'd be immensely grateful on some insight about this, as it would greatly affect the development direction of my own project going forward, not to mention there are surely many other people out there wondering how such general parallel computing tasks can be handled in RS.
Thank you very much!
As you've discovered, there's no way in RS to directly share data across threads. However, what you are describing can be done using a ScriptGroup. The catch is that each script in the group has to be unique, so you cannot feed your same script over and over. At least, not as it is written now. You could certainly put the "core" of your script in a RS header and include it from multiple kernels. The ScriptGroup allows you to have the output from one script become the input of another, or the output of one script becomes a global field in another. The documentation states that the kernel to kernel (output to input) is the more efficient use case. Using this approach, your synchronization issue would be resolved as the engine will execute the first script against the entire input data set before starting the second script, etc. The scripts themselves will be parallelized appropriately for the hardware (using either CPU or GPU/DSP). The engine will not have to pop back out to Java between scripts and can also manage the data allocations behind the scenes, if needed.
Something you may notice is the ScriptGroup utilizes Script.KernelID or Script.FieldID in order to identify the exact script or field in which to connect two kernels. Your custom scripts have these things auto-generated as long as you explicitly call out your kernel function using the RS compiler attribute pragma. Then you can call getKernelID_<name> (where 'name' is the kernel function name from your script) to get the kernel ID.
I want to find the total number of machine instructions of an Android Application. I have explored the Debug.InstructionCount class of Android SDK, but I believe it provides info of Dalvik VM instructions (not the machine level instructions which actually executes on the processor).
I need this info to estimate the time required for the execution of an Android application on a particular processor (using fixed frequency). I am aware of the fact that different type of instructions take variable cycles due to which the computational time cannot be estimated accurately, but I still want to do some experimentation. Thank you
My solution involves writing an instruction set simulator, running the app and counting the instructions. There are already open source avr simulators out there than you can just use/modify for this.
At the end of the day in order to do this you have to follow the instruction flow, so either you actually sim it and that automatically solves how many times the code really goes through a loop and such, or you write a disassembler (which is half of a simulator) and basically follow the code flow in execution order (actually much simpler than a full disassembler or simulator) but you have to deal with all the possible code paths and loops as you find the different paths and count the different paths. With minimal work you could come up with the shortest possible path and know the code could never be faster than that.
I am not sure if the terminology is correct what code practices can you use to make it difficult for someone to modify the binary/assembly to bypass a check:
eg in the source code.
bool verificationResult = verify();
if (verificationResult){
allow_Something();
}else{
prevent_Something();
}
If a person looking at the disassembly version of the above code can modify the 'jump opcodes(?)' to run allow_Something even when the verification result is false.
Something similar is covered here
http://www.codeproject.com/Articles/18961/Tamper-Aware-and-Self-Healing-Code#pre0
Note I am creating the binary in C++ for it to be used via NDK on Android.
As the general consensus is so far, its impossible to prevent anyone hell-bent upon "cracking" your APK from doing so. Obfuscation techniques will only increase the complexity required to "crack" the APK once. After it gets uploaded to the myriad of the sites that offer to host APKs for free, its just a google search away from even the "noob-est" of Android noobs.
Also security through obscurity will NOT get you far.
Regarding protecting your APK from being hacked, i would recommend the following article that discusses the current state of license validation of APKs on Android. The techniques described in it should give you an idea of the common attack-vectors that you need to safeguard against.
Proguard is a good place to start obfuscating your APK.
After you manage to obtain an obfuscated APK, DO run it through the following tools and observe the de-compiled source. All these are free and open-source tools that are very popular and will surely be the first thing that any decent "cracker" will try :
1. baksmali
2. apktool
3. Dex2Jar + JD-Gui
Keep adding layers of obfuscation to your code until you are satisfied that the output of the above tools is fairly complicated to make sense. (Again do NOT under-estimate what a college-grad armed with coke, pizza and the knowledge of DVM opcodes can accomplish over a weekend).
Regarding the techniques discussed in the link you shared, i fail to see how they can be implemented to protect the .dex on Android. And if you end up implementing the verification logic in a separate .so then all the "cracker" would need to do is patch the call in your java code to the verify() function inside the .so.
UPDATE:
Additional obfuscation steps to secure the .so.
1. Do NOT follow a more or less linear path.
Adding additional jumps all over the place works by flooding the "cracker" with so many potential targets which need to be individually modified and patched and verified if the protection has been bypassed.
2. Add timing checks
This is mainly to throw off the "cracker" by making the code follow different paths during debug and actual run-time. If the time spent between two points is a lot more than usual then its a clear indication that your program is being debugged. i.e time to jump into that part of junk code that calculates the number of pianos in the world.
3. Write self modifying code
Again this thwarts static analysis. For example if your jump into the verification function does not exist in the binary but is patched everywhere as part of some init() function in the .so.
All the above techniques(and more) are described with examples in the following article on anti-debugging techniques.
A more comprehensive guide is Ultimate Anti Debugging Reference by Peter Ferrie.
Avoid using too transparent checks. Try some basic workflow obfuscating (for example XOR-ing result), this can help to defend against simple opcode replacing. But I assure you, that if someone wants (very-very) to crack you, he can do it regardless of complexity of your protection.
Dexguard is made by the same people who did Proguard, but it allows for even finer-grained options. That said, Proguard is more or less the industry standard for Android obfuscation. Though, as said above, if someone with the know-how wants to crack your app, there's no protection to be had for love or money.
The simple truth: you can't.
You can purchase utilities to obfuscate your object code but they are all trivially bypassed by any slightly motivated attacker. If your user can write to the program image (on disk or in memory) no amount of obfuscation will defend against it.
If it is extremely important, I recommend moving the important component to a device you control and provide some form of challenge-response code to access it. It won't prevent people from cracking it, but it can put up a much more significant barrier against it.
I just read that Android has a 450% performance improvement because it added a JIT compiler, I know what JIT is, but I don't really understand why is it faster than normal compiled code? or what's the difference with the older approach from the Android platform (the Java like run compiled bytecode).
Thanks!
EDIT: This is hugely interesting, thanks!, I wish I could pick every answer as correct :)
First a disclaimer, I'm not at all familiar with Android. Anyway...
There are two applications of JIT compilation that I am familiar with. One is to convert from byte-codes into the actual machine instructions. The second is Superoptimisation.
JIT bytecode compilation speeds things up because the bytecodes are only interpeted once, instead of each time it is executed. This is probably the sort of optimisation you are seeing.
JIT superoptimsation, which searches for the truly optimal set of instructions to implement a programs logic, is a little more esoteric. It's probably not what your talking about, though I have read reports of 100% - 200% speed increases as a result.
The VM needs to turn compiled byte code into machine instructions to run. Previously this was done using an interpreter which is fine for code that is only invoked once but is suboptimal for functions that are called repeatedly.
The Java VM saw similar speedups when asa JIT-versions of the VM replaced the initial interpreter versions.
The JIT compiler knows about it's system, it can use that knownledge to produce highly efficient code compared to bytecode, and rumors go it can surpass pre-compiled programs.
That's why it can go faster than the traditional system of java, where the code was run as bytecode only, which Android used, too.
Besides compiling java code to native code, which could be done with a compiler too, a JIT does optimizations, that you can only do at runtime.
A JIT can monitor the applications behavior over time and optimize those usage patterns that really make a difference, even at the expense of other branches in the execution path of the code, if those are less frequently used.