Does dx conversion to dex include verification of original class files?

Does dx conversion to dex include verification of original class files? - android

I'm interested in doing some tinkering on compiled Class files before they're converted to dex files by dx. I've looked a bit at the official Dalvik documentation and also at comparisons between the DEX format and Class format. I can't find much information regarding the actual conversion process, class->dex. Does dx first verify the Class files before the conversion? Does it simply go field by field and method by method, merging groups of instructions into more compact groupings? Any insight would be appreciated.
Thanks.

The way that dx is run, it doesn't typically have sufficient information to do all possible verification, nor is it written to do so. In particular, part of verification has to do with how the code in one class refers to code in other classes, and when dx is run, the code for the "other classes" in question might not actually be available. For example, you could compile some code against Android API level 6, producing a .dex file. Later, when a device running API level 29 comes out, you could try to run that .dex file. It's only when the file is on a system and getting ready to run that the system has all the info needed to perform verification. At that point, it can inspect the references in the .dex file with what's available on the system and either accept (pass verification of) or reject (fail verification of) that file.
As a brief example, maybe the .dex file refers to a class or method that existed in API level 6 but was removed as of API level 29.
But to be clear, as #JesusFreke said, dx needs to be able to parse .class files enough to be able to do its job of translation. If it runs into a problem at that layer, it will report that as a failure to translate, which, in context, is about equivalent to a verification error, though it's not generally phrased as such.
Even disregarding the possibility of evolution of the API, it is possible to take a .class that wouldn't verify, succeed in translating it into a (part of a) .dex file, and then observe that the .dex file fails to verify.
I hope this helps!

I'm not as familiar with dx itself and the conversion process as with dalvik bytecode, but I don't recall seeing any verification of the original java bytecode, although obviously it has to be well-formed enough to be parsed/understood by dx.
There is no documentation on the conversion process that I am aware of. It involves converting the bytecode into a couple of intermediate formats (ROP, SSA), and includes some logic for efficient register allocation and some optimizations on the intermediate forms (I think).
For more information on the conversion process, your best bet is to look at the dx source itself (/dalvik/dx)

Related

What is Smali Code Android

I am going to learn a little bit about Dalvik VM, dex and Smali.
I have read about smali, but still cannot clearly understand where its place in chain of compilers. And what its purpose.
Here some questions:
As I know, dalvik as other Virtual Machines run bytecode, in case of Android it is dex byte code.
What is smali? Does Android OS or Dalvik Vm work with it directly, or it is just the same dex bytecode but more readable for the human?
Is it something like dissasembler for Windows (like OllyDbg) program executable consist of different machines code (D3 , 5F for example) and there is appropriate assembly command to each machine code, but Dalvik Vm also is software, so smali is readable representation of bytecodes
There is new ART enviroment. Is it still use bytecodes or it executes directly native code?
Thank you in advance.

When you create an application code, the apk file contains a .dex file, which contains binary Dalvik bytecode. This is the format that the platform actually understands. However, it's not easy to read or modify binary code, so there are tools out there to convert to and from a human readable representation. The most common human readable format is known as Smali. This is essentially the same as the dissembler you mentioned.
For example, say you have Java code that does something like
int x = 42
Assuming this is the first variable, then the dex code for the method will most likely contain the hexadecimal sequence
13 00 2A 00
If you run baksmali on it, you'd get a text file containing the line
const/16 v0, 42
Which is obviously a lot more readable then the binary code. But the platform doesn't know anything about smali, it's just a tool to make it easier to work with the bytecode.
Dalvik and ART both take .dex files containing dalvik bytecode. It's completely transparent to the application developer, the only difference is what happens behind the scenes when the application is installed and run.

High level language programming include extra tools to make programming easier & save time for the programmer. After compiling the program, if it was to be decompiled, going back to the original source code would need a lot of code analysis, to determine structure & flow of program code, most likely a few more than 1 pass/parse. Then the decompiler would have to structure the source based on the features of the compiler that compiled the code, the version or the compiler, and the operating system it was compiled on eg. if an OS specific features or frameworks or parsers or external libraries were involved, such as .net or dome.dll, and their versions, etc
The next best result would be to output the whole program flow, as if the source code was written in one large file ie. no separate objects, libraries, dependencies, inheritances, classes or api. This is where the decompiler would spit out code which when compiled, would result in errors since there's no access to the source codes & structure of the other files/dependencies. See example here.
The 3rd & best option would be to follow what the operating system is doing based on the programmed instructions, which would be machine code, or dex (in case of Android). Unless you're sitting in the Nebuchadnezzar captained by Morpheus and don't have time to decode every opcode in the instruction set of the architecture your processor is running, you'd want something more readable than unicode characters scrolling on the screen as you monitor the program flow/execution.
This is where assembly code makes the difference; it's almost the direct translation of machine code, in a human readable format. I say "almost" direct because microprocessors have helpers like microcodes, multithreaders for pipelining & hardware accelerators to give a better user experience.
If you have the source code, you'd be editing in the language the code is written in. Similarly, if you don't have the source code, and you're editing the compiled app, you'd still be editing in the language the code is written in; in this case, it's machine code, or the next best thing: smali.
Here's a diagram to illustrate "Dalvik VM, dex and Smali" and "its place in chain of compilers".

Does the Android ART runtime have the same method limit limitations as Dalvik?

Does the Android ART runtime have the same method limit limitations as Dalvik?
Currently, there's a limit of 64k methods in the primary dex file

The issue is not with the Dalvik runtime nor the DEX file format, but with the current set of Dalvik instructions. Specifically, the various method invocation methods, which look like this:
invoke-kind {vC, vD, vE, vF, vG}, meth#BBBB
B: method reference index (16 bits)
You can reference a very large number of methods in a DEX file, but you can only invoke the first 65536, because that's all the room you have in the method invocation instruction.
I'd like to point out that the limitation is on the number of methods referenced, not the number of methods defined. If your DEX file has only a few methods, but together they call 70,000 different externally-defined methods, you're going to exceed the limit.
One way to fix this is to add additional instructions that take wider method references. An approach called "jumbo opcodes" was implemented and released in Android 4.0 (ICS), but was never fully put into action, and was later removed from the tree. (I occasionally see posts here with error messages from "dx" that reference jumbo ops, or from developers who stumbled over them.)
Note this is not the problem solved by the Facebook hack. That's due to a fixed-size buffer for holding class/method/field meta-data. There's no method-specific limit there; you can blow out the buffer by having lots of fields.
My understanding is that the current implementation of ART handles the same set of instructions that Dalvik does, so the situation will be no different.

Anwar Ghuloum told in this Android Developers Backstage episode that they're not going to fix the bytecode in the near future.
Instead, starting from Android L they will natively support multi-dex by collapsing all the dex files (from an APK) into a single oat file.

Efficient way of using proguard in android

I am trying to prevent the app from being de-compiled and thus getting exposed. I know there is proguard which I can use to convert the java files to .smali files. But my question is, how secure are these .smali files?
When I did R&D on that, I got some results that .smali files can be converted back to java files. Is that true? Or else what is the best way to prevent the apk from decompiling? My app includes lot of financial details, so at any cost I should not be able to reveal them to the outside world or at least I am trying to make it very difficult to decompile it.
Note: I have already did lot of work on getting the working of proguard
Your answer would be greatly appreciated

Proguard is built in to later versions of the Android SDK. You just point to proguard.cfg and it will be used during release. I assume you know this bit.
Proguard is not related to smali. In the end all these tools output working bytecode and you can always recompile bytecode. Can't stop that. What proguard can do is rename all the symbols in your code so that the result is very hard to understand.
If you mean you are storing sensitive info in string literals in your app then don't do that. These can't be obfuscated or else your app wouldn't work. They are always visible as literals in the byte code.

Does the Dalvik file format (*.dx) support more instructions than a Java .class file?

Is there anything the Dalvik VM supports (in terms of bytecode) which is not used currently because the .class files don't have it?
As an example, if people would write their own Source-to-DX converter for their functional language XYZ, would they be able to implement e. g. full tail calls although the .class file does support tail calls only under certain circumstances?

I'm no expert, but from what I can see, the answer would be no.
The following two sites lists the Dalvik and JVM opcodes, and put aside the fact that Dalvik is a register based VM and the JVM is stack based, the opcodes are fairly similar.
http://pallergabor.uw.hu/androidblog/dalvik_opcodes.html
http://en.wikipedia.org/wiki/Java_bytecode
Both of them are tailored specifically to handle the Java-language, (even though there are suggestions to lift this constraint, in future versions of the JVM).
One of the problems with tail call optimization on Java, is that the call stack is actually available for the program (through for instance new Throwable().getStackTrace(), which is also present on the Android). If the VM did tail call optimizations, it would need to have some bookkeeping for what it just "optimized away" in order to be able to properly implement the getStackTrace method.

Android and Protocol Buffers

I am writing an Android application that would both store data and communicate with a server using protocol buffers. However, the stock implementation of protocol buffers compiled with the LITE flag (in both the JAR library and the generated .java files) has an overhead of ~30 KB, where the program itself is only ~30 KB. In other words, protocol buffers doubled the program size.
Searching online, I found a reference to an Android specific implementation. Unfortunately, there seems to be no documentation for it, and the code generated from the standard .proto file is incompatible with it. Has anyone used it? How do I generate code from a .proto file for this implementation? Are there any other lightweight alternatives?

I know it's not a direct answer to your question, but an extra 30kb doesn't sound that bad to me. Even on EDGE that'll only take an extra 1 to 2 seconds to download. And memory is tight on android, but not THAT tight -- 30 kb is only about 1/10th of one percent of the available application memory space.

Are there any other lightweight alternatives?
I'm taking this to mean "to using protocol buffers", rather than "for using protocol buffers with an Android application". I apologise if you are already commited to protocol buffers.
This site is about "comparing serialization performance and other aspects of serialization libraries on the JVM". You'll find many alternatives listed there.
While there is no mention of the memory footprint of the different implementations at the moment I am sure it is a metric which the people on the wiki would be interested in.

Just to revive this archaic thread for anyone seeing it, the answer is to use Square's Wire library (https://github.com/square/wire)
As they mention themselves:
Wire messages declare public final fields instead of the usual getter methods. This cuts down on both code generated and code executed. Less code is particularly beneficial for Android programs.
They also internally build using the Lite runtime I believe.
And of course Proguard, the new Android 2.0 minify tools, [other generic answers], etc etc.

Use ProGuard[1] on your project. It will reduce the size of jars included in APK file.
[1] http://developer.android.com/guide/developing/tools/proguard.html

Develop Reference

The Android operating system is a mobile operating system that was developed by Google (GOOGL?) to be primarily used for touchscreen devices, cell phones, and tablets.