Dalvik’s smali static code analysis with Python, Part II

If you still haven’t read the first blog post of this series, I highly recommend you do. You can grab it here: Dalvik’s smali static code analysis with Python.

Without further ado, lets dive straight into business. Continuing the topic of static data extraction from smali files such as classes info, method deceleration and invocation. Why only extract these types of data and not something like other glorious types of opcodes? Well, It would take a really long time and I needed a quick solution for my main question – “What could have called the function I’m currently reading?”

When dealing with regular code and utilizing some advanced IDE’s you might as well click on the “call hierarchy” button, if you have one, and check out all the method invocations done all the way until the method you are standing on. But with obfuscated code, it is much harder for the IDE’s to track these calls and tell you what’s the call hierarchy.

Personally I found that when dealing with a large and obfuscated code base, all of the major IDE’s cant cope with displaying call hierarchies. When dumping all of our extracted data into a database, you could by theory track method calls and make queries in order to recursively print call hierarchies. So that’s exactly what we’ll do.

We will start with creating the database and all the relevant schemes for our call tracking. We will use python’s sqlite3 library because we don’t need crazy amount of processing power, at least for this stage. The following present the scheme layout:

This gives us a nice start for inserting all of the extracted data to be later selected by our recursive matcher function. The id field is simply the full representation of a smali class+method. For example: ibe/a->c when ibe is the parent class, a is the child class and c is the method declared and defined inside a class. The data column contains all of the method’s smali code. It is much simpler containing this inside the database and not extracting this straight out of the smali files each time we want to access some method’s data. calling_to column is for listing all of the method invocation done inside a single method definition.

Lets try to give an example. Imagine you have the following class definition and a method definition like so:


Let’s try to deconstruct this code ourselves. What come after .class is the type of the class and the full path that the class is contained in. The .super is the super class, or the base class from which the class should derive. The .source is simply for stating in which file Java file the code should reside. The .method directive is what we are really after, and it states what method is being declared, what it’s type is, it’s parameter types, it’s return type and all of the method is defined up until the closest .end method. Also, all of the invoke- directives are for method calling. This is exactly where we extract what methods are being called.

I’ll skip the part where we scan for all smali files inside a directory and load all of the extracted data into the database. This could be done as seen in parse_smali_file method.

After understanding the most basic directives, and ignoring some for the sake of simplicity, we can start building our method matcher. Let’s assume we stumble upon a method, and we would like to know who across all the application’s code could have invoked it. After populating all of the defined methods inside the application’s smali code files, we could select from the database all of the rows containing the relevant method ID inside their calling_to column. This is achieved by:

For each method received by this select statement, apply the same statement recursively. This will result in something like this:

downloadThe full image can be downloaded from here.

To summarize it all, by applying all of the above we could achieve greater accomplishments which we would talk about extensively later on the series. The matcher function and the graph populating function can be grabbed from here.

Dalvik’s smali static code analysis with Python

A while ago I played around with android applications, trying to figure out new ways I could make my life easier as a vulnerability researcher. I stumbled upon Caleb Fenton’s amazing Simplify. This neat framework executes application’s code, trying to make it more human-approachable. This allows researchers like me understand obfuscated code better. Going through Simplify’s code was a bit shocker, as I wasn’t a big fan of java frameworks, not to talk about code analysis done in java.

So being the stubborn fella I am, I tried to implement my own version of smali code analysis with Python. Dalvik, which is Android’s Java Virtual Machine implementation uses smali code as it’s assembly-level opcodes.

The following regular expressions take care of extracting data from any relevant opcodes for my needs.

For class deceleration extraction:

This results in a result group containing only one member – the name of the class.

For method deceleration extraction:

Quite a fancy regex, but this extracts into a single result group: methods’ name, methods’ parameters, methods’ return value, and all of the methods’ data declared inside a single class.

For method invocation extraction:

When applying this regex on the captured method’s data as seen above, you get all of the called methods’ parameters values, methods’ object type, methods’ name, methods’ parameter object types and the methods’ return object type.

For method return deceleration extraction:

The move-result opcode moves the result of the last invoked method inside a register. This allows us to track the returned values from various method invocations inside a single smali class.

The whole python package and it’s features will be discussed in a future blog post, but for now, the complete code extracting all of the data above from a set of smali files can be found here: https://github.com/0rka/smalidroid/blob/master/parser.py