If you still haven’t read the first blog post of this series, I highly recommend you do. You can grab it here: Dalvik’s smali static code analysis with Python.
Without further ado, lets dive straight into business. Continuing the topic of static data extraction from smali files such as classes info, method deceleration and invocation. Why only extract these types of data and not something like other glorious types of opcodes? Well, It would take a really long time and I needed a quick solution for my main question – “What could have called the function I’m currently reading?”
When dealing with regular code and utilizing some advanced IDE’s you might as well click on the “call hierarchy” button, if you have one, and check out all the method invocations done all the way until the method you are standing on. But with obfuscated code, it is much harder for the IDE’s to track these calls and tell you what’s the call hierarchy.
Personally I found that when dealing with a large and obfuscated code base, all of the major IDE’s cant cope with displaying call hierarchies. When dumping all of our extracted data into a database, you could by theory track method calls and make queries in order to recursively print call hierarchies. So that’s exactly what we’ll do.
We will start with creating the database and all the relevant schemes for our call tracking. We will use python’s sqlite3 library because we don’t need crazy amount of processing power, at least for this stage. The following present the scheme layout:
This gives us a nice start for inserting all of the extracted data to be later selected by our recursive matcher function. The
id field is simply the full representation of a smali class+method. For example:
ibe is the parent class,
a is the child class and
c is the method declared and defined inside
a class. The
data column contains all of the method’s smali code. It is much simpler containing this inside the database and not extracting this straight out of the smali files each time we want to access some method’s data.
calling_to column is for listing all of the method invocation done inside a single method definition.
Lets try to give an example. Imagine you have the following class definition and a method definition like so:
Let’s try to deconstruct this code ourselves. What come after
.class is the type of the class and the full path that the class is contained in. The
.super is the super class, or the base class from which the class should derive. The
.source is simply for stating in which file Java file the code should reside. The
.method directive is what we are really after, and it states what method is being declared, what it’s type is, it’s parameter types, it’s return type and all of the method is defined up until the closest
.end method. Also, all of the
invoke- directives are for method calling. This is exactly where we extract what methods are being called.
I’ll skip the part where we scan for all smali files inside a directory and load all of the extracted data into the database. This could be done as seen in parse_smali_file method.
After understanding the most basic directives, and ignoring some for the sake of simplicity, we can start building our method matcher. Let’s assume we stumble upon a method, and we would like to know who across all the application’s code could have invoked it. After populating all of the defined methods inside the application’s smali code files, we could select from the database all of the rows containing the relevant method ID inside their
calling_to column. This is achieved by:
For each method received by this select statement, apply the same statement recursively. This will result in something like this:
The full image can be downloaded from here.
To summarize it all, by applying all of the above we could achieve greater accomplishments which we would talk about extensively later on the series. The matcher function and the graph populating function can be grabbed from here.