Wednesday, 5 November 2014

Automated DEX Decompilation using Androguard


Hey guys, its been a while since my last post and my blog is beginning to gather dust. So I though I would drop a couple posts about some new stuff I've been trying and learning. This post is about Androguard and how to write a simple python script that dumps decompiled dalvik bytecode from an Android APK.



Androguard is a framework and basically a collection of python libraries that give hopeful android bug hunters like me a programmatic interface to decompiling, analyzing, visualizing and parsing content from an APK; another feature allows you to mount Androguard as malware analysis engine. You can learn more about the epic features of this project from the wiki . 

Why is this cool? Well basically it allows you to orchestrate:
  •  As close to completely automated vulnerability analysis and debugging of Android Applications as possible (if you're up to the challenge, which I think I am).
  • Application Packaging and Wrapping (for Mobile Device Management solutions)

What I'm going to cover here and in the next couple of posts probably is the basic process of installing Androguard and how to get going with a few basic python scripts to automate a couple tedious tasks. So lets get going!

Install Androguard

Before we start you will need to grab a copy of the androgaurd project which you can get over here https://code.google.com/p/androguard/ . What you need to do then is dump the archive to a handy location and unzip it like so:

Unpackaged Androguard, ready to install, please make sure you have python-setuptools installed.
After that you can go ahead and fire off the setup.py script with an argument of 'install'. And thats it!

My First Androguard Script

Here's what this script looks like:

You can clone a copy of this little script from here https://gist.github.com/2f0c0330cc2cd0910040.git.
Running this on a sample application, the output should look a little something like this:

running dump_methods on aca.db,tool.apk 

Breakdown

So lets break down what this script does. Obviously I'm gonna skip the basic stuff, and jump down into the Androguard specific calls; the first of which appears in line 6 . The call the APK class filters down into the APK class which is used to represent an APK file and all its interesting attributes, you can checkout a copy of this class under androguard/core/bytecodes/apk.py

Now one thing to note is that this whole class works by accepting an APK file, which as you should know is just a plain old zip file . Once the constructor method (not sure if that's the correct python-specific term) triggers it scrapes and pulls attributes from the file by primarily using a library that handles zip files (this collection of code is included in the apk.py source). What happens next is we pull out the dex file and parse it to the DalvikVMFormat constructor call.

The DalvikVMFormat class is pretty cool, you can find it under androguard/core/bytecodes/dvm.py. The dvm.py file is where all the dirty dex file parsing happens, whats awesome about this is its a pure python implementation of a dex file parser so if you'd like to do your own dex file parsing in python that's where you should look for ideas (its also pretty isolated and includes a lot of the classes that support is own operation, so its pretty easy to pull this class out and plug it into whatever else you'd like to use it for that requires dex file parsing). What this file does is strip out the contents of the dex file and plug it into a class that represents the file and its attributes (bytecode offsets, type descriptors etc).

What we are left with when the dex file parsing finishes is a class with full grasp of all the attributes of the dex file as well as where all the interesting stuff is stored. For instance in this script we are interested in the classes and the methods. The get_classes() call which follows returns a list (or iterable - basically something you can stick in a for loop and traverse) of . Which contains a whole bunch of ClassDefItem objects, these are defined in dvm.py. Methods as associated to ClassDefItem objects which is why you can call get_methods to grab a list of EncodedMethod objects. Each ClassDefItem object, includes information about its index in the dex file, this is since each method, string, class and other objects are deferenced in the dex file using indexes (commonly refered to as idx's in the androguard and Dalvik VM source).

By the way its great to talk about ClassDefItems and EncodedMethods like this because the API represents the structure of a dex file as it is defined and handled by the Dalvik VM itself :) So I might be talking about python objects and what not, but you are actually learning the dex file format as well! 

Anyway back to the code. We now have a list of EncodedMethod objects, what we do next is print out the method name and its descriptor, what is a method descriptor you may ask.

descriptor is a string representing the type of a field or method. Descriptors are represented in the class file format using modified UTF-8 strings (§4.4.7) and thus may be drawn, where not further constrained, from the entire Unicode codespace. - http://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.3

Well its basically just a string that describes the basic attributes of a method or class stuff like the types of the parameters and the return type. If you're just getting into reverse engineering Android apps I suggest reading about these since they are used by the DalvikVM to identify Object and Method types aaaand you're not gonna know what the hell is going on in bytecode if you don't understand this little language.

Moving on, finally we get to pulling code out of methods. The next interesting lines of code start at line 11. Here we make a call to the get_code() method which simply returns a DalvikCode object. After that we pull out a DCode object by calling get_dc(). The Dcode object is used to represent the actual bytecode and some of its atrributes.

You may wonder why its necessary to have DalvikCode objects in the first place? Well I'm not entirely sure of this, but looking at the source the DalvikVMs dexlib as well as the source for angroguard it looks like the DalvikCode object is used to associated meta-data to the actual bytecode. This is stuff like the register sizes (number of registers accessible to the associated bytecode), number of try blocks, debug information and other cool stuff. As far as I know the bytecode checker uses these details to ensure the actual executable byte code operates within its limits and given access to its arguments and makes use of its registers correctly. It also allows some methods to be defined differently according to their DalvikCode objects (for optimization and purposes most probably) i.e. one method could have byte code that only requires 2 registers to complete its task and another may need 10. 

Anyway, the DalvikCode object holds a reference to the Dcode object which is used to handle actual byte code. In Androguard this object services as an interface to the part of the dex file that stores the raw dex opcode. It includes support methods for parsing bytecode into human readable representation, one of these methods is get_instructions(). The script calls get_instructions to get an iterable list of bytecode instructions, after that it pulls out parsed bytecode by invoking get_name() which returns the opcode name (invokevirtual, iget, etc) and get_output() which returns the operands for the associated instruction as well as the contents of the operands where applicable.

Aand that's its folks! I hope you guys enjoyed this post and will be writing your own Androguard scripts in no time!