[Learning LLVM I ] Introduction to the LLVM Pass Framework

Hi folks, its been a while! In this post I'm going to talk about getting started with LLVM and I'll discuss writing a basic pass which we will build on as the post series develops.


LLVM is becoming really popular, with a sprawling community behind it and a string of research projects contributing plugins and passes there's ever more reason to get involved and hack out some passes of your own.

We should start with what is LLVM? LLVM formerly standing for "Low Level Virtual Machine" (I hear the acronym no longer means this) now refers to a collection of tools that comprise a whole compiler architecture and tool chain. There are components that debug, instrument code, link libraries and much much more. In this post we will be focusing on llvm as it pertains to the set of libraries for interacting with the compiler internals called the LLVM Pass Framework.

An important thing to know about LLVM before we move on is its modular design. The actual compiler is comprised of 3 seperate components namley the:
  • Front-end - which handles lexing and compiling code into LLVM's intermediate representation (more on this in future posts). IR is a powerful tool in compilers especially here because it means whatever strange language you generated the IR from doesn't really mean anything to the phases going forward. LLVM can be re-targeted for pretty much whatever you can contrive into bitcode or IR. 
  • Optimizer - The optimizer performs uhm well optimization on the IR passed to it. There are a ton of different optimzations (I believe some papers speak of something like 100 different loop optimzations for instance). The optimizer strips out as much redudent look ups or dead variable assignments as possible all backed by the static single assignment (SSA) form (each register is only assigned a value once). This SSA grammar form allows the compiler to isolate imporant properties and anomalies in the langauge that would otherwise be quite ambiguos and tedious to code around.
  • Back-end -  This part of the compiler emits the actual machine depedent assembler code. If you want to participate in the machine dependent code generator then writing a MachineFunctionPass is for you since it kicks in everytime a function is rendered in machine dependent code (I discuss the pass types further on in the post). Useful reasons to do this might be to check for differentials in the IR vs machine code, maybe to nuke certain instructions like cache flushing incase code is trying some side-channel attacks, maybe inject functions that force it to run with a conditioned cache to defen aginst attacks or instrument specific asm side effects at machine level and im sure tons more awesome stuff.
We can see from the diagram above that IR floats between the different stages meaning passes actually operate on llvm IR. This means you can't run into this assuming that you can filter for stack base register behavior or instruction pointer weirdness. You'll have to get to grips with LLVM IR if you mean to do anything useful with the framework, its the lingua franca of the compiler so learn it good.

This video is an excellent introduction to the LLVM IR concepts please check it out if you're looking or a well structured and well delivered introduction https://www.youtube.com/watch?v=m8G_S5LwlTo .

Here come the compiler bugs

A motivating factor for security folk like me is the advent of nifty security bugs in code that stem from compiler optimizations. One really good example is a bug termed "memsad". The key issue here is applying aggressive optimizations to certain contexts of the memset call (Illja Van Sprundel from IOActive originally showed me this bug, defo check out what he has to say [https://media.ccc.de/v/35c3-9788-memsad]).

What we learned here is that optimizations can actually remove memset calls that could be used to clear cryptographic materials from memory. The memset optimization can culminate in almost heart-bleed like conditions, directly compromising cryptographic operations if they go unchecked for too long or appear in too many contexts.

To clarify what I'm talking about here's a potential example of memsad in something called RIOT-OS:

Extract from https://github.com/RIOT-OS/RIOT/issues/10751

In the code above you can see the code call memset (on line 14) with a buffer as an argument, and then not use the buffer after that point just before returning.  What this means in short is that GCC (including some other compilers) will not mark it as "in-use" after that point (after line 14) and remove the memset during optimization; correctly assuming that it has no impact on the outcome of the function. The result being that in the code actually being run, the memset will not be called.

Of course if that buffer happens to hold a hash of something sensitive or a private key, this means taht when the function returns these values will be available in memory; potentially leaked out to disk during swaps or divulged during any number of kernel memory disclosure vulnerabilities. Either way if you are serious about controlling access to your crypto, this bug can be a big problem because means you no longer have a solid grasp of where exactly in your org cryptographic materials are accessible.

Anyway the point I'm trying to make here is that the internals of a compiler matter in a security sense. The existance of this bug, immediately means other examples exist, at the least as more contrived examples of this one. And in order to get a view of where these bugs come from obviously that means either trudging through unfriendly compiler code or hooking into a framework specifically designed to give you purchase on the internals of the compiler, LLVM is trying to be this framework.

In order to invoke some of the magic of LLVM one can write "passes" using the nifty API LLVM forwards. The next section discusses some background on these passes and gets you going with your first one. 

Your first Pass

To start, I should say that compilers don't do everything in a single run at your code (well at least LLVM doesn't), most compilers resort to a simple strategy of do things in separate "passes" over the code. This effectively means that it will pass over the code once to achieve a certain goal and then when a desired property emerges from the code (like provably correct syntax, efficient array look ups, etc) it will be hit with more passes until it is rendered into compiled machine code (or LLVM bit code if thats your target).

LLVM gives you access to these passes via something called the LLVM Pass Framework (documentation linked below). The way this works is you write an instance of of the llvm::Pass class with methods that get called during each of the instantiated pass types and you get to tell it what to do with the code! Pretty cool right? You are literally writing a compiler here, and if that doesn't get you laid then I dunno what will.

Anyway here's a list of some of the passes LLVM has APIs for:
  • Module Pass - When you write a module pass your methods will trigger in a context that gives you view of the entire script (a .c/c++ file) being processed as a single unit. Whats neat about the Module Pass is that you can trigger analysis on functions from the Module Pass, in know this sounds redundant but imagine trying to process function semantics and say, needing context of the global variables, or seeking to optimize FunctionPass specific stuff by triggering some analysis from view of the entire script first i.e setting up your shadow memory manager, collecting metrics on the module etc.
  • Function Pass -the function pass as the name indicates triggers on Functions as stand alone units, this is obviously very useful functions are where the action happens! There are some caveats though to using these though because of the out of order mode of processing, you shouldn't expect to hit each function in your pass in a given order or depend any analysis on it. More important caveats are mentioned on the LLVM site (check out "Writing an LLVM Pass" in the reference section).
  • Loop PassLoop Passes run on you guessed it loop definitions in context. It actually processes nested loops starting from the inward out, meaning the last loop in a collection of nested loops will be the outer-most one i,e, loop A{ loop B { loop C} } will be processed C->B->A. These passes are great for performing quick optimizations on loops for instance if you're loooking for any code that may have interesting cache behaviour, you can whip up a loop pass and model for instance what the cache would look like during loop. Other more obvious applications could be things like simplifying array operations, removing statements that don't affect computation or slow down loop speed. 

There are a few other very powerful pass types I'm not mentioning here for the sake of brevity namely the RegionPass, CallGraphSCCPass and the MachineFucntionPass, if you need the deets on these I suggest checking out the LLVM documentation in the reference section.

The general pattern to employing LLVM to do stuff is having one part of your code collect data and another analyze data. For instance lets say you're looking for use-after-free's you could have on part of your code tag and log all the calls to free() and another collecting these contexts to see if there are any funky things going on. Point I'm making here is there is usually a collection phase and an analysis phase loosely speaking. For us these will be a bit compressed in our first example pass because we're just going to spit out all the function calls and make LLVM tell us the name of the function being called.

The Code

Here's the code for our first LLVM pass, don't fret I'll explain whats going on right after the code snippet:

Lines 1-6 are pretty straight forward they just make sure all the relevant namespaces and function definitions are accounted for, more important to discuss is the code on line 9:

9 struct FunctionNamePass : public FunctionPass {

Here we are declaring what kind of pass we want and what our instance should be called, namely we're writing a pass of type llvm::FunctionPass called "FunctionNamePass". If you'd like to check out the documentation for the FunctionPass class its available here ().

Moving on we then give it an ID member field so the LLVM Pass framework can uniquely identify it. Then finally in line 13-16 we implement the llvm::FunctionPass::runOnFunction(Function *) which is the star of the show as far as getting stuff done goes. This function is where you put all the analysis you want to do, pull out function calls, check arguments etc etc. The argument stuffed in here is a pointer to a Function type, which is the current function being analyised. In line 14 we can see the code do the following:

14 errs() << "[*] function '" << F.getName() << "'\n";

Which is a contrived way of printing to stderr (via the errs() call) , and calling getting the function name with the nifty Function::getName() method.

We now have a defined runOnFunction method and we can move onto registering our pass so that clang can pick it up. We're doing this so that we can use the clang -Xclang -load -Xclang command (more detail on this in the next section) to invoke our pass, though there is an alternative way discussed near the end of the post. Doing things this way makes the story a little shorter and easier to understand. So how do we get clang to see and invoke our pass automatically? We make use of LLVM's Pass registration.

Here's the code that registers our Pass (lines 24-30):

24 static void registerFunctionNamePass(const PassManagerBuilder&, legacy::PassManagerBase &PM) {

 PM.add(new FunctionNamePass());


28 static RegisterStandardPasses

Above we can see the registerFunctionNamePass which is a call back we are defining that will register our pass for us, you can name this anything of course, the important thing is that it stuffs an instance of our pass in the legacy::PassManagerBase::add( ) function. Next we need to pass our registration call back to the actual pass registration system, this is done by making an instance of RegisterStandardPass in line 28.

Okay so that pretty much makes up  all the important aspects of the pass, we can move onto compiling and running it, this is dicussed in the next section.

Compiling and Running

Okay so our pass is all scripted up we need to be able to build and run it. To get that done we're gonna need to setup a folder structure and some CMakeLists.txt's. To skip all the headache involved in this I suggest checking out a repo that has all of this already pre-cooked, I relied on this repo by Adrian Sampson https://github.com/sampsyo/llvm-pass-skeleton.

CMakeLists.txt will expect to appear with a folder named FunctionName (this is what we are renaming Skeleton.cpp to) in its path like so:

├── CMakeLists.txt
├── FunctionName
│   ├── CMakeLists.txt
│   └── FunctionName.cpp

Where the top level llvm-skeleton-pass/CMakeLists.txt looks as follows:

And the sub-level on under FunctionName/CMakeLists.txt looks like this:

Once your folders are set up good, you can make your build folder and pump out some cmake and make action:

cd llvm-pass-skeleton
mkdir build
cd build
cmake ..

If everything goes well you should see the following output:

>$ cmake ..
-- The C compiler identification is GNU 9.2.1
-- The CXX compiler identification is GNU 9.2.1
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /home/kh3m/Research/llvm/tutorials/llvm-passes/build

>$ make
Scanning dependencies of target FunctionNamePass
[ 50%] Building CXX object FunctionName/CMakeFiles/FunctionNamePass.dir/FunctionName.cpp.o
[100%] Linking CXX shared module libFunctionNamePass.so
[100%] Built target FunctionNamePass

We can then run it by invoking the following command:

>$ clang -Xclang -load -Xclang FunctionName/libFunctionNamePass.so ../radamsa.c
[*] function 'main'
[*] function 'find_heap'
[*] function 'setup'
[*] function 'load_heap'
[*] function 'vm'
[*] function 'onum'
[*] function 'read_heap'
[*] function 'heap_metrics'
[*] function 'get_obj_metrics'
[*] function 'get_nat'
[*] function 'set_signal_handler'
[*] function 'signal_handler'
[*] function 'decode_fasl'
[*] function 'get_obj'
[*] function 'get_field'
[*] function 'mkraw'
[*] function 'gc'
[*] function 'mkpair'


You may not have a radamsa.c in your folder, any C code should suffice I just like using radamsa because its standalone, doesn't have any complex dependencies and comes packed with all kinds of crazy code to test on.

Alternatively you can invoke your pass by using opt which is the way the LLVM documentation does it, more on that method here: https://llvm.org/docs/WritingAnLLVMPass.html#running-a-pass-with-opt

Okay I think this is a good place to end this post, we will carry on adding stuff to or Pass as the series goes on so stay tuned!

References and Futher Reading

  1. LLVM For Grad Students - https://www.cs.cornell.edu/~asampson/blog/llvm.html  
  2. Static Single Assignment (wikipedia) - https://en.wikipedia.org/wiki/Static_single_assignment_form
  3. llvm::Pass Class Reference abstract - https://llvm.org/doxygen/classllvm_1_1Pass.html
  4. RIOT-OS https://github.com/RIOT-OS/RIOT/blob/master/sys/crypto/helper.c#L38-L44
  5.  Memsad why clearing memory is hard. (CCC, 2018) - https://media.ccc.de/v/35c3-9788-memsad
  6. LLVM: A Compilation Framework forLifelong Program Analysis & Transformation (2014) - https://llvm.org/pubs/2004-01-30-CGO-LLVM.pdf 
  7. The most dangerous function in the C/C++ world (2015) - https://www.viva64.com/en/b/0360/ 
  8. 2019 EuroLLVM Developers’ Meeting: V. Bridgers & F. Piovezan “LLVM IR Tutorial - Phis, GEPs ...” - https://www.youtube.com/watch?v=m8G_S5LwlTo  
  9. Loop Optimization Framework - https://arxiv.org/pdf/1811.00632.pdf  
  10. LLVM Skeleton Pass (Adrian Sampson) - https://github.com/sampsyo/llvm-pass-skeleton