Throwing Some Fuzzy Dice
I mentioned a while ago that the Mirage project agreed to have me on board, through the OPW internship project, for the summer. We started on Monday, and I’ve already had a lot of fun!
Officially, my job for the summer is to help shore up the network stack in Mirage, in part by running the current code through its paces, and in part through implementing some new functionality. This first week, I continued some work I started at the end of Hacker School - figuring out how to fuzz some strange (and not-so-strange) corners, and how to wrangle the data I got out of doing so.
Fuzz What Now?
Let’s step back. Way, way, way back.
If you’re a computer program, and you have some data that you care about, your data is likely in some kind of structure reflecting an underlying order to that data. Objects are a common way to organize this stuff; dictionaries, hashmaps, lists, arrays, trees, the list goes on. That’s all well and good when your program is running, keeping all this stuff in memory. But it happens depressingly often that you need to dump this stuff to permanent storage, or express it in some way to some other program or another computer, or represent it on the screen because something awful has happened, and you can’t just say “memory address 0x52413abd
, memory address 0x52413cda
, memory address 0x52413ea2
” - these things are meaningless outside the context of the current run of that program.
So we have serialization, the high-level concept for the jillion different ways to take that data and put it in a string, or a binary data format, so something else can read that string and reassemble the structure of the data. That’s deserialization, which implies parsing; parsing is a pretty big deal.
When the data you’re attempting to assemble into a structure is as you expect it and everything is correct, parsing’s no problem. But it frequently happens that everything is not as you expect it, for any number of reasons - the programmer who made the program that made the message made a mistake; the programmer who made the program that reads the message made a mistake; the programs reading and writing the message are using different versions of the specification in the first place; the specification wasn’t specific about whether the third byte’s range from 0 to 5 was inclusive or exclusive and each programmer made a different decision; both programs agree, but the message was corrupted in transit; the message was corrupted in transit, and one program has implemented a different corruption recovery algorithm than the other… I’ll stop now, but I could keep going for a long time.
There are a lot of bad messages out there. It’s hard to make your parser do the right thing when it receives an arbitrary bad message. It can be hard to even know that your parser does the wrong thing when it receives an arbitrary bad message - if you thought of a certain kind of bad message to use in testing, of course you fixed your code to deal with it; you thought of it! But there are almost certainly loads more bad messages out there than the ones you thought of - both by chance, and by design.
If humans can’t make enough bad messages, maybe computers can. Randomly generating a whole mess of bad messages, sending them to your program, and seeing what happens is called fuzz testing, and it’s awesome.
After finding a problem in the DHCP client option parser for Mirage, I wondered whether there might be more kinds of messages which would cause the parser to do something unexpected. There are a lot of fuzzers out there for common attack targets like HTTP stacks and XML parsers, but I couldn’t find any nice, precooked ways to fuzz a DHCP client (besides hooking up a random noise generator to firehose a whole bunch of broadcast packets on the DHCP server port), so I decided to make my own.
I needed two things - a Mirage program that would repeatedly request a DHCP lease, and a program that would generate random sets of incorrect or unusual messages. After diving into the lightweight threads library tutorial for a couple of days, I was able to get the client side up and running, requesting leases all over the place and quickly exhausting the available lease pool of any DHCP server that didn’t know to reserve it a lease of its very own.
Fuzzy Answering Machine
The fuzzing DHCP server took a little while longer to put together. I ended up using scapy, a venerable Python 2.7 domain-specific language for sending, receiving, dissecting, and manipulating network traffic. Scapy is full-featured enough to even offer a fuzz()
method! All I had to do was figure out how to plumb that into a simple DHCP server (also already provided in scapy, one of a few “answering machines” provided by the base package), which took a few days of swearing and learning how to use Python. The fuzzing answering machine, plus a modification of the native BOOTP answering machine, happily serves strange leases to Mirage instances all day long, and leaving just one instance of the repeated DHCP asker to run overnight generated a lot of data.
After gleefully looking at a really big packet dump, I realized that I had absolutely no way of knowing what any of it meant. Why? A successful DHCP transaction has four messages:
- Client sends a broadcast DHCPDISCOVER message with a random ID number
- Server sends a broadcast DHCPOFFER message with the same ID number and some parameters for network configuration
- Client sends a DHCPREQUEST message with the same ID number and parameters from DHCPOFFER
- Server sends a DHCPACK message with the same ID number
(The protocol is more complex than this, but that’s what you see when all goes well.) Unfortunately, the failure mode I encountered on EC2 had this exact profile on the wire, and so I needed for something else to happen after the lease was successfully transacted. I added a little bit of extra code to my client:
let reset_state (thread : D.t) =
C.log c "Resetting state...";
match thread.state with
| D.Lease_held offer ->
let output_message = Io_page.(to_cstruct (get 1)) in
Cstruct.blit_from_string (uint32_to_bytes offer.xid)
0 output_message 0 4;
let output_message = Cstruct.set_len output_message 4 in
ignore (U.write ~dest_ip:(Ipaddr.V4.make 255 255 255 255)
~source_port:1337 ~dest_port:1337 udp output_message);
thread.state <- D.Shutting_down;
deconfigure_ip ();
fst (D.create c i udp)
| _ -> thread
After successfully getting a lease, the client sends a broadcast packet with the random ID for which it got a successful lease.
With this change, I was able to write some code that, given a whole bunch of DHCP transactions in a packet dump, would figure out which ones were successful and which not. I could then examine the failed transactions and succeeded transactions for common elements. I could also figure out (for example) whether in a given run, any DHCP leases with option 53 were encountered, and whether they succeeded or failed.
After all of that, I didn’t find anything that I couldn’t have found with grep TODO
.
Okay, But What Have You Done For Us Lately?
Pretty much the same thing, but with a prepackaged HTTP fuzzer called pathod and the Cohttp codebase. I wrote some more analysis helpers for scapy, and discovered that Cohttp’s major behavior in error cases resulted in timeouts, rather than failure messages.
Here’s the kind of thing I was doing:
$ sudo scapy
Welcome to Scapy (2.2.0)
>>> execfile("http_server_analyzer.py")
>>> domain=rdpcap("/home/catpetter/experimenting_with_pathoc.pcap")
>>> conversations = domain.sessions(session_extractor)
>>> conversation_list = conversations.values()
>>> timeouts = filter (lambda p: not has_request_and_response(p), conversation_list)
>>> # timeouts is a list of the TCP conversations that did not include
>>> # both an HTTP request (e.g. GET / HTTP/1.1)
>>> # and a response (e.g. 200 OK).
>>> # How many of them are there?
>>> len(timeouts)
3051
>>> # out of the whole set of conversations?
>>> len(conversation_list)
4755
>>> # that's a lot! let's look at a few,
>>> # although we should only bother with conversations bearing payloads.
>>> payload_list = map(payloads, conversation_list)
>>> nonempty = filter(lambda p: len(p) > 0, payload_list)
>>> len(nonempty)
2577
>>> # that's still a lot! let's see what they have in common:
types = get_request_types(nonempty)
>>> set(sorted(types))
set(['G\xe9ET', 'GE\xecT', 'GET\xed', 'G\xfaET', 'G\nET', '\xd2GET', '\x11GET', 'G\xfeET', 'GE\xdfT', 'GUET', '\xecGET', 'G\x89ET', 'G\x08ET', 'GET', 'GET\xc2', 'G\xebET', '6GET', 'GeET', 'GE{T', 'GET[', 'CONNECT', 'OPTIONS', 'G\xd5ET', 'GETT', 'GE\xc7T', 'HEAD', 'GE\x01T', '~GET', '\xcfGET', 'GET!', 'G\x11ET', '\xfdGET', '\xa8GET', '\x92GET', '\x0cGET', 'G\xd3ET', 'GE(T', 'WGET', 'POST', 'G\x12ET', 'GuET', 'GET\n', 'GE!T', 'GET\x8d', 'TRACE', 'GET\xce', 'G\xa7ET', ':GET', 'GETO', '\x10GET', 'lGET', 'GE\x87T', 'gGET', 'PUT', '7GET', 'GET\x13', 'GET\x02', 'DELETE'])
>>> # well, to be fair, those mostly don't look like requests
>>> # we want to honor, although it would be nice to get
>>> # a response.
>>> # what types do successful transactions have?
>>> successful = filter(has_request_and_response, conversation_list)
>>> successful_payloads = map(payloads,successful)
>>> len(successful_payloads)
1704
>>> set(sorted(get_request_types(successful_payloads)))
set(['HEAD', 'GET', 'PUT', 'POST', 'OPTIONS', 'DELETE'])
>>> # That all looks much more civil!
>>> # The good news is, we're not answering bad requests!
>>> # The bad news is, 'TRACE' and 'G\nET' are bad for different reasons.
Nice Round Thing; I Think I’ve Heard of Those
If you think this all sounds like stuff you should be able to do with a generative testing framework like quickcheck, you’re right! I had only barely heard of generative testing when embarking on this project, and one of the first questions I got on the Mirage mailing list was about using it for this project. Generative testing might be a really good fit for Mirage. Furthermore, a framework that integrates this testing in the develop-test-release cycle has some clear advantages over an extra process that some random idiot (hey, that’s the name of the blog!) does every now and again.