yomimono - something to read

Most instructions on how to get started with OCaml packages now advise the user to get started with opam, which is excellent advice. Getting up and running with opam is pretty easy, but I wasn’t sure where to go from there when I wanted to modify other people’s packages and use the modifications in my environment. I wish I’d realized that the documentation for making packages has a lot of applicable advice for that use case, as well as the apparent target (making your own packges from scratch).

In 2014, I spent 12 weeks at the Recurse Center, formerly (and at the time) known as Hacker School. After finishing up my time there in May of that year, a lot of people asked me reasonable questions like:

How was the Recurse Center?
Was attending RC worth your time?
What did you learn at the Recurse Center?

My response to these questions was “I don’t know yet! It’s too early to say.” Now that more than a year has passed, I think I might have some idea of where to start.

git (and its distributed version control system friends hg and darcs) have some great properties. Not only do you get a full history of changes on objects stored in them, you can get comments on changes, as well as branching and merging, which lets you do intermediate changes without messing up state for other entities which want to work with the repository.

That’s all pretty cool. I actually want that for some of my data structures, come to think of it. Say, for example, a boring ol’ key-value store which can be updated from a few different threads – in this case, a cache that stores values it gets from the network and the querying/timeout code around it. It would be nice if each thread could make a new branch, make its changes, then merge them into the primary branch once it’s done.

It turns out you can totally do that with Irmin, “the database that never forgets”! I did (and am still doing) a bit of work on sticking a modified version of the MirageOS address resolution protocol code’s data structures into Irmin:

$ git log --all --decorate --oneline --graph
* 68216f3 (HEAD, primary, expire_1429688434.645130) Arp.tick: updating to age out old entries
* ec10c9a entry added: 192.168.3.1 -> 02:50:2a:16:6d:01
* 6446cef entry added: 10.20.254.2 -> 02:50:2a:16:6d:01
* 81cfa43 entry added: 10.50.20.22 -> 02:50:2a:16:6d:01
*   4e1e1c7 Arp.tick: merge expiry branch
|\  
| * cd787a0 (expire_1429688374.601896) Arp.tick: updating to age out old entries
* | 8df2ef7 entry added: 10.23.10.1 -> 02:50:2a:16:6d:01
|/  
* 8d11bba Arp.create: Initial empty cache

When last we spoke, I left you with a teaser about writing your own NAT implementation. iptables (and friends nftables and pf, to be a little less partisan and outdated) provide the interfaces to the kernel modules that implement NAT in many widely-used routers. If we wanted to implement our own in a traditional OS, we’d have to either take a big dive into kernel programming or find a way to manipulate packets at the Ethernet layer in userspace.

But if all we need to do is NAT traffic, why not just build something that only knows how to NAT traffic? I’ve looked at building networked applications on top of (and with) the full network stack provided by the MirageOS library OS a lot, but we can also build lower-level applications with fundamentally the same programming tactics and tools we use to write, for example, DNS resolvers.

Building A Typical Stack From Scratch

Let’s have a look at the ethif-v4 example in the mirage-skeleton example repository. This example unikernel shows how to build a network stack “by hand” from a bunch of different functors, starting from a physical device (provided by config.ml at build time, representing either a Xen backend if you configure with mirage configure --xen or a Unix tuntap backend if you build with mirage configure --unix). I’ve reproduced the network setup bits from the most recent version as of now and annotated them a bit:

WiFi is fairly ubiquitous in 2015. In most of the nonprofessional contexts in which we use it, it’s provided by a small box that’s plugged into mains power and an Ethernet cable, usually with an antenna or two sticking out of it. I’ve heard these boxes called all kinds of things - hotspots, middleboxes, edge routers, home routers, NAT devices, gateways, and probably a few more I’ve forgotten; there are surely more I haven’t heard. “Router” is the word I hear and use most often myself, despite the unfortunate overlap with a more specific meaning (a device with multiple network links, capable of sending traffic between them). There are an awful lot of things these boxes do which aren’t implied by “router”!

My first interesting job was as a student systems administrator for a fairly heterogenous group of UNIX servers. For the first many months, I was essentially a clever interface to an array of search engines. I came to have a great appreciation for the common phenomenon of a detailed solution to a very specific problem, laid out beautifully in the personal site of someone I’d never met. I answered a lot of “how on Earth did you figure that out?” with “somebody on the Internet wrote about it”.

For reasons that don’t need exploring at this juncture, I decided to start reading through a bunch of papers on virtualization, and I thought I’d force myself to actually do it by publicly committing to blogging about them.

First on deck is Disco: Running Commodity Operating Systems on Scalable Multiprocessors, a paper from 1997 that itself “brings back an idea popular in the 1970s” – run a small virtualization layer between hardware and multiple virtual machines (referred to in the paper as a virtual machine monitor; “hypervisor” in more modern parlance). Disco was aimed at allowing software to take advantage of new hardware innovations without requiring huge changes in the operating system. I can speculate on a few reasons this paper’s first in the list:

if you have a systems background, most of it is intelligible with some brow-furrowing
it goes into a useful level of detail on the actual work of intercepting, rewriting, and optimizing host operating systems’ access to hardware resources
the authors went on to found VMware, a massively successful virtualization company

I read the paper intending to summarize it for this blog, but I got completely distracted by the paper’s motivation, which I found both interesting and unexpected.

We’ve come to the end of my round of the Outreach Program for Women, which sponsored my work with the MirageOS folks this summer. I was fortunate to be able to mark the occasion by joining my mentors and an awful lot of badass Xen hackers at the Xen Project Developers Summit earlier this week, where I waved my extremely conspicuous American accent around in everyone’s face and saw some awesome presentations on Xen internals and research. (Xen on ARM is relatively performant! It’s hard to run 10,000 host VMs! The HaLVM has already implemented a whole bunch of stuff I was thinking about doing!)

Julia Evans, prolific blogger and rad person, gave me several kind comments on the “Why I Unikernel” posts (security, self-hosting). She also asked, quite reasonably, whether I’d written a high-level summary of how I host my blog from a unikernel. “No, but I should,” I said, and unlike most times I say I should do something, I actually did it.

Here’s the very-high-level overview:

use brain to generate content that some human, somewhere, might want to read (hardest step)
write all that stuff in Markdown
use Octopress to generate a static site from that Markdown
use Mirage to build a unikernel with the blog content
upload the unikernel to an EC2 instance running Linux
build a new EC2 instance from the uploaded unikernel
make sure that newly generated instance looks like my website with new content
shut down the Linux host that made the new EC2 instance
make somerandomidiot.com point to the new EC2 instance
kill the EC2 instance which previously served somerandomidiot.com

And below, one can find the gory details.

Having a machine capable of executing arbitrary instructions on the public Internet is a responsibility, and it’s a fairly heavy one to assume just to run a blog. Some people solve this by letting someone else take care of it – GitHub, Tumblr, or Medium, for example. I’m not so keen on that solution for a number of reasons, almost none of which are Internet-old-person crankery.

First, and most emotionally: as dumb as my thoughts are, they’re mine. Not GitHub’s or Medium’s or any other group’s. Most entities on the web don’t host user content out of the goodness of their heart; they’re getting something out of it, and it’s likely that they’re getting more out of it than the user is. I’m reminded of the old MetaFilter maxim: “If you’re not paying for it, you’re not the consumer, you’re the product.” Either someone’s making money off of you now or they plan to do it later. I don’t want to encourage that kind of behavior. I just want to write things that people can read about how to make stuff work.

Fun with Opam: Advice to my Past Self

Retrospective

What a Distributed, Version-Controlled ARP Cache Gets You

Let's Play Network Address Translation: The Home Game

Building A Typical Stack From Scratch

Things Routers Do: Network Address Translation

Some Random Idiot

Virtualization: WTF

OPW FIN

I Am Unikernel (And So Can You!)

My Content is Mine: Why I Unikernel, Part 2