The Minnesota Goodbye

Looking into some of the results from last week’s fuzzing session, I noticed something interesting:

$ tcpdump -r experimenting_with_pathoc.pcap 'src host 192.168.2.24 and tcp[13] & 1 != 0'
reading from file experimenting_with_pathoc.pcap, link-type EN10MB (Ethernet)
$

Let’s translate that into human.

  • tcpdump -r experimenting_with_pathoc.pcap: use tcpdump to read an existing packet trace named experimenting_with_pathoc.pcap.
  • src host 192.168.2.24: show me only packets that were sent by 192.168.2.24, which is the IP address of a running unikernel that’s serving web pages on port 80.
  • and tcp[13] & 1 != 0: of the packets sent by 192.168.2.24, show me only those where the least significant bit of the 13th byte of the TCP header was not zero. The 13th byte of the TCP header is designated for flags relevant to how the packet should be processed by the TCP state machine, and the least significant bit corresponds to the FIN flag, used to initiate graceful connection closures.

All together, “show me all the packets sent by 192.168.2.24 which initiated a graceful connection closure.” tcpdump helpfully shows us… all zero such packets in the trace.

This isn’t necessarily wrong for a webserver implementing HTTP/1.1, which defaults to persistent connections:

8.1.2 Overall Operation

A significant difference between HTTP/1.1 and earlier versions of HTTP is that persistent connections are the default behavior of any HTTP connection. That is, unless otherwise indicated, the client SHOULD assume that the server will maintain a persistent connection, even after error responses from the server. – RFC 2616

So let’s make something that will try to initiate a connection closure.

open Lwt
open V1_LWT

module Main (C: V1_LWT.CONSOLE) (S: V1_LWT.STACKV4) = struct

  let make_output =
    let buf = Io_page.(to_cstruct (get 1)) in
    let output = "I made a string!" in
    Cstruct.blit_from_string output 0 buf 0 (String.length output);
    buf 

  let start c s = (* equivalent to a "main" function *)
    S.listen_tcpv4 s ~port:19  (* define a listener for TCP port 19 *)
      (fun flow ->
 let dst, dst_port = S.TCPV4.get_dest flow in
 (* when we get a new connection, output some info to the console *)
 C.log_s c (Printf.sprintf "new tcp connection from %s %d"
	 (Ipaddr.V4.to_string dst) dst_port)
 >>= fun () ->
 (* once we've written to the console, write to the socket *)
 S.TCPV4.write flow (make_output) 
 >>= fun () -> 
 ignore (C.log_s c (Printf.sprintf "closing tcp connection with %s %d"
		 (Ipaddr.V4.to_string dst) dst_port));
 S.TCPV4.close flow (* close the connection *)
);

S.listen s (* begin listening with all defined listeners,
	* and continue until the program is terminated*)
end

We’d expect this code to respond to an incoming TCP connection by finishing the 3-way handshake, sending I made a string! in jubilant celebration, then closing the connection gracefully.

However, this doesn’t seem to be the case; if I build and run a unikernel with this code (say, at 192.168.2.25), and run echo "hi mom"|nc 192.168.2.25 & from my build host 128 times, all jobs are still running after many minutes. No packets with the FIN flag set have been sent, and netstat -an | grep 192.168.2.25:19 shows 128 connections in FINWAIT2 (“represents waiting for a connection termination request from the remote TCP”). Hm.

Let’s look at S.TCPV4.close in mirage-tcpip. It leads me quickly to mirage-tcpip/tcp/pcb.ml, which looks like the actual implementation of our close funtion. Here it is:

let close pcb = Tx.close pcb

and in module Tx, where transmission code lives:

(* Queue up an immediate close segment *)                                   
let close pcb =
match state pcb.state with
| Established | Close_wait ->
UTX.wait_for_flushed pcb.utx >>
(let {wnd} = pcb in
 STATE.tick pcb.state (State.Send_fin (Window.tx_nxt wnd));             
 TXS.output ~flags:Segment.Fin pcb.txq []
)
|_ -> return ()

Hm. So it would appear there are many possible reasons why we might never see a FIN:

  • pcb.state is never one of Established or Close_wait when this code is entered
  • UTX.wait_for_flushed waits forever, and so the code after the >> never runs
  • There is a logic error in the function called once UTX.wait_for_flushed returns, resulting in the packet never being sent
  • Something else that I didn’t see

I add a debug print statement to the function, before any action is taken but in the branch of the pattern match that only executes on Established and Connection_closed states. With this, I can see whether that code ever executes. I also dig a little bit into state.ml and find the tick function. It has a helpful bit of commented-out debug output at the top, so I uncomment it.

Looking at the output, it’s looking an awful lot like pcb.state might not ever be one of Established or Close_wait. It’s also possible that we don’t actually enter this function, so I throw some more debug output at the very beginning of the function, before the match statement.

Jackpot! I do get output from that, and it looks like this:

Close invoked on connection in state Syn_rcvd

Which is extremely interesting. We shouldn’t be invoking close until after we’ve finished the three-way handshake and sent some data!

It’s also interesting that we see Syn_rcvd and not Listen as the state description. Syn_rcvd implies that we’ve sent a SYN/ACK but not received the ACK back yet. If we’re sending a whole mess of data (potentially) before receiving the third thing in the three-way handshake, that would explain a lot! That’s surprising behavior from listen, though - I think listen should likely only give us data (or allow us to write) once the connection is actually established.

It turns out this behavior isn’t replicated with a unikernel client (which is a little difficult to get going; one must invoke listen to get IP configuration done via DHCP). The state machine seems to go through the expected transitions in that case:

{ Closed }  - Send_syn 466432097 ->  { Syn_sent 466432097 }
{ Syn_sent 466432097 }  - Recv_synack 466432098 ->  { Established }
Close invoked on connection in state Established.
Enqueueing connection close packet.
{ Established }  - Send_fin 466432120 ->  { Fin_wait_1 466432120 }
{ Fin_wait_1 466432120 }  - Recv_ack 466432120 ->  { Fin_wait_1 466432120 }
{ Fin_wait_1 466432120 }  - Recv_ack 466432121 ->  { Fin_wait_2 0 }
{ Fin_wait_2 0 }  - Recv_fin ->  { Time_wait }

So the problem seems to be in the transition from Syn_rcvd to Established, and is only encountered for unikernels running listeners.

Digging really deeply into the pcb.ml code, it looks like the processes by which we initiate a live connection from the server end (in tandem with sending SYN/ACK) are decoupled from receipt of an appropriate ACK packet. In this case, it seems like the most appropriate thing to do might be to allow close to be invoked on a connection in state Syn_rcvd, considering that our close will be enqueued behind other messages. Let’s try it!

{ Closed }  - Passive_open ->  { Listen }
{ Listen }  - Send_synack 452873839 ->  { Syn_rcvd 452873839 }
new tcp connection from 192.168.2.1 37629
closing tcp connection with 192.168.2.1 37629
Close invoked on connection in state Syn_rcvd 452873839.
Enqueueing connection close packet.
TCP retransmission on timer seq = 452873839
{ Syn_rcvd 452873839 }  - Recv_ack 452873840 ->  { Established }
{ Established }  - Send_fin 452873864 ->  { Fin_wait_1 452873864 }
{ Fin_wait_1 452873864 }  - Recv_ack 452873840 ->  { Fin_wait_1 452873864 }
{ Fin_wait_1 452873864 }  - Recv_ack 452873864 ->  { Fin_wait_1 452873864 }
{ Fin_wait_1 452873864 }  - Recv_ack 452873865 ->  { Fin_wait_2 0 }
{ Fin_wait_2 0 }  - Recv_fin ->  { Time_wait }

Yay, we sent a FIN! …but we’re not really done here. There is a perceptible pause between making the connection and actually receiving any data. It’s also notable that we had to do a retransmission of our SYN/ACK, and that’s no fluke - rerunning many times produces the same result. Examining a packet trace is illuminative - there’s a normal 30-way handshake, then a long pause, then a repeated SYN/ACK and ACK, followed by the expected data transmission, followed by a normal connection closure. Moreover, it looks like the unikernel keeps retrying the SYN/ACK after the connection’s been closed!

It seems rather like the state machine is both rejecting (because it doesn’t advance the state from Syn_rcvd) and not rejecting (because it initiates the callback function) the first transmitted ACK. Is it the case that there should be an entry for this in the listeners table, but there isn’t yet, and so the ACK is discarded? I add some debug output to input_no_pcb, the function that figures out what to do with packets that don’t correspond with an established connection, and run another test.

The code is pretty deep in a big function, but here’s the relevant snippet:

if Sequence.(to_int32 (incr tx_isn)) = ack_number then begin
(* Established connection - promote to active channels *)
Hashtbl.remove t.listens id;
Hashtbl.add t.channels id newconn;
(* send new connection up to listener *)
 pushf (fst newconn)
>>= fun () ->
Rx.input t pkt newconn
end 

The callback we submit in our end-user application, here called pushf, is invoked before we call Rx.input to process the incoming packet with the ACK flag set. That means our callback is operating on the completed channel, which we just initiated, but that the ACK hasn’t yet been processed. If we swap the order of these operations, the user function runs on a consistent model of the connection, and there’s no duplicate SYN/ACK traffic or waiting for timeouts. Hooray!

And best of all, we send our FIN packet, the remote end sends a FIN/ACK, and both sides close the connection in a timely manner - we say “goodbye”, and we mean it.