The Minnesota Goodbye
Looking into some of the results from last week’s fuzzing session, I noticed something interesting:
$ tcpdump -r experimenting_with_pathoc.pcap 'src host 192.168.2.24 and tcp[13] & 1 != 0'
reading from file experimenting_with_pathoc.pcap, link-type EN10MB (Ethernet)
$
Let’s translate that into human.
tcpdump -r experimenting_with_pathoc.pcap
: use tcpdump to read an existing packet trace namedexperimenting_with_pathoc.pcap
.src host 192.168.2.24
: show me only packets that were sent by192.168.2.24
, which is the IP address of a running unikernel that’s serving web pages on port 80.and tcp[13] & 1 != 0
: of the packets sent by192.168.2.24
, show me only those where the least significant bit of the 13th byte of the TCP header was not zero. The 13th byte of the TCP header is designated for flags relevant to how the packet should be processed by the TCP state machine, and the least significant bit corresponds to theFIN
flag, used to initiate graceful connection closures.
All together, “show me all the packets sent by 192.168.2.24 which initiated a graceful connection closure.” tcpdump
helpfully shows us… all zero such packets in the trace.
This isn’t necessarily wrong for a webserver implementing HTTP/1.1, which defaults to persistent connections:
8.1.2 Overall Operation
A significant difference between HTTP/1.1 and earlier versions of HTTP is that persistent connections are the default behavior of any HTTP connection. That is, unless otherwise indicated, the client SHOULD assume that the server will maintain a persistent connection, even after error responses from the server. – RFC 2616
So let’s make something that will try to initiate a connection closure.
open Lwt
open V1_LWT
module Main (C: V1_LWT.CONSOLE) (S: V1_LWT.STACKV4) = struct
let make_output =
let buf = Io_page.(to_cstruct (get 1)) in
let output = "I made a string!" in
Cstruct.blit_from_string output 0 buf 0 (String.length output);
buf
let start c s = (* equivalent to a "main" function *)
S.listen_tcpv4 s ~port:19 (* define a listener for TCP port 19 *)
(fun flow ->
let dst, dst_port = S.TCPV4.get_dest flow in
(* when we get a new connection, output some info to the console *)
C.log_s c (Printf.sprintf "new tcp connection from %s %d"
(Ipaddr.V4.to_string dst) dst_port)
>>= fun () ->
(* once we've written to the console, write to the socket *)
S.TCPV4.write flow (make_output)
>>= fun () ->
ignore (C.log_s c (Printf.sprintf "closing tcp connection with %s %d"
(Ipaddr.V4.to_string dst) dst_port));
S.TCPV4.close flow (* close the connection *)
);
S.listen s (* begin listening with all defined listeners,
* and continue until the program is terminated*)
end
We’d expect this code to respond to an incoming TCP connection by finishing the 3-way handshake, sending I made a string!
in jubilant celebration, then closing the connection gracefully.
However, this doesn’t seem to be the case; if I build and run a unikernel with this code (say, at 192.168.2.25
), and run echo "hi mom"|nc 192.168.2.25 &
from my build host 128 times, all jobs are still running after many minutes. No packets with the FIN
flag set have been sent, and netstat -an | grep 192.168.2.25:19
shows 128 connections in FINWAIT2
(“represents waiting for a connection termination request from the remote TCP”). Hm.
Let’s look at S.TCPV4.close
in mirage-tcpip
. It leads me quickly to mirage-tcpip/tcp/pcb.ml
, which looks like the actual implementation of our close
funtion. Here it is:
let close pcb = Tx.close pcb
and in module Tx
, where transmission code lives:
(* Queue up an immediate close segment *)
let close pcb =
match state pcb.state with
| Established | Close_wait ->
UTX.wait_for_flushed pcb.utx >>
(let {wnd} = pcb in
STATE.tick pcb.state (State.Send_fin (Window.tx_nxt wnd));
TXS.output ~flags:Segment.Fin pcb.txq []
)
|_ -> return ()
Hm. So it would appear there are many possible reasons why we might never see a FIN
:
pcb.state
is never one ofEstablished
orClose_wait
when this code is enteredUTX.wait_for_flushed
waits forever, and so the code after the>>
never runs- There is a logic error in the function called once
UTX.wait_for_flushed
returns, resulting in the packet never being sent - Something else that I didn’t see
I add a debug print statement to the function, before any action is taken but in the branch of the pattern match that only executes on Established
and Connection_closed
states. With this, I can see whether that code ever executes. I also dig a little bit into state.ml
and find the tick
function. It has a helpful bit of commented-out debug output at the top, so I uncomment it.
Looking at the output, it’s looking an awful lot like pcb.state
might not ever be one of Established
or Close_wait
. It’s also possible that we don’t actually enter this function, so I throw some more debug output at the very beginning of the function, before the match
statement.
Jackpot! I do get output from that, and it looks like this:
Close invoked on connection in state Syn_rcvd
Which is extremely interesting. We shouldn’t be invoking close
until after we’ve finished the three-way handshake and sent some data!
It’s also interesting that we see Syn_rcvd
and not Listen
as the state description. Syn_rcvd
implies that we’ve sent a SYN/ACK
but not received the ACK
back yet. If we’re sending a whole mess of data (potentially) before receiving the third thing in the three-way handshake, that would explain a lot! That’s surprising behavior from listen
, though - I think listen
should likely only give us data (or allow us to write) once the connection is actually established.
It turns out this behavior isn’t replicated with a unikernel client (which is a little difficult to get going; one must invoke listen
to get IP configuration done via DHCP). The state machine seems to go through the expected transitions in that case:
{ Closed } - Send_syn 466432097 -> { Syn_sent 466432097 }
{ Syn_sent 466432097 } - Recv_synack 466432098 -> { Established }
Close invoked on connection in state Established.
Enqueueing connection close packet.
{ Established } - Send_fin 466432120 -> { Fin_wait_1 466432120 }
{ Fin_wait_1 466432120 } - Recv_ack 466432120 -> { Fin_wait_1 466432120 }
{ Fin_wait_1 466432120 } - Recv_ack 466432121 -> { Fin_wait_2 0 }
{ Fin_wait_2 0 } - Recv_fin -> { Time_wait }
So the problem seems to be in the transition from Syn_rcvd
to Established
, and is only encountered for unikernels running listeners.
Digging really deeply into the pcb.ml
code, it looks like the processes by which we initiate a live connection from the server end (in tandem with sending SYN/ACK
) are decoupled from receipt of an appropriate ACK
packet. In this case, it seems like the most appropriate thing to do might be to allow close
to be invoked on a connection in state Syn_rcvd
, considering that our close
will be enqueued behind other messages. Let’s try it!
{ Closed } - Passive_open -> { Listen }
{ Listen } - Send_synack 452873839 -> { Syn_rcvd 452873839 }
new tcp connection from 192.168.2.1 37629
closing tcp connection with 192.168.2.1 37629
Close invoked on connection in state Syn_rcvd 452873839.
Enqueueing connection close packet.
TCP retransmission on timer seq = 452873839
{ Syn_rcvd 452873839 } - Recv_ack 452873840 -> { Established }
{ Established } - Send_fin 452873864 -> { Fin_wait_1 452873864 }
{ Fin_wait_1 452873864 } - Recv_ack 452873840 -> { Fin_wait_1 452873864 }
{ Fin_wait_1 452873864 } - Recv_ack 452873864 -> { Fin_wait_1 452873864 }
{ Fin_wait_1 452873864 } - Recv_ack 452873865 -> { Fin_wait_2 0 }
{ Fin_wait_2 0 } - Recv_fin -> { Time_wait }
Yay, we sent a FIN
! …but we’re not really done here. There is a perceptible pause between making the connection and actually receiving any data. It’s also notable that we had to do a retransmission of our SYN/ACK
, and that’s no fluke - rerunning many times produces the same result. Examining a packet trace is illuminative - there’s a normal 3-way handshake, then a long pause, then a repeated SYN/ACK
and ACK
, followed by the expected data transmission, followed by a normal connection closure. Moreover, it looks like the unikernel keeps retrying the SYN/ACK
after the connection’s been closed!
It seems rather like the state machine is both rejecting (because it doesn’t advance the state from Syn_rcvd
) and not rejecting (because it initiates the callback function) the first transmitted ACK
. Is it the case that there should be an entry for this in the listeners table, but there isn’t yet, and so the ACK
is discarded? I add some debug output to input_no_pcb
, the function that figures out what to do with packets that don’t correspond with an established connection, and run another test.
The code is pretty deep in a big function, but here’s the relevant snippet:
if Sequence.(to_int32 (incr tx_isn)) = ack_number then begin
(* Established connection - promote to active channels *)
Hashtbl.remove t.listens id;
Hashtbl.add t.channels id newconn;
(* send new connection up to listener *)
pushf (fst newconn)
>>= fun () ->
Rx.input t pkt newconn
end
The callback we submit in our end-user application, here called pushf
, is invoked before we call Rx.input
to process the incoming packet with the ACK
flag set. That means our callback is operating on the completed channel, which we just initiated, but that the ACK
hasn’t yet been processed. If we swap the order of these operations, the user function runs on a consistent model of the connection, and there’s no duplicate SYN/ACK
traffic or waiting for timeouts. Hooray!
And best of all, we send our FIN
packet, the remote end sends a FIN/ACK
, and both sides close the connection in a timely manner - we say “goodbye”, and we mean it.