https://new.jameshunt.us

Using iptables To Emulate Network Conditions

A few weeks ago, I ran into an interesting bug in some production daemon code that was being caused by weird network behavior.

This is what was happening on the wire (courtesy of tcpdump). Addresses and ports have been changed to protect the guilty, and some whitespace added to improve readability.

1   0.000000 10.0.1.8 -> 10.0.0.5 TCP 74 33892 > 1234 [SYN]      Seq=0       Win=5840 Len=0
2   0.000004 10.0.0.5 -> 10.0.1.8 TCP 74 1234 > 33892 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0
3   0.000466 10.0.1.8 -> 10.0.0.5 TCP 66 33892 > 1234 [ACK]      Seq=1 Ack=1 Win=5840 Len=0

4   2.057643 10.0.1.8 -> 10.0.0.5 TCP 74 33894 > 1234 [SYN]      Seq=0       Win=5840 Len=0
5   2.057656 10.0.0.5 -> 10.0.1.8 TCP 74 1234 > 33894 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0
6   2.058656 10.0.1.8 -> 10.0.0.5 TCP 66 33894 > 1234 [ACK]      Seq=1 Ack=1 Win=5840 Len=0

7  15.614608 10.0.1.8 -> 10.0.0.5 TCP 74 33899 > 1234 [SYN]      Seq=0       Win=5840 Len=0
8  15.614615 10.0.0.5 -> 10.0.1.8 TCP 74 1234 > 33899 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0
9  15.614991 10.0.1.8 -> 10.0.0.5 TCP 74 33899 > 1234 [ACK]      Seq=1 Ack=1 Win=5840 Len=0

That pattern continues. A SYN packet from the client, the SYN+ACK response, and the final ACK to nail up the connection, and then... nothing.

The upshot for the server is that the finite number of connection slots in the daemon's memory structures filled up over time with essentially dead connections from hosts like this one. The fix was easy — add a timer to each slot and eject clients that tarry for too long without sending us any real data.

But that's not the point of this post.

The interesting part of the whole ordeal was reproducing the network issue post-incident, to ensure that my proposed fix to the daemon connection-handling logic would withstand hours of such misbehavior.

The problem is we don't know exactly what happened. Graphs of per-process open file descriptor usage helped us to poinpoint when the daemon started leaking fds. We think we know why it happened. We know that a hypervisor / virtual switch upgrade was taking place around the same time as the ramp-up period. We also know that we had seemingly unrelated network outages in backbone connectivity to other data centers.

An RRD (round-robin database) graph of time series data showing the number of open files as a function of time. The plotted data is relative constant until the middle of the graph, where it grows linearly from the normal value of under 2k to an outlying peak of 16k open files.  At the peak, it cuts off cleanly (as the process crashes, we assume) and then grows again linearly until the problem is fixed, 36 hours later.
We definitely know the 'when'

Literally all we have to go on is the tcpdump we took while things were broken. The challenge was to introduce network outage in a controlled fashion, inside of our development infrastructure.

Naturally, I turned to iptables, which can do more than just keep out intruders. As a consummate inspector of packets, it can be used to emulate specific network problems, with a surprising degree of accuracy.

For starters, here's the shell script I start all iptables experimentation with:

#!/bin/sh
IPTABLES="/usr/bin/sudo /sbin/iptables"

$IPTABLES -F

# rules go here

$IPTABLES -L -nv
echo
echo "The firewall is up.  Ctrl-C if want to keep it up;"
echo "otherwise it will be automatically dropped in 4 minutes."
sleep 240

echo "Dropping firewall (failsafe)"
$IPTABLES -F
echo "Dropped."

Whenever I deal with firewalls, especially on virtual machines and remote physical servers where it can be difficult to get a non-networked console, I build in a deadman switch. After four minutes, without human intervention, the nascent firewall will be dropped. This ensures that if I do bork the connection all up (a distinct possibility), the box will "fix" itself after some time. Four minutes is enough time to switch to another terminal and verify that I can get back in over SSH.

In my case, my dev environment is set up with lots of other machines that are constantly connecting to my dev server, just like it was a prod service. As such, I won't be going into generating network load or anything like that.

We'll start off with (incorrect) rules that outright block a single host in our dev network, and verify that everything else is kosher:

VICTIM=10.44.0.6
PORT=1234

$IPTABLES -p tcp --dport $PORT --src $VICTIM -j DROP

That works (except that it doesn't). If we raise that firewall (and Ctrl-C to make it permanent), we should see that connections from our hapless victim at 10.44.0.6 are being silently dropped; no RST packet will be sent back.

Cool. But that's not the observed behavior. We were seeing the SYN / SYN+ACK / ACK three-way handshake complete, we just weren't seeing any data packets. Enter the --tcp-flags option.

Activated whenever you set the protocol match to TCP (via -p tcp), this handy option lets you accept or reject packets based on the TCP flags set in each. It takes two arguments. The first is the set of flags you want to consider, given by their symbolic name (SYN ACK URG PSH etc.). The second is the subset of those that must be set for the rule to match.

The canonical example (and about the only one you'll find by googling it) is BADFLAGS. Some flag combinations are nonsensical. Take SYN+FIN and SYN+RST for example. "Set me up a connectiont that should be torn down".

Here's the full rule:

iptables -N BADFLAGS
iptables -A BADFLAGS -j LOG --log-prefix "BADFLAGS: "
iptables -A BADFLAGS -j DROP

iptables -N TCP_FLAGS
iptables -A TCP_FLAGS -p tcp --tcp-flags ACK,FIN FIN                  -j BADFLAGS
iptables -A TCP_FLAGS -p tcp --tcp-flags ACK,PSH PSH                  -j BADFLAGS
iptables -A TCP_FLAGS -p tcp --tcp-flags ACK,URG URG                  -j BADFLAGS
iptables -A TCP_FLAGS -p tcp --tcp-flags FIN,RST FIN,RST              -j BADFLAGS
iptables -A TCP_FLAGS -p tcp --tcp-flags SYN,FIN SYN,FIN              -j BADFLAGS
iptables -A TCP_FLAGS -p tcp --tcp-flags SYN,RST SYN,RST              -j BADFLAGS
iptables -A TCP_FLAGS -p tcp --tcp-flags ALL     ALL                  -j BADFLAGS
iptables -A TCP_FLAGS -p tcp --tcp-flags ALL     NONE                 -j BADFLAGS
iptables -A TCP_FLAGS -p tcp --tcp-flags ALL     FIN,PSH,URG          -j BADFLAGS
iptables -A TCP_FLAGS -p tcp --tcp-flags ALL     SYN,FIN,PSH,URG      -j BADFLAGS
iptables -A TCP_FLAGS -p tcp --tcp-flags ALL     SYN,RST,ACK,FIN,URG  -j BADFLAGS

That's not how we want to use this flag, though. What we need to do is allow SYN, SYN+ACK and ACK, but disallow most everything else (including the FIN packet and any data packets).

More formally, we should:

  1. Allow any packet with the ACK flag set
  2. Allow any packet with the SYN flag set
  3. Drop everything else

This translates quite nicely into our iptables calls:

VICTIM=10.44.0.6
PORT=1234

$IPTABLES -p tcp --dport $PORT --src $VICTIM --tcp-flags ALL SYN -j ACCEPT
$IPTABLES -p tcp --dport $PORT --src $VICTIM --tcp-flags ALL ACK -j ACCEPT
$IPTABLES -p tcp --dport $PORT --src $VICTIM -j DROP

Perfect!

If we run a tcpdump with this firewall up, we should see the same SYN / SYN+ACK / ACK pattern we saw from our production packet capture.

iptables is a flexible filter that operates on packets. When combined with the insight gained from tcpdump, you can turn it into a powerful development tool.

James (@iamjameshunt) works on the Internet, spends his weekends developing new and interesting bits of software and his nights trying to make sense of research papers.

Currently exploring Kubernetes, as both a floor wax and a dessert topping.