BOSH-lite/warden on 10.x.x.x

UPDATE: Geoff helpfully pointed out this morning that you can set the Warden netmask via the bare-metal.yml.example file. But that's no fun.


I recently spun up BOSH-lite on my lab server, thanks to the vagrant tweaks that Geoff Franks made against Ruben Koster's bare-metal-bosh-lite code.

With BOSH spinning, I put together a small testing BOSH release (which you can find here), and deployed it. The deployment worked fine, and soon I had some test VMs running as Warden containers on the server.

Problem is, the networking was all mismatched.

Anyone familiar with BOSH-lite is accustomed to the 10.244.0.0/16 network space that it uses by default. This convention over configuration approach has lead to lots of BOSH releases shipping a Warden configuration out-of-the-box that spins up static IPs in the 10.244 network. No problem; BOSH-lite is intended to be used for development.

For reasons I'll not go into right now (maybe a future blog post), I want to be able to run several bare-metal BOSH servers using Warden containers. I don't want to invest the money in vSphere, and I don't have the patience to stand up Openstack. That leaves AWS, which is too expensive for my ambitions, and a bit overkill.

Which is why I turned to BOSH + vagrant in the first place.

The problem I ran into had to do with routing. Here's a highly simplified view of the network topology inside of my lab:

All wireless clients (laptops and phones included) live on the Unprivileged Access Network (the pink cloud), which routes everything to the Core Network (green) via a consumer-grade wireless router at 10.0.1.1.

The core network router is a Linksys WRT54G with a custom build of OpenWRT that allows it to manage routing, firewall duty, and termination of a handful of OpenVPN point-to-point VPNs that stitch my network into a larger one distributed across the Internet.

The important part of the diagram, however, is the 10.4.0.0/16 (yellow) network labeled "Service Network 1". This network exists entirely inside of the beefy server hardware that I am running BOSH-lite on. The Core Network OpenWRT router has static routes for the 10.4/16 subnet.

The Easy Solution

The easy solution, of course, is to just change the Service Network range from 10.4/16 to 10.244/16. Problem solved. Off to the deployments!

Unfortunately, that masks a not-so-subtle problem with the default configuration of BOSH-lite (which, as a develpoment platform, is not important enough to warrant discussion): you can't run more than one!

The Hacker Solution

Just for fun, let's try changing the BOSH deployment manifest to use networks in 10.4/16, and see if it works!

jumpbox $ ./test-dev manifest warden
jumpbox $ sed -i -e 's/10.244/10.4/' manifests/test-warden-manifest.yml
jumpbox $ bosh -n deploy
Acting as user 'admin' on deployment 'test-dev' on 'Bosh Lite Director'
Getting deployment properties from director...

Deploying
---------

Director task 59
  Started unknown
  Started unknown > Binding deployment. Done (00:00:00)

  Started preparing deployment
  Started preparing deployment > Binding releases. Done (00:00:00)
  Started preparing deployment > Binding existing deployment. Done (00:00:00)
  Started preparing deployment > Binding resource pools. Done (00:00:00)
  Started preparing deployment > Binding stemcells. Done (00:00:00)
  Started preparing deployment > Binding templates. Done (00:00:00)
  Started preparing deployment > Binding properties. Done (00:00:00)
  Started preparing deployment > Binding unallocated VMs. Done (00:00:00)
  Started preparing deployment > Binding instance networks. Done (00:00:00)

  Started preparing package compilation > Finding packages to compile. Done (00:00:00)

  Started preparing dns > Binding DNS. Done (00:00:00)

  Started creating bound missing vms > small_z1/0. Failed: Creating VM with agent ID \
    'aacf0752-d306-4788-b6ee-aabbfc338d4b': Creating container: network already acquired: \
    10.4.2.8/30 (00:00:01)

Error 100: Creating VM with agent ID 'aacf0752-d306-4788-b6ee-aabbfc338d4b': \
  Creating container: network already acquired: 10.4.2.8/30

Task 59 error

For a more detailed error report, run: bosh task 59 --debug

You can try any number of different networks, but if they aren't in 10.244/16, BOSH (or more specifically, Warden) will fail to provision the network, claiming that it is "already acquired".

Luckily, while the decision to use 10.244/16 is hard-coded into the BOSH-lite/warden distribution, it is explicitly called out in exactly one place: the startup script that runs the Warden supervisor.

boshbox # grep -nC6 10.244.0.0 /var/vcap/jobs/warden/bin/warden_ctl
29-    exec /var/vcap/packages/warden-linux/bin/warden-linux \
30-      -disableQuotas=true \
31-      -listenNetwork=tcp \
32-      -listenAddr=0.0.0.0:7777 \
33-      -denyNetworks= \
34-      -allowNetworks= \
35:      -networkPool=10.244.0.0/16 \
36-      -depot=/var/vcap/data/warden/depot \
37-      -rootfs=/var/vcap/packages/rootfs_lucid64 \
38-      -overlays=/var/vcap/data/warden/overlays \
39-      -bin=/var/vcap/packages/warden-linux/src/github.com/cloudfoundry-incubator/warden-linux/linux_backend/bin \
40-      -containerGraceTime=5m \
41-      1>>$LOG_DIR/warden.stdout.log \

If you change line 35 to specify the -networkPool option as 10.0.0.0/8, and subsequently restart the Warden supervisor, you can provision against any subnet of 10/8, and even use different nets on different boxen.

boshbox # sed -i -e 's@10.244.0.0/16@10.0.0.0/8@' \
          /var/vcap/jobs/warden/bin/warden_ctl
boshbox # monit restart warden

Happy Hacking!

James (@iamjameshunt) works on the Internet, spends his weekends developing new and interesting bits of software and his nights trying to make sense of research papers.

Currently working on Rook.