UPDATE: Geoff helpfully pointed out this morning that you can set the Warden netmask via the bare-metal.yml.example file. But that's no fun.
I recently spun up BOSH-lite on my lab server, thanks to the vagrant tweaks that Geoff Franks made against Ruben Koster's bare-metal-bosh-lite code.
With BOSH spinning, I put together a small testing BOSH release (which you can find here), and deployed it. The deployment worked fine, and soon I had some test VMs running as Warden containers on the server.
Problem is, the networking was all mismatched.
Anyone familiar with BOSH-lite is accustomed to the 10.244.0.0/16 network space that it uses by default. This convention over configuration approach has lead to lots of BOSH releases shipping a Warden configuration out-of-the-box that spins up static IPs in the 10.244 network. No problem; BOSH-lite is intended to be used for development.
For reasons I'll not go into right now (maybe a future blog post), I want to be able to run several bare-metal BOSH servers using Warden containers. I don't want to invest the money in vSphere, and I don't have the patience to stand up Openstack. That leaves AWS, which is too expensive for my ambitions, and a bit overkill.
Which is why I turned to BOSH + vagrant in the first place.
The problem I ran into had to do with routing. Here's a highly simplified view of the network topology inside of my lab:
All wireless clients (laptops and phones included) live on the Unprivileged Access Network (the pink cloud), which routes everything to the Core Network (green) via a consumer-grade wireless router at 10.0.1.1.
The core network router is a Linksys WRT54G with a custom build of OpenWRT that allows it to manage routing, firewall duty, and termination of a handful of OpenVPN point-to-point VPNs that stitch my network into a larger one distributed across the Internet.
The important part of the diagram, however, is the 10.4.0.0/16 (yellow) network labeled "Service Network 1". This network exists entirely inside of the beefy server hardware that I am running BOSH-lite on. The Core Network OpenWRT router has static routes for the 10.4/16 subnet.
The Easy Solution
The easy solution, of course, is to just change the Service Network range from 10.4/16 to 10.244/16. Problem solved. Off to the deployments!
Unfortunately, that masks a not-so-subtle problem with the default configuration of BOSH-lite (which, as a develpoment platform, is not important enough to warrant discussion): you can't run more than one!
The Hacker Solution
Just for fun, let's try changing the BOSH deployment manifest to use networks in 10.4/16, and see if it works!
jumpbox $ ./test-dev manifest warden
jumpbox $ sed -i -e 's/10.244/10.4/' manifests/test-warden-manifest.yml
jumpbox $ bosh -n deploy
Acting as user 'admin' on deployment 'test-dev' on 'Bosh Lite Director'
Getting deployment properties from director...
Deploying
---------
Director task 59
Started unknown
Started unknown > Binding deployment. Done (00:00:00)
Started preparing deployment
Started preparing deployment > Binding releases. Done (00:00:00)
Started preparing deployment > Binding existing deployment. Done (00:00:00)
Started preparing deployment > Binding resource pools. Done (00:00:00)
Started preparing deployment > Binding stemcells. Done (00:00:00)
Started preparing deployment > Binding templates. Done (00:00:00)
Started preparing deployment > Binding properties. Done (00:00:00)
Started preparing deployment > Binding unallocated VMs. Done (00:00:00)
Started preparing deployment > Binding instance networks. Done (00:00:00)
Started preparing package compilation > Finding packages to compile. Done (00:00:00)
Started preparing dns > Binding DNS. Done (00:00:00)
Started creating bound missing vms > small_z1/0. Failed: Creating VM with agent ID \
'aacf0752-d306-4788-b6ee-aabbfc338d4b': Creating container: network already acquired: \
10.4.2.8/30 (00:00:01)
Error 100: Creating VM with agent ID 'aacf0752-d306-4788-b6ee-aabbfc338d4b': \
Creating container: network already acquired: 10.4.2.8/30
Task 59 error
For a more detailed error report, run: bosh task 59 --debug
You can try any number of different networks, but if they aren't in 10.244/16, BOSH (or more specifically, Warden) will fail to provision the network, claiming that it is "already acquired".
Luckily, while the decision to use 10.244/16 is hard-coded into the BOSH-lite/warden distribution, it is explicitly called out in exactly one place: the startup script that runs the Warden supervisor.
boshbox # grep -nC6 10.244.0.0 /var/vcap/jobs/warden/bin/warden_ctl
29- exec /var/vcap/packages/warden-linux/bin/warden-linux \
30- -disableQuotas=true \
31- -listenNetwork=tcp \
32- -listenAddr=0.0.0.0:7777 \
33- -denyNetworks= \
34- -allowNetworks= \
35: -networkPool=10.244.0.0/16 \
36- -depot=/var/vcap/data/warden/depot \
37- -rootfs=/var/vcap/packages/rootfs_lucid64 \
38- -overlays=/var/vcap/data/warden/overlays \
39- -bin=/var/vcap/packages/warden-linux/src/github.com/cloudfoundry-incubator/warden-linux/linux_backend/bin \
40- -containerGraceTime=5m \
41- 1>>$LOG_DIR/warden.stdout.log \
If you change line 35 to specify the -networkPool
option as 10.0.0.0/8
, and subsequently restart the Warden supervisor, you can provision against any subnet of 10/8, and even use different nets on different boxen.
boshbox # sed -i -e 's@10.244.0.0/16@10.0.0.0/8@' \
/var/vcap/jobs/warden/bin/warden_ctl
boshbox # monit restart warden
Happy Hacking!