We have a tough job here at Packet. Making bare metal servers provision on-demand in 5 minutes or less is a tricky proposition. Bringing a server up from power-off and no operating system to booting into an OS with functional network and user access is complicated. It involves configuring and orchestrating many different systems, many of which don’t really like to talk to each other (*cough* IPMI *cough*) which can make troubleshooting extraordinarily difficult. Add to the mix some of the unusual and cool things we are doing at Packet, like bonded line-rate network interfaces or programming customer networks with software overlays, and things start getting real.
We’ve setup a nice suite of tools that help monitor, diagnose, test and fix problems as they arise: service monitoring, aggregated logging, unit and functional tests, exception reporting, and alerts all play a critical role in exposing problems and helping us fix issues in the stack. But they don’t give us a consistent indicator that all things in the life cycle of a device are working properly at all times. We need a way to ensure that everything works as expected from an initial “device create” API call, to building a correctly configured environment for the end-user, to a clean deprovision once the device is terminated, across all operating systems and server configurations, at all times. We need a canary that can tell us when something has gone wrong.
The solution, as can happen, presented itself while I was working on building the Packet driver for Docker Machine.
As it states in the README, Docker Machine makes it really easy to create Docker hosts on your computer, on cloud providers and inside your own data center. It creates servers, installs Docker on them, then configures the Docker client to talk to them. In short, it makes it really easy to create and manage hosts running docker, regardless of platform or provider. For example, if I want to create a host called “funtimes” running docker on Packet, I simply do:
and once complete, I can then ssh into the machine using the docker ssh key pair generated during setup like so:
and if I setup the docker environment, I can issue docker commands on the remote hosts on my local docker machine host:
Machine also supports creating nodes and having them join a Docker Swarm pool, and although it only supports Ubuntu at the moment, there is active work on adding additional operating systems (e.g. CoreOS, RancherOS, etc).
As I was developing the driver, I had the realization that this was exactly what we wanted for testing the end-user life cycle. Docker Machine creates a server using our API, creates an ssh key pair that is installed on the server, logs in with it, and installs docker. Once setup, it’s trivial to execute remote commands over ssh, and then destroy the device using “docker-machine rm”.
Once the machine driver was working well, I made a little program in Go to ping one of our channels in slack, and made a docker image for it using Quay.io. I then whipped up a bash script (canary.sh) which uses docker machine to create a server with a hostname based on the current time and operating system flavor, run my slack pinger docker image on the new host, and then deprovision the host. The script logs these events to logentries where we have two alerts configured: the first triggers if a failure is logged, and the second triggers if provision test log entries *don’t* appear after a period of time. This ensures that we’ll be notified if the canary logs a failure, but also if something has happened to the canary which is preventing it from logging, or has stopped it from working altogether.
We plan to expand the tests that run on the host so that in addition to pinging slack, we do things like checking to make sure the network configuration is setup properly, that the disk / cpu and ram all report what they should, and that the host is getting the expected metadata from our metadata service. We also plan to use Docker Machine and a little more sophisticated application to do full rack burn-in and benchmarking on new inventory (we get servers in bunches of 120 per rack, so manually doing this is out of the question).
Hope you enjoyed the post!