Building a Marathon Cluster on CoreOS and Bare Metal

August 20th, 2015

Force12 is all about scaling and orchestrating microservices - and open source!  To that end, they've been crafting up code that deploys a 3 node Mesos cluster running the Marathon scheduler and have released it on Github.  This post explains how they deployed the project on Packet bare metal.

Note: This is a guest blog from Ross Fairbanks of Force12.io.  thanks Ross!

At Force12, we’re big into scaling and orchestrating microservices - and open source!  To that end, we’ve been crafting up code that deploys a 3 node Mesos cluster running the Marathon scheduler and have released it on Github.  The cluster can either be run locally using Vagrant or deployed to physical servers at Packet via their API and the use of cloud-init compatible user data. Everything we talk about here is available on github.

What is Force12.io?
We think a big benefit of Linux containers is enabling microscaling, or scaling up and down containers in close to real time, within existing compute capacity and based on current demand (for more on our vision and the Force12 project, read up here). The idea is that if there is a spike in web traffic, one wants the ability to stop demand-insensitive worker containers and replace them with demand-sensitive web containers.  The Force12 project aims to make “application QoS” a realistic capability for users of microservices.  To be explicit, the goal of Force12 is not to provide auto-scaling, which requires spinning up additional capacity, as net new capacity cannot be deployed in real-time.

Why Packet?
Back in May, Force12 launched a demo of our microscaling service running on Amazon ECS (EC2 Container Service). This was the quickest way for us to launch a working demo, prove some of our theories and get feedback from the community.  However, it was always our goal to make Force12 as a platform agnostic project and support as many major container platforms as we could.

So when I saw Sam Tresler’s presentation at CoreOSFest in San Francisco about running CoreOS on bare metal, I thought it was a perfect time to try both a new cloud platform (with bare metal as agnostic as one can get!) and get deeper into CoreOS (note: we had already used CoreOS for our ECS setup, vs proprietary Amazon Linux).

So in good modern web 3.0 fashion, I tweeted about Sam’s presentation and the Packet team got in touch -- turns out we had tons in common, including shared interest in Calico, a highly innovative project for dynamic container networking.  The Packet guys got me some credits (thanks!) to cover a few of their “Type 3” servers (16 physical CPU cores, 1.8TB of local NVMe Flash and 128GB of DDR4 RAM), which would allow me to test a large number of containers as well as achieve much faster container launch speeds (near instant, versus the 4-5 seconds we were used to on ECS).

Why Marathon?
Eventually, we’d like to see Force12 as a container scheduler that can work nicely with other schedulers: Force12 handling microscaling for your container cluster, and other schedulers to handle fault tolerance. However, with our first version leveraging ECS, we needed to choose a scheduler to get started with.  We decided to use Marathon, developed by Mesosphere, which provides fault tolerance and has a nice RESTful API -- and, if you squint a bit, is pretty close to the ECS API provided by AWS!  So we chose to leverage Marathon as the next base to build the next Force12 demo on Packet and give us an open-source scheduler that could be used for high availability on any infrastructure platform.

CoreOS Vagrant Template
As mentioned earlier, we wanted to ensure that Force12 could work locally on your laptop or on a cloud platform, so it was important to have a good Vagrant option that was easy to deploy and automate.  Luckily, CoreOS has a great Vagrant template that supports creating a standalone VM or a cluster of 3 VMs running on either VirtualBox or VMware Fusion. It has some nice features such as setting the Discovery token for starting an etcd cluster and  supports providing a cloud-config file, which is cloud-init style YAML file that we could customize to configure CoreOS.  This YAML should work the same on our local Vagrant setup as well as cloud providers with a modern metadata service, like Packet.

Marathon & Mesos systemd templates
One of the other things we leveraged on, was a set of systemd unit templates released by Mesosphere that have the systemd units for starting Marathon, ZooKeeper and Mesos. Sadly, these systemd templates have since been deprecated by Mesosphere in favour of their DCOS product.  However, in order to run Mesos on Packet with CoreOS, using these was our best option because DCOS is only currently available for AWS.  So, we decided to plough on with systemd regardless.   It wasn’t exactly smooth sailing, but we got through it and have detailed some of our issues below.

Packet API
Using the Packet API, we needed to create a cluster of a master node and 2 slave nodes. The master node runs Marathon, Mesos Master and ZooKeeper. The slave nodes just run Mesos Slave. There are separate cloud-config YAML files for the master and slave nodes, which would need to be fed to the Packet API for each node.

Side note -- as mentioned above, one of the features of the YAML template I like most is that the cloud-config files run locally by Vagrant are identical to the cloud-config provided when provisioning servers via the Packet API.  Now we’re cooking with cross-platform gas!

Service Discovery
Service discovery was the biggest challenge in getting the cluster working reliably. Both Mesos and Marathon use ZooKeeper for finding the master node that they should connect to. ZooKeeper runs on the master node so the slave nodes need to know the IP address of the master node. Within Vagrant this is easy because the IP address is set in the Vagrantfile, but on Packet (and most cloud providers) this is trickier because the IP address is not pre-assigned. To solve this, we used etcd2 to bootstrap a Consul cluster.  The ZooKeeper service gets registered with Consul and all clients access via the DNS interface, so they connect to zookeeper-2181.service.consul.

Bootstrapping
The whole bootstrapping process involves first setting the unique discovery token from discovery.etcd.io, followed by processing of cloud-config, which starts etcd2 as service. The etcd2 cluster is used to bootstrap the Consul cluster. We use consul-coreos from Democracy Works to do this.

We then use Registrator to register each service as it starts with Consul. Once Consul is running we start the Marathon and Mesos services. The bootstrapping process is quite complex and getting it working involved fixing several problems.

The 2 main problems were an issue with the Consul DNS interface and the sequence of starting units in the cloud config files.

  • The DNS interface problem happened because we were setting the hostname to be fully qualified (i.e. core-01.force12.io). This meant .force12.io was being added to the DNS requests to .service.consul and they were failing.  The pro-tip is that you should not set the domain on the hostnames when using consul!

  • The startup sequence problem was due to starting the Marathon and Mesos services too early. It takes time to form the Consul cluster, as each node has to have started and a leader node has to be elected before consensus is made. To get around this, we simply defined Marathon and Mesos as dependent on Consult in the cloud config file, so that those services don’t start until after Consul is up and running fully.

Side note: ideally we wouldn’t use ZooKeeper at all and instead leverage Consul for service discovery and it’s key/value store.  We plan to implement this in a future release.

Other Problems
The other problem was somewhat mundane -- while I was doing most of this work, I was travelling around the UK and not at home comfortable in Barcelona.  File sizes for the Docker images were rather large because the Marathon and Mesos components are written in Java or Scala and take up a decent amount of space.  Needless to say, provisioning the Vagrant VMs on shared wifi and with remotely hosted Docker images, simply didn’t work all the time and wasted a lot of cycles.  I did look into setting up a local Docker repository, but didn’t have time to get it fully working.  Drats!

Last Steps!
Now that we had a working Marathon cluster we needed to start porting our demo to use it. This involved changing our Go scheduler and Ruby API to use the Marathon REST API rather than the ECS API. We also had used DynamoDB for storing state on the ECS demo and chose to replace this with the Consul Key / Value store for the Marathon demo.

The local Vagrant cluster was useful for testing all of this, but with 3 VMs on my laptop, it was still pretty heavy.  So, for future development we plan to use mini-mesos from Container Solutions. This is a single Docker container that has all the Mesos components. We’ll run this, along with a Marathon container and our custom containers, within a Docker Compose environment for development.  

Conclusion
Hopefully the automation code will be useful to people wanting to run a Marathon cluster. We’d like to thank Packet not only for sponsoring us with some very shiny servers, but also for the extremely useful support they provided whilst getting this working.

Again the code is at http://github.com/force12io/coreos-marathon if you want to take a look.  Or give us a shout over at @force12io