How We Failed at OpenStack - Equinix Metal Cloud Provider

Early last summer, Zac reached out to me about building a modern, from-the-ground-up bare metal cloud hosting platform. Having spent the vast majority of my working life building, supporting or using scalable infrastructure services, I was intrigued, but asked myself - was this really needed? Weren’t there plenty of good IaaS services around?

As the conversation developed, I eventually agreed that many of the public cloud services were not user friendly and had an overly high barrier to usage. Moreover, I was an early Docker adopter and knew that the coming wave of container-powered application deployments would make high quality bare metal servers exponentially more useful in the DevOps toolbox. And yet, virtualization specific public clouds and legacy dedicated hosting providers weren’t positioned well to meet the growing demand for flexible physical hardware. I decided there was work to be done - so my journey began as I hopped aboard the Packet Express!

Over in Install Land...

As I dove head-first into Packet, I spent some time taking another look into the current state of the art for deployment and cloud automation. I checked out bespoke installers, all the open source cloud platforms and what it’d take to architect our own suite of services from the ground up.

During my time at Voxel, a cloud hosting platform acquired by Internap, we had built much of our software stack and had experienced both the benefits and the consequences of owning our own software platform. It seems like installing servers should be pretty simple - you get it right once and off you go, right? Wrong! There are countless networking “gotchas”, ongoing hardware changes and a myriad of OS differences to tackle in providing a truly automated service layer for customers. Installing, managing and securing thousands of servers at scale and doing so within Zac’s mandate (“5 minutes or less, every time!”) was something I wasn’t taking lightly.

To help Packet meet its goal of operating thousands of installs 24/7 and be up and running in a few short months, I became interested in leveraging OpenStack’s innovations in infrastructure plumbing as building blocks for our services: network automation, IP management, installation routines, hardware lifecycle, and (of course!) installations. If I could rely on these core components of OpenStack, then my team and I could focus more of our efforts on adding value to users with things like hardware profiling and container support.

I’d been warned of some of the pitfalls of OpenStack - but had also spent weeks reading the latest commits, trolling a half-dozen official IRC channels, and running DevStack. I was getting pretty comfortable with the core OpenStack projects and the project had dramatically matured over the past 2 years. Also, the timing was pretty good. Rackspace had recently released OnMetal and blogged openly about how they were running Ironic for their bare metal cloud servers and a big, important release, Juno, was nearly ready for primetime.

So I committed myself and the team to leveraging OpenStack for our bare metal server deployments.

The Story

I knew that OpenStack would be a steep learning curve and I needed to know the working guts of each project, not just install them. I dug into the OpenStack projects one by one, working to understand the exact state of Nova, the Ironic driver, and Neutron, in particular. Not only did we want to leverage Ironic for bare metal installs, we needed to support Packet’s host-level networking model, which specifically avoids Layer 2 networks and VLAN’s in favor of bringing Layer 3 networks directly to each host.

You could say my perception was that “man, there are a lot of docs to read and a lot to learn!”. Over the course of a month, what became obvious was that a huge amount of the documentation I was consuming was either outdated or fully inaccurate. This forced me to sift through an ever greater library of documents, wiki articles, irc logs and commit messages to find the “source of truth”. After the basics, I needed significant python debug time just to prove various conflicting assertions of feature capability, e.g. “should X work?”. It was slow going.

It’s worth noting that there is a large ecosystem of people and companies that have experience with Openstack, particularly relating to Nova and standard Neutron implementations. However, there is hardly anyone with production level experience working with Ironic. And even though the community is sizable compared to many other OSS projects out there, I regularly ran into situations where even some of the core developers couldn’t answer our implementation questions and google searches for error messages yielded less than a dozen results.

Lesson #1 - OpenStack is big, young and fast moving. Docs are pretty spotty once you get past the basics.

I got myself further into Ironic, leaving Neutron to one of my colleagues. The truth was (per Lesson #1), we needed a dedicated developer for each Openstack component just to understand the code base and keep pace with the project and how we’d apply it appropriately to our needs. So I focused my attention and spent many long nights with the fantastic group from Rackspace’s OnMetal team via IRC, email threads and the OpenStack developer forums. I’m pretty sure I read every doc, forum post and debug output that google search turned up on Ironic!

Even though much work has been made to break out Nova’s baremetal driver into the first class Ironic project, OpenStack remains extremely virtualization-centric in its design. There are still many features and documentation changes that are in-flight between Nova “baremetal” and Ironic with Nova’s Ironic driver. I ran into this square on with Ironic’s limited networking support. With Ironic you are pigeonholed into the openvswitch and linuxbridge agents that come with modular layer 2 “ML2” plugin. Our networking model conflicted heavily with this and, as I was to find out, Neutron was lacking both in terms of its vendor-specific switch support and ability to extend into different network models.

Bigger players (Rackspace, most notable) that have an even deeper understanding of the core OpenStack code, have resorted to highly customizing most of the individual Openstack components to be able to deploy physical servers on real physical networks. Several of these patches have been made available to the public, but many of the important patches have not and would need to be written from the ground up and be campaigned to be included in upcoming releases to be maintained.

Lesson #2: OpenStack is all about VM’s. If you’re not, good luck!

At this point, I was having serious concerns about leveraging OpenStack installation services for our product. The amount of resources it was taking to understand and keep pace with each project was daunting - and I was beginning to feel that the level of customization we’d have to do in the Nova and Ironic projects would be non trivial and counteract the OSS benefits we were looking for in the OpenStack project and its developer momentum.

However, I felt it was important to fully understand the details of Neutron, one of the last key projects on my personal bucket list.

In the world of physical switches and servers, installing servers is not incredibly hard. Doing so reliably, on the other hand, is hard. Automation needs a consistent set of tools to work with and, from my experience, the most error prone part of most infrastructure deployment systems is network automation. You see, physical switch operating systems leave a lot to be desired in terms of supporting modern automation and API interaction (Juniper’s forthcoming 14.2 JUNOS updates offer some refreshing REST API’s!). In fact, my disappointing experience with other network automation tools was one of the primary reasons I chose to invest time with OpenStack -- and the Neutron project has a pretty awesome mission: “To implement services and associated libraries to provide on-demand, scalable, and technology-agnostic network abstraction.” Sign me up!

However, the reality doesn’t match the promise. With all the talk of Software Defined Networking, most of it has to do with virtual networks sitting on top of hypervisors and not with real switches. Not only was the Neutron driver for our switch vendor (Juniper) woefully out of date, the support was minimal even after we ported it to the latest OpenStack Juno release. Additionally, Neutron leverages an internal, rudimentary IP address manager (IPAM) and has no concept of accessing anything outside itself for assigning, reporting on or providing permissions on IP address assets. Bending our user experience to fit the limited capabilities of Neutron was unacceptable.

Lesson #3: Neutron support is pretty fragmented. Check your switches first!

So, What’d We Do?

Long story short, we ditched OpenStack a week before Christmas and spent the next three weeks developing a custom deployment automation platform. After building our own IP manager in early December, the team was energized to build on top of a bespoke tools. While every new software project creates its own legacy of responsibility, our vision as a company is 100% forward looking and we felt that, in exploring and deploying OpenStack, we had filled most of the gaps: a flexible, service-provider grade IPAM (we call it Magnum IP), a user and permissions model that was powered by our SWITCH OSS and tighter integration between our facility management platform and our physical infrastructure.

Sometimes, what exists just isn’t good enough or fitted to your needs, and this clearly was the case with OpenStack and what we are doing at Packet. While we look forward to releasing our Neutron plugin to the community and staying current with the OpenStack project as it develops, for now we’re moving on.

As we finalize our installation setup for CoreOS this next week (after plowing through Ubuntu, Debian and CentOS) I’m excited about how our lean, fast, documented system will allow us to support advanced functionality and high availability without compromising user experience. Dare I say: Lessons Learned and Mission (nearly) Accomplished?

Partners & Examples

Why We Threw 4 Months of Work in the Trash; or How we Failed at OpenStack

Over in Install Land...

The Story

So, What’d We Do?

Published on

Category

Tags

Ready to kick the tires?

Over in Install Land...

The Story

So, What’d We Do?

Published on

Category

Tags

Ready to kick the tires?

Subscribe to our newsletter