Most of our infrastructure is heavily virtualised. Picking the right technology for the type of virtual hardware you need to run is critical.
In my last post I wrote about the base of our hosting stack, our infrastructure providers over at Rackspace UK. Now it's time to move up a level and look at what we put on that infrastructure, starting with our virtualisation platform.
When we started out we were platformed at a public cloud provider using Xen as their virtualisation platform. Even now, the vast majority of our servers are virtual, not physical dedicated hosts. Now when it comes to performance, stability, security and so on, I suspect (without actually doing the checking) you'll find the record of the major providers is pretty similar, regardless of whether they are open source or proprietary. You can certainly assume they have competent developers working on secure and stable products that have stood the test of time. And of course all these products have all the standard tools you'd expect;- good metrics, terminal access, a decent web interface, snapshotting, stored images, etc. etc.
So where do you look for differentiators? Or do you just spin the bottle? Well, it might help to look at our decision making. At our last product review with Rackspace we had three different virtualisation platforms on the table:
We went with (or rather, we stayed with as it happens) VMware vSphere, but to understand how we arrived there, we need to knock down the other two options:
Let's start with OpenStack. It's a great open source platform, so it fits with our ethos as a business. Also, Rackspace have invested heavily in OpenStack so they're keen to see it succeed, (although they're only one of many organisations who have poured a lot of time and resource into it). Lastly it's a lot cheaper, because there's no licensing costs at all. But we decided we couldn't use OpenStack, primarily because we provide our clients with customised systems; to use an old analogy, we're looking after "pets versus cattle".
Make no mistake, we offer a platform as a service, but while some of our competitors sell a software product on top of virtual machines (or Linux containers sometimes) - an application - our product is the virtual machines. Other people (sometimes us) make the applications that live on those virtual machines. While our competitors have devised fault-tolerant software to keep their websites going, even if the underlying virtual machines go away (within reasonable limits), in our case we need the virtualisation layer itself to be fault-tolerant. The virtualisation layer is our platform, it is our product, we don't want anything to happen to that, even if the whole physical host machine catches fire. We need a highly available virtualisation platform, and OpenStack just isn't that. (Yet. It might be one day. Or the lead architects might just decide to leave the "pets" market alone entirely.)
OpenStack was designed to mirror the AWS experience of virtualisation. It's very service oriented, it's designed to run applications that need their servers to be like "cattle" - disposable units that are rapidly replaced if they fail and the failure of no single one really affects the operation of the overall application. If that's your ethos, if your application or platform is fault-tolerant to that degree, OpenStack is great. Indeed, that's why many of our competitors in this space are on AWS. They've taken the route of building a fault-tolerant platform that treats servers like cattle. (Though I'd be concerned if I was purchasing a product from someone and they appeared to be applying the "pets" mentality at AWS, I've seen that in some surprising places.)
At Code Enigma we cater to organisations who have bespoke needs, we don't believe in "one size fits all" and all that it implies for customers. Consequently, we sell servers and then we configure and tailor each server (or set of servers, as the case may be) for our customer. Some people might call that old-fashioned, but that's how we do it and that's how our customers like it. Our servers are pets.
You don't just shoot a pet when it gets sick and buy another. Sure, we keep their DNA on file so we can clone them for their distraught owners, should the worst happen (that'd be Puppet) but fundamentally we don't want them to die, we do care they're healthy and happy.
So a virtualisation platform that just shrugs and says "well make another!" if the host machine dies and takes the guest with it isn't going to work for us. Nor is one that says "Oh, yeh, sorry about that ... but we'll get you a new one tomorrow." (Which is pretty much what happened with our previous provider.) While we're waiting, our customer is going nuts, and rightly so. So sorry, OpenStack - no ability to seemlessly balance host load and/or automatically handle host failure, no dice.
(For more on OpenStack versus VMware, I read this excellent blog post while gathering my thoughts for this one.)
On to Microsoft. At first Hyper-V seemed like an OK option, so I floated it to the team. And we were collectively worried about Hyper-V.
Firstly, Microsoft. Everyone on our team has philosophical misgivings about Microsoft, given we're all steeped in the free open source software movement. Microsoft is busily trying to convince the world it's ready to play nice in that regard, but we're a tough crowd.
Secondly, um, Microsoft. The product just is going to have huge take-up, they are going to be running to keep up, and it's Microsoft: it's going to be the hackers' target of choice when they go hunting for exploits, just because of the volumes there'll be to go at. Hacking is as much a numbers game as anything. Take a vulnerability, even one that broke 2 years ago, and start hunting for people who haven't patched it. Let's say, for arguments' sake, 0.01% of sysadmins are so ignorant or dumb or demotivated they haven't patched that vulnerability yet, even though it's 2 years old. Well if that's 100,000 people, then that's only 10 systems that might be hackable and might be worth hacking. If it's 10,000,000 people, that's 1,000 potentially vulnerable and interesting systems. Volumes. Go for Microsoft. Go for the volumes.
Thirdly, it's Microsoft. We know nothing about Windows Server and we have no desire to start learning now. We're a *nix business through and through. So we're not going to procure a system that requires us to throw all our collective knowledge in the bin and start over. This makes VMware ESXi a much better fit. (Note, contrary to popular belief, ESXi is categorically not Linux, but it's a lot more familiar than Windows, nevertheless.)
Fourthly, there are rumblings that Hyper-V live migrations might not be as live as VMware vMotioning. And we didn't want to find out the hard way that was true! They might be fast, but VMware migrations are so fast as to be totally invisible. (More on DRS shortly.)
And finally, the API didn't allow us to STONITH (at time of writing) which is an issue in some scenarios. You need STONITH in some highly available system designs to ensure you avoid data corruption, it allows you to power off another machine immediately to avoid what's known as a "split brain" scenario.
So bye bye Hyper-V.
That leaves VMware vSphere standing.
And the big BIG sell for us is VMware's fault tolerance. It is just superb. Sure, you can't bank on it. That's why, for very important websites, we still implement multi-VM solutions with no single point of failure at the guest level, because you can never know when a service on a VM will fail on you - and it is when, not if.
But VMware have a thing called Distributed Resource Scheduler - DRS for short - which is basically magic. DRS continuously monitors your hypervisors and their load, and constantly moves your guests around between hypervisors. That's right, unless you turn DRS off (and why would you, because it works beautifully!) you never know where your guests are going to be from one moment to the next. What you do know is which guests they will not be sharing a host with, because happily DRS lets us set anti-affinity rules to ensure, for example, a customer's two load balancers never end up on the same hypervisor, no matter what the load patterns are doing. But for everything else, DRS is in charge.
If three or four guest VMs on the same hypervisor start getting really busy, while another hypervisor has guests that are all snoozing, DRS will push a couple of guests from the busy host to the quiet host. Seamlessly. No one will even know it happened. And if a host fails entirely, vMotion just moves the guests to the least busy hypervisors in your cluster. You can literally shut down a host machine in a VMware cluster, with no warning, and all its guests will hop over to one that's still up. And DRS will choreograph all this, so no single hypervisor takes a hammering. Like I said, magic.
Another big sell is clearly familiarity. We have a number of customers we manage servers for and their private clouds, which we have to interact with, are all VMware. We also work with an absolutely top notch freelance Linux sysadmin with a Drupal twist called Miguel Jacq (we've worked with him so long he's on our team page - and you'll be hearing a lot more about him in coming posts), he's a key member of our team, and he's now very familiar with the VMware products and what can be achieved. So why change?
So there you have it, Code Enigma and VMware, sitting in a tree... For now it's a great fit for us. One day it might not be, but that's fine. We'll roll with that when the time comes. On to the next part, our preferred Linux distribution. More on that next time.
Photo by "jemimus" on Flickr.com, released under the Creative Commons Attribution 2.0 Generic license.