Cloud vs Metal Infrastructure

A Changing Industry

It’s important to understand that, while aspects of the cloud industry are reaching a certain level of maturity, it is still an industry moving at a breakneck pace of innovation. There is a lot of competition in the space, and every big player is looking to stake their claim, as evidenced by Google’s recent entry into the IaaS market.

With this constantly changing system in mind, the most high level takeaway I’d like someone to have from this article is that “cloud vs metal” is about what you can take advantage of, and that the best choice for any given system can change over time.

The real subtext is about disaster recovery and agility – the former a topic which often induces acute narcolepsy in software developers, the latter often not considered in context of infrastructure, assumed to be the domain of development methodologies and business management.

Differentiators

A good way to help make an informed decision is to understand what tends to make one architecture different from the other.

Programmability

The concept of server infrastructure being something one can build and maintain programmatically is probably the most revolutionary part of cloud infrastructure. This also makes it one of the fundamental hurdles to overcome when reasoning about it as part of your technical resources.

When you have the ability to programmatically define your hardware, you can version control it, test it and deploy it as many times as you like. System administrators have been moving towards this with configuration management for a number of years, and cloud programmability helps reenforce that process.

Speed of Tear Up/Down

With the cloud, the ability to quickly build and destroy a collection of servers is (inevitably) complementary to their programmability, and arguably, just as revolutionary.

One of the more frequently espoused use cases is that of Netflix, smoothly dealing with the daily cycle of consumer video demand by ramping up or down servers as required. Netflix is not unique, but it is high volume, and has a predictable day/night media demand cycle across different geographical regions.

Netflix is a special case of variable, voluminous demand. The difference between a trough and a peak is hundreds, if not thousands of servers. It is this quality which allows Netflix to take advantage of cloud services.

One of the more important, but perhaps less frequently highlighted, benefits of being able to quickly spin up and tear down machines is that it lowers the barrier to entry for experimenting with distributed architectures. Indeed any “research” style project can benefit greatly from this, with no capital requirements up front. It tackles the cost, and thus the fear, of failure. In a startup, this can make the initial gambles cheaper, in a large organization it can help to tackle the Innovator’s Dilemma.

Management Architecture

A sometimes overlooked aspect of cloud based services, including IaaS and PaaS providers, is the management architecture they provide you. An interface which gives you high level and immediate control over infrastructure is a huge psychological step forward.

Interfaces like the AWS Managment Console, Engine Yard’s dashboard, or Netflix’s Asgard also subtly alter the way we reason about infrastructure. They can make the resources more tangible.

Cost Per CPU Cycle or Byte Stored

When you want a consistent chunk of CPU, storage space or RAM available, dedicated hardware is a lot cheaper (byte for byte, cycle for cycle) than the cloud equivalent.

According to Amazon’s FAQ:

“One EC2 Compute Unit [ECU] provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.” 5

Right now, the pricing for a basic Amazon “m1.small” instance, which has one ECU, is $0.08 per hour. It has 1.7GB of RAM available to it and 160GB of ephemeral storage1. This translates to about $57.60 per month of continuous use (assuming reserved instances are not used). It can be cancelled within an effective hour’s notice (charging is per hour).

To contrast, popular european host Hetzner are currently offering an AMD Athlon 64 X2 5600+ (2.9GHz) with 4GB of RAM and 800GB of storage for €49 per month (about $61.60). The minimum cancellation period for this offer is 30 days.

Interestingly, if one makes use of the middle of the road AWS Reserved Instances facility (1 Year Medium Utilization Reserved Instances) for the m1.small, the monthly cost drops to $30.61. $17.28 of this is the reduced hourly rate of $0.024, and the remaining $13.33 reflects 1/12th of the “up front” $160 charge amortized over 12 months.

For web applications, a consistent level of traffic is expected, or at minimum, some traffic is expected. This means at least one machine must be available to take requests, therefore the ability to dynamically spin up and down machines is not particularly cost effective at lower traffic levels.

Taken by itself, these price points are something of an indictment of the cloud’s ability to compete on cost. Real cost, of course, is more than the dollars you pay for CPU cycles, and there are a host of complementary advantages that come with that higher price tag.

Moore’s law

Some spectators have noted the perceived slow erosion by cloud providers of the benefit infrastructure used to accrue from the regular reduction in hardware costs observed by Moore’s law.

Simply, cloud providers do not reduce their prices in accordance with the reduction in cost of computing due to advances in hardware design and development.

At the crudest interpretation, this would mean Amazon are making twice as much money every 18 months for providing the same resources to customers. This is not an accurate reflection however, because the cost of providing infrastructure does not progress linearly – meta services need to be added at various points, sysops, logistics, administration, etc. Even leaving that aside, it’s also arguable that Amazon are investing a lot of money into additional services to add value to their core offering, thus increasing utility to customers and passing on the benefits of Moore’s law indirectly.

Vendor Lock-in

When you write your infrastructure against Amazon’s API and services, you are wedded to them with a certain amount of technical debt.

This debt can manifest itself in various ways and it is up to you to decide how much of it you can afford. Ideally, of course, you want to avoid it altogether – there are various projects (fog.io being one of the more notable) which aim to build a provider independent interface to IaaS providers.

You may wish to switch your provider to save on costs as increased IaaS competition results in price differentiation, or as your requirements change and moving all or part of your architecture between cloud and metal would be beneficial.

There is no easy answer to this question. For many companies, the answer is to assume the debt and deal with problems if and when they arise. For others, building parts of their stack twice (or more) is accepted as a cost of high availability.

Vendor lock-in is not an issue unique to cloud computing. Previously, vendor lock-in was the practical difficulty associated with moving your physical machines from one place to another, or the software on them. With the cloud, a proprietary API is just more obvious.

Illusion of Redundancy

There is a persistent and dangerous myth that cloud == redundancy. Cloud services can certainly lower the barrier to building a redundant architecture, but on their own these services are not often inherently redundant.

In fact, as evidenced by various outages, cascading failures are a potential emergent behaviour of these large IaaS architectures.

EBS mounts are as susceptible to failure as regular hard drives, though recovery options are sometimes better.

“Availability Zones” are data centres, and like any data centre, are susceptible to power, ventilation and weather problems.

Entire groups of data centres (regions) are vulnerable to anything large enough to affect a single data center, simply due to network effects. If an entire AZ goes offline, the easiest solution for customers is to try and bring their systems up in other, inevitably geographically close, AZs. The APIs are the same, the response times will be similar, and certain services are easier to migrate between AZs than Regions. As a result, these zones may receive unprecendented and unsustainable demands on their resources.

Visibility & Performance

Why did X machine go offline? Why is this task suddenly taking the whole machine offline?

As you make more demands on your infrastructure, your need to understand its behaviour will increase in tandem.

Once your infrastructure is programmable, you may find a situation occuring where a machine disappears without a trace within 24 hours. Perhaps an errant termination request was dispatched to the instance by your own software, or perhaps a bug in the IaaS provider’s software caused it.

If you’re using networking mounted disks like EBS2, you could find an instance inaccessible, or inexplicably (or at least unpredictably) corrupted.

This is not a situation likely to occur with colocated hardware. At the very least, you will probably have a physical machine to examine (if it hasn’t been stolen or destroyed in a fire).

Your computing resource may also be subject to the dreaded “steal” – ie, another virtual machine on the same hardware taking CPU that your VM would otherwise be using. This can result in bizarre and difficult to inspect behaviour.

Databases and other I/O heavy applications reach peak performance earlier than they might on physical machines with better disks and equivalent RAM/CPU. Costs can become prohibitive after a certain point. One effect of this is that the data layer of a stack may need to become “distributed” earlier than before.

Some businesses and development teams may be able to take advantage of this to improve their architecture early, but it may also be a critically limiting factor for a business short on time and people.

Knowledge & Expertise

With managing hardware comes the responsibility of understanding hardware. Beyond that, understanding how an operating system will interact with a specific set of hardware, how applications will perform, and what happens when a piece of that hardware dies, are all vital when you own your hardware.

Heck, depending on how involved your data centre operations team is, you may need to have the physical strength to lift and rack a 4U server packed with hard drives.

On the other hand, cloud providers do not yet provide completely arbitrary operating systems to run on their infrastructure. This is a function of the virtualization layer, and results in some annoying quirks like kernel alterations being laborious and established operating systems not being supported. If you have significant tooling built around one of these weak points, you may find diversifying into the cloud a difficult process.

Just like software developers, your increasingly talented ops team is going to be poached constantly by Amazon, Google, Facebook, and any of the other major players looking to get a leg up on the opposition. Except these guys are going to be even harder to replace.

Conclusion & tl;dr

There are some topics I haven’t touched on, like VPS and shared hosting, or items I could go into more depth on, but this is a broad overview and is necessarily limited by attention spans (not least of which is my own). When deciding on how to develop your infrastructure, your choices are not clear cut, but you also don’t have to go all in.

Experiment where you can afford to, learn, explore. Make informed decisions.

A brief synopsis of the points above:

Cloud

Pro

  • Programmable
  • Fast to deploy / destroy
  • Management software

Con

  • More expensive per cycle/byte than metal
  • Vendor lock-in
  • Illusion of built-in redundancy
  • Poor visibility

Metal

Pro

  • Much more performant individual units
  • Costs potentially more measurable

Con

  • High upfront cost
  • Slow to deploy / destroy
  • Requires hardware knowledge/debugging, OS tuning

  1. So called “ephemeral” storage refers to the physical hard drive on the host machine where the Amazon EC2 instance is currently running. Because the physical machine on which an instance resides can change between reboots, any data stored here may “disappear” from the machine; hence the name. 

  2. Elastic Block Store, described by Amazon as “off-instance storage that persists independently from the life of an instance”, is comparable to iSCSI devices. It’s major benfits are persistent storage, variable sizes and ease of snapshotting. Downsides are that performance is network-bound and must be treated as a physical drive – ie, cannot be mounted to multiple instances. 

This entry was posted in Infrastructure. Bookmark the permalink. Both comments and trackbacks are currently closed.