Using gems to package polyglot CLI tools

Just in time for Halloween! A true horror, but with a sort of tortured beauty. How (and why, WHYYYY?!) you can annex one of the ruby community’s most pervasive technologies to distribute your filthy, heathen, non-ruby code.

I’m not going to lie to you – this is not for the faint of heart. There’s not a lot of ruby involved, but you will need to edit some. If you’re a veteran ruby developer and packager – I am so sorry.


Rubygems, for the uninitiated, are how the ruby community does ad-hoc package management. Despite the potential security nightmare involved in allowing anyone at all to add to the primary rubygems repository, it’s pretty much the only way packages get distributed for ruby.

Whatever qualms one may have about that particular aspect of the gem architecture, from a developer and consumer’s perspective (if not a sysadmin’s), it’s a fantastic system. It’s also very straight forward to actually build gems themselves.

If you want to get some software in front of a bunch of developers, rubygems are a delicious low hanging fruit. The only problem, really, is that it’s pretty much expected that you’re writing the bulk if your stuff in ruby. Maybe you’ll be patching in some native extensions, etc, but otherwise why would you be using gems?

Well, for starters, rubygems comes preinstalled on every OSX machine for the last few years, putting it one step ahead of the likes of brew, macports, etc in terms of penetration. It also does not have the centralized (if benevolent) authority of these resources (a mixed blessing).

Another reason is that there are other languages you can expect to be on a Unix system if Ruby is on it – some sort of shell for starters, but probably a smattering of others depending on your audience.

You’re a savvy, globetrotting developer; you know there’s more to programming than picking a language and running with it like someone who has just recently discovered scissors.

In any case, here’s what you need to get your abomination into the world.

Before we continue

You’re gonna want to have ruby and rubygems installed. This is sort of a pre-req. There are many truly horrifying ways to get ruby on to your system; you’re a coder (if you’re not, bail here), you’ll figure it out.

You will need to have your code, at least, all sitting in a directory on your hard drive.

Dependencies, etc, will be a bit of a show stopper. If you’re going to be dropping executables on machines, you’re probably going to expect whatever runs them to already be there. For the purposes of shlint, which uses POSIX shell and Perl, we can expect most Unix based systems to either have these preinstalled, or for them to land on a system before you’ll want something like shlint.

Yes, haphazard.

First step – the gem specification

So the major thing that makes a gem a gem is the gemspec. The gemspec is a file, written in ruby, which describes your code. It’s a really straight forward bit of code, and you can read the gemspec for shlint here:

# -*- encoding: utf-8 -*-
lib = File.expand_path('../lib/', __FILE__)
$:.unshift lib unless $:.include?(lib) do |s|        = "shlint"
  s.version     = "0.1.1"
  s.platform    = Gem::Platform::RUBY
  s.authors     = ["Ross Duggan"]       = ["[email protected]"]
  s.homepage    = ""
  s.summary     = "A linting tool for shell."
  s.description = "Checks the syntax of your shellscript against known and available shells."
  s.required_rubygems_version = ">= 1.3.6"
  s.files        = Dir.glob("{bin,lib}/**/*") + %w(LICENSE
  s.executables  = ['shlint', 'checkbashisms']
  s.require_path = 'lib'

Ok, so the biggest chunk of this is pretty straight forward, I basically have no idea why the chunk at the top is there, other than that it’s in the example I used as a reference. Ruby voodoo?

The interesting things are the s.files, s.executables and s.require_path directives.

We’re going to use this sort of directory structure (again, see shlint for a working example):


For the sake of simplicity, you’re going to throw the executables you wish to run (shell, perl, etc) into a directory named lib. You are going to do something horrible in the bin directory (nobody will forgive you).

LICENSE and are going to be in the root of your gem, naturally.

Second step – the executable

Ok, so you’ve got your alien code in lib, but what are we putting in bin? Well, unfortunately when you install a gem, it gets interpreted as ruby, meaning your unruby will cause it to throw a total fit.

(Un)fortunately, there’s a solution! For each tool you want available on the command line, you’re going to want something that looks like this:

#!/usr/bin/env ruby
spec = Gem::Specification.find_by_name("shlint")
gem_root = spec.gem_dir
gem_lib = gem_root + "/lib"
shell_output = ""
IO.popen("#{gem_lib}/shlint #{ARGV.join(" ")}", 'r+') do |pipe|
  shell_output =
puts shell_output

Haha, it is so evil, but it works!

The gist of what’s happening:

spec = Gem::Specification.find_by_name("shlint")
gem_root = spec.gem_dir
gem_lib = gem_root + "/lib"

Here, we’re interrogating the gem system to find out where the hell we’re executing from, using that to direct to what we have sitting in the lib directory of our gem.


shell_output = ""
IO.popen("#{gem_lib}/shlint #{ARGV.join(" ")}", 'r+') do |pipe|
  shell_output =
puts shell_output

We’re opening a pipe and basically shunting all arguments into our tool to deal with. You can get clever here if you like, but I felt just passing it all through was preferable. The result gets printed to screen.

Import side note here: if you’re executing the tool directly like I am here you’ll need the shebang set correctly. Otherwise, you’ll want to invoke the code prefixed by whatever executable you hope is going to be running it (like perl #{gem_lib}/ #{ARGV.join(" ")}, etc.)

Third step – packaging

Once all the bits and pieces are in place, you can try packaging your code with:

gem build mypackage.gemspec

If you’re lucky, you’ll have gotten everything right first time and you’ll now have a .gem file sitting in the directory you ran the command in. If not, the error output is pretty good, you’ll muddle through.

Next thing you’ll want to do is install your freshly forged gem with gem install mypackage-0.1.gem and see if it actually works. I went through several iterations to get the path stuff worked out, but you should find it easier.

Once you’re satisfied the gem is working, it’s time to magic it out to the world.

Fourth step – rubygems

So to get your gem out to the rest of the world, you’ll need to sign up for an account at This process is simple, and once you’ve got an account, you can then proceed to run this command:

gem push mypackage-0.1.gem

The first time this is run, it’ll ask for your account details, fill these in and boom, your gem will be published to the world!

Now try booting a VM or something to see if you can install it from anywhere by just running gem install mypackage.

Final thoughts

This is a pretty horrible hack, but it has its benefits, frankly the largest of which is the sheer glee of building such a Frankenstein. Maybe though, if you’ve got a system where you’re pretty confident that it’s both got ruby on it and a particular subset of other languages, it’s a fun way to subvert the intended usage.

Posted in Code | Comments closed

CNAMEs – how do they work?

CNAMEs are a neat trick of the domain name system, but are often unused or misunderstood.

Here’s a quick(ish) explanation of what CNAMEs are, how they work, and why you should use them.

CNAMEs are basically an alias of an existing domain. It’s easy to imagine that CNAMEs behave like internal redirects, but they’re much neater.

Aside: CNAME is shorthand for “canonical name”, and refers to the *relationship* that a subdomain – for example – has with the primary or “canonical” domain, rather than the entry itself. So for a CNAME record of 6 IN CNAME, the “canonical name” is actually the part right of “CNAME”.

If a domain is set up like so:

ross-air:~ ross$ dig +nostats +nocomments +nocmd
;  IN  A 6 IN  CNAME 513 IN  A

Then when a browser visits, the DNS resolver (usually part of your OS stack) actually sends a standard request for an A record for to the nameserver. If this lookup fails, the nameserver will check to see whether the record exists as a CNAME entry. When it finds it, the nameserver restarts the query using, which resolves to the IP

The browser sets the Host header of the request to (so that the receiving server knows what was originally requested) and continues with the request to the server.

When the server receives this message, it looks up the host it’s been passed and serves up the corresponding configured website.

It’s not just web browsers like Chrome, Firefox, etc that perform this task, tools like curl, wget, etc will perform the same lookup as part of the request.

CNAMEs are particularly useful when a lot of websites are being hosted at a single IP address. If all the domains hosted at that IP point to a single “gateway” domain, then only that gateway domain needs to point to an IP address, making Domain->IP address portability less of a headache.

If you’re using a web host or platform provider that deals with some of the hassle of server and domain administration for you, this is probably one of the tricks they use to make it possible to handle lots of users without breaking everything when an IP address somewhere needs to change.


Posted in Infrastructure | Comments closed

The Noflake Manifesto

The Snowflake Server is a system which is difficult to reproduce due to the unique and undocumented methods by which it came to its current state. Developers and systems administrators grow to fear altering these servers in case they cause damage.

The Noflake Manifesto proposes that we have reached a point where there is no longer any excuse for such servers to exist.

All servers should:

  • Have their hardware specifications known and documented.
  • Have their build process automated.
  • Have their configuration managed and version controlled.
  • Be regularly tested for consistency with documented expectations.

Google Docs and Github wikis, amongst others, allow for collaborative, secure, version controlled document editing.

Build systems do not have to be complicated. Something as simple as a collection of shellscripts, when version controlled and documented, are the seeds of a replicable system.

Open source configuration management and deployment software, including Chef, Puppet and BOSH are well tested, documented and deployed across thousands of production environments.

Companies making use of cloud computing services such as Amazon EC2 can afford to regularly and automatically build new machines to replace existing ones, ensuring no uncontrolled state becomes a dependency.

Don’t build snowflakes.

Posted in Infrastructure | Comments closed

Cloud vs Metal Infrastructure

A Changing Industry

It’s important to understand that, while aspects of the cloud industry are reaching a certain level of maturity, it is still an industry moving at a breakneck pace of innovation. There is a lot of competition in the space, and every big player is looking to stake their claim, as evidenced by Google’s recent entry into the IaaS market.

With this constantly changing system in mind, the most high level takeaway I’d like someone to have from this article is that “cloud vs metal” is about what you can take advantage of, and that the best choice for any given system can change over time.

The real subtext is about disaster recovery and agility – the former a topic which often induces acute narcolepsy in software developers, the latter often not considered in context of infrastructure, assumed to be the domain of development methodologies and business management.


A good way to help make an informed decision is to understand what tends to make one architecture different from the other.


The concept of server infrastructure being something one can build and maintain programmatically is probably the most revolutionary part of cloud infrastructure. This also makes it one of the fundamental hurdles to overcome when reasoning about it as part of your technical resources.

When you have the ability to programmatically define your hardware, you can version control it, test it and deploy it as many times as you like. System administrators have been moving towards this with configuration management for a number of years, and cloud programmability helps reenforce that process.

Speed of Tear Up/Down

With the cloud, the ability to quickly build and destroy a collection of servers is (inevitably) complementary to their programmability, and arguably, just as revolutionary.

One of the more frequently espoused use cases is that of Netflix, smoothly dealing with the daily cycle of consumer video demand by ramping up or down servers as required. Netflix is not unique, but it is high volume, and has a predictable day/night media demand cycle across different geographical regions.

Netflix is a special case of variable, voluminous demand. The difference between a trough and a peak is hundreds, if not thousands of servers. It is this quality which allows Netflix to take advantage of cloud services.

One of the more important, but perhaps less frequently highlighted, benefits of being able to quickly spin up and tear down machines is that it lowers the barrier to entry for experimenting with distributed architectures. Indeed any “research” style project can benefit greatly from this, with no capital requirements up front. It tackles the cost, and thus the fear, of failure. In a startup, this can make the initial gambles cheaper, in a large organization it can help to tackle the Innovator’s Dilemma.

Management Architecture

A sometimes overlooked aspect of cloud based services, including IaaS and PaaS providers, is the management architecture they provide you. An interface which gives you high level and immediate control over infrastructure is a huge psychological step forward.

Interfaces like the AWS Managment Console, Engine Yard’s dashboard, or Netflix’s Asgard also subtly alter the way we reason about infrastructure. They can make the resources more tangible.

Cost Per CPU Cycle or Byte Stored

When you want a consistent chunk of CPU, storage space or RAM available, dedicated hardware is a lot cheaper (byte for byte, cycle for cycle) than the cloud equivalent.

According to Amazon’s FAQ:

“One EC2 Compute Unit [ECU] provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.” 5

Right now, the pricing for a basic Amazon “m1.small” instance, which has one ECU, is $0.08 per hour. It has 1.7GB of RAM available to it and 160GB of ephemeral storage1. This translates to about $57.60 per month of continuous use (assuming reserved instances are not used). It can be cancelled within an effective hour’s notice (charging is per hour).

To contrast, popular european host Hetzner are currently offering an AMD Athlon 64 X2 5600+ (2.9GHz) with 4GB of RAM and 800GB of storage for €49 per month (about $61.60). The minimum cancellation period for this offer is 30 days.

Interestingly, if one makes use of the middle of the road AWS Reserved Instances facility (1 Year Medium Utilization Reserved Instances) for the m1.small, the monthly cost drops to $30.61. $17.28 of this is the reduced hourly rate of $0.024, and the remaining $13.33 reflects 1/12th of the “up front” $160 charge amortized over 12 months.

For web applications, a consistent level of traffic is expected, or at minimum, some traffic is expected. This means at least one machine must be available to take requests, therefore the ability to dynamically spin up and down machines is not particularly cost effective at lower traffic levels.

Taken by itself, these price points are something of an indictment of the cloud’s ability to compete on cost. Real cost, of course, is more than the dollars you pay for CPU cycles, and there are a host of complementary advantages that come with that higher price tag.

Moore’s law

Some spectators have noted the perceived slow erosion by cloud providers of the benefit infrastructure used to accrue from the regular reduction in hardware costs observed by Moore’s law.

Simply, cloud providers do not reduce their prices in accordance with the reduction in cost of computing due to advances in hardware design and development.

At the crudest interpretation, this would mean Amazon are making twice as much money every 18 months for providing the same resources to customers. This is not an accurate reflection however, because the cost of providing infrastructure does not progress linearly – meta services need to be added at various points, sysops, logistics, administration, etc. Even leaving that aside, it’s also arguable that Amazon are investing a lot of money into additional services to add value to their core offering, thus increasing utility to customers and passing on the benefits of Moore’s law indirectly.

Vendor Lock-in

When you write your infrastructure against Amazon’s API and services, you are wedded to them with a certain amount of technical debt.

This debt can manifest itself in various ways and it is up to you to decide how much of it you can afford. Ideally, of course, you want to avoid it altogether – there are various projects ( being one of the more notable) which aim to build a provider independent interface to IaaS providers.

You may wish to switch your provider to save on costs as increased IaaS competition results in price differentiation, or as your requirements change and moving all or part of your architecture between cloud and metal would be beneficial.

There is no easy answer to this question. For many companies, the answer is to assume the debt and deal with problems if and when they arise. For others, building parts of their stack twice (or more) is accepted as a cost of high availability.

Vendor lock-in is not an issue unique to cloud computing. Previously, vendor lock-in was the practical difficulty associated with moving your physical machines from one place to another, or the software on them. With the cloud, a proprietary API is just more obvious.

Illusion of Redundancy

There is a persistent and dangerous myth that cloud == redundancy. Cloud services can certainly lower the barrier to building a redundant architecture, but on their own these services are not often inherently redundant.

In fact, as evidenced by various outages, cascading failures are a potential emergent behaviour of these large IaaS architectures.

EBS mounts are as susceptible to failure as regular hard drives, though recovery options are sometimes better.

“Availability Zones” are data centres, and like any data centre, are susceptible to power, ventilation and weather problems.

Entire groups of data centres (regions) are vulnerable to anything large enough to affect a single data center, simply due to network effects. If an entire AZ goes offline, the easiest solution for customers is to try and bring their systems up in other, inevitably geographically close, AZs. The APIs are the same, the response times will be similar, and certain services are easier to migrate between AZs than Regions. As a result, these zones may receive unprecendented and unsustainable demands on their resources.

Visibility & Performance

Why did X machine go offline? Why is this task suddenly taking the whole machine offline?

As you make more demands on your infrastructure, your need to understand its behaviour will increase in tandem.

Once your infrastructure is programmable, you may find a situation occuring where a machine disappears without a trace within 24 hours. Perhaps an errant termination request was dispatched to the instance by your own software, or perhaps a bug in the IaaS provider’s software caused it.

If you’re using networking mounted disks like EBS2, you could find an instance inaccessible, or inexplicably (or at least unpredictably) corrupted.

This is not a situation likely to occur with colocated hardware. At the very least, you will probably have a physical machine to examine (if it hasn’t been stolen or destroyed in a fire).

Your computing resource may also be subject to the dreaded “steal” – ie, another virtual machine on the same hardware taking CPU that your VM would otherwise be using. This can result in bizarre and difficult to inspect behaviour.

Databases and other I/O heavy applications reach peak performance earlier than they might on physical machines with better disks and equivalent RAM/CPU. Costs can become prohibitive after a certain point. One effect of this is that the data layer of a stack may need to become “distributed” earlier than before.

Some businesses and development teams may be able to take advantage of this to improve their architecture early, but it may also be a critically limiting factor for a business short on time and people.

Knowledge & Expertise

With managing hardware comes the responsibility of understanding hardware. Beyond that, understanding how an operating system will interact with a specific set of hardware, how applications will perform, and what happens when a piece of that hardware dies, are all vital when you own your hardware.

Heck, depending on how involved your data centre operations team is, you may need to have the physical strength to lift and rack a 4U server packed with hard drives.

On the other hand, cloud providers do not yet provide completely arbitrary operating systems to run on their infrastructure. This is a function of the virtualization layer, and results in some annoying quirks like kernel alterations being laborious and established operating systems not being supported. If you have significant tooling built around one of these weak points, you may find diversifying into the cloud a difficult process.

Just like software developers, your increasingly talented ops team is going to be poached constantly by Amazon, Google, Facebook, and any of the other major players looking to get a leg up on the opposition. Except these guys are going to be even harder to replace.

Conclusion & tl;dr

There are some topics I haven’t touched on, like VPS and shared hosting, or items I could go into more depth on, but this is a broad overview and is necessarily limited by attention spans (not least of which is my own). When deciding on how to develop your infrastructure, your choices are not clear cut, but you also don’t have to go all in.

Experiment where you can afford to, learn, explore. Make informed decisions.

A brief synopsis of the points above:



  • Programmable
  • Fast to deploy / destroy
  • Management software


  • More expensive per cycle/byte than metal
  • Vendor lock-in
  • Illusion of built-in redundancy
  • Poor visibility



  • Much more performant individual units
  • Costs potentially more measurable


  • High upfront cost
  • Slow to deploy / destroy
  • Requires hardware knowledge/debugging, OS tuning

  1. So called “ephemeral” storage refers to the physical hard drive on the host machine where the Amazon EC2 instance is currently running. Because the physical machine on which an instance resides can change between reboots, any data stored here may “disappear” from the machine; hence the name. 

  2. Elastic Block Store, described by Amazon as “off-instance storage that persists independently from the life of an instance”, is comparable to iSCSI devices. It’s major benfits are persistent storage, variable sizes and ease of snapshotting. Downsides are that performance is network-bound and must be treated as a physical drive – ie, cannot be mounted to multiple instances. 

Posted in Infrastructure | Comments closed

Starting with Arduino (Musical Floppy Drives!)

First, a confession.

I have forgotten almost everything that I used to know about electronics, and I didn’t know that much to begin with either. I’ve spent the last week or so getting a tentative reintroduction to the fundamentals. Frankly, a current running across a wire is still a novel and exciting thing for me. I used to build my own computers when I was a teenager, but I’ve lost interest in recent years in favour of computers that just work.

With that disclaimer out of the way, the idea of mucking around with Arduino has been stewing in the back of my head for a while. Arduino is an open source, hobbyist computer hardware hacking platform, and can be used for anything from automated plant feeding to powering remote control aircraft. A couple of weeks ago, I was passing by a local electronics store (Maplin), and, on a whim, decided to see if they had any Arduino kits in stock.

I had no real idea what I wanted to do, but I was pretty sure I could figure that out along the way.

Arduino Uno

The first thing I discovered is that there are several types of Arduino for various reasons, with the variable factors seeming to be the physical size of the board (weight/size is important in embedded systems), power of the microcontroller (probably not an issue when you’re starting out), and number of available connector pins (complexity of device you’re building).

Make: Arduino Bots and Gadgets

Make: Arduino Bots and Gadgets

Available in the store I visited were the Uno and the Mega. There are quite a few other types, including the awesome-looking LilyPad (can be sewn into clothing), but the Uno seems like a good general starting point. I did not know this, however, and decided on balance that the slightly more expensive Mega was a good way to make sure I didn’t have to come visit the shop again too soon. I also found a surprisingly good book on a nearby shelf: Make: Arduino Bots & Gadgets published by O’Reilly. I flipped through the first few pages to discover what was not inside the Arduino box (but was still required to get it working) and discovered that a USB-B (fat, square connector) cable is a necessary component, so picked up one of those (the book, too) and made my first Arduino purchases.

I also picked up two small packets of jumper wires, an adjustable power adapter (3.3V up to 12V with interchangeable heads) and a precision screwdriver set, not really knowing what the heck I was going to use these for, but they all seemed like reasonable purchases. There was a device called an “ethernet shield” on the rack next to the Mega, but the name “shield” did not help me make any immediate logical connections so I ignored it (bit of sensory overload at this point).

I should also mention that the book had a litany of tools listed, which I decided to ignore in favor of getting familiar with the software environment of the Arduino first. I was short on time that day, and didn’t get much done other than installing the development software and running the blinking LED “Hello World”.

I started reading my newly acquired O’Reilly book the following evening, and it quickly became apparent that I had nowhere near enough basic electronics resources to do anything immediately satisfying. Wire cutters, wire strippers, alligator clips, breadboards, soldering irons… I dithered. I let about a week slip by before I decided to press on with the book.

Further reading rekindled my interest, I decided another trip to the electronics store was in order. This time, I picked up an Uno, an Ethernet Shield, a combined wire cutter / stripper and decided I was in business. Satisfied, I spent a good chunk of that evening hunting down and taking apart every old, broken and/or unused electronic device in my apartment to salvage for parts. This yielded me a couple of motors, a 3.7V rechargeable battery, some microphones, speakers, the core of a CD player, and, most promisingly, a pretty simple LCD screen salvaged from a Nokia 3310 (drivers and sample code are available).

Unfortunately, none of these were really morphing into project ideas in my head. Then, while chatting to colleagues the following day, the idea of recreating the “Derezzed on floppy drives” video I’d seen a couple of weeks previously came up, and I had my goal. I was ready for a grueling few weeks of trial and error bit manipulation, but it turns out that floppy drive music is a something of a micro community. One chap, Sammy1Am, seems to have kicked off the recent interest by releasing some software (Moppy) to let people stream MIDI sequences to their Arduinos, and manipulate the floppy drives. This was the software used to create the Derezzed video I’d seen. Not only that, but he had also put together a helpful tutorial video. What a hero!

First problem I encountered: no floppy drives. At least, none near me. Hell, my laptop doesn’t even have a DVD drive. Another week passed, and I visited my family for Easter. I figured I had a few useful bits and pieces in storage, so I rummaged around the attic and pulled out two floppy drives, an old PC power supply and a few other bits and pieces. I also borrowed a digital multimeter from my dad, but I haven’t used it much yet – have been using that 3.7V battery to test whether current is going through something so far.

Power supply turned out to be a dud, but it took me a good while to figure that out; damn thing would work for a little while, then go through periods of shorting out entirely. Eventually I decided that I could power the floppy drives via the Arduino itself, as it has a handy little power passthrough socket. Also, I have no idea how Sam does it, but hooking up wires salvaged from Cat5 to a floppy drive male pin is hard. Well, more precisely, it is really, really fiddly.

I gave up and decided that yet another trip to the store was in order; this time I purchased a couple of small breadboards, a box of variously sized jump wires, some little jumpers (like you get on the main board of your PC – if you’ve ever built one) and some plastic containers to store the small pile of parts I had been accruing. Now I had all the pieces assembled. I cabled everything up as well as I could, then tried to get Moppy working.

Moppy on OSX Lion does not work. Or, at least, does not work without some fiddling – it’s most of the way there. It’s been a long time since I’ve touched Java above the JVM, and I had to install “Netbeans IDE” to get the project built. I’ve forked the project and updated it with the changes that got it working for me, but I’m not comfortable pushing them back since it changes the assumed development OS from Windows to OSX, and that’s not something that should be pushed back upstream.

Eventually, this evening, I got the whole ensemble working together, and recorded my first test (one of the sample MIDIs included):

Now I just need to get a lot more floppy drives, and either Learn How to Music, or find a musician who can create MIDIs and wants to floppydrive-ify some music :) I’ve a bunch of ideas I want to muck around with on the software side too.

Bootstrapping from zero-electronics was a little expensive, but a lot of fun, and it’s been great to get into the hardware in a way I haven’t done for some time.

The (current) finished product

Posted in Arduino | Comments closed

Receive.js – simple mocking of remote HTTP requests

This is a quick and really basic node.js snippet for mocking up remote HTTP requests, like API calls.

During initial development of an API client, sometimes you just want to see what the server is receiving in real time, so you can hack things together quickly. This script sets up a HTTP server which spits out the URL requested, along with header information and any POSTed parameters.

Ideally, you could have this running on a parallel screen as you code, so you can see requests as they come in. This could also be used for simple mocking out of responses, but I haven’t gone that far :)

receive.js on github.

Posted in Miscellaneous | Comments closed

Reading list for scaling Solr

Brain dump time. I kinda need this as a memory aid for myself, and I figure it’ll be useful to anyone else who is building a Solr cluster. There’s probably a lot of crossover here for tuning any JVM-based application servicing a large number of requests, but this is my first, so it’s all together.

Some of the background for this list can be read here, and for some further context, this is some of what I read to build something probably more powerful than websolr’s top tier offering (those guys are probably worth investigating before building your own cluster, by the way). There were some pretty “out there” requirements for though (potentially thousands of FQ permutations per search phrase, lots of big, ugly, old data, etc).

Some of the issues I ran into scaling Solr are relatively unique, but the general approach should be the same for everyone:

  1. Learn how to gather information
  2. Gather all the information you can
  3. Analyze / Evaluate
  4. Make incremental changes
  5. Return to Step 2.

Figure out what you’re actually running

If you’re unfamiliar with the world of Java (or a rusty shade of green like me), you might be horrified surprised to discover that there are a few different implementations (let alone versions) of the Java Virtual Machine (JVM) available to you. What’s more, the best documented and supported one, the “Oracle” JVM (still documented almost everywhere as the Sun JVM) is probably not what you’re running if you’re running Ubuntu Server.

There’s also a difference between engineering numbers and product numbers, which may not be immediately apparent from the outset, and often they appear to be used interchangeably.

Understanding the JVM

Understanding Solr

Somewhat passive aggressive Lucid Imagination advice –
An example of some of the hilarious bureaucracy in Solr development –

There’s probably plenty more, but those are the ones I have saved :)

Posted in Technology | Comments closed

I’m joining EngineYard to work on Orchestra

Today, I finish up three and a half great years with, and next week I join Eamon, David, Helgi and Gwoo with Noah, ElizabethDavey (and a couple of others) to help kick Orchestra into orbit by joining EngineYard.

At EngineYard, I hope to build upon my experience bringing scale to web applications within Distilled Media, and take it to a wider audience.

I’m looking forward to (and a little daunted by!) working with such an accomplished group of people, and helping to drive our manifesto. Our offices will be down on Barrow Street, near Facebook and Google, in the same building as DogPatchLabs and friends at Intercom, in an area that is being dubbed the “Silicon Dock.” I’m incredibly excited by the sort of buzz that is going to be around this place.

Posted in Miscellaneous | Comments closed

Bots are crawling new domain registrations and namesquatting Twitter handles

Something to be wary of when you’re domain shopping for that perfect .com: bots are watching.

I’m not sure what combination triggers it, but when I was done brainstorming for an app name, I checked Twitter to see if the handle was taken, then registered the domain name. It was quite late and I didn’t want to start fiddling around with email aliases to get the Twitter account, so I decided to leave it since the application is only in early planning and layout stages.

A couple of days later, I decided to grab the account to start playing around with the Twitter API – couldn’t believe it when I discovered the account had been nabbed. No updates, no avatar – just a squatter. It seems obvious in retrospect that someone would be scraping domain registrations and comparing against Twitter handles (there was nothing but the .com taken and I’d never mentioned it anywhere), but it is immensely frustrating. The app is a Twitter-based tool, so having a .com and matching account name is essential.

What’s even more frustrating is that this seems to fall within Twitter’s acceptable usage policy. They have a section on namesquatting, but it states:

“Please note that if an account has had no updates, no profile image, and there is no intent to mislead, it typically means there’s no name-squatting or impersonation.”

You’d be hard-pressed to claim that there is an “intent to mislead” when you’re starting to build something – after all, the fact that you’re only starting to build it means there’s not much to mislead. So I’m back to the drawing board on figuring out a name while I put the app together.

The lesson here, a rather obvious one I suppose, is that Twitter handles are as valuable as the domain for certain classes of applications. If you’re building something that will interface with Twitter, you have to get the Twitter handle at the same time as the domain, or it’ll be swept out from under you.

As per Andrew’s comment below, I’ve submitted an impersonation report – hopefully something will come of it.

Well, it looks like this might be going places, the account in question is now showing as “suspended” – no word from support yet (it’s a weekend, I wasn’t expecting anything until at least Monday) but it’s encouraging!

Update 2011-08-29:
So, received some communication from Twitter today; they will not release the account name to me. This is disappointing, but I will move on.

Posted in Technology | Comments closed

“Levelling the playing field” in education

Came across a post via HN which suggests levelling the playing field in CS by teaching with obscure functional programming languages.

The reasoning is that there are “privileged” students who begin a computer science degree already knowing how to code, and that this is unfair.

Beyond the impracticality of doing this (you’re going to have to change the language every year so that people don’t “cheat”), the fact that so many educators in the comments agree with this position is further evidence that the current education system is creaking.

People like Salman Khan are already helping to make education more accessible with the likes of Khan Academy. In fact, teachers who have used Khan Academy apparently don’t want their students to be too smart.

When you are effectively looking to punish students for learning how to read before the rest of their peers, you have – at best – lost perspective as an educator.

At some point (actually around the time of the industrial revolution), it seems that education stopped being about the acquisition of knowledge and started being about churning out template humans to fulfil tasks.

If computer science education is broken because people are learning all by themselves, then society should (and already does) route around the problem – the problem being universities holding a monopoly on what it means to be educated. Tech companies in particular have been navigating around this for years, reaping the rewards of hiring self-taught, self-motivated individuals.

If modern enterprises wish to take advantage of this revolution in education, they should recognize this lest they lose some of the best candidates to more modern companies.

Posted in Technology | Comments closed