Munin plugins for Solr

I’ve been mucking around with Python recently and have written a couple of simple Munin plugins for Boards.ie’s Solr cluster (in the hope of helping to track down some annoying performance bugs).

If you’re not familiar with Munin, it’s a bit like Nagios, (and if you haven’t heard of Nagios, it’s a network monitoring tool). Munin is particularly nifty because it’s easily extensible in pretty much any language that takes your fancy; Perl seems to be the favorite, but Python is also used.

You can check out the plugins on Distilled’s Github account (parent company of Adverts.ie, Boards.ie, Daft.ie and TheJournal.ie).

There’s a bunch more there that Conor McDermottroe has written for MySQL, Snort, IMPI and FreeBSD too which you might also be interested in!

Edit (2012-11-06): Github user kura has written an installer for these plugins available here: https://github.com/kura/solr-munin

Posted in Technology | Comments closed

Google Plus

I’ve been playing on and off with Google’s new toy, Google+, for the last few days and while there are a plethora of opinion pieces all over the place, I figure what the hell, one more can’t really hurt (especially a nice short one that isn’t too gushing.)

The Bad

I think one way in which Google really dropped the ball is that Google Apps customers are having serious difficulties using the service. Google is undergoing a massive effort to unify various types of accounts over their network and are mucking it up quite badly for such a smart group of people. They risk alienating some of their most committed users; heck, they’re alienating the minority of users who actually pay.

Something else I’ve heard is that Google should have gone to more effort to allow groups of people to sign up – after all, critical mass is what is going to make this thing successful. Coming back to Google Apps, they had an excellent opportunity to pre-authenticate all Google Apps account holders for the service, thus enabling entire already dedicated groups to bootstrap the service.

Sparks feels like a bit of a miss. I’m not quite sure what mix of keywords is going to give me stuff I should know about. I’m probably trying too hard to recreate the experience of hackernews or Techmeme, but since I don’t know what “topic” to pick and I’m not sure how the results are populated, it feels… off.

The Good

Having the Google+ notification bar as part of my GMail bar is inspired. It’s amazing how in many ways this is not all that different from the time Buzz was lodged below our inboxes – yet it just feels so much more appropriate.

The interface is finally something between the dorky, utilitarian UI of older products and the sweeping strokes of pastel colours, Helveticaesque typefaces and thick lines we’ve become more familiar with in the last few years.

Profiles are very nicely executed. There’s just the right amount of subtle completionist bait to encourage people to add more information and fill out more circles, but without feeling like you’re being pressured into it (a la LinkedIn).

It feels like it has the opportunity to be a richer experience than Twitter (who are struggling to fit the rich world of the web into their 140 character ethos), while retaining the serendipity of interaction with strangers (for want of a more appropriate term) that isn’t really available with Facebook. It doesn’t seem like it would replace the organic, real time stream that is Twitter; but it might supplant a good portion of my (admittedly low) Facebook usage.

Overall, I think if the bootstrapping issues can be addressed then this is going to be a really useful addition to the space that Facebook, Twitter and LinkedIn occupy.

Posted in Technology | Comments closed

Getting Windows 7 onto a USB stick using Ubuntu

I spent way too much time trying to do this so maybe this will save someone else some time.

I haven’t owned a copy of Windows for years, and have been using Ubuntu as my solitary home and office OS for some time. Last week though, I decided I should have a copy of Windows 7 since there’s some software I can’t run without it (my hardware is pretty new, and it seems to take Linux about 6-12 months to catch up with changes in hardware, especially graphics).

So, after a bit of hunting, I discovered it was possible to purchase Windows 7 in downloadable ISO format at http://emea.microsoftstore.com

Great, I thought, that’s pretty progressive of Microsoft! Unfortunately their helpfulness ends there, as the only instructions for actually using the ISO revolve around DVD burners and/or assume you already have Windows (of some sort) installed. I didn’t buy a DVD drive with my last computer build because I haven’t actually needed to use one for a long time.

So, if, like me, you have Ubuntu as your only OS and no DVD drive, it seems like you’re kinda up the creek. Fortunately, this is actually pretty easily remedied. This guide does assume some familiarity with both Ubuntu and partitioning and I’m not big on tech support, so consider yourself forewarned :)

You need:

  1. A USB key that you can afford to wipe completely (at least 4GB)
  2. Your Windows 7 ISO of choice (32 bit and 64 bit varieties are available from the store) and the serial (it’ll be emailed to you).
  3. A connection to the Internet on your Ubuntu machine (or just read the guide and preinstall anything you see mentioned)
  4. A willingness to nuke whatever OS you already have installed, because Windows takes no prisoners (blindly overwrites the Master Boot Record).

Now, technically 4 can be alleviated by recovering grub, but it’s a bit of a roundabout process, and my Ubuntu machines are all basically thin clients. Important documents are stored in Dropbox, my code is on Github, my music, etc is on external drives and almost everything else is also stored remotely or run in the browser. If you don’t want to lose your Ubuntu install, I imagine the guide on the Ubuntu Community Wiki is where you’ll want to look after this.

Important: when you’re downloading the ISO, do not bother with “Part 1″ – it is a boot sector editor for Windows and will be useless to you. Download only “Part 2″ as it’s the ISO (you can only do this about three times, so try to download and store the ISO the first time round).

Partitioning the USB key

We’re going to prepare this USB key with gparted. Disk Utility won’t do (since it won’t allow you to set a device as bootable). You might have this installed already, but if you’re running a fairly fresh copy of Ubuntu, you’ll need to install this using Synaptic. Once it’s installed, you should find it under System->Adminstration->GParted Partition Editor.
Insert your USB key, start up Gparted and select the key from the drop down in the top right. It’ll be something like /dev/sdc, with the size of the drive in brackets next to it.
Delete any partitions that are on the drive, and create a new NTFS one, applying the changes.
Right click on the device, and select “manage flags”. Set the “boot” flag and nothing else.

Extracting the ISO

You may need to eject and re-insert the drive for it to appear, at this point you need to open up your freshly downloaded ISO using the Archive Manager. Do not “mount” the ISO as it will not be usable.
Extract the contents of the archive straight onto your newly formatted (and bootable) USB key. Once this is completed, eject the drive and voila, you have a bootable USB key with Windows 7 on it!

Posted in Technology | Comments closed

Searching Boards.ie – Solr, EC2, SQS, SNS, Node.js

This is the first in a series of posts about the design and implementation of a search engine for Boards.ie.

Boards.ie recently launched a new search engine – http://www.boards.ie/search/ – which is built upon Amazon Web Services using Solr with PHP as the glue.

Currently, Boards.ie users are searching nearly 30 million posts almost a million times a month.

A little background

Solr has been used in production by Distilled Media (formerly the Daft Media Group) for a couple of years now, first being tested on LetsRent.ie, then for powering the maps functionality on Daft.ie, and even more recently as the core technology for the relaunch of Adverts.ie.

Boards.ie’s usage has been an interesting look into the future for Daft.ie and Adverts.ie; with nearly 26 million posts – “documents” in Solr nomenclature – and growing by more than a million again every month, it presents a challenge in providing a reliable, affordable, relevant and fast search solution to our users.

Boards.ie has relied on MySQL’s “Fulltext” search as long as it has been using MySQL (forever). As time has passed, more and more restrictions had to be placed on search in order to keep it online, and the sheer volume of data was making results less and less useful to people well used to using Google search.

There have been a number of challenges to solve along the way:

Can we make search faster and more relevant with what we have already?

This is sort of an obvious question, but we didn’t want to jump the gun and leap into an entirely new technology stack if it were possible for us to overhaul what was currently there.

The answer was not particularly straight forward, but any solution that involved keeping the MySQL fulltext engine as our primary search system involved a lot of new beefy hardware and a lot of the same slowness, punting relevance issues down the line. As a result, we decided that it was not a realistic option.

Which search technology should we use?

There were only a couple of choices available to us when we started working on this project. First, we wanted to use Open Source – there are plenty of good reasons to choose either open source or proprietary software depending on your requirements, and we have experience with both. We wanted the flexibility to hack away at what we were using and share back to the community where possible. We wanted to be able to scale horizontally without worrying about skyrocketing licensing costs. We wanted access to the huge amount of expertise and goodwill that comes with open source projects, as well as experimental work done by other developers.

It also had to be compatible with the rest of our infrastructure and fit with our areas of experience – we’re pretty evenly split between FreeBSD and Linux on the infrastructure side; we didn’t want to introduce an entirely new stack (Microsoft) if it could be avoided.

There were two major options for us as a result, SphinxSearch and Apache Solr. SphinxSearch appeared to have successful deployments with vBulletin (our forum software), but all the experience in the Distilled Group was with Solr. Daft.ie had already deployed successful Solr installations, so we decided to go with what we had some experience with.

What infrastructure should we use?

We had the option of building using our own hardware, or making the leap to introducing cloud infrastructure. Moving partially into the cloud introduced a data transportation issue – how were we going to get data into the cloud while keeping our core infrastructure in Digiweb? Latency and data transfer costs were likely to be show stoppers. Amazon had recently introduced a new service in beta, Simple Notification Service (SNS). We were pretty sure we could combine this with their Simple Queue Service (SQS) to create a scalable message routing and queuing system for posts. With a few hiccups and some conversation with Amazon along the way, we were successful. As it turns out, getting the data out of our core infrastructure and into Amazon is the cheapest component of the entire operation. For the tens of thousands of messages we send to Amazon on a daily basis, we’re billed a couple of dollars a month.

What are the tradeoffs?

There are always tradeoffs.

The advantages and disadvantages of SQL and NoSQL solutions has been covered at length, I’ll just highlight the areas that were significant for us.

With MySQL, our data was/is potentially consistent. For our usage, it is consistent enough for us to call it the canonical source of our data. Occasionally it chews up a post or thread, but rarely one that can not be rewritten.

The information is also effectively instantly available upon submission. There’s a lag of up to a couple of seconds occasionally as the data waits to propagate across our MySQL cluster due to an extended lock of some sort, but it’s generally not a particularly noticeable to the average user.

With the new search system we have sacrificed some consistency and how immediately data is made available to the searcher. For instance, posts sometimes (more frequently than with MySQL) do not make it into the search system. There is multiple redundancy built into the system to limit this, but occasionally a full resynchronization is required. We judged this to be an acceptable cost.

Immediacy has been sacrificed such that it usually takes about 2-5 minutes for a new post or update to an old one to become available in searches, occasionally an hour, infrequently a day, and very infrequently a couple of months (relating back to consistency). We determined that the normal search profile does not require the absolute newest data, only the most relevant.

At the time we made these decisions, losing real time search was the one I was least happy about, for most users it does not appear to be a concern. In hindsight it’s almost obvious why – Boards.ie is a gigantic repository of historical information. The value is in being able to search this rich back catalogue of conversation, opinion and information, not just the most recent.

Even at that, recency is only sacrificed when measured in seconds.

Architectural choice

As useful as Solr has been for us, it’s a bit of a black box in our architecture. Trying to run Solr through a debugger remotely is a gigantic pain in the arse and not something I had much success with. Fortunately, the Jetty error logs are enough to illuminate most problems:

  • Memory management. It’s perhaps unfair to take issue with Solr over this – after all, it’s a far cry from MySQL’s famously labyrinthine memory usage configuration options, and Solr’s problems are really the restrictions of the JVM – but it feels like a single purpose machine with one major application should be able to figure out how much RAM to dedicate to disk cache and how much the application should get for optimal performance.
  • XML everywhere. It’s inescapable, and again, this can be ascribed to Java culture, but damn.
  • Fixed schema. I suppose this is an ideological argument, but during development this was tedious.
  • The book is two inches thick and you need to know it. If you’re building a search service that you expect to take a lot of traffic and contain a lot of documents (why else would you be interested, right?) you will simply have to know all (or at least a good chunk of) the features, quirks, optimizations and architectural choices.

Solr is now a pretty well field-tested application in Distilled, and I’m pretty sure I could rapidly prototype an installation and have it up and running in production inside a week for another site, but I would prefer to investigate ElasticSearch the next time I am revisiting search options.

Further development

During the course of development, I wrote a small node.js server to speed up the relay of post data from our web servers to Amazon SNS. Due to the limitations of deploying node.js on our FreeBSD architecture at the time, this system is, unfortunately, yet to be implemented. Once it is, however, write operations on the site should speed up noticeably for end users. Once this is up and running and has the bugs ironed out, it will be released as an open source project on github.

A side effect of this is that I contributed some code to the AWS Library node.js project, something I would like to continue doing.

I have also amended chunks of the Solr PHP client with a sharded Solr deployment in mind, but it’s quite clunky and I would prefer to have another stab at it, maybe writing my own Solr PHP client and making it available.

In the next posts on this topic, I plan to dive more deeply into the specifics of our implementation, and hopefully release some code in tandem.

Posted in Technology | Comments closed

EC2: Create AMI from a running instance

Log into the AWS web console, find the instance you want to create an AMI from, right click and select “Create Image (EBS AMI)”. Follow the wizard.

All the top results in Google are for The Long Way™ to do this. The Long Way has a bunch of useful things to take into consideration (security, for example), but the actual process of creating an AMI from a running instance has been made simpler by Amazon in the last few months.

Right-click context menus in web pages aren’t the most obvious metaphor for people who have been using the web for a number of years. I know people new to EC2 who have completely missed this and could have saved themselves some time. Hopefully this will bubble up and save people a half hour or so :)

Posted in Technology | Comments closed

Gender breakdown for software development in Ireland

This post was inspired by this TED talk from Sheryl Sandberg, which reminded me of this brilliant blog post by Jolie O’Dell. The whole women in tech thing is something which interests me (as a techie, as well as a human being), and there are loads of great discussions (and terrible ones) about it.

Have been trying to figure out a good way to get an idea of the size and gender breakdown of the somewhat nebulous software development community in Ireland. I’ve settled on LinkedIn as probably the best way to figure this out, Facebook appears to be useless for it (despite having that information). I want to compare these to the official numbers provided by government and other agencies to see how they measure up, though it involves a good few caveats.

These numbers are at least interesting to play around with. I suggest trying out LinkedIn’s DirectAds system yourself; don’t have to pay to get to the “targetting” stuff at step 2.

Currently, LinkedIn lists 416,030 people as working in Ireland. The National Skills Bulletin 2010 (pdf) states that there were 1.88 million people working in full or part time employment in Q4 2009. As that number has probably only gone down since then, I think LinkedIn has a pretty sizeable chunk of the workforce listed; maybe more than a quarter of the total, though probably a heavy bias towards networking-enabled industries with strong computer usage.

Of those listed on LinkedIn, 37,322 describe themselves as working in “Engineering” or “Information Technology” – two quite broad descriptions, both of which will include a portion of people who have absolutely nothing to do with software development. At this level, the breakdown is 28,387 male, 6,722 female. Looks to be about 2,213 who have not listed gender.

It gets tricky at this point – if you wish to breakdown by Industry, you’ve got a max of ten to choose from a list of hundreds, and techies work in almost every industry. To try and extract the main corpus of “software developers” from the Industry section, I did a quick poll of my own LinkedIn connections and chose: Computer Hardware, Computer Software, Computer Networking, Internet, Information Technology and Services, Computer & Network Security, Wireless, Online Media, Publishing and Information Services. I figure this cuts out most of the mechanical/bio/pharma engineers.

This gives me 13,080 people in total. The gender breakdown for this is 10,141 male, 2,148 female (791 unknown).

To see how these numbers match up with regular LinkedIn search (if I were looking for software people), I searched for people in Ireland with a couple of different keywords as their current job title. “Engineer” turned up 13,848 (24,446 including past jobs). “Systems Administrator” turned up 330 (997 including past jobs). “Developer” turned up 2,930 (6,414 including past jobs) – not sure how many of these might be “business developer, property developer” or similar, but the first few pages of results looked about right. “Operations” turned up too many misses (COO, ad operations, etc), so haven’t included it.

This tots up to about 17,108 people. I might guess that my DirectAds version of this is missing 4,028 people who have listed themselves as being in finance, banking, etc, or engineers who have nothing to do with software. Either way, not totally disparate figures.

So, does 13,000 sound like a good ballpark figure for the number of software developers in Ireland? 84% male vs 16% female?
CareersPortal.ie has some information compiled from the CSO and the National Skills Bulletin 2010. They conclude that there are 9,000 people employed in software development in Ireland, and that the gender breakdown is 89% male, 11% female. Interesting that while there’s a difference in the “totals” for each, the percentages are pretty close. Considering the time gap between the Forfás data (about a year old, sometimes more) and the LinkedIn (almost realtime), I think those numbers are a pretty good indicator.

Posted in Technology | Comments closed

Dublin Bus route statuses

I’ve thrown together a little hack that gives current Dublin Bus route status information in JSON format.

Endpoint: http://rossduggan.ie/stuff/bus/

Simply calling the endpoint will return a JSON object of all bus routes and their associated statuses.

Append ?route=x to the endpoint and you’ll get just the results for the route specified (or what it thinks you mean if it doesn’t understand the route):

GET http://rossduggan.ie/stuff/bus/?route=15

You’ll get something like this:

{
	"status":"Operating on normal route.",
	"match":
		{
			"exact":"15"
		}
}

If you look for one of the routes that Dublin Bus merge together (for whatever reason; 42A and 42B are like this):

GET http://rossduggan.ie/stuff/bus/?route=42a

You’ll get:

{
	"status":"Unable to serve Edenmore or Harmondstown. Operating via Springdale Rd and Tonlagee Rd.",
	"match":
		{
			"closest":"42a\/b"
		}
}

The reason for this is because I threw this together in about 30 minutes as an excuse to familiarize myself with XPath, and I thought some error handling would be better than no error handling :)

Once the bad weather goes away this will probably break since Dublin Bus will change the layout of their notices, but I’ll try to keep it functional for as long as I can. I have a horrible feeling the data source is being copied and pasted by whomever is maintaining it from an old copy of MS Word into a WYSIWYG editor for the dublinbus.ie homepage.

In usage:

I pushed this out on Twitter at about 7:30pm, and by about 10:00pm @walmc had already thrown together a neat little node.js growl notifier using it!

Posted in Codetry | Comments closed

Links for Friday, 3rd December 2010

Paul Conroy has written a little bookmarlet to add imgur previews to the Twitter web interface, and has a detailed explanation of how he’s done it.

The Algorithm + the Crowd is not enough. I have a response post gestating.

Defecting by Accident, A Flaw Common to Analytical People. This is “How to Win Friends and Influence People” digested into a blog post for nerds.

Potentially Consistent, or why your MySQL master-slave setup is not “Eventually Consistent”.

This post and this video are a good introduction to Clojure.

Kafka is a roll-your-own SNS from the guys at LinkedIn (though I’m sure they wouldn’t describe it that way).

First class APIs, by @h.

What the HTTP is CouchApp?

Engineering Shortage Is Real. Former Digg Engineer Gets 7 Offers, Takes One for $150K. Good news for software engineers.

This is a really cool take on introducing an application.

Posted in Codetry, Technology | Comments closed

Extracting information from a lot of images on disk using find

If you need to extract information from a large number of images on disk (and you’re using a *nix system), you could do worse than using find with Imagemagick’s command line tools.

If you’re unfamiliar with find, I’d recommend reading the beginners guide on Linux.ie. It has terse and initially daunting syntax, but is one of the most powerful tools available to *nix users and proficiency with it is massively useful, especially for sysadmins and developers.

Here’s how you’d go about finding all jpg, gif, png and bmp images in a directory, excluding anything in a “thumbs” directory, getting their dimensions, compression type and filesize, separate each piece of information with a comma and writing it our to a file:

find . -path "*/thumbs/*" -prune -o -type f \(\
 -iname "*.jp*g" -o -iname "*.gif" -o -iname "*.png" -o -iname "*.bmp"  \)\
  -exec identify -format "%i,%wx%h,%m,%[size]\n" {} + > /tmp/images.info

Broken down:

find .

Searches in the current directory (.) – you can specify a path just as easily (find /path/to/directory/)

-path "*/thumbs/*" -prune

Exclude (prune) paths that match the preceding pattern. You can specify this multiple times (or not at all).

-o

This is the OR operator. AND is implied between each modifier if left out.

-type f

Specifies that we’re looking for a file (a directory would be -type d)

\(\
 -iname "*.jp*g" -o -iname "*.gif" -o -iname "*.png" -o -iname "*.bmp" 
 \)\

( opens a group, ) closes it. The backslashes escape the parentheses and newline (I’ve just used the newline to make it more readable). The -iname directive specifies a case-insensitive filename, in this case matching file extensions. The usage of the -o operator is more obvious here, as without it we’d be asking that each file match .jpg AND .png AND .gif – which wouldn’t really work.

-exec ... {} +

This executes a command on each item found, the “current” found item being contained in the {} placeholder. + is the terminator in this case. \; can also be used (again, backslash as escape), but the + terminator batches results and performs much better with large numbers of files. This is roughly equivalent to piping into xargs on older systems which may not have the + terminator available (pre-2005 builds).

identify -format "%i,%wx%h,%m,%[size]\n"

In this case, the command we’re executing is Imagemagick’s identify tool. There’s quite a lot of information available here, it’s prudent to use the -format option to limit the information to what you need. Helpfully, there’s a list of escape characters to let you know what can be extracted.
Here, I’m getting the file path (%i), the width(%w), the height (%h) and putting in a literal ‘x‘ to separate them. After that, there’s the compression type (%m) and the filesize in KB (%[size]). I separate each value with a literal comma and ending each line with a newline (\n).

> /tmp/images.info

Finally, rather than output this information to the screen (by default), we direct the output into a file in the tmp directory. If there are a lot of files to process, you won’t immediately see data start to pour in here, as it’ll be batched using the + terminator mentioned before. You’ll probably see it populate in lumps of several thousand.
You should get a file containing results that look something like this:

./images/3tm9wzz4z9kzd51168cef0a9cc77ca616916128aaa3d.JPG,640x480,JPEG,22.8KB
./images/226te3jc3m85519d6348418bdde11ee08d77ffd338ff.JPG,626x639,JPEG,44.6KB
./images/2s9262f4uix2e26113b8007a2a3dfadb6aa3fa7aa0ee.JPG,384x288,JPEG,36.6KB
./images/3572wcuya3pi3fb0f68eff3d6104a7b94d5725b2b526.jpg,480x640,JPEG,50.9KB
./images/5wby49rxay9lcc890e914b4d52e9909700f8d5227bb9.jpg,354x142,JPEG,11.9KB
./images/1c6cf3icti8v9c2b997592c0c7c51c25e900969eaec4.JPG,478x640,JPEG,41.4KB
./images/53h1y0x1q37q22d65cc682f6d7994db2510cab013ddf.JPG,478x640,JPEG,28.1KB
./images/4r8ck3kn1ezi809f7d4a63c0fb95b4f07053641bd8d3.JPG,478x640,JPEG,33.5KB
./images/156m118zdn7n4a10fef7d6c88067482f0803db2837e6.JPG,478x640,JPEG,25.5KB

If you spot any typos, mistakes or ways you think this might be improved, feel free to let me know.

Posted in Codetry | Comments closed

Links for 7th April, 2010

Leslie Jensen-Inman via A List Apart, examines the use of colour. Worth a read, especially for the tools she highlights:

Clay Shirky – “The Collapse of Complex Business Models”

Microsoft’s TERMINATOR project – solving the halting problem… for a finite number of “real” systems.

Damien Katz gives up on git for managing CouchDB. He’s a smart dude, hope he elaborates.

Bruce Schneier – Privacy and Control

Posted in Technology | Comments closed