Reading list for scaling Solr

Brain dump time. I kinda need this as a memory aid for myself, and I figure it’ll be useful to anyone else who is building a Solr cluster. There’s probably a lot of crossover here for tuning any JVM-based application servicing a large number of requests, but this is my first, so it’s all together.

Some of the background for this list can be read here, and for some further context, this is some of what I read to build something probably more powerful than websolr’s top tier offering (those guys are probably worth investigating before building your own cluster, by the way). There were some pretty “out there” requirements for Boards.ie though (potentially thousands of FQ permutations per search phrase, lots of big, ugly, old data, etc).

Some of the issues I ran into scaling Solr are relatively unique, but the general approach should be the same for everyone:

  1. Learn how to gather information
  2. Gather all the information you can
  3. Analyze / Evaluate
  4. Make incremental changes
  5. Return to Step 2.

Figure out what you’re actually running

If you’re unfamiliar with the world of Java (or a rusty shade of green like me), you might be horrified surprised to discover that there are a few different implementations (let alone versions) of the Java Virtual Machine (JVM) available to you. What’s more, the best documented and supported one, the “Oracle” JVM (still documented almost everywhere as the Sun JVM) is probably not what you’re running if you’re running Ubuntu Server.

There’s also a difference between engineering numbers and product numbers, which may not be immediately apparent from the outset, and often they appear to be used interchangeably.

Understanding the JVM

Understanding Solr

http://www.lucidimagination.com/content/scaling-lucene-and-solr

Somewhat passive aggressive Lucid Imagination advice - http://www.lucidimagination.com/blog/2010/01/21/the-seven-deadly-sins-of-solr/
An example of some of the hilarious bureaucracy in Solr development - https://issues.apache.org/jira/browse/SOLR-1143

There’s probably plenty more, but those are the ones I have saved :)

Posted in Technology | Leave a comment

I’m joining EngineYard to work on Orchestra

Today, I finish up three and a half great years with Boards.ie, and next week I join Eamon, David, Helgi and Gwoo with Noah, ElizabethDavey (and a couple of others) to help kick Orchestra into orbit by joining EngineYard.

At EngineYard, I hope to build upon my experience bringing scale to web applications within Distilled Media, and take it to a wider audience.

I’m looking forward to (and a little daunted by!) working with such an accomplished group of people, and helping to drive our manifesto. Our offices will be down on Barrow Street, near Facebook and Google, in the same building as DogPatchLabs and friends at Intercom, in an area that is being dubbed the “Silicon Dock.” I’m incredibly excited by the sort of buzz that is going to be around this place.

Posted in Miscellaneous | Comments closed

Bots are crawling new domain registrations and namesquatting Twitter handles

Something to be wary of when you’re domain shopping for that perfect .com: bots are watching.

I’m not sure what combination triggers it, but when I was done brainstorming for an app name, I checked Twitter to see if the handle was taken, then registered the domain name. It was quite late and I didn’t want to start fiddling around with email aliases to get the Twitter account, so I decided to leave it since the application is only in early planning and layout stages.

A couple of days later, I decided to grab the account to start playing around with the Twitter API – couldn’t believe it when I discovered the account had been nabbed. No updates, no avatar – just a squatter. It seems obvious in retrospect that someone would be scraping domain registrations and comparing against Twitter handles (there was nothing but the .com taken and I’d never mentioned it anywhere), but it is immensely frustrating. The app is a Twitter-based tool, so having a .com and matching account name is essential.

What’s even more frustrating is that this seems to fall within Twitter’s acceptable usage policy. They have a section on namesquatting, but it states:

“Please note that if an account has had no updates, no profile image, and there is no intent to mislead, it typically means there’s no name-squatting or impersonation.”

You’d be hard-pressed to claim that there is an “intent to mislead” when you’re starting to build something – after all, the fact that you’re only starting to build it means there’s not much to mislead. So I’m back to the drawing board on figuring out a name while I put the app together.

The lesson here, a rather obvious one I suppose, is that Twitter handles are as valuable as the domain for certain classes of applications. If you’re building something that will interface with Twitter, you have to get the Twitter handle at the same time as the domain, or it’ll be swept out from under you.

Edit:
As per Andrew’s comment below, I’ve submitted an impersonation report – hopefully something will come of it.

Update:
Well, it looks like this might be going places, the account in question is now showing as “suspended” – no word from support yet (it’s a weekend, I wasn’t expecting anything until at least Monday) but it’s encouraging!

Update 2011-08-29:
So, received some communication from Twitter today; they will not release the account name to me. This is disappointing, but I will move on.

Posted in Technology | Comments closed

“Levelling the playing field” in education

Came across a post via HN which suggests levelling the playing field in CS by teaching with obscure functional programming languages.

The reasoning is that there are “privileged” students who begin a computer science degree already knowing how to code, and that this is unfair.

Beyond the impracticality of doing this (you’re going to have to change the language every year so that people don’t “cheat”), the fact that so many educators in the comments agree with this position is further evidence that the current education system is creaking.

People like Salman Khan are already helping to make education more accessible with the likes of Khan Academy. In fact, teachers who have used Khan Academy apparently don’t want their students to be too smart.

When you are effectively looking to punish students for learning how to read before the rest of their peers, you have – at best – lost perspective as an educator.

At some point (actually around the time of the industrial revolution), it seems that education stopped being about the acquisition of knowledge and started being about churning out template humans to fulfil tasks.

If computer science education is broken because people are learning all by themselves, then society should (and already does) route around the problem – the problem being universities holding a monopoly on what it means to be educated. Tech companies in particular have been navigating around this for years, reaping the rewards of hiring self-taught, self-motivated individuals.

If modern enterprises wish to take advantage of this revolution in education, they should recognize this lest they lose some of the best candidates to more modern companies.

Posted in Technology | Comments closed

Munin plugins for Solr

I’ve been mucking around with Python recently and have written a couple of simple Munin plugins for Boards.ie’s Solr cluster (in the hope of helping to track down some annoying performance bugs).

If you’re not familiar with Munin, it’s a bit like Nagios, (and if you haven’t heard of Nagios, it’s a network monitoring tool). Munin is particularly nifty because it’s easily extensible in pretty much any language that takes your fancy; Perl seems to be the favorite, but Python is also used.

You can check out the plugins on Distilled’s Github account (parent company of Adverts.ie, Boards.ie, Daft.ie and TheJournal.ie).

There’s a bunch more there that Conor McDermottroe has written for MySQL, Snort, IMPI and FreeBSD too which you might also be interested in!

Posted in Technology | Comments closed

Google Plus

I’ve been playing on and off with Google’s new toy, Google+, for the last few days and while there are a plethora of opinion pieces all over the place, I figure what the hell, one more can’t really hurt (especially a nice short one that isn’t too gushing.)

The Bad

I think one way in which Google really dropped the ball is that Google Apps customers are having serious difficulties using the service. Google is undergoing a massive effort to unify various types of accounts over their network and are mucking it up quite badly for such a smart group of people. They risk alienating some of their most committed users; heck, they’re alienating the minority of users who actually pay.

Something else I’ve heard is that Google should have gone to more effort to allow groups of people to sign up – after all, critical mass is what is going to make this thing successful. Coming back to Google Apps, they had an excellent opportunity to pre-authenticate all Google Apps account holders for the service, thus enabling entire already dedicated groups to bootstrap the service.

Sparks feels like a bit of a miss. I’m not quite sure what mix of keywords is going to give me stuff I should know about. I’m probably trying too hard to recreate the experience of hackernews or Techmeme, but since I don’t know what “topic” to pick and I’m not sure how the results are populated, it feels… off.

The Good

Having the Google+ notification bar as part of my GMail bar is inspired. It’s amazing how in many ways this is not all that different from the time Buzz was lodged below our inboxes – yet it just feels so much more appropriate.

The interface is finally something between the dorky, utilitarian UI of older products and the sweeping strokes of pastel colours, Helveticaesque typefaces and thick lines we’ve become more familiar with in the last few years.

Profiles are very nicely executed. There’s just the right amount of subtle completionist bait to encourage people to add more information and fill out more circles, but without feeling like you’re being pressured into it (a la LinkedIn).

It feels like it has the opportunity to be a richer experience than Twitter (who are struggling to fit the rich world of the web into their 140 character ethos), while retaining the serendipity of interaction with strangers (for want of a more appropriate term) that isn’t really available with Facebook. It doesn’t seem like it would replace the organic, real time stream that is Twitter; but it might supplant a good portion of my (admittedly low) Facebook usage.

Overall, I think if the bootstrapping issues can be addressed then this is going to be a really useful addition to the space that Facebook, Twitter and LinkedIn occupy.

Posted in Technology | Comments closed

Getting Windows 7 onto a USB stick using Ubuntu

I spent way too much time trying to do this so maybe this will save someone else some time.

I haven’t owned a copy of Windows for years, and have been using Ubuntu as my solitary home and office OS for some time. Last week though, I decided I should have a copy of Windows 7 since there’s some software I can’t run without it (my hardware is pretty new, and it seems to take Linux about 6-12 months to catch up with changes in hardware, especially graphics).

So, after a bit of hunting, I discovered it was possible to purchase Windows 7 in downloadable ISO format at http://emea.microsoftstore.com

Great, I thought, that’s pretty progressive of Microsoft! Unfortunately their helpfulness ends there, as the only instructions for actually using the ISO revolve around DVD burners and/or assume you already have Windows (of some sort) installed. I didn’t buy a DVD drive with my last computer build because I haven’t actually needed to use one for a long time.

So, if, like me, you have Ubuntu as your only OS and no DVD drive, it seems like you’re kinda up the creek. Fortunately, this is actually pretty easily remedied. This guide does assume some familiarity with both Ubuntu and partitioning and I’m not big on tech support, so consider yourself forewarned :)

You need:

  1. A USB key that you can afford to wipe completely (at least 4GB)
  2. Your Windows 7 ISO of choice (32 bit and 64 bit varieties are available from the store) and the serial (it’ll be emailed to you).
  3. A connection to the Internet on your Ubuntu machine (or just read the guide and preinstall anything you see mentioned)
  4. A willingness to nuke whatever OS you already have installed, because Windows takes no prisoners (blindly overwrites the Master Boot Record).

Now, technically 4 can be alleviated by recovering grub, but it’s a bit of a roundabout process, and my Ubuntu machines are all basically thin clients. Important documents are stored in Dropbox, my code is on Github, my music, etc is on external drives and almost everything else is also stored remotely or run in the browser. If you don’t want to lose your Ubuntu install, I imagine the guide on the Ubuntu Community Wiki is where you’ll want to look after this.

Important: when you’re downloading the ISO, do not bother with “Part 1″ – it is a boot sector editor for Windows and will be useless to you. Download only “Part 2″ as it’s the ISO (you can only do this about three times, so try to download and store the ISO the first time round).

Partitioning the USB key

We’re going to prepare this USB key with gparted. Disk Utility won’t do (since it won’t allow you to set a device as bootable). You might have this installed already, but if you’re running a fairly fresh copy of Ubuntu, you’ll need to install this using Synaptic. Once it’s installed, you should find it under System->Adminstration->GParted Partition Editor.
Insert your USB key, start up Gparted and select the key from the drop down in the top right. It’ll be something like /dev/sdc, with the size of the drive in brackets next to it.
Delete any partitions that are on the drive, and create a new NTFS one, applying the changes.
Right click on the device, and select “manage flags”. Set the “boot” flag and nothing else.

Extracting the ISO

You may need to eject and re-insert the drive for it to appear, at this point you need to open up your freshly downloaded ISO using the Archive Manager. Do not “mount” the ISO as it will not be usable.
Extract the contents of the archive straight onto your newly formatted (and bootable) USB key. Once this is completed, eject the drive and voila, you have a bootable USB key with Windows 7 on it!

Posted in Technology | Comments closed

Searching Boards.ie – Solr, EC2, SQS, SNS, Node.js

This is the first in a series of posts about the design and implementation of a search engine for Boards.ie.

Boards.ie recently launched a new search engine – http://www.boards.ie/search/ – which is built upon Amazon Web Services using Solr with PHP as the glue.

Currently, Boards.ie users are searching nearly 30 million posts almost a million times a month.

A little background

Solr has been used in production by Distilled Media (formerly the Daft Media Group) for a couple of years now, first being tested on LetsRent.ie, then for powering the maps functionality on Daft.ie, and even more recently as the core technology for the relaunch of Adverts.ie.

Boards.ie’s usage has been an interesting look into the future for Daft.ie and Adverts.ie; with nearly 26 million posts – “documents” in Solr nomenclature – and growing by more than a million again every month, it presents a challenge in providing a reliable, affordable, relevant and fast search solution to our users.

Boards.ie has relied on MySQL’s “Fulltext” search as long as it has been using MySQL (forever). As time has passed, more and more restrictions had to be placed on search in order to keep it online, and the sheer volume of data was making results less and less useful to people well used to using Google search.

There have been a number of challenges to solve along the way:

Can we make search faster and more relevant with what we have already?

This is sort of an obvious question, but we didn’t want to jump the gun and leap into an entirely new technology stack if it were possible for us to overhaul what was currently there.

The answer was not particularly straight forward, but any solution that involved keeping the MySQL fulltext engine as our primary search system involved a lot of new beefy hardware and a lot of the same slowness, punting relevance issues down the line. As a result, we decided that it was not a realistic option.

Which search technology should we use?

There were only a couple of choices available to us when we started working on this project. First, we wanted to use Open Source – there are plenty of good reasons to choose either open source or proprietary software depending on your requirements, and we have experience with both. We wanted the flexibility to hack away at what we were using and share back to the community where possible. We wanted to be able to scale horizontally without worrying about skyrocketing licensing costs. We wanted access to the huge amount of expertise and goodwill that comes with open source projects, as well as experimental work done by other developers.

It also had to be compatible with the rest of our infrastructure and fit with our areas of experience – we’re pretty evenly split between FreeBSD and Linux on the infrastructure side; we didn’t want to introduce an entirely new stack (Microsoft) if it could be avoided.

There were two major options for us as a result, SphinxSearch and Apache Solr. SphinxSearch appeared to have successful deployments with vBulletin (our forum software), but all the experience in the Distilled Group was with Solr. Daft.ie had already deployed successful Solr installations, so we decided to go with what we had some experience with.

What infrastructure should we use?

We had the option of building using our own hardware, or making the leap to introducing cloud infrastructure. Moving partially into the cloud introduced a data transportation issue – how were we going to get data into the cloud while keeping our core infrastructure in Digiweb? Latency and data transfer costs were likely to be show stoppers. Amazon had recently introduced a new service in beta, Simple Notification Service (SNS). We were pretty sure we could combine this with their Simple Queue Service (SQS) to create a scalable message routing and queuing system for posts. With a few hiccups and some conversation with Amazon along the way, we were successful. As it turns out, getting the data out of our core infrastructure and into Amazon is the cheapest component of the entire operation. For the tens of thousands of messages we send to Amazon on a daily basis, we’re billed a couple of dollars a month.

What are the tradeoffs?

There are always tradeoffs.

The advantages and disadvantages of SQL and NoSQL solutions has been covered at length, I’ll just highlight the areas that were significant for us.

With MySQL, our data was/is potentially consistent. For our usage, it is consistent enough for us to call it the canonical source of our data. Occasionally it chews up a post or thread, but rarely one that can not be rewritten.

The information is also effectively instantly available upon submission. There’s a lag of up to a couple of seconds occasionally as the data waits to propagate across our MySQL cluster due to an extended lock of some sort, but it’s generally not a particularly noticeable to the average user.

With the new search system we have sacrificed some consistency and how immediately data is made available to the searcher. For instance, posts sometimes (more frequently than with MySQL) do not make it into the search system. There is multiple redundancy built into the system to limit this, but occasionally a full resynchronization is required. We judged this to be an acceptable cost.

Immediacy has been sacrificed such that it usually takes about 2-5 minutes for a new post or update to an old one to become available in searches, occasionally an hour, infrequently a day, and very infrequently a couple of months (relating back to consistency). We determined that the normal search profile does not require the absolute newest data, only the most relevant.

At the time we made these decisions, losing real time search was the one I was least happy about, for most users it does not appear to be a concern. In hindsight it’s almost obvious why – Boards.ie is a gigantic repository of historical information. The value is in being able to search this rich back catalogue of conversation, opinion and information, not just the most recent.

Even at that, recency is only sacrificed when measured in seconds.

Architectural choice

As useful as Solr has been for us, it’s a bit of a black box in our architecture. Trying to run Solr through a debugger remotely is a gigantic pain in the arse and not something I had much success with. Fortunately, the Jetty error logs are enough to illuminate most problems:

  • Memory management. It’s perhaps unfair to take issue with Solr over this – after all, it’s a far cry from MySQL’s famously labyrinthine memory usage configuration options, and Solr’s problems are really the restrictions of the JVM – but it feels like a single purpose machine with one major application should be able to figure out how much RAM to dedicate to disk cache and how much the application should get for optimal performance.
  • XML everywhere. It’s inescapable, and again, this can be ascribed to Java culture, but damn.
  • Fixed schema. I suppose this is an ideological argument, but during development this was tedious.
  • The book is two inches thick and you need to know it. If you’re building a search service that you expect to take a lot of traffic and contain a lot of documents (why else would you be interested, right?) you will simply have to know all (or at least a good chunk of) the features, quirks, optimizations and architectural choices.

Solr is now a pretty well field-tested application in Distilled, and I’m pretty sure I could rapidly prototype an installation and have it up and running in production inside a week for another site, but I would prefer to investigate ElasticSearch the next time I am revisiting search options.

Further development

During the course of development, I wrote a small node.js server to speed up the relay of post data from our web servers to Amazon SNS. Due to the limitations of deploying node.js on our FreeBSD architecture at the time, this system is, unfortunately, yet to be implemented. Once it is, however, write operations on the site should speed up noticeably for end users. Once this is up and running and has the bugs ironed out, it will be released as an open source project on github.

A side effect of this is that I contributed some code to the AWS Library node.js project, something I would like to continue doing.

I have also amended chunks of the Solr PHP client with a sharded Solr deployment in mind, but it’s quite clunky and I would prefer to have another stab at it, maybe writing my own Solr PHP client and making it available.

In the next posts on this topic, I plan to dive more deeply into the specifics of our implementation, and hopefully release some code in tandem.

Posted in Technology | Comments closed

EC2: Create AMI from a running instance

Log into the AWS web console, find the instance you want to create an AMI from, right click and select “Create Image (EBS AMI)”. Follow the wizard.

All the top results in Google are for The Long Way™ to do this. The Long Way has a bunch of useful things to take into consideration (security, for example), but the actual process of creating an AMI from a running instance has been made simpler by Amazon in the last few months.

Right-click context menus in web pages aren’t the most obvious metaphor for people who have been using the web for a number of years. I know people new to EC2 who have completely missed this and could have saved themselves some time. Hopefully this will bubble up and save people a half hour or so :)

Posted in Technology | Comments closed

Gender breakdown for software development in Ireland

This post was inspired by this TED talk from Sheryl Sandberg, which reminded me of this brilliant blog post by Jolie O’Dell. The whole women in tech thing is something which interests me (as a techie, as well as a human being), and there are loads of great discussions (and terrible ones) about it.

Have been trying to figure out a good way to get an idea of the size and gender breakdown of the somewhat nebulous software development community in Ireland. I’ve settled on LinkedIn as probably the best way to figure this out, Facebook appears to be useless for it (despite having that information). I want to compare these to the official numbers provided by government and other agencies to see how they measure up, though it involves a good few caveats.

These numbers are at least interesting to play around with. I suggest trying out LinkedIn’s DirectAds system yourself; don’t have to pay to get to the “targetting” stuff at step 2.

Currently, LinkedIn lists 416,030 people as working in Ireland. The National Skills Bulletin 2010 (pdf) states that there were 1.88 million people working in full or part time employment in Q4 2009. As that number has probably only gone down since then, I think LinkedIn has a pretty sizeable chunk of the workforce listed; maybe more than a quarter of the total, though probably a heavy bias towards networking-enabled industries with strong computer usage.

Of those listed on LinkedIn, 37,322 describe themselves as working in “Engineering” or “Information Technology” – two quite broad descriptions, both of which will include a portion of people who have absolutely nothing to do with software development. At this level, the breakdown is 28,387 male, 6,722 female. Looks to be about 2,213 who have not listed gender.

It gets tricky at this point – if you wish to breakdown by Industry, you’ve got a max of ten to choose from a list of hundreds, and techies work in almost every industry. To try and extract the main corpus of “software developers” from the Industry section, I did a quick poll of my own LinkedIn connections and chose: Computer Hardware, Computer Software, Computer Networking, Internet, Information Technology and Services, Computer & Network Security, Wireless, Online Media, Publishing and Information Services. I figure this cuts out most of the mechanical/bio/pharma engineers.

This gives me 13,080 people in total. The gender breakdown for this is 10,141 male, 2,148 female (791 unknown).

To see how these numbers match up with regular LinkedIn search (if I were looking for software people), I searched for people in Ireland with a couple of different keywords as their current job title. “Engineer” turned up 13,848 (24,446 including past jobs). “Systems Administrator” turned up 330 (997 including past jobs). “Developer” turned up 2,930 (6,414 including past jobs) – not sure how many of these might be “business developer, property developer” or similar, but the first few pages of results looked about right. “Operations” turned up too many misses (COO, ad operations, etc), so haven’t included it.

This tots up to about 17,108 people. I might guess that my DirectAds version of this is missing 4,028 people who have listed themselves as being in finance, banking, etc, or engineers who have nothing to do with software. Either way, not totally disparate figures.

So, does 13,000 sound like a good ballpark figure for the number of software developers in Ireland? 84% male vs 16% female?
CareersPortal.ie has some information compiled from the CSO and the National Skills Bulletin 2010. They conclude that there are 9,000 people employed in software development in Ireland, and that the gender breakdown is 89% male, 11% female. Interesting that while there’s a difference in the “totals” for each, the percentages are pretty close. Considering the time gap between the Forfás data (about a year old, sometimes more) and the LinkedIn (almost realtime), I think those numbers are a pretty good indicator.

Posted in Technology | Comments closed