Update (24th March) Stop your engines! The response has been amazing, in 24 hours we’ve managed to crawl almost everything, and are processing the final few batches now! No need for more instances!
I’ve created a tool to help people who decided to fire up a whole bunch of spot instances slowly trim down their cluster as the workload winds down. Check out Slayer.
You may have missed the news that Yahoo is shuttering its old message boards, taking a huge amount of Internet history with it. Old news, maybe, but Internet historians (or people’s family, friends, or just interested parties) should have that data available to them in future.
In this instance, the Yahoo Message Boards are shutting down in just eight days. There’s been very little notice and it’s become a race to try and get the entire history of the boards before they’re wiped from the Internet. The Archive Team have supplied a virtual appliance for use with Virtualbox, VMware, etc, as a sort of [email protected] style distributed system. Unfortunately Yahoo is rate limiting, which is making for slow progress even with all those who are helping.
Over at the Archive Tracker, you can see the real time progress of the archival process.
If you can afford to throw a couple of dollars at this and have an AWS account, this is a good opportunity to experiment with spot instances on EC2 – I did.
I set it up to use micros at a cost of $0.005/hour, which is a good chunk below the standard on demand price for a micro, already quite low. I’m rocketing my way up the rankings right now (duggan on the leaderboard.)
To make it super easy, I’ve thrown together a public AMI (search for 149682410612 or see AMI id list below) which takes a username in the userdata field (if blank it’ll default to “hackernews”).
It’s just a basic Alestic Ubuntu 12.04 LTS image with some short installation scripts.
Update: due to demand, I’ve made the image available in all other regions:
N. Virginia: ami-2400984d Ireland:
ami-d8d2d8acTokyo: ami-a361e1a2Singapore: ami-6e703c3cSydney: ami-4e0e9f74Sao Paolo: ami-9d7aa180N. California: ami-94f6dbd1Oregon: ami-cf9206ff
All will be called “ArchiveTeam Warrior Yahoo Messages” under account
149682410612 (you can also search for this number in the AMI screen). This means that you can now run a max of about 800 spot instances per account (max of 100 per region) without going to Amazon to look for more.
The script for setting up the image (without the scrubbing of history and private keys) can be found here, if you’d rather not trust my AMI It’s not complex.
Help save a bit of web history!
Note: The reason I’m advocating using spot instances to get around the rate limiting, is because I don’t believe the rate limiting is intentional on Yahoo’s part (in the context of Archive Team’s cause), just bureaucratic slowness in removing anti-spam measures. Hopefully they’ll get on the case next week, but until then, we should do what we can.
Edit: conroy on Hacker News has provided this little snippet of Python for those who use boto:
import boto.ec2 conn = boto.ec2.connect_to_region("us-east-1") conn.request_spot_instances('0.005', 'ami-2400984d', instance_type='t1.micro', user_data='USERNAME')
In the comments, Ian McEwan has put together a quick and dirty guide for the AWS uninitiated:
a.) the AMI is only in us-east, as far as I can tell
b.) once you have an AWS account, go to the dashboard and to “AMIs” in the sidebar
c.) search “public images” and “all platforms”, wait a while for it to actually finish searching, filter to ‘warrior’, and choose the one that matches the number here
d.) click ‘spot request’ button, fill in form with price/etc. I’d recommend turning “Persistent” on and setting an end date of April 1.
e.) click through and do whatever bits it asks, mostly you don’t need to care