Update (24th March) Stop your engines! The response has been amazing, in 24 hours we’ve managed to crawl almost everything, and are processing the final few batches now! No need for more instances!
I’ve created a tool to help people who decided to fire up a whole bunch of spot instances slowly trim down their cluster as the workload winds down. Check out Slayer.
Original post:
You may have missed the news that Yahoo is shuttering its old message boards, taking a huge amount of Internet history with it. Old news, maybe, but Internet historians (or people’s family, friends, or just interested parties) should have that data available to them in future.
Here’s where the Archive Team, fronted by Jason Scott (of archive.org) comes to the rescue.
In this instance, the Yahoo Message Boards are shutting down in just eight days. There’s been very little notice and it’s become a race to try and get the entire history of the boards before they’re wiped from the Internet. The Archive Team have supplied a virtual appliance for use with Virtualbox, VMware, etc, as a sort of folding@home style distributed system. Unfortunately Yahoo is rate limiting, which is making for slow progress even with all those who are helping.
Over at the Archive Tracker, you can see the real time progress of the archival process.
If you can afford to throw a couple of dollars at this and have an AWS account, this is a good opportunity to experiment with spot instances on EC2 – I did.
I set it up to use micros at a cost of $0.005/hour, which is a good chunk below the standard on demand price for a micro, already quite low. I’m rocketing my way up the rankings right now (duggan on the leaderboard.)
To make it super easy, I’ve thrown together a public AMI (search for 149682410612 or see AMI id list below) which takes a username in the userdata field (if blank it’ll default to “hackernews”).
It’s just a basic Alestic Ubuntu 12.04 LTS image with some short installation scripts.
Update: due to demand, I’ve made the image available in all other regions:
N. Virginia: ami-2400984d Ireland:ami-d8d2d8acTokyo:ami-a361e1a2Singapore:ami-6e703c3cSydney:ami-4e0e9f74Sao Paolo:ami-9d7aa180N. California:ami-94f6dbd1Oregon:ami-cf9206ff
All will be called “ArchiveTeam Warrior Yahoo Messages” under account 149682410612
(you can also search for this number in the AMI screen). This means that you can now run a max of about 800 spot instances per account (max of 100 per region) without going to Amazon to look for more.
The script for setting up the image (without the scrubbing of history and private keys) can be found here, if you’d rather not trust my AMI It’s not complex.
Help save a bit of web history!
Note: The reason I’m advocating using spot instances to get around the rate limiting, is because I don’t believe the rate limiting is intentional on Yahoo’s part (in the context of Archive Team’s cause), just bureaucratic slowness in removing anti-spam measures. Hopefully they’ll get on the case next week, but until then, we should do what we can.
Edit: conroy on Hacker News has provided this little snippet of Python for those who use boto:
import boto.ec2 conn = boto.ec2.connect_to_region("us-east-1") conn.request_spot_instances('0.005', 'ami-2400984d', instance_type='t1.micro', user_data='USERNAME') |
In the comments, Ian McEwan has put together a quick and dirty guide for the AWS uninitiated:
a.) the AMI is only in us-east, as far as I can tell
b.) once you have an AWS account, go to the dashboard and to “AMIs” in the sidebar
c.) search “public images” and “all platforms”, wait a while for it to actually finish searching, filter to ‘warrior’, and choose the one that matches the number here
d.) click ‘spot request’ button, fill in form with price/etc. I’d recommend turning “Persistent” on and setting an end date of April 1.
e.) click through and do whatever bits it asks, mostly you don’t need to care
f.) profit
14 Comments
If the instance dies, will its reserved items or uploads get corrupted?
I would really like to support The Archive and run some micros for the megacrawl but I have no experience at all with setting up anything on AWS. Since running the Warrior appliance on my own machine helps only a small bit because of the rate limiting, any chance you would eloborate a bit how to run your AMI on AWS? Thanks!
Where is that AMI? can’t find it in my aws console…
That’s a cost I can bear. Someone just has to come up with a tut/walkthrough about how to do it on/with AWS for mere mortals.
Basic attempt at a “mere mortals” version:
a.) the AMI is only in us-east, as far as I can tell
b.) once you have an AWS account, go to the dashboard and to “AMIs” in the sidebar
c.) search “public images” and “all platforms”, wait a while for it to actually finish searching, filter to ‘warrior’, and choose the one that matches the number here
d.) click ‘spot request’ button, fill in form with price/etc. I’d recommend turning “Persistent” on and setting an end date of April 1.
e.) click through and do whatever bits it asks, mostly you don’t need to care
f.) profit
re: retries, the archiveteam tracker will automatically do that via some criteria — if nothing else, when it runs out of other items to do, AFAICT, it’ll reassign old ones. Code is on github if someone’s inspired to go spelunking.
Scott: it uploads periodically, so only the last batch it was crawling will be lost.
JB: you may need to search under “all AMIs”
AK/Andyfoo: I’ll see what I can come up with.
This sounds like a silly question, but what happens if I request 10 instances instead of the default 1? Will I be doing 10x as much good for the project (at 10x the AWS bill)? Assuming I can afford it, is there a practical limit to how many instances I can run?
You’ll be doing about 10x as much good/price, yes!
There’s a practical limit after which you have to contant Amazon to increase. That limit is 100 spot instances (or 20 on-demand), according to AWS repos on Quora.
Trying to follow the steps you have listed in the gist there, but this fails:
# git clone https://gist.github.com/5226491.git setup-config && cd setup-config
Cloning into setup-config…
error: The requested URL returned error: 403 Forbidden while accessing https://gist.github.com/5226491.git/info/refs
fatal: HTTP request failed
Thoughts?
Could be just a problem using HTTP, I’ve changed the gist to use a git URI instead, try that.
Including the cost of bandwidth downloading from Yahoo and uploading to Archive Team, how much are each of your instances costing you per day?
Not 100% sure on the individual cost as I’m only monitoring aggregate bandwidth and not tracking when instances are killed or reappear, but I’m running about 300 right now.
That’s at about $0.003 per hour per instance, and only charged for outgoing bandwidth (about 10 GB for all so far). Ballpark figure is about $23 per day. It’ll fluctuate up occasionally, but I’ve set my payment limit to $0.005, so there’s a roof.
I built 2 CloudFormation templates to allow you to easily spin up a ton of these things across multiple availability zones:
With a keypair ( so you can login to the host)
http://files.wordsaboutbytes.com/yahoo-messages-save.cf.txt
Without a keypair ( can’t log in locally, but it will run)
http://files.wordsaboutbytes.com/yahoo-messages-save-nokeypair.cf.txt
1. Open the console
2. Go to CloudFormation
3. Give your stack a name.
4. Select the file you downloaded from above
5. Click Next.
6. Fill in the parameters here ( # of instances, The nick you want to be tracked with at the archive team site, the spot price you are willing to pay, and optionally a keypair if you selected that file).
7. Check the box at the bottom acknowledging that the template will create IAM resources ( used by the host to bootstrap )
8. Click Continue.
9. Tags if you want, or click continue.
10. Review. Click Continue.
11. Close.
This will launch however many instances you told it to, as t1.micro’s, as the spot price you set it to. When you want to stop, you just go and delete the stack in this console and everything should go away.
Oh yeah, the CloudFormation templates I created use the Amazon Linux AMI for each region, so this way the person who created the other AMIs doesn’t need to worry about maintaining them, or paying for access to them. These are standard across regions, are free, and might work a slight bit better too!
One Trackback
[…] response to my post last week was amazing. As I write this, there are nine items left processing out of what ended up being more […]