ArchiveTeam + Yahoo Messages Shuttering + EC2 Spot Instances = MegaCrawl

Update (24th March) Stop your engines! The response has been amazing, in 24 hours we’ve managed to crawl almost everything, and are processing the final few batches now! No need for more instances!

I’ve created a tool to help people who decided to fire up a whole bunch of spot instances slowly trim down their cluster as the workload winds down. Check out Slayer.

See the follow-up post.

Original post:


You may have missed the news that Yahoo is shuttering its old message boards, taking a huge amount of Internet history with it. Old news, maybe, but Internet historians (or people’s family, friends, or just interested parties) should have that data available to them in future.

Here’s where the Archive Team, fronted by Jason Scott (of archive.org) comes to the rescue.

In this instance, the Yahoo Message Boards are shutting down in just eight days. There’s been very little notice and it’s become a race to try and get the entire history of the boards before they’re wiped from the Internet. The Archive Team have supplied a virtual appliance for use with Virtualbox, VMware, etc, as a sort of [email protected] style distributed system. Unfortunately Yahoo is rate limiting, which is making for slow progress even with all those who are helping.

Over at the Archive Tracker, you can see the real time progress of the archival process.

If you can afford to throw a couple of dollars at this and have an AWS account, this is a good opportunity to experiment with spot instances on EC2 – I did.
I set it up to use micros at a cost of $0.005/hour, which is a good chunk below the standard on demand price for a micro, already quite low. I’m rocketing my way up the rankings right now (duggan on the leaderboard.)

To make it super easy, I’ve thrown together a public AMI (search for 149682410612 or see AMI id list below) which takes a username in the userdata field (if blank it’ll default to “hackernews”).
It’s just a basic Alestic Ubuntu 12.04 LTS image with some short installation scripts.

Update: due to demand, I’ve made the image available in all other regions:

N. Virginia: ami-2400984d
Ireland: ami-d8d2d8ac
Tokyo: ami-a361e1a2
Singapore: ami-6e703c3c
Sydney: ami-4e0e9f74
Sao Paolo: ami-9d7aa180
N. California: ami-94f6dbd1
Oregon: ami-cf9206ff

All will be called “ArchiveTeam Warrior Yahoo Messages” under account 149682410612 (you can also search for this number in the AMI screen). This means that you can now run a max of about 800 spot instances per account (max of 100 per region) without going to Amazon to look for more.

The script for setting up the image (without the scrubbing of history and private keys) can be found here, if you’d rather not trust my AMI :) It’s not complex.

Help save a bit of web history!

Note: The reason I’m advocating using spot instances to get around the rate limiting, is because I don’t believe the rate limiting is intentional on Yahoo’s part (in the context of Archive Team’s cause), just bureaucratic slowness in removing anti-spam measures. Hopefully they’ll get on the case next week, but until then, we should do what we can.

Edit: conroy on Hacker News has provided this little snippet of Python for those who use boto:

import boto.ec2
    conn = boto.ec2.connect_to_region("us-east-1")
    conn.request_spot_instances('0.005', 'ami-2400984d',
                                instance_type='t1.micro', user_data='USERNAME')

In the comments, Ian McEwan has put together a quick and dirty guide for the AWS uninitiated:

a.) the AMI is only in us-east, as far as I can tell
b.) once you have an AWS account, go to the dashboard and to “AMIs” in the sidebar
c.) search “public images” and “all platforms”, wait a while for it to actually finish searching, filter to ‘warrior’, and choose the one that matches the number here
d.) click ‘spot request’ button, fill in form with price/etc. I’d recommend turning “Persistent” on and setting an end date of April 1.
e.) click through and do whatever bits it asks, mostly you don’t need to care
f.) profit

This entry was posted in Technology. Bookmark the permalink. Both comments and trackbacks are currently closed.

14 Comments

  1. Posted March 23, 2013 at 9:02 am | Permalink

    If the instance dies, will its reserved items or uploads get corrupted?

  2. Andyfoo
    Posted March 23, 2013 at 10:44 am | Permalink

    I would really like to support The Archive and run some micros for the megacrawl but I have no experience at all with setting up anything on AWS. Since running the Warrior appliance on my own machine helps only a small bit because of the rate limiting, any chance you would eloborate a bit how to run your AMI on AWS? Thanks!

  3. JB
    Posted March 23, 2013 at 11:19 am | Permalink

    Where is that AMI? can’t find it in my aws console…

  4. AK
    Posted March 23, 2013 at 11:53 am | Permalink

    That’s a cost I can bear. Someone just has to come up with a tut/walkthrough about how to do it on/with AWS for mere mortals.

  5. Posted March 23, 2013 at 1:10 pm | Permalink

    Basic attempt at a “mere mortals” version:

    a.) the AMI is only in us-east, as far as I can tell
    b.) once you have an AWS account, go to the dashboard and to “AMIs” in the sidebar
    c.) search “public images” and “all platforms”, wait a while for it to actually finish searching, filter to ‘warrior’, and choose the one that matches the number here
    d.) click ‘spot request’ button, fill in form with price/etc. I’d recommend turning “Persistent” on and setting an end date of April 1.
    e.) click through and do whatever bits it asks, mostly you don’t need to care
    f.) profit

    re: retries, the archiveteam tracker will automatically do that via some criteria — if nothing else, when it runs out of other items to do, AFAICT, it’ll reassign old ones. Code is on github if someone’s inspired to go spelunking.

  6. Posted March 23, 2013 at 1:10 pm | Permalink

    Scott: it uploads periodically, so only the last batch it was crawling will be lost.

    JB: you may need to search under “all AMIs”

    AK/Andyfoo: I’ll see what I can come up with.

  7. Mark
    Posted March 23, 2013 at 2:13 pm | Permalink

    This sounds like a silly question, but what happens if I request 10 instances instead of the default 1? Will I be doing 10x as much good for the project (at 10x the AWS bill)? Assuming I can afford it, is there a practical limit to how many instances I can run?

  8. Posted March 23, 2013 at 2:34 pm | Permalink

    You’ll be doing about 10x as much good/price, yes!

    There’s a practical limit after which you have to contant Amazon to increase. That limit is 100 spot instances (or 20 on-demand), according to AWS repos on Quora.

  9. TryingToHelp
    Posted March 23, 2013 at 3:26 pm | Permalink

    Trying to follow the steps you have listed in the gist there, but this fails:

    # git clone https://gist.github.com/5226491.git setup-config && cd setup-config
    Cloning into setup-config…
    error: The requested URL returned error: 403 Forbidden while accessing https://gist.github.com/5226491.git/info/refs

    fatal: HTTP request failed

    Thoughts?

  10. Posted March 23, 2013 at 3:45 pm | Permalink

    Could be just a problem using HTTP, I’ve changed the gist to use a git URI instead, try that.

  11. Mark
    Posted March 23, 2013 at 5:40 pm | Permalink

    Including the cost of bandwidth downloading from Yahoo and uploading to Archive Team, how much are each of your instances costing you per day?

  12. Posted March 23, 2013 at 5:55 pm | Permalink

    Not 100% sure on the individual cost as I’m only monitoring aggregate bandwidth and not tracking when instances are killed or reappear, but I’m running about 300 right now.

    That’s at about $0.003 per hour per instance, and only charged for outgoing bandwidth (about 10 GB for all so far). Ballpark figure is about $23 per day. It’ll fluctuate up occasionally, but I’ve set my payment limit to $0.005, so there’s a roof.

  13. TryingToHelp
    Posted March 23, 2013 at 7:20 pm | Permalink

    I built 2 CloudFormation templates to allow you to easily spin up a ton of these things across multiple availability zones:

    With a keypair ( so you can login to the host)
    http://files.wordsaboutbytes.com/yahoo-messages-save.cf.txt
    Without a keypair ( can’t log in locally, but it will run)
    http://files.wordsaboutbytes.com/yahoo-messages-save-nokeypair.cf.txt

    1. Open the console
    2. Go to CloudFormation
    3. Give your stack a name.
    4. Select the file you downloaded from above
    5. Click Next.
    6. Fill in the parameters here ( # of instances, The nick you want to be tracked with at the archive team site, the spot price you are willing to pay, and optionally a keypair if you selected that file).
    7. Check the box at the bottom acknowledging that the template will create IAM resources ( used by the host to bootstrap )
    8. Click Continue.
    9. Tags if you want, or click continue.
    10. Review. Click Continue.
    11. Close.

    This will launch however many instances you told it to, as t1.micro’s, as the spot price you set it to. When you want to stop, you just go and delete the stack in this console and everything should go away.

  14. TryingToHelp
    Posted March 23, 2013 at 7:21 pm | Permalink

    Oh yeah, the CloudFormation templates I created use the Amazon Linux AMI for each region, so this way the person who created the other AMIs doesn’t need to worry about maintaining them, or paying for access to them. These are standard across regions, are free, and might work a slight bit better too!

One Trackback

  1. […] response to my post last week was amazing. As I write this, there are nine items left processing out of what ended up being more […]