Hosting TIGER U.S. Census Data in Amazon's Cloud
Open Data is Also About Accessibility, and the Cloud Improves Access
Amazon is now hosting all United States TIGER Census data in its cloud. We just finished moving 140 gigs of shapefiles of U.S. states, counties, districts, parcels, military areas, and more over to Amazon. This means that you can now load all of this data directly onto one of Amazon’s virtual machines, use the power of the cloud to work with these large data sets, generate output that you can then save on Amazon’s storage, and even use Amazon’s cloud to distribute what you make.
This looks a lot like this:

Let me explain how this works. The TIGER data is available as an EBS store EBS, or Elastic Block Storage, which is essentially a virtual hard drive. Unlike S3, there isn’t a separate API for EBS stores and there are no special limitations. Instead an EBS store appears just like an external hard drive when it’s mounted to an EC2 instance, which is a virtual machine at Amazon. You can hook up this public virtual disk to your virtual machine and work with the data as if it’s local to your virtual machine – it’s that fast.
We built a custom EC2 that runs Mapnik and TileCache, generating tiles. This custom EC2 is configured specifically for mapping and is accessible to everyone since it’s stored in an AMI, an “image” of a virtual machine that’s accessible in a public directory. You can select an AMI, boot it up, edit just one configuration file, and start running with pre-installed software.
So, it goes like this. We mount an EBS drive of TIGER data to our custom EC2 machine, and then run tilecache_seed.py on our EC2. I like to think of EC2 as what it is at it’s best – a big computer – rather than a server. This frontloads the processing, making tons and tons of tiles at your desired zoom levels. TileCache has had an S3 interface since version 2.0, so we generate tiles and send them directly there.
From here, the possibilities are great. For example, you can make maps showing different areas both from one large EC2 image or from many smaller ones. Or by combining this data with Census data, which is already available as an EBS, and spending some time working out the details, you could create color-coded voting prediction maps that could be widely accessible and rather customizable.
So, if you need to get at the TIGER data and the Census server has been throttling your IP, stop banging your head against scripting through all the subdirectories and give this a try. The 2008 TIGER/line Shapefiles EBS Volume is a public dataset available for any EC2 user to instantiate.
The folks at Amazon – Vikram Rana and Santiago Alonso Lord – have been awesome to work with over the past couple weeks, thank you! Also a big thanks to the New America Foundation’s Federal Education Budget project that first sponsored the data collection and got us thinking about how to make this process better. If you’re interested, check out Eric’s post about the FEBP project. Eric will also be talking a little about this TIGER work with the FEBP and Amazon in his talk at Where 2.0 tonight.
Several people have asked me
Several people have asked me follow up questions, and I wanted to share the answers with everyone.
What are the restrictions on the data? All of the TIGER data has unrestrictive licensing (TIGER/line data, of course, is public domain), so there’s no cost involved in getting or maintaining it in the cloud.
How much does it cost? Amazon hosts the public EBS stores for free, and users can mount the datasets as virtual drives without paying for the storage. Money starts changing hands when users create EC2 images (which are virtual servers, charged by the hour) and transfer and process the data. So Amazon has an interest in keeping this data in the cloud. And having data that would otherwise take hours to download instantly mountable and freely stored (we’re talking about what could be hundreds of gigabytes of genes or roads) is a huge selling point for people who want to run big computing tasks with this kind of data.
As far as the actual cost, it all depends on the operation. EC2 instances aren’t just tuned to be servers. To get an equivalent amount of RAM and CPU time, you need an unusually large VPS. If you can get a several hundred gig dataset sending results to S3 (Amazon’s more-traditional file storage service) hosted for free, that’s another point at which Amazon can be cheaper than doing it in-house.
Reducing barriers to entry.
Having this data in the cloud makes it easier to access for people running servers not in Amazon’s cloud, and also makes downloading the data easier, compared to the currently slow and inconvenient state of the actual TIGER/Line FTP servers. Another big benefit is the fact that EC2 instances can be stored as AMIs. The mapping technology that we’re looking at has been historically tricky to install, and that’s held people back. Being able to boot up a pre-made free image of an OS and its software significantly reduces the barrier to entry.