Hosting TIGER U.S. Census Data in Amazon's Cloud

Estimated: 3 min read

Tom MacWrightonMay 19, 2009

Amazon is now hosting all “United States TIGER Census data in its cloud”:http://en.wikipedia.org/wiki/Topologically_Integrated_Geographic_Encoding_and_Referencing. We just finished moving 140 gigs of shapefiles of U.S. states, counties, districts, parcels, military areas, and more over to Amazon. This means that you can now load all of this data directly onto one of Amazon’s virtual machines, use the power of the cloud to work with these large data sets, generate output that you can then save on Amazon’s storage, and even use Amazon’s cloud to distribute what you make.

This looks a lot like this: !http://farm3.static.flickr.com/2157/3545384521_5d77ff2bb7.jpg?v=0!

Let me explain how this works. The TIGER “data is available as an EBS store”:http://aws.typepad.com/aws/2009/04/new-aws-public-data-sets-anthrokids-twiliowiglenet-sparse-matrices-usa-spending-tiger.html EBS, or Elastic Block Storage, which is essentially a virtual hard drive. Unlike S3, there isn’t a separate API for EBS stores and there are no special limitations. Instead an EBS store appears just like an external hard drive when it’s mounted to an EC2 instance, which is a virtual machine at Amazon. You can hook up this public virtual disk to your virtual machine and work with the data as if it’s local to your virtual machine — it’s that fast.

We built a custom EC2 that runs Mapnik and TileCache, generating tiles. This custom EC2 is configured specifically for mapping and is accessible to everyone since it’s stored in an AMI, an “image” of a virtual machine that’s accessible in a public directory. You can select an AMI, boot it up, edit just one configuration file, and start running with pre-installed software.

So, it goes like this. We mount an EBS drive of TIGER data to our custom EC2 machine, and then run tilecache_seed.py on our EC2. I like to think of EC2 as what it is at it’s best — a big computer — rather than a server. This frontloads the processing, making tons and tons of tiles at your desired zoom levels. TileCache has had an “S3 interface since version 2.0”:http://crschmidt.net/blog/289/tilecache-two-announcements/, so we generate tiles and send them directly there.

From here, the possibilities are great. For example, you can make maps showing different areas both from one large EC2 image or from many smaller ones. Or by combining this data with Census data, which is already available as an EBS, and spending some time working out the details, you could create color-coded voting prediction maps that could be widely accessible and rather customizable.

So, if you need to get at the TIGER data and the Census server has been throttling your IP, stop banging your head against scripting through all the subdirectories and give this a try. The “2008 TIGER/line Shapefiles EBS Volume”:http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2367&categoryID=275 is a public dataset available for any EC2 user to instantiate.

The folks at Amazon — Vikram Rana and Santiago Alonso Lord — have been awesome to work with over the past couple weeks, thank you! Also a big thanks to the “New America Foundation’s Federal Education Budget project”:http://febp.newamerica.net/ that first sponsored the data collection and got us thinking about how to make this process better. If you’re interested, check out “Eric’s post about the FEBP project”:http://www.developmentseed.org/blog/2009/may/13/opening-education-data-funding-demographic-and-achievement-data. Eric will also be talking a little about this TIGER work with the FEBP and Amazon in “his talk at Where 2.0 tonight”:http://www.developmentseed.org/blog/2009/may/19/open-data-ignite-presentation-where-20-featuring-wwwstumblesafelycom.

What we're doing.

Latest