When we were building the Afghanistan election data website, performance was a chief concern. As Eric described in his post, there is enough tabular data in the site to fill a few thousand pages of a PDF document, and getting all the calculations and queries needed to run each page in a reasonable time was a challenge. We were able to get Drupal — and Managing News, the Drupal distribution the site is built on — to generate pages in a acceptable amount of time, but it still wasn’t fast — and we needed it to be. To bring it to the speed we needed, we turned to Varnish, an HTTP accelerator that caches data in virtual memory, and Pressflow, a Drupal distribution focused on performance that works with Varnish.
As nearly all of the site’s traffic is anonymous and the content is relatively static, Varnish was a straight forward way to speed the site up. Varnish is a reverse proxy cache, meaning it keeps copies of frequently requested pages handy and doesn’t bother the webserver to generate pages for every request. So if two people request the same page, the webserver will assemble it once and give the second requester a copy that was previously generated by the server. When you imagine that the two people are really a few thousand, you can start to see how this sort of cache can be useful.
On top of that Varnish is fast. Without Varnish on AfghanistanElectionData.org, it takes about two seconds to generate and deliver the HTML of a page, and it takes longer to render the page itself, since css and images all need to delivered to the browser. But once a page has been cached by Varnish, the time to deliver the HTML drops to around 115 milliseconds. This is a 10x improvement, and it requires significantly fewer server resources as well.
The way this works is that reverse proxy caches like Varnish examine each request they receive and try to determine if a new page really needs to be generated. This process means looking at the headers of the request, any cookies, and the URL that has been requested. Any information that the reverse proxy cache needs to make the correct decision needs to be in the request, so nearly all of the information about what the page contains needs to live in the request.
For example take this URL:
From the front to the back the parts of this url are:
http://afghanistanelectiondata.orgNothing special about the domain here. It just tells the browser what server to go to.
/filter/600-95-AND/province/33The first two elements of the path together describe how poll station highlighting should work on the site. In this case, poll stations with more than 600 votes and 95% of those votes for a single candidate should be highlighted. The rest of the path
/province/33is just the path for the page, and in this case it represents the province ‘Farah’.
?views-mode=views-mode-mapThe querystring describes how the page is going to be displayed. Here the map is being requested.
#layers=district_map_openlayers_1&baseLayers=afghanistan-greyThe anchor here describes what the map is going to look like and what layers are active.
The domain is implied by how you’re connected to the server, and the anchor is never actually set to the server. So when the webserver gets the request it’s basically looks like this:
This path explicitly describes the contents of the page, the filtering criteria, and the way it should be displayed. Having a URL like this prevents using a session to store information about what the user wants to look at and makes it possible to use tools like Varnish.
To get URLs like this with persistent meaningful elements, we’re using the Persistent URL (aka PURL) module. PURL allows the site builder to establish certain elements of the request as meaningful and map them to conditions in Drupal. On this site we’re using PURL for a few things. It’s being used to make filtering criteria persistent in the path, so that you can effectively set a fraud criteria and browse the election results through that lens, and it’s being used to map the querystring to a Views display, so that the user can browse the data with listings or on the map.
But URL design is not enough. Drupal demands that all users have a session and a cookie, and this sort of behavior isn’t compatible with how reverse proxy caches work. Thankfully there is Pressflow, a fork of Drupal that focuses on performance and scalability. Among its many improvements is the fact that it doesn’t try to hand out cookies to all users and it works very well with external caching layers like Varnish.
What is really cool about Pressflow is that you can completely and quickly swap our Drupal with Pressflow without making any changes to your application. AfghanistanElectionData.org was built with Managing News, which uses Drupal core out of the box, but the switch to Pressflow was quick and painless. The most difficult part was properly configuring Varnish, which has its own configuration language and requires some pretty intimate knowledge of how HTTP works to properly configure. But if the web is your field, the more your know about HTTP the better, so dive on in! If you’d like help, there are some very good write-ups about how to setup Pressflow and Varnish.
This was the first site we’ve put into production using Pressflow and Drupal, and the experience continues to be great. There are others who have more detailed benchmarks and who are doing very interesting things with these two products. Below are some places to dig in a little deeper.