Using Varnish + Pressflow to Make Managing News Faster
How we increased performance tenfold and used fewer server resources on AfghanistanElectionData.org
When we were building the Afghanistan election data website, performance was a chief concern. As Eric described in his post, there is enough tabular data in the site to fill a few thousand pages of a PDF document, and getting all the calculations and queries needed to run each page in a reasonable time was a challenge. We were able to get Drupal - and Managing News, the Drupal distribution the site is built on - to generate pages in a acceptable amount of time, but it still wasn't fast - and we needed it to be. To bring it to the speed we needed, we turned to Varnish, an HTTP accelerator that caches data in virtual memory, and Pressflow, a Drupal distribution focused on performance that works with Varnish.
As nearly all of the site's traffic is anonymous and the content is relatively static, Varnish was a straight forward way to speed the site up. Varnish is a reverse proxy cache, meaning it keeps copies of frequently requested pages handy and doesn't bother the webserver to generate pages for every request. So if two people request the same page, the webserver will assemble it once and give the second requester a copy that was previously generated by the server. When you imagine that the two people are really a few thousand, you can start to see how this sort of cache can be useful.
On top of that Varnish is fast. Without Varnish on AfghanistanElectionData.org, it takes about two seconds to generate and deliver the HTML of a page, and it takes longer to render the page itself, since css and images all need to delivered to the browser. But once a page has been cached by Varnish, the time to deliver the HTML drops to around 115 milliseconds. This is a 10x improvement, and it requires significantly fewer server resources as well.
The way this works is that reverse proxy caches like Varnish examine each request they receive and try to determine if a new page really needs to be generated. This process means looking at the headers of the request, any cookies, and the URL that has been requested. Any information that the reverse proxy cache needs to make the correct decision needs to be in the request, so nearly all of the information about what the page contains needs to live in the request.
For example take this URL:
http://afghanistanelectiondata.org/filter/600-95-AND/province/33?views-mode=views-mode-map#layers=district_map_openlayers_1&baseLayers=afghanistan-grey
From the front to the back the parts of this url are:
- Domain:
http://afghanistanelectiondata.orgNothing special about the domain here. It just tells the browser what server to go to. - Path:
/filter/600-95-AND/province/33The first two elements of the path together describe how poll station highlighting should work on the site. In this case, poll stations with more than 600 votes and 95% of those votes for a single candidate should be highlighted. The rest of the path/province/33is just the path for the page, and in this case it represents the province 'Farah'. - Querystring:
?views-mode=views-mode-mapThe querystring describes how the page is going to be displayed. Here the map is being requested. - Anchor:
#layers=district_map_openlayers_1&baseLayers=afghanistan-greyThe anchor here describes what the map is going to look like and what layers are active.
The domain is implied by how you're connected to the server, and the anchor is never actually set to the server. So when the webserver gets the request it's basically looks like this:
GET /filter/600-95-AND/province/33?views-mode=views-mode-map
This path explicitly describes the contents of the page, the filtering criteria, and the way it should be displayed. Having a URL like this prevents using a session to store information about what the user wants to look at and makes it possible to use tools like Varnish.
To get URLs like this with persistent meaningful elements, we're using the Persistent URL (aka PURL) module. PURL allows the site builder to establish certain elements of the request as meaningful and map them to conditions in Drupal. On this site we're using PURL for a few things. It's being used to make filtering criteria persistent in the path, so that you can effectively set a fraud criteria and browse the election results through that lens, and it's being used to map the querystring to a Views display, so that the user can browse the data with listings or on the map.
But URL design is not enough. Drupal demands that all users have a session and a cookie, and this sort of behavior isn't compatible with how reverse proxy caches work. Thankfully there is Pressflow, a fork of Drupal that focuses on performance and scalability. Among its many improvements is the fact that it doesn't try to hand out cookies to all users and it works very well with external caching layers like Varnish.
What is really cool about Pressflow is that you can completely and quickly swap our Drupal with Pressflow without making any changes to your application. AfghanistanElectionData.org was built with Managing News, which uses Drupal core out of the box, but the switch to Pressflow was quick and painless. The most difficult part was properly configuring Varnish, which has its own configuration language and requires some pretty intimate knowledge of how HTTP works to properly configure. But if the web is your field, the more your know about HTTP the better, so dive on in! If you'd like help, there are some very good write-ups about how to setup Pressflow and Varnish.
This was the first site we've put into production using Pressflow and Drupal, and the experience continues to be great. There are others who have more detailed benchmarks and who are doing very interesting things with these two products. Below are some places to dig in a little deeper.
Does pressflow out of the
Does pressflow out of the box work with Varnish directly without any configuration ?
You certainly need to
You certainly need to configure Pressflow, Apache and Varnish to work together. In the links above there are some very good descriptions of this process. In short you need to tell Pressflow that you’re using a external cache layer, you need to change the port that Apache serves on, and you need to tell varnish how to make requests of Apache and under what conditions to serve a cached page or request a new one.
FWIW, we use Pressflow on
FWIW, we use Pressflow on every project we do. There’s not really a reason not to, I think. By using Pressflow, our deployment options are much wider because it will work with regular boring shared hosting or we can deploy the site on a super awesome Slicehost server with varnish and such installed and have the site be crazy fast.
Just my sleep-deprived $0.02.
Pressflow requirements
Pressflow is a great performance focused fork of Drupal, a big part of the performance is that they ONLY support PHP 5+ and mysql.
Will Pressflow be the new
Will Pressflow be the new standard? IE: will you start building projects with Pressflow, or stick with core Drupal?
For now we’re making this
For now we’re making this decision on a case by case basis. Not every site benefits as hugely from using Pressflow as this one does. But Pressflow and Varnish are certainly a pair of tools I expect we’ll start using more and more for sites that are primarily public facing.
Reasons for not using Pressflow?
What are the cons of using Pressflow over core? Or the benefits of core over Pressflow?
Very quickly: the main cons
Very quickly: the main cons of using Pressflow are that it iss mysql only and php5 only. So if you’re using postgresl as your database or are stuck with php4 it isn’t going to work for you. The benefits are that you get a version of Drupal that is faster and more scalable. For more details you may want to read our interview with David Strauss, the creator of Pressflow.
FWIW, we use Pressflow on
FWIW, we use Pressflow on every project we do. There’s not really a reason not to, I think. By using Pressflow, our deployment options are much wider because it will work with regular boring shared hosting or we can deploy the site on a super awesome Slicehost server with varnish and such installed and have the site be crazy fast.
Just my sleep-deprived $0.02.