Blog: Aggregator

Improved Aggregator for Drupal 7: What's Under the Hood
engineer

An Overview of Its New Features and a Request to Test Drive It

An Overview of Its New Features and a Request to Test Drive It

The patch for an improved aggregator for Drupal 7 is now available on Drupal.org #236237. This code is result of Aron Novak's Google Summer of Code project and it is available as a Drupal project with regular patches against Drupal HEAD #236237. The patch has been out for a couple of weeks, so it's high time to talk about what improvements it aims to bring to Drupal core.

Before I dive into the details though, I'd like to point out that several people requested to break the patch into smaller pieces as it is rather big and touches on more than one functionality of the aggregator. We yet have to work on this, however I do think that there is a value in presenting the proposed improvements as a whole. So here we go.

There are four major differences in comparison to the existing aggregator:

  • Extensible architecture - allows external modules to add or replace functionality
  • Per feed content type configuration of aggregator
  • Replaces aggregator's XML parser with a SimpleXML based parser
  • Replaces aggregator's category system with taxonomy

1. Extensible architecture

This change is the widest reaching of all. At its core, there are the concepts of parsers and processors for aggregator. Parsers download and parse feeds, normalize feed data and expose it to other parts of the application. Processors grab feed data and act on it. For example they create database records for feed items.

In order to define a parser or a processor, one of two hooks need to be implemented:

  • hook_parse() for defining a parser
  • hook_process() for defining a processor

According to the parser/processor architecture, the new structure of aggregator is as follows:

  • aggregator module - implements API and standard download routines
  • syndication_parser module - standard RSS/Atom/RDF parser module that is supposed to ship with core. Can be used independently from aggregator.
  • aggregator.light.inc (part of aggregator module) - this implements a processor that stores feed items as lightweight database records just as the current aggregator does

The implementation of parsers and processors can be seen in syndication_parser module and in aggregator.light.inc. There is also a feed-items-as-nodes implementation in the project version of aggregator for Drupal 7. To my knowledge, the parser/processor architecture was first introduced in Drupal by Ted Serbinski with SimpleFeed, and it also exists in FeedAPI.

Current discussion points around the extensible API are:

Why the heck a new aggregator for Drupal 7?
engineer

Or, Check Out the Patch

Or, Check Out the Patch

This year’s Google Summer of Code season I’ve got the distinct pleasure of mentoring Aron Novak’s work for a new aggregator in Drupal 7. Aron’s well into his task and has just rolled a patch for core and an alpha 2 version – time to share why I think that this patch is important and why you should have a look at it. If you’re into aggregation and Drupal, that is.

Drupal’s original aggregator module was designed foremostly for pulling news feeds into your site and displaying in a straightforward fashion: no workflow, very basic permissioning, no API for interacting with feed items, no event aggregation, no custom parsers – to name a few limitations.

Soon contrib modules mushroomed that addressed one or the other shortcoming of the core aggregator: a list of them would start with the aggregator 2 module which was published in the fall of 2005 and would include Leech (I don’t regret its demise), Aggregation (first time use of SimpleXML for parsing in Drupal), SimpleFeed (first extensible architecture) and FeedAPI

Migrate Your Aggregator From Leech to FeedAPI This Weekend
Drupal coder

A Migration Script to Make the Transition Easy

A Migration Script to Make the Transition Easy

After a lot of hard work, we've released FeedAPI 1.0. But what does this mean for Leech users? Simply put, in most cases it means a better aggregation option is now available - and with the ready-to-use migration script explained below the transition is easy.

As most active users know, Leech isn't being further developed. On the other hand, FeedAPI is actively maintained and constantly improved. This alone means that in most cases, the obvious move is to migrate to FeedAPI. You probably don't want replace Leech if your site is running nicely, doesn't need new features, and has no performance problems. But if that isn’t the case, then migrating to FeedAPI is a good idea. If you're still in doubt, try FeedAPI out at feedapi.novaak.net. (Note: The site can be snail slow, but that's not because of FeedAPI - it's on shared hosting :) )

Follow these instructions to upgrade to FeedAPI from Leech:

Requirements: Leech 1.9 installed, MySQL, PHP4-5 (Note: The migration script does not work with PostgreSQL)

1) Download the migration script.

2) Go into maintenance mode offline (admin/settings/site-maintenance). Copy the leech2feedapi.php script to Drupal's main directory. This is the directory where Drupal's index.php can be found.

FeedAPI 1.0 Released
Drupal coder

New Flexible Aggregator for Drupal Ready for Use

New Flexible Aggregator for Drupal Ready for Use

After being in development for about seven months, we released FeedAPI 1.0 nearly two weeks ago! This is really exciting for me and everyone else who has been craving a more flexible aggregator for Drupal.

We wanted FeedAPI to be as flexible and as fast as possible, and I'm happy to say that the end result has reached those goals. All feeds are represented by nodes. Feed items remain independent from the node system - you can choose, whether you would like to store them as nodes or fast flat database records. Per default, FeedAPI is configured to use SimplePie as a feed parser, which works perfectly in most cases. But if you have particular requirements, you can switch to the alternative Common Syndication parser or write your own.

A particularity nice thing about FeedAPI is that you can configure your feed processing preferences on a per content type basis. Cron processes are performance optimized and work on a time limit rather than a number of feeds limit - a big relief for tuning cron performance. FeedAPI also collects important statistics for monitoring wicked huge aggregating sites. Additionally, FeedAPI has a straightforward configuration and is easily extensible through an API. (Just to know: The content-type preset (among others) is entirely Alex's concept and work, and Jose did a lot to make FeedAPI faster.)

In the near future, we plan to further tune FeedAPI's performance and make it more feature-rich (as I type this blog entry, we are working on committing new features that aren't included in 1.0 :) )

For now, I'd like to share with you what you can do with FeedAPI 1.0:

  • Install it on a PHP4 or PHP5 site with a MySQL or a PostgreSQL database
  • Cache feed downloading to save bandwidth and time
  • Turn feed items into nodes
  • Turn feed items into lightweight items – use FeedAPI like the good old core aggregator
  • Adjust cron processing of feeds based on elapsed time. No more php timeouts
  • Get statistics about feed processing
  • With Feed Element Mapper, map feed item elements to CCK fields or taxonomy
  • Import your feeds from anywhere with OPML import
  • Export OPML
  • Inherit taxonomy and organic group settings from feeds to feed items
  • Views support, having a predefined views for exploring feed items for a specific feed
  • Content-type based configurable presets
  • Promote items to the front page according to your settings
  • 1-click feed creating block. Just supply the feed url and the feed is created
  • Purge or not purge outdated feed items – it's your choice
  • Automate tests (simpletest module yeay!) - easy checking for problems on your installation.
  • Auto-detect feed URL, you do not have to know the RSS/ATOM link, you can just supply the site URL, for example http://drupal.org
  • Access to a documented API, you can write parsers or processors and use FeedAPI’s power in feed management
  • Ready-to-translate .po files in the package
  • +1: low turnaround time issue queue-based support :D
Drupal, Meet Python
professional hacker

Speeding Up Aggregation in Drupal with a Daemon

Speeding Up Aggregation in Drupal with a Daemon

Recently we’ve been working to get our team aggregator and media analyzer Managing News running faster – we want it to fly. To do this we’ve had to really push what Drupal and the LAMP stack can do. Aron and Alex have done great things with aggregation and feed parsing to extend the volume of content that can be collected, which is essential for Managing News since is can aggregate tens of thousands of articles every day. But we still wanted to be able to aggregate content faster.  

A major thing that slows down adding content to Managing News is its semantic analysis – every piece of content that the system pulls in is automatically browsed and given tags that describe it. To do this we mostly use third party web services like Yahoo's term extraction API. Waiting for Yahoo to process the text of each article coming in can add a significant amount to time to the process of adding content - it keeps cron running longer, which increases the chances of a bad cron run, and ties up the system’s resources to just wait. 

To get around this we looked for ways to move the semantic analysis elsewhere. We settled on the idea of using a small external program to talk to tagging web services and do it's own analysis of content in our Drupal system. We looked at a couple options, and quickly decided to write a Python daemon. (A daemon is a program that runs in the background on a server. It waits for certain events and then takes specified actions. In our case the daemon waits for new content, and then tags and processes it as it becomes available.)

Pick It off the Feed, Stick It on the Node
engineer

FeedAPI Now Maps Feed Elements to Node Fields

FeedAPI Now Maps Feed Elements to Node Fields

Last week I did some cleaning up of the soon-to-come FeedAPI beta release. I got fired up and decided to address one of the more exciting feature requests for Drupal aggregators - a feed element mapper. This feature will make it possible to freely map elements on feed items to fields on nodes that are created from these feed items.

The feed element mapper is an add on module for FeedAPI. It's currently in a proof of concept state and therefore lives in my sandbox. This screencast shows a demo of the straightforward mapping process.

Feed element mapper screencast

For the developers

At the moment it's only possible to map to taxonomy, although writing your own mapper should be easy. This example of the taxonomy mapper shows how you can define a feed element mapper for your node field. First your mapping function is asked which fields it exposes ($op == 'list'). Then, when the user defines a feed element for one of the fields you exposed, the function is called again with the mapped elements and the node to stick them onto ($op == 'map').

function feedapi_mapper_map_taxonomy($op = 'map', $node, $feed_element = array(), $sub_field = '') {
  if ($op == 'list') {
    if ($vocabularies = taxonomy_get_vocabularies($node->type)) {
      foreach ($vocabularies as $v) {
        $sub_fields[$v->vid] = $v->name;
      }
      return $sub_fields;
    }
    return FALSE;
  }
  else if ($op == 'map') {
    if (is_string($feed_element)) {
      $feed_element = array($feed_element);
    }
    if (is_array($feed_element)) {
      if (!is_array($node->taxonomy)) {
        $node->taxonomy = array();
      }
      $node->taxonomy = array_merge($node->taxonomy, _feedapi_mapper_create_terms($feed_element, $sub_field));
    }
    return $node;
  }
}