FeedAPI and Drush: Refresh Your Feeds Faster

Blog

Estimated
4 min read

Recently FeedAPI 1.7beta2 was released, and this new release has Drush support. Drush, in a nutshell, lets you handle your Drupal instance with standard unix command line tools (DRUpal SHell), and with Drush support now in FeedAPI, you can better control your feeds. In this post, I’ll introduce you to a way to refresh your feeds faster and more efficiently.

This method is only useful when your Drupal site runs on a VPS or a dedicated box, otherwise you cannot really use Drush at all. The script uses multiprocessing and starts refreshing four feeds at once. However, it’s not a true parallel since the next four feeds will only start refreshing after all of the four previous feeds are finished. You can also adjust the number of feeds that you’d like to start in a round. The script is basically an infinite loop, but don’t worry, it won’t eat up all your server resources. If all of your feeds can be refreshed in a short period of time (900 seconds, configurable), the script will wait a bit before it starts from the beginning again.

Imagine a similar solution in pure PHP. It would have to be really hackish because of the lack of parelell/multithreaded constructs in PHP. And needless to say, with this script you don’t have to worry about memory limits and timeouts (you can have a separate php conf for the command line environment).

Here is the script:

#! /bin/bash
#
# File:   feedapi.sh
# Author: Aron Novak
# Contact: [email protected]

# begin config

# if all the feeds are refreshed and min_iteration does not elapsed from the beginning of the iteration
# wait until then
min_iteration=900

# start to refresh $paralell feed at one round
paralell=4

## end config, do not modify below unless you know what to do

# check for sane parameter
if [ -z $1 ]; then
  echo "Usage: ./feedapi.sh /path/to/drupal_instance/"
  exit 1
fi
cd $1
command_avail=`drush  | grep "feedapi refresh" | wc -l`
if [ ! $command_avail -eq 1 ]; then
  echo "The directory is not a Drupal instance or FeedAPI Drush commands are not available."
  exit 1
fi

# become a daemon
if [ "x$2" != "x--" ]; then
  $0 $1 -- 1> /dev/null 2> /dev/null &
  exit 0
fi

while [ 1 ]
  do
  start=`date +%s`
  i=0
  for nid in `echo "SELECT f.nid FROM feedapi f NATURAL JOIN node n WHERE n.status = 1" |drush sql cli| grep ^[0-9]`
    do
    already=`ps aux | grep drush | grep $nid`
    if [ -z "$already" ]; then
      drush feedapi refresh $nid 2> /dev/null 1> /dev/null &
      pids[$i]=$!
      i=`expr $i + 1`
    fi
    if [ $i -eq $paralell ]; then
      end=`expr $paralell - 1`
      for p in `seq 0 $end`
        do
        wait ${pids[$p]}
      done;
      i=0
    fi
  done;
  end=`date +%s`
  elapsed=`expr $end - $start`
  if [ $elapsed -lt $min_iteration ]; then
    sleep `expr $min_iteration - $elapsed`
  fi
done;

It’s simple, right? With Drush and FeedAPI, you can really have fine-grained control over your feeds. For example, if you replace the SQL in the script above, you could filter out some of the feeds or restrict the refresh to a specific content-type.

The script switches to daemon mode when running, which means that you get back the prompt and you can logoff, the script will run in the background. And don’t forget to comment out hook_cron() in feedapi.module — it’s no longer neccessary.

Finally, you don’t want to try this on a shared hosting environment — I bet the administrator would terminate the script quickly : ) This method is meant for large FeedAPI deployments with VPS or dedicated boxes.

This post is also available in Hungarian.

What we're doing.

Latest