Sometimes The API Just Won't Do It

Photo of Greg Harvey
Fri, 2009-10-02 23:30By greg

So I've been having lots of fun (read: horrible pain) this week thanks to some quirks of Drupal that only really present themselves when you are looping through, loading, saving and manipulating nodes quickly. The scenario for this sort of thing is normally (yup, you've probably guessed it) importing.

I had an ugly problem. I had to import 64,000 XML documents I received from a client in to Drupal as nodes. Doesn't sound too bad? If it were one XML document per node, everything I needed contained within each document, it wouldn't be. But actually there are more like four documents per node.

Why? The way the IT team at the client did the export of data from their system they produced one copy of an article for each category it was in.

As a result I had to parse the first document I came to, save *their* unique document ID somewhere - an ID found in all XML documents relating to that node - then continue on to the next document. I looped through the documents until I found another one with that ID, but this time I was only interested in the taxonomy data. Now this is where the fun starts.

The first problem was with node_load(). The node object is cached inside the function using a static variable. I didn't realise this, so I spent a good deal of time wondering where the hell some (not all) CCK data had gone - specifically file and nodereference fields. Fortunately, and thanks to some help in IRC, someone pointed out a little-known feature of the node_load() function. It has a $reset parameter that, when set to TRUE, resets the cache of the node.

I changed my function call to look like this and, finally, my nodes started coming out right:

$my_node = node_load($nid, NULL, TRUE);
krumo($my_node);
?>

So that was one static variable caching issue dealt with. I thought, at this point, I was home and dry. How wrong can you be!

Remember that from this point on all I needed to do was load subsequent matching documents, pull out their category data and attach it to the corresponding node as a taxonomy term. So I've loaded my reset node, I've loaded the category from the XML document, I've looked up the corresponding taxonomy term and I've added it to $node->taxonomy and done a node_save().

Yet when I load the same node again to add the next term, the taxonomy data I added on the previous loop was gone! What the deuce??

Turns out the taxonomy_node_get_terms() function, used to populate the $node->taxonomy property of the node object, *also* caches the terms from the previous run in a static variable. However, it does not respect the reset from node_load() and worse, it has no reset parameter of its own.

(You don't want to know how many hours it took me to work this out.)

So what was happening? My terms were being successfully saved, but when I went to re-load the node object to apply more terms an old, cached version of the applied terms was persisting. Without my update, the term I saved previously was overwritten, giving the impression it was never saved.

How to get around this? Ditch the API! =(

Here's how I got my taxonomy terms back, ignoring the node object since it contained an incorrect set and I couldn't change that:

$node = node_load($nid, NULL, TRUE);

$values = array();

// existing taxonomy terms must be loaded from the db
// node_load can't get around caching problems with taxonomy.module
$result = db_query("SELECT * FROM {term_node}
WHERE nid = %d
AND vid = %d",
$node->nid, $node->vid);
while ($term = db_fetch_object($result)) {
$existing_terms[] = $term;
}

// save back the terms we just rescued
if (is_array($existing_terms)) {
foreach ($existing_terms as $term) {
$values['taxonomy'][] = $term->tid;
}
}

$node->taxonomy = $values['taxonomy'];
?>

Apparently this whole thing is going to be a whole lot smarter in Drupal 7, but for now if you have static variable caching problems, specifically with taxonomy, there's not much you can do except forget about the API. It won't help you. In fact, quite the opposite.

So, lessons learned:

1. Odd behaviour where data appears to be missing/overwritten in Drupal 5 and 6 could well be a static variable issue - search for statics that might be in the way.

2. If you need to save a node, load it again and then manipulate the taxonomy, you *must* get the terms manually from the database.

3. Thankfully, node caching can be prevented by using the node_load() function's $reset parameter.

*phew*

What a week.