A long grain train of the Union Pacific Railroad crossing a bridge in Washington state, United States

Main image by Kabelleger / David Gubler (http://www.bahnbilder.ch) CC BY-SA 3.0, Link

Processing heavy/lengthy php operations in batches

Stop operating at the edge of php limits !

Return to Blog
Photo of Pascal Morin
Mon, 2016-10-17 14:29By pascal

Quite often, you'll need to perform operations on such a number of items that it is going to exceed the resources available. We are going to have a look at some ways around this and how to loop over smaller batches.

Lets pretend you want to perform some kind of heavy lifting for each and every user of your website, by querying them all.

<?php
$dsn = 'mysql:host=localhost;dbname=drupal';
$username = 'drupal';
$password = 'drupal';
$pdo = new PDO($dsn, $username, $password);
// Connect to our db and query all users.
$string = "SELECT * FROM users";
$query = $pdo->prepare("$string");
$query->execute();
// This is likely to fill the memory.
$results = $query->fetchAll();
// This is going to time out.
$i = 0;
foreach ($results as $result) {
  error_log("Processing item $i");
  // Perform lengthy operation.
  sleep(1);
}

Obviously, unless we have very few users to query, chances are this is just going to blow up first thing with a nice OOM message. "Fatal Error: Allowed Memory Size of XXX Bytes Exhausted" as we're trying to load them all at once in memory. So let's try to fix that up a bit.

<?php
 
$dsn = 'mysql:host=localhost;dbname=drupal';
$username = 'drupal';
$password = 'drupal';
$pdo = new PDO($dsn, $username, $password);
// Connect to our db and query users 10 by 10.
for ($i = 0; TRUE; $i++) {
  error_log("Processing pass $i");
  $string = "SELECT * FROM users LIMIT $i,10";
  $query = $pdo->prepare("$string");
  $query->execute();
  // This won't fill memory anymore.
  $results = $query->fetchAll();
  // Nothing to do, we've finished.
  if(!count($results)){
    return;
  }
  // This is going to time out, still.
  foreach ($results as $result) {
    // Perform lengthy operation.
    sleep(1);
  }
}

Right, looks better now, memory should be fine. But wait, as you'd expect, accessing this over HTTP, we're now hitting the max_execution_time and the request does timeout.

Easy fix, let's just move that over to CLI, eg by writing a drush command. No execution timeout here. So far so good. In fact, it would even work in this example. In real life, chances are memory usage would just grow and grow slowly but surely until we hit OOM again.

You can still find here and there advices about unsetting variables or NULLing them in your loops, destroying child objects in __destruct methods and so on. These are mostly totally outdated and refer to PHP 5.2. Nowadays, we typically don't have to care about leaks or garbage collection and just let PHP handle it, unless really trying to optimize some very custom app.

However, on a larger application mainly made for web content, there will always be something piling up data in memory on each cycle, and parts of code and interactions that are beyond our control. Drupal, for example, does make extensive use of static variables. Try to call entity_load on a few dozens thousands nodes in a loop while keeping an eye on memory usage, for example. Even if you try hard to reset the static cache, something, somewhere, will most probably end up filling up your memory. Even if we currently know that what we are doing is able to run within the available memory_limit, how can we be confident that the same will still be true when the number of users has doubled up in a few months ?

By splitting processes. That is, processing only a small batch of data within each request/process. Let's have a look:

<?php
 
$dsn = 'mysql:host=localhost;dbname=drupal';
$username = 'drupal';
$password = 'drupal';
$pdo = new PDO($dsn, $username, $password);
// Do we have a batch "marker" present ?
$i = !empty($_GET['pass']) ? (int) $_GET['pass'] : 0;
error_log("Processing pass $i");
// Connect to our db and query only 10 users.
$string = "SELECT * FROM users LIMIT $i,10";
$query = $pdo->prepare("$string");
$query->execute();
// This won't fill memory anymore.
$results = $query->fetchAll();
// Nothing to do, we've finished.
if (!count($results)) {
  return;
}
foreach ($results as $result) {
  // Perform lengthy operation.
  sleep(1);
}
//Increment our marker.
$i++;
// Send request back to process next pass.
$redirect = $_SERVER['DOCUMENT_URI'] . '?pass=' . $i;
header("Location: $redirect");
exit();

See what we've done here ? Each batch of 10 users is now treated separately, and everything will get reset memory-wise as we're issuing a new request for pass. Note: your browser would eventually stop redirecting these after a dozen or so, and a real implementation would use AJAX calls and/or an "http-equiv: Refresh" html "meta" head tag fallback.

The equivalent for the CLI version would be quite similar, as shown below:

$dsn = 'mysql:host=localhost;dbname=drupal';
$username = 'drupal';
$password = 'drupal';
$pdo = new PDO($dsn, $username, $password);
// Do we have a batch "marker" present ?
$i = !empty($argv[1]) ? (int) $argv[1] : 0;
error_log("Processing pass $i");
// Connect to our db and query only 10 users.
$string = "SELECT * FROM users LIMIT $i,10";
$query = $pdo->prepare("$string");
$query->execute();
$results = $query->fetchAll();
// Nothing to do, we've finished.
if (!count($results)) {
  return;
}
foreach ($results as $result) {
  // Perform lengthy operation.
  sleep(1);
}
//Increment our marker.
$i++;
// Call our self again to process next pass.
exec('/usr/bin/php -f ' . __FILE__ . ' ' . $i);

What we end up with here is a parent process driving the processing, and a child process for each pass, each only consuming the memory needed for processing our batch of 10 users. No matter how much passes we process, we know resource consumption will be stable.

Congrats, we have just re-invented the wheel !

It should be obvious you MUST NOT reuse these snippets "as is", they are wrong in all manners (bypassing PDO prepared statements, relying on passed arguments, etc), and won't really work beside illustrating the general principles. You should instead use the likely ready to use mechanism provided by your platform of choice. In the case of Drupal, we're talking about the Batch API (and its countepart Drush implementation drush_backend_batch_process()).

Hopefully, however, it should make it more clear how this operates in the background, and what benefits we can expect from such mechanisms.