Doing more with Drush sql-sanitize

Support team lead Dan Smith shows us how to anonymise personal data in Drupal using Drush.

 

The sites that we build very often end up containing large amounts of user data, much of which allows individual people to be identified. Not just names, but social media accounts, telephone numbers, addresses, occupation details etc. In the UK, user data comes under the Data Protection Act, particularly personal data, which the Act defines as "data which relate to a living individual who can be identified".

So, your site contains lots of personal data, but you want to give a database dump to a Drupal developer so that she can code some new feature for you. How can you handle this without getting on the wrong side of the Information Commissioner's Office (ICO)? You could manually delete all the personal data before you created the database dump, but then your developer won't have any realistic data to work with. Fortunately, there are ways of sanitising the personal data, a process that the ICO refers to as anonymisation. The ICO has a useful code of practice that is worth reading – Anonymisation: managing data protection risk code of practice.

The Drupal command line tool, Drush, comes with a command to sanitise the users table, drush sql-sanitize. While this is better than nothing, it only sanitises user passwords and email addresses. Obviously, it doesn't know about all of the additional fields storing personal data that you added when you built your site.

Note: Drush sql-sanitize will sanitise data in the current database – you do not want to play with this on a live or production server!

With that caveat out of the way, assume I have a site with profile fields set up for a user to enter a LinkedIn url (field_profile_linkedin_page), a telephone number (field_profile_telephone_number), and a twitter username (field_profile_twitter_handle). How can I tell Drush to sanitise these fields too?

Thankfully, Drush is kind enough to provide a hook, hook_drush_sql_sync_sanitize(), which makes this straightforward. Assuming you’ve got a custom module that you’re using already, my_module, you would add a drush include file, /my_module/my_module.drush.inc, containing:

/**
* Implements hook_drush_sql_sync_sanitize.
*/
function my_module_drush_sql_sync_sanitize($source) {
  // The query to sanitise field_data_field_profile_linkedin_page
  $linkedin_page_query = "UPDATE field_data_field_profile_linkedin_page SET field_profile_linkedin_page_value = CONCAT('http://uk.linkedin.com/user-name/', FLOOR(100000 + (RAND() * 900000)));";
 
  // The query to sanitise field_revision_field_profile_linkedin_page
  $linkedin_page_query .= "UPDATE field_revision_field_profile_linkedin_page SET field_profile_linkedin_page_value = CONCAT('http://uk.linkedin.com/user-name/', FLOOR(100000 + (RAND() * 900000)));";
 
  drush_sql_register_post_sync_op('sanitise_linkedin_page', dt('Sanitise field_profile_linkedin_page'), $linkedin_page_query);
 
  // The query to sanitise field_data_field_profile_telephone_number
  $telephone_number_query = "UPDATE field_data_field_profile_telephone_number SET field_profile_telephone_number_value = CONCAT('+44', FLOOR(1000000000 + (RAND() * 100000000)));";
 
  // The query to sanitise field_data_field_profile_telephone_number
  $telephone_number_query .= "UPDATE field_revision_field_profile_telephone_number SET field_profile_telephone_number_value = CONCAT('+44', FLOOR(1000000000 + (RAND() * 100000000)));";
 
  drush_sql_register_post_sync_op('sanitise_telephone_number', dt('Sanitise field_profile_telephone_number'), $telephone_number_query);
 
  // The query to sanitise field_data_field_profile_twitter_handle
  $twitter_query = "UPDATE field_data_field_profile_twitter_handle SET field_profile_twitter_handle_value =  SUBSTRING(MD5(RAND()) FROM 1 FOR 8);";
 
  // The query to sanitise field_revision_field_profile_twitter_handle
  $twitter_query .= "UPDATE field_revision_field_profile_twitter_handle SET field_profile_twitter_handle_value =  SUBSTRING(MD5(RAND()) FROM 1 FOR 8);";
 
  drush_sql_register_post_sync_op('sanitise_twitter_handle', dt('Sanitise field_profile_twitter_handle'), $twitter_query);
}

The hook works by allowing you to specify database queries that will be run as part of the Drush sql-sanitize process.

The first query added is used to sanitise the LinkedIn url:

$linkedin_page_query = "UPDATE field_data_field_profile_linkedin_page SET field_profile_linkedin_page_value = CONCAT('http://uk.linkedin.com/user-name/', FLOOR(100000 + (RAND() * 900000)));";

The database table storing the LinkedIn field data is field_data_field_profile_linkedin_page, and the column containing the actual url string is field_profile_linkedin_page_value. The query will step through each entry and replace the url with the output of CONCAT('http://uk.linkedin.com/user-name/', FLOOR(100000 + (RAND() * 900000))).

CONCAT is the SQL command to concatenate strings together, RAND returns a random floating point number between 0 and 1, and FLOOR returns the largest integer value not greater than the value it’s passed. Putting that together, the real LinkedIn urls are replaced with ‘http://uk.linkedin.com/user-name/<random number between 100,000 and 1,000,000>’.

LinkedIn seem to use a few different url formats, and this is just the one that I picked as an example.

Don’t forget that Drupal stores revisions of data, so as well as sanitising the field_data_field_profile_linkedin_page table, we need to repeat the query on the field_revision_field_profile_linkedin_page table too. This is easy to overlook!

Now that we have a query built to sanitise both of the LinkedIn tables, we let Drush know about it by calling drush_sql_register_post_sync_op().

drush_sql_register_post_sync_op('sanitise_linkedin_page', dt('Sanitise field_profile_linkedin_page'), $linkedin_page_query);

The first parameter isn’t actually used for anything at the moment. The second parameter is the text that Drush will display when the the sql-sanitize command is run, and the third parameter is the SQL query that you want to run.

Moving on to the telephone number field, we need a new sanitisation query:

$telephone_number_query = "UPDATE field_data_field_profile_telephone_number SET field_profile_telephone_number_value = CONCAT('+44', FLOOR(1000000000 + (RAND() * 100000000)));";

This is very similar to the LinkedIn query above, replacing all user telephone numbers with ‘+44<random number between 1000000000 and 1100000000>’. Again, don’t forget to do the same on the corresponding revision table before calling drush_sql_register_post_sync_op().

The final example is the twitter username field. This one is a little different as instead of a random number, I’ll generate a random 8 character string. One way to do this is to use the SUBSTRING, MD5 and RAND SQL functions.

$twitter_query = "UPDATE field_data_field_profile_twitter_handle SET field_profile_twitter_handle_value =  SUBSTRING(MD5(RAND()) FROM 1 FOR 8);";

This generates an MD5 hash of a random value, which we then take the first 8 characters from. Yet again, don’t forget to do the same thing on the corresponding revision table!

With our hook_drush_sql_sync_sanitize() implementation complete, we can try running it with Drush sql-sanitize. Don’t forget – you do not want to be running this on a production server!

$ drush sql-sanitize
 
The following post-sync operations will be done on the destination:
  * Reset passwords and email addresses in users table
  * Truncate Drupal's sessions table
  * Sanitize field_profile_linkedin_page
  * Sanitize field_profile_telephone_number
  * Sanitize field_profile_twitter_handle
 
Do you really want to sanitize the current database? (y/n):

All being well, you should see a new item in the bullet list for each call to drush_sql_register_post_sync_op() that you made in your hook.

After answering ‘y’, I checked my database. For the LinkedIn tables, I had entries like:

For the telephone values, I had entries like:

  • +441007009394
  • +441058298812
  • +441070465880

And for the twitter values, I had entries like:

  • 8845fb7e
  • 9af12ccd
  • 43df577c

Not the best twitter usernames in the world, but adequate for testing.

Now that you’ve got a sanitised database, you can export it using Drush sql-dump. Before you do that, though, you should be aware that Drupal’s cache tables in the database can also potentially leak information. To protect against this, you can instruct Drush sql-dump to export just the structure of certain tables, omitting the actual data.

You can specify these structure-only tables directly on the command line, using the --structure-tables-list option (see the sql-dump documentation for more information). Alternatively, you can add a list of tables to the $options['structure-tables'][‘your_ref’] array in your drushrc.php file. That way, you don’t have to type the list each time, just use the --structure-tables-key=your_ref option with sql-dump.

So, with the addition of a small chunk of code, we can generate a completely anonymised database dump very easily. If you keep this in mind during development and add a sanitisation query every time you add a new field which stores personal data, it’s even easier.

 

Main image by Mike Mozart, released on Flickr.com under the Creative Commons Attribution 2.0 Generic license.