Breaking words

Migrating away from legacy content management systems (CMS) can sometimes throw up some interesting technical challenges. Recently we were involved in a migration from a CMS where URL’s contained no word boundaries... here’s how we transformed these unfriendly URL’s and improved search engine optimisation.

Photo of Chris Maiden
Tue, 2015-10-20 13:39By chris

I'm sure we're all aware that human-readable URL's are a good thing, not only for us but for search engines too. Take, for example, the following URL:

/topsubjects/businessandintellectualproperty/businessandmanagementstudies/yourstudyprogram

Whilst it's readable (with effort), it's quite unfriendly and does nothing to promote the content of the page.

Initially I thought it an impossible (or at least very complicated) task to programmatically split a word such as "businessandintellectualproperty" into its component parts: "business and intellectual property". There were literally thousands of URL's so it was an impossible task to do manually.

The exported data from the legacy CMS contained human-readable page titles and the HTML of the body content so at least I had the option of using the page titles to generate the "your study program" part of the URL but what about the component parts of the URL? How could "businessandmanagementstudies" be transformed to "business-and-management-studies"?

Enter Wordbreaker! Wordbreaker (and Indexer) are utilities that are packaged with Sphinx (since version 2.1.1), an open source full text search server. Given that you have some useful data somewhere (in my case, the combined HTML body content of all of the pages in the CMS), Indexer can be used to create a frequency dictionary which in turn is passed into Wordbreaker along with the string you want to split, such as "businessandmanagementstudies" and lo and behold, out pops "business and management studies". It's magic, let me explain how to use both Indexer and Wordbreaker!

Creating a frequency dictionary with Indexer

The first thing you need to do is to identify the most useful data you have from which to create a frequency dictionary. A frequency dictionary is essentially a list of unique words found in a body of data with a frequency count of how often each word appears. Common words appear often so have a high frequency count whilst un-common words appear less frequently and thus have a low frequency count. Wordbreaker uses a frequency dictionary to determine the likelihood of the words it finds in the string actually being the words you want.

To give you an idea of what a frequency dictionary looks like, I've created one using the content of this blog post, here's a snippet:

the 51
to 30
of 27
a 23
frequency 16
you 13
in 12
dictionary 11
content 11
and 9

Clearly then, the richness of your source data and its relevance to the strings you're trying to split is an important factor in how successful the word splitting will be. In my migration, I used the combined content of the entire website to create a frequency dictionary which thankfully proved to be quite successful. What would have been potentially days of mind-numbing editing of URL's was reduced to a few hours of reviewing and fixing the few unsuccessful cases.

So, now you've identified your data, how do you create a frequency dictionary?

To use Indexer, you'll first need to create a Sphinx configuration file. This example shows how you can configure Sphinx to index data from an XML file:

source demo
{
  type = xmlpipe2
  xmlpipe_command = cat source.xml
  xmlpipe_fixup_utf8 = 1
}

index demo
{
  source = demo
  path = /tmp/demo
}

indexer
{
  mem_limit = 128M
}

Here we have Sphinx indexing the content of source.xml but it could easily be configured to index content from a MySQL or PostgreSQL database. See the sphinx.dist.conf file that comes with Sphinx for examples of how to do this.

Now let's take a look at the content of the source.xml file:

<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>
    
  <sphinx:schema>
  <sphinx:field name="content"/>
  </sphinx:schema>
  
  <sphinx:document id="1">
    <content><![CDATA[Document content here]]></content>
  </sphinx:document>
  <sphinx:document id="2">
    <content><![CDATA[More content here]]></content>
  </sphinx:document>

  <sphinx:killlist>
    <id>1</id>
    <id>2</id>
  </sphinx:killlist>
  
</sphinx:docset>

For conciseness, the example contains only two documents with hardly any content, in reality you'd want much more source content with which to generate a rich frequency dictionary.

Now, assuming you've installed Sphinx already, the command for producing the frequency dictionary is:

$ indexer --buildstops demo.dict 100000 --buildfreqs demo -c sphinx.conf

The --buildstops flag tells indexer to stop short of actually producing an index and to just produce the list of words. The --buildfreqs flag tells indexer to add the frequency count.

demo.dict is the name of the resulting frequency dictionary file.

demo is the name of the source to use (referred to in sphinx.conf, the configuration file to use for this operation).

We've now got everthing we need to start splitting strings into their component words.

Using Wordbreaker

The command for doing so is:

$ echo businessandmanagementstudies | wordbreaker --dict demo.dict split

And the result:

business and management studies

I think that's pretty amazing and more importantly, so did our client! We were able to pass each section of the URL to Wordbreaker, replace spaces in the string that Wordbreaker returned and transform URL's like this:

/topsubjects/businessandintellectualproperty/businessandmanagementstudies/yourstudyprogram

Into URL's like this:

/top-subjects/business-and-intellectual-property/business-and-management-studies/your-study-program

I did a lightening talk at the October North West Drupal User group, here are the slides - http://slides.com/matason/wordbreaker

And here's the blog post from Sphinx about Wordbreaker - http://sphinxsearch.com/blog/2013/01/29/a-new-tool-in-the-trunk-wordbrea...