Document Management in Drupal

Quick write-up of an interesting talk on managing documents with Drupal, from DrupalCamp Paris 2013

Photo of Greg Harvey
Sat, 2013-06-22 16:12By greg

This weekend I'm hanging around at DrupalCamp Paris in the Microsoft campus, in the west of Paris on the banks of the Seine. I am writing a few blog posts around this event because, as I always find, there's a wealth of information and nice people here. I love DrupalCamps in general, and Paris is no exception!

This morning I attended an interesting presentation by Simon Göger, project manager at a company here in France called Actency, on how they use Drupal for document management. As a business delivering Drupal solutions we find document management always presents an interesting challenge. Should we recommend building it in Drupal? Should we integrate? If we integrate, what document management system should we recommend? Sharepoint, Alfresco, KnowledgeTree, something else? To date we at Code Enigma have mostly explored Sharepoint and Alfresco integration approaches, and of course, there's no single answer. But it was interesting to see Actency make the case for using Drupal without integration. In fact, they have made document management systems, in Drupal, for several large enterprises, including the world-famous Médicins Sans Frontières.

Simon began the talk with some interesting statistics (from IBM, if my notes are correct). If you weren't convinced of the importance of proper document management, this is a wake-up call: 

Apparently 7.5% of documents in a company, on average, are badly categorised or lost. Sound familiar? That number might even be a bit low! But here's the one that got me. It has been calculating that finding those documents again has an average cost of about $120 (that's just under £80 at time of writing). So if an organisation has 10,000 documents, there are probably 750 missing in action and the cost of recouping them is around £60,000, or more. Ouch! You do not want to be losing documents!


Archives: Image by Marino González, released under the Creative Commons Attribution-ShareAlike 2.0 Generic license.

So, you're going to want a document management system. If you're using an existing system, the two main routes for custom development of integration are the Services module and the CMIS module. CMIS seems to be the most popular of the approaches right now and supports CMIS-enabled systems right out of the box (in theory, though Sharepoint support seems broken right now). There's also a CMIS Views module for exposing lists of content directly from a document management system.

But let's assume we have decided we're going to build a document management system with Drupal. What are the key components we'll need?

The first basic requirement is the ability to store and view the revision history of files. Drupal 7 does this in quite an elegant way right out of the box. You have files available as a field, they can be stored privately so the documents cannot be directly accessed by individuals who do not have the correct permissions, they can be connected to any entity type for storing meta data (we'll come to that) and Drupal also handles revisioning, so old versions of files can be stored, easily accessed, restored from history, etc.

Next, you obviously want to classify and organise your documents. Drupal's taxonomy system has been a strong offer for a number of years, since at least Drupal 5 in fact. Nowadays it's even better because taxonomy terms are fieldable entities. For example, one of our clients has a vocabulary of vehicle manufacturers and models and we can attach manufacturer information (e.g. their logo, website, head office location, etc.) to that organisation's taxonomy term but still keep it as a category, something that wasn't really possible with Drupal 6. But I digress. Taxonomy, free tagging, automated tagging via web services like Calais, descriptions, alternative versions, all these things are easy to apply in Drupal to help you categorise your documents. I'd also add the Taxonomy Manager module is available, if your vocabularies start to become unmanageably large in the core interface, which is still somewhat limited.

Then comes access permissions and organisational units. You probably have organisational units, lots of users, perhaps you want to integrate users with existing systems (see our piece by Mark Davies on Microsoft Active Directory and Drupal). You might want to make private assets only available to certain teams as well, control who can edit what, who can create content, who can review it, who can roll back revisions. The main module Simon mentioned for this kind of work was Taxonomy Access, though on some projects he said Actency had to create very customised permissions modules for particular clients. Personally, I'd add Organic Groups to the list of modules that could be very useful for this sort of thing, particularly in Drupal 7. In fact, I think I prefer to use "OG" for organisational units and grouping of content like that, because it gives you more complex management options and lots of bolt-on extras you might take advantage of.

You also need to manage your workflow. You probably want to ensure there's some level of workflow control, sign-off of assets before they are available across your organisation, these sorts of things. At Code Enigma we typically use Workbench for this kind of thing, to allow different states for documents, define the roles and users who can change those states, trigger Rules on transitions, etc. But Simon mentioned the Maestro module, which I'm very keen to find out more about. It certainly looks very interesting from the project page.

Re-surfacing documents


Card Catalog: Image by Panatomix, released under the Creative Commons Attribution-ShareAlike 2.0 Generic license.

Finally, re-surfacing the documents in an efficient way is crucial. Simon talked about the Facet API, Solr Views and of course, the corresponding Solr integration module. These tools combined give us powerful options for displaying Solr search data within Drupal, affording the flexibility for highly customised search experiences that can be tailored to specific clients. I noted from the screenshots it seems Actency have spent some time making a nice file-manager interface using these tools.

And what of Solr itself? It is a very powerful indexing tool and it gets better with every release (we're now on Solr 4 at time of writing, and clustering has been implemented so an index across multiple servers is properly supported). You can configure it to search inside PDF files, other document types, it can index metadata kept in Drupal's database, it is fast, scaleable and, best of all, free open source software. Performance? Well, one of Simon's examples processed 6,000 individual documents every day, over 2 million documents a year, and the processing time is typically only about 2 hours a day, running on an overnight cycle. That's more than adequate for most people and if it isn't, no problem. Solr will scale up. There are even hosted, cloud-based Solr services you can buy so you don't have to worry about supporting and scaling it yourself.

So you can see, all the building blocks are there, and Actency have successfully convinced several large organisations Drupal is the way forward for their document management system. So what are the specific advantages Drupal might have over dedicated document management systems?

  • Guaranteed support through the community! There is a much larger set of committed Drupal developers than any of the dedicated document management systems can bring to bear, even the open source ones. It's the nature of the beast, the communities are smaller, albeit very specialised, so close support options are more limited.
  • A growing base. Drupal gets more popular all the time, some of the document management systems less so. Drupal feels more like a future-proof solution in that context.
  • It's proven. Big organisations are using it for knowledge and document management already.
  • It's PHP. Now, at this point some developers would say that's why it sucks, but if you're going to talk plain economics of the solution, PHP developers simply are cheaper than Java developers. There's no escaping that fact, look on any job board. 
  • Flexibility is king. And Drupal is so much more flexible than all the document manager alternatives, primarily because it's not actually a document manager. Its API framework-meets-CMS architecture makes it very easy to bend to the will of a specific brief. And while 'boxed' document management solutions might offer more features at a glance, they generally fall down if you need to modify them.
  • Strong performance. Not something people often say about Drupal, when compared with other, lighter, PHP-based CMS, but try some of the Java-based document management competitors and it's a different story! Drupal is far easier to scale and tune than the competition in this market.

So, all told, Drupal is actually a pretty powerful document management system in the making. It will be cheaper, more scaleable, easier to integrate and more flexible than any dedicated document management system. Simon highlighted the two main missing components are WebDAV support and inline document integration.

However, I have an addition of my own on that second point. I'm aware (from conversations at Drupal CXO in Barcelona earlier in the year) that Kristof Van Tomme and team are working on Google Drive integration, as is our own Mark Davies in his labs time. This might fill a hole, as one thing Sharepoint can do is allow you to edit documents inline, in your browser, save, version, collaborate, etc. This is a strong offer, but I know some large organisations (notably ITV) are transferring to corporate Google Drive accounts, so a live-editing solution coupled to Google documents and embedded in Drupal is already very possible and this presents an opportunity for Drupal to match these abilities in some cases.

All that remains is to thank Simon for a very interesting session. More DrupalCamp Paris posts later.