Sunday, 1 January 2012

Document Management System review - Mayan EDMS

We're redoing all our web pages in Maths and Stats. This means getting rid of our old Plone 2.1 CMS and replacing it with a groovy new Django-based system using Arkestra - see previous blog entry for more on that.

There's a casualty though. We have an archive of old exam papers going back to 1997. These are sitting on our Plone server (technically as Zope File objects in the ZODB) and are accessed via some custom page templates I wrote so students can get them by calendar year, year of study, or search for a string.

Its a mess. I've been wanting to put them into a proper document management system for a while. Here's my requirements:
  • Users should be able to search by course code, year of study, and calendar year.
  • Users should be able to bulk download a zip file of selected PDFs
  • Admins should be able to bulk upload and enter metadata
I imagine Sharepoint can do this, but I'm no Sharepoint Wrangler and I don't really want to be one. There are a few open source solutions around, such as LogicalDoc and these all seem to be Java-based Enterprisey solutions. Often there's an open-source, or 'community', version, and then an Enterprise supported version with integration features such as drag n drop, OCR, versioning etc.

Then the other day I noticed Mayan EDMS. Its a Django-based DMS which looks like it might well do everything I want. And being open-source if it doesn't, I can have a go at making it do so.

Smooth as you like. The documentation was spot on, just download, get some dependencies with pip, and it all goes nicely in a virtual environment. The whole virtual env ends up about 90M big on my system, which is comparable to Java-based systems when they bundle the JVM with themselves (which they often seem to do).

The included file has options for development and production servers. For development you can run from the usual Django dev server, and for production you can collect all your static files into /static for serving from a web server. For test purposes I set the development mode and did the usual Django syncdb/migrate dance. I couldn't see what the default superuser name and password was, so I just created one with ./ createsuperuser and I was in. I was uploading documents in no time.

Can it do what I want? The version I downloaded currently couldn't. The big missing thing - anonymous read-only access. We don't want our students to need to login to download exam papers, but the permission system on Mayan EDMS doesn't allow anonymous access to anything.

I had a look at creating a new user with read-only permissions. The system has a very fine level of permission control, with lots of capabilities that can bet set on or off. The access to permission is via the usual Role/Group/User style of thing. You create a Role with permissions, then give that role to a User (or a Group of Users). So I created a 'Reader' role and a 'guest' user with that role, thinking we could use that for anonymous access. But the guest user could still create Folders. At that point I had a look at the code.

The code is all on  Github and it's incredibly well written, seemingly by one person. It uses lots of existing django apps for some of its functionality which is a good thing. It appears to comply with Python and Django best practices, is commented, and easy to find your way around. I found the Folder creation code and noticed there was no permission check like there was with, say, Document creation. I popped a quick issue on the issue tracker.

The next morning I had a response from Roberto. And that was over New Year's Eve to New Year's Day. He's already added a permission check for Folder creation on a development branch. Coming out in the next version. Not sure if that will also have support for fully anonymous users without logging in (Django returns AnonymousUser if this is the case, rather than a real User object) but it should be sufficient for us. I don't think I've ever had such a quick response from an Open Source dev for a while - and certainly never over New Year celebrations! Credit also goes to Github for its fantastic tracking and notification system, but most of the credit goes to Roberto himself!

You can define metadata sets in Mayan EDMS, so I defined a set for our exam papers and added the properties we want to classify our papers with - calendar year, year of study, and course code. Documents can also be tagged, so we could tag very old exams with 'obsolete' - since sometimes course codes are re-used for different courses altogether.

I'm not sure some of the features are working properly yet. In the document details view, there's a 'Contents' box that I thought would extract the text contents of the document (PDF, OCR'd image etc) but it's blank even for a text document. Searches for text that occurs in that text document don't return the document. Then I noticed one of my PDFs did have text in the box. But only one. Here the text was all run on together with no spaces, but the search engine could find 'scenario' in the text containing 'modelscenariowas', but it would also match on run-on strings like 'wehavealsofoundthat'. This might be a problem but I know it is hard to extract words from PDF files sometimes.

 I also tried the OCR functionality on an image but that returned no text too - looking in the Django admin showed me the OCR Document queue - with an entry in it with an error message: "get_image_cache_name() takes exactly 3 non-keyword argument (2 given)". I can run tessaract from the command line successfully so something isn't quite right. I'll have a look at these issues before reporting them on Github (and checking to see if they are fixed in dev versions!)

The other missing function I need as far as I can see is bulk downloads. From a search I'd like to click a few boxes and have a 'Download All' option, that gets everything in a ZIP (or tar.gz) file.

More Features?

The documentation for some of the features are a bit thin. But clicking around and looking at the source code revealed that it has:

  • Themes: change the look with a config option, several themes supplied
  • Versioning: keep old docs around. Not sure we need this.
  • History: show all the changes to files and metadata
  • Multilingual: with English, Spanish, Portuguese and Russian.
  • Smart Links, Indexes: I don't know what these are and I couldn't make them do anything. I'm sure they are wonderful.
  • Signatures: documents can be signed by uploading a signature file. I don't know how this works.
  • Public Keys: In the setup, you can upload GPG public keys which you can easily get by querying a keyserver. I'm not sure what these keys then let you do.

So overall I'm very impressed with it - it does a lot of what the Enterprisey, Java-based solutions try to give you but in a light python package. Just about the only Enterprise feature I've seen that's not here is true Filesystem integration via SMB or WebDAV, even though there's "Filesystem Integration" on the features list. I suppose we need to see what that means. I'm sure it's good!