Saturday, 29 September 2012

The Taarifa Project

I seem to have been drawn in to helping a bit with The Taarifa Project for one reason or another. Mostly the enthusiasm of Mark Iliffe, but also the possibility of helping out with an interesting project.

For those who don't know and are too lazy to click the link I thoughtfully provided, here's the elevator pitch:

The Taarifa Platform is an open source web application for information collection, visualization and interactive mapping. It allows people to collect and share their own stories using various mediums such as SMS, Web Forms, Email or Twitter, placing these reports into a workflow. Where these reports can be followed up and acted upon, while engaging citizens and communities.

Currently it has been deployed in Uganda, and further deployments are planned. The current codebase is a heavily modified fork of Ushahidi, but the application has grown out of that now. The new project is a reboot and a new existence as a Django application.

Which suits me fine. I've done a lot of work with Django, including content management systems, applications for disease modelling and monitoring (that's my day job), and a social site that built a linked system of interesting technological new items. 

The Django version of Taarifa, djangorifa, is open source, and hosted on github. Getting the source and the requirements took about ten minutes. I first tried to get it running with Spatialite, but currently that doesn't work because Spatialite has some limitations that mean it can't work out distances given lat-long coordinates. Seems a bit odd since any schoolkid can do it with a calculator with sin and cos buttons. But never mind.

So I stuck PostGIS on my virtual box and away it went. Using the sample OpenStreetMap dump file supplied I had a demo site working, and I could start pretend-reporting things.

There's clearly a lot to do. I think from a top-level there's still plenty of functionality to re-think. I believe the PHP version had a complicated workflow for dealing with reports, so maybe that needs doing. There's also a lot of basic web site functionality too. I thought of a few things:

  1. Translation. For something intended for developing countries it has to be multilingual. Django supports this out of the box, and translations are supported in templates, python code, and even in URLs themselves, so a site could have URLs in several languages that go to the same effective page, and that page would be translated into that language. For example, /voiture/12 would show the page for car number 12 in French, and /car/12 would show it in English, and /auto/12 would show it in German. Not sure how you handle pages where the same word is used in two languages...
  2. Search. Vital for this system. I've used Haystack integrated into Django-CMS talking to a whoosh backend. Pretty heavy, and works well. I had a quick look for new django search projects and gave django-watson a test-drive. Worked straight out of the box. Uses Postgres' search facilities so indexing is all handled there. Might be my new goto- django search system.
  3. REST API. Data in is one thing, data out is another. For a community-centric system such as this, it would be great to let people get the data and do that web 2.0 mash-up think that you may remember from a few years ago. 
  4. Spatial Analytics. Take the report data and look for hotspots, time trends and so on. Present it all as nice graphs and maps. Make that available via a WMS server for mash-up purposes.
Check the project out if it looks interesting to you! http://taarifa.org/

Tuesday, 11 September 2012

Fixing Polygons with pprepair

At the OSGIS UK conference last week the prize for best presentation went to Ken Arroyo Ohori of TU Delft and his talk on "Automatically repairing polygons and planar partitions". The presentation, jointly credited to Hugo Ledoux and Martijn Meijers detailed a method for cleaning up dodgy polygons which we've all probably seen in our geo-lifetimes.

I've noticed that computational geometry theorists seem averse to publishing code - its as if they don't want to sully the purity of their theory with dirty code, dealing with edge cases and arithmetic precision and all that. Not these guys. The code is on github, and I was eager to try it out. There was a lot of buzz about integrating this with PostGIS or QGIS or R, the stumbling block seeming to be the requirement for the CGAL library.

Anyway, back at Lancaster, and check out the code. Easy as:

git clone https://github.com/tudelft-gist/pprepair.git

and then 

make -f Makefile.linux

There were a couple of problems which I submitted as issues, but I managed to fix them before any response. Check these on the github issue tracker.

Then I wanted a sample dataset to test it on. From the workshop I gave at the UseR! conference I remembered some awful polygons outlining local police neighbourhood regions. I'd used it as an example of bad data. Some crime locations had fallen into cracks between the neighbourhood polygons, and some were in locations covered by more than one polygon. Maybe you could escape justice by claiming you were outside a police jurisdiction this way? 

Here's what a small region looks like in Qgis:
Notice the white spaces between some polygons, and also the overlapping made visible by setting transparency.

The pprepair process is currently a command-line tool, but its options are simple. An input shapefile, an output shapefile and some repair parameters. But to start, all you need is "-fix".

pprepair  -i neighbourhoods.shp -o fixed.shp -fix

Loading this into Qgis looks much nicer:
Everything is now clean, there's no overlaps and no holes (note the colours are generated by my semi-random topological colouring system so are different). You now have a consistent topology that won't cause odd errors when using other analysis techniques.

Ken's presentation claimed big wins on speed and memory usage over similar algorithms in both open-source and proprietary GIS packages. Once the licensing requirements of the CGAL library are solved, it would be great to see this algorithm pop up in all the open-source geospatial systems.


Tuesday, 10 July 2012

The World Wine Web

The ggmap package caused a buzz at UseR! last month. Well, anything gg-something plus anything to do with maps will do that. It lets you download map tiles to use as backgrounds to ggplot graphics, and has some other useful functions such as interfacing to google's routing engine and geocoding.

Dave Smith of Revolution Analytics used it to map his local wineries, a map someone on twitter called 'useful'. I disagree...

 That map shows me one thing - that the wineries that you don't need an appointment for tend to be close to the main roads, and the ones you do need an appointment for are up in the hills. Its a static, dead map. If you decide to visit a winery, you've then got to do some spatial query to find out the name and address. That's not very useful.

 One poster on Dave's blog asked if it could be linked to Google Vis to produce a dynamic map. Well, better than that. We can use R to create a web map with all open source tools and open data.

 My off-CRAN package, webmaps has two capabilities - it can download map tiles like ggmap, and it can create slippy webmaps using OpenLayers. Recent changes to OpenLayers have however broken the slippy map functionality, and I'm working on a dedicated replacement package called openlayr. However, enough of the webmaps functionality exists to give me a good start on making a dynamic map of the wineries.

 If it was all working, all I'd need to do is to divide the wineries into those that need appointments and those that don't, and then do:

osmMap(layer(Appt,name="Yes",style=s1),layer(NoAppt,name="No",style=s2),outputDir="WineMap")

and I'd get something like this:
- a web map with two layers of points (the objects s1 and s2 define the colours). Each layer can be activated on or off via the checkboxes in the pullout menu, and you can zoom and pan. And if you click on a winery...
up pops the information. Remember apart from those s1 and s2 objects and a bit of standard dataframe manipulation, this is one line of R.

The background imagery is an OpenStreetMap tile background - so I really need to add an attribution to this. Copyright OpenStreetMap Contributors, CC BY-SA. Its also possible to add other base-layers, such as the oooh-inducing Stamen.com watercolour maps.

Once I get openlayr all working nicely I can probably give you a quick script to do all this. For now, you can check out the map online and see for yourself. What happens is that each layer is converted to GML (I might change to using GeoJSON for openlayr) and a simple Javascript and HTML wrapper is created to load the layers.

With a bit more customisation you could make the popups automatically link to the winery web page! This is how easy it is to make a web map with R - remember you can push this onto a web server and anyone can see it.

I'm off now for a glass of... oh, my wine cellar is empty...

Wednesday, 30 May 2012

Crop Circles

There was a question on the GIS group on Linked In about creating circular buffers that were enclosed within a polygonal parcel yet of a given size. I reckon the way to do it is to try various values of the radius, computing the overlay, and repeating until you find the radius that gets you the area. The area is strictly increasing with radius, and as long as the polygon is big enough there will be a single unique solution between the min and max values. Simple. Here's the R code that does it:
clipCir <- function(parcel,x,y,area){
### if the parcel is smaller than the target area, we're stuffed:
  if(parea(parcel)<area){
    stop("cant be done, parcel too small")
  }
### radius is somewhere between a bit less than sqrt(a/pi)
### and the diagonal of the parcel.
  bb = bbox(parcel)
  rmin = sqrt(area/pi)*0.99
  rmax = sqrt(diff(bb[1,])^2+diff(bb[2,])^2)
### this function returns the difference between the area
### of the clipped poly and the target area:
  f = function(r){
    area-clipAreaR(parcel,x,y,r)
  }
### uniroot computes f for r between rmin and rmax until f~=0,
### which is when the area of the clipped poly = the target area
  soln = uniroot(f,c(rmin,rmax))
### return the polygon
  poly = makeClipP(parcel,x,y,soln$root)
  poly
}

clipAreaR <- function(parcel,x,y,r){
  return(parea(makeClipP(parcel,x,y,r)))
}

makeClipP <- function(parcel,x,y,r){
### create a polygon of a circle clipped to a polygon.
### the circle has 360 points round it, radius r, centre (x,y)
  theta = seq(0,2*pi,len=360)
### hoop jumping to make a SpatialPolygons object
  c = Polygon(cbind(x+r*sin(theta),y+r*cos(theta)))
  pc = Polygons(list(c),ID="A")
  spc = SpatialPolygons(list(pc))
  proj4string(spc)=proj4string(parcel)
### intersect them
  gIntersection(spc,parcel)
}

parea <- function(sp){
### compute the area of a spatial object by summing the
### area of the components
  sum(unlist(lapply(sp@polygons,function(x){x@area})))
}
The code relies on sp objects and the rgeos package. The uniroot function from the base package does the root finding.
parcels = readOGR("Data/","counties")
p = clipCir(parcels[2,],xy[1],xy[2],0.3)
plot(parcels[2,])
plot(p,add=TRUE,lwd=4)
Here's one solution:
The thin outline is the county boundary, and the thick line is a polygon centred at xy[1],xy[2] with area 0.3 (actually 0.299984). It should be easy enough to convert this to any language that can handle geometry objects - the only tricky part might be writing a uniroot function - the easiest way is to start at the min and max, and do interval bisection - compute the function at (min+max)/2 and then decide whether the zero is left or right of the midpoint. This used to be an exercise for our first-year Fortran programming students...

Here's another example with a weirder-shape parcel - here the required area was 0.005, and the returned polygon was 0.004999925 units.
Some brief timings tell me you can do about 3000 of these every minute on my 2yo home PC.

This is all good for a fixed centre, but someone mentioned the possibility that the centre isn't fixed, in which case you need some other condition for the shape you are finding - maximising some compactness parameter perhaps. Given that the parcel could be a complex, concave polygon (with holes?) this is a very hard problem. But I think I've solved the original question.

Sunday, 20 May 2012

knitr + cactus + TwitterBootstrap + Jquery

A few notes and tips from a week of preparing some course notes.

I switched from Sweave to knitr because all the cool kids are doing it. And its better. The caching has already saved me more time than I spent switching from Sweave to knitr, which wasn't much time at all. Win

At first I was building a PDF from .Rnw sources, but then thought maybe HTML would be a better delivery platform. No need to open a PDF reader. So I figured out the .Rhtml syntax and it was done. Win.

But how to do my equations? Mathjax. Problem solved. Almost pixel-perfect equations using LaTeX syntax. Users without Javascript get the LaTeX source so its not all lost on them. Win.

Making a simple HTML file is easy enough, but I want to make a few tutorial sessions, and some other info, so I want to use a template engine. My current fave is cactus.py - you design Django templates, then write your pages to fill in sections such as 'title' or 'content' in the template, then cactus.py will build the static pages. These can then be served by a simple web server or straight from the file system. Win.

So instead of a full HTML file, I just got knitr to process a Django template chunk. However, this didn't give me the CSS for the source code syntax highlighting because there's no HEAD section in the Html. Solved by copying the knitr CSS into a file and serving it up from my template. Works fine, and serving CSS from files rather than inline in HTML is a win because now the server or client can cache it. Win.

For style and structure, I just write my templates to use styles from the Twitter Bootstrap framework. Include the CSS and JS in your template base and you have good-looking responsive pages. Sure, they look like every other Twitter Bootstrap site, but the flexibility to style them is there too. Win.

Then I figured something real cool. The R syntax highlighting wraps all functions with a span of class "functioncall". One line of jQuery can turn all those spans into hyperlinks to information from my current favourite R documentation - the R Graphical Manual. Here's some code which might get mangled by Blogger's formatting and/or my inexperience trying to stick hypertext tags into a Blogger doc:

$("span.functioncall").replaceWith(function(){return '< a href="http://rgm2.lab.nig.ac.jp/RGM2/search.php?query='+$(this).text()+'" >'+$(this).text()+'< /a >'})

Now every function links to the search for that function name. It's probably hard to link directly to the help for that function exactly, since the system would have to know what package each function was in to get to exactly the right help on the RGM, but this is a big help. Win.

So that was my weekend. Also, Chelsea FC. Win.

Sunday, 26 February 2012

Stopwatch in R

There was a discussion the other day on stackoverflow started by someone who wanted a stopwatch timer in R. It would start when they hit return, and stop when that line was finished executing. I guess they often have long-running jobs going and frequently think, "hmm, how long has that been running?".

I figured something could be done by using R's mechanisms for hooking code into various places, but there seems to be no way to hook in before code runs. There's the taskCallBack mechanism, but that happens at the end of the code, just before your prompt appears. I wasted a bit of time trying to figure something out using that, and gave up.

So the solution. A stopwatch function that evaluates its argument. You can do:

stopwatch(foo(a))
stopwatch({z=foo(a)})
stopwatch({x=foo(a);y=foo(b)})

and a handy little HH:MM:SS dialog will tick along until your code is finished. Then it will stop, and you can close it.

The dialog is done with the tcltk package, using a one-second timer tick. There's just the one main function, and a timeForm function that produces the HH:MM:SS display from a time in seconds:

stopwatch <- function(expr){
# tcltk R expression timer function, Barry Rowlingson Feb 2012
  require(tcltk)

  start=Sys.time()
  tt <- tktoplevel()

  tkwm.title(tt,gsub(" ","",paste(deparse(substitute(expr)),sep="",collapse=";")))

  closeme <- function(){
    tkdestroy(tt)
  }

  labelText <- tclVar("Running...")
  label1 <- tklabel(tt,text=tclvalue(labelText))
  tkconfigure(label1,textvariable=labelText)
  tkgrid(label1)
  e = environment()
  z <- function () {
    tclvalue(labelText)=paste("Running: ",timeForm(Sys.time()-start));
    assign("sw", tcl("after", 1000, z), env=e)
  }
  quit.but <- tkbutton(tt,text="Close",command=closeme)
  tkgrid(quit.but)
  sw = tcl("after", 1000, z)

  finished=function(){
    tcl("after","cancel",sw)
    tclvalue(labelText)=paste("Done after: ",timeForm(Sys.time()-start));
  }

  tryCatch(eval(expr),finally=finished())
}
 
timeForm <- function(t){
  t=as.numeric(t)
  s=t %% 60
  m=((t-s)/60) %% 60
  h = t %/% (60*60)
  stopifnot(s+m*60+h*60*60==t)
  tstring = sprintf("%02d",c(h,m,s))
  return(paste(tstring,collapse=":"))
  list(h=h,m=m,s=s)
}


This code is hereby released for anyone to do anything with, but keep the credit in the comment unless you've really changed a lot and its unrecognizable from my original...

I can think of a few improvements - possibly you might want to run a few things on different lines and have a cumulative stopwatch - a bit like split lap times.  You might then need a different paradigm, something like:

s = stopwatch()
timeOn(s,foo(a))
timeOn(s,foo(b))
foo(c)
timeOn(s,foo(c))

This would create a TclTk dialog with a timer that only runs within the timeOn evaluations.

Another possibility is a text-based timer rather than a GUI. You might also want it to return the time taken - currently it returns the value of the expression. Go ahead, knock yourself out.

Sunday, 1 January 2012

Document Management System review - Mayan EDMS

We're redoing all our web pages in Maths and Stats. This means getting rid of our old Plone 2.1 CMS and replacing it with a groovy new Django-based system using Arkestra - see previous blog entry for more on that.

There's a casualty though. We have an archive of old exam papers going back to 1997. These are sitting on our Plone server (technically as Zope File objects in the ZODB) and are accessed via some custom page templates I wrote so students can get them by calendar year, year of study, or search for a string.

Its a mess. I've been wanting to put them into a proper document management system for a while. Here's my requirements:
  • Users should be able to search by course code, year of study, and calendar year.
  • Users should be able to bulk download a zip file of selected PDFs
  • Admins should be able to bulk upload and enter metadata
I imagine Sharepoint can do this, but I'm no Sharepoint Wrangler and I don't really want to be one. There are a few open source solutions around, such as LogicalDoc and these all seem to be Java-based Enterprisey solutions. Often there's an open-source, or 'community', version, and then an Enterprise supported version with integration features such as drag n drop, OCR, versioning etc.

Then the other day I noticed Mayan EDMS. Its a Django-based DMS which looks like it might well do everything I want. And being open-source if it doesn't, I can have a go at making it do so.

Installation
Smooth as you like. The documentation was spot on, just download, get some dependencies with pip, and it all goes nicely in a virtual environment. The whole virtual env ends up about 90M big on my system, which is comparable to Java-based systems when they bundle the JVM with themselves (which they often seem to do).

Setup
The included settings.py file has options for development and production servers. For development you can run from the usual Django dev server, and for production you can collect all your static files into /static for serving from a web server. For test purposes I set the development mode and did the usual Django syncdb/migrate dance. I couldn't see what the default superuser name and password was, so I just created one with ./manage.py createsuperuser and I was in. I was uploading documents in no time.

Features
Can it do what I want? The version I downloaded currently couldn't. The big missing thing - anonymous read-only access. We don't want our students to need to login to download exam papers, but the permission system on Mayan EDMS doesn't allow anonymous access to anything.

I had a look at creating a new user with read-only permissions. The system has a very fine level of permission control, with lots of capabilities that can bet set on or off. The access to permission is via the usual Role/Group/User style of thing. You create a Role with permissions, then give that role to a User (or a Group of Users). So I created a 'Reader' role and a 'guest' user with that role, thinking we could use that for anonymous access. But the guest user could still create Folders. At that point I had a look at the code.

Code
The code is all on  Github and it's incredibly well written, seemingly by one person. It uses lots of existing django apps for some of its functionality which is a good thing. It appears to comply with Python and Django best practices, is commented, and easy to find your way around. I found the Folder creation code and noticed there was no permission check like there was with, say, Document creation. I popped a quick issue on the issue tracker.

Author
The next morning I had a response from Roberto. And that was over New Year's Eve to New Year's Day. He's already added a permission check for Folder creation on a development branch. Coming out in the next version. Not sure if that will also have support for fully anonymous users without logging in (Django returns AnonymousUser if this is the case, rather than a real User object) but it should be sufficient for us. I don't think I've ever had such a quick response from an Open Source dev for a while - and certainly never over New Year celebrations! Credit also goes to Github for its fantastic tracking and notification system, but most of the credit goes to Roberto himself!

Metadata
You can define metadata sets in Mayan EDMS, so I defined a set for our exam papers and added the properties we want to classify our papers with - calendar year, year of study, and course code. Documents can also be tagged, so we could tag very old exams with 'obsolete' - since sometimes course codes are re-used for different courses altogether.

Problems
I'm not sure some of the features are working properly yet. In the document details view, there's a 'Contents' box that I thought would extract the text contents of the document (PDF, OCR'd image etc) but it's blank even for a text document. Searches for text that occurs in that text document don't return the document. Then I noticed one of my PDFs did have text in the box. But only one. Here the text was all run on together with no spaces, but the search engine could find 'scenario' in the text containing 'modelscenariowas', but it would also match on run-on strings like 'wehavealsofoundthat'. This might be a problem but I know it is hard to extract words from PDF files sometimes.

 I also tried the OCR functionality on an image but that returned no text too - looking in the Django admin showed me the OCR Document queue - with an entry in it with an error message: "get_image_cache_name() takes exactly 3 non-keyword argument (2 given)". I can run tessaract from the command line successfully so something isn't quite right. I'll have a look at these issues before reporting them on Github (and checking to see if they are fixed in dev versions!)

The other missing function I need as far as I can see is bulk downloads. From a search I'd like to click a few boxes and have a 'Download All' option, that gets everything in a ZIP (or tar.gz) file.

More Features?

The documentation for some of the features are a bit thin. But clicking around and looking at the source code revealed that it has:

  • Themes: change the look with a config option, several themes supplied
  • Versioning: keep old docs around. Not sure we need this.
  • History: show all the changes to files and metadata
  • Multilingual: with English, Spanish, Portuguese and Russian.
  • Smart Links, Indexes: I don't know what these are and I couldn't make them do anything. I'm sure they are wonderful.
  • Signatures: documents can be signed by uploading a signature file. I don't know how this works.
  • Public Keys: In the setup, you can upload GPG public keys which you can easily get by querying a keyserver. I'm not sure what these keys then let you do.

So overall I'm very impressed with it - it does a lot of what the Enterprisey, Java-based solutions try to give you but in a light python package. Just about the only Enterprise feature I've seen that's not here is true Filesystem integration via SMB or WebDAV, even though there's "Filesystem Integration" on the features list. I suppose we need to see what that means. I'm sure it's good!