At the OSGIS UK conference last week the prize for best presentation went to Ken Arroyo Ohori of TU Delft and his talk on "Automatically repairing polygons and planar partitions". The presentation, jointly credited to Hugo Ledoux and Martijn Meijers detailed a method for cleaning up dodgy polygons which we've all probably seen in our geo-lifetimes.
I've noticed that computational geometry theorists seem averse to publishing code - its as if they don't want to sully the purity of their theory with dirty code, dealing with edge cases and arithmetic precision and all that. Not these guys. The code is on github, and I was eager to try it out. There was a lot of buzz about integrating this with PostGIS or QGIS or R, the stumbling block seeming to be the requirement for the CGAL library.
Anyway, back at Lancaster, and check out the code. Easy as:
git clone https://github.com/tudelft-gist/pprepair.git
and then
make -f Makefile.linux
There were a couple of problems which I submitted as issues, but I managed to fix them before any response. Check these on the github issue tracker.
Then I wanted a sample dataset to test it on. From the workshop I gave at the UseR! conference I remembered some awful polygons outlining local police neighbourhood regions. I'd used it as an example of bad data. Some crime locations had fallen into cracks between the neighbourhood polygons, and some were in locations covered by more than one polygon. Maybe you could escape justice by claiming you were outside a police jurisdiction this way?
Here's what a small region looks like in Qgis:
Notice the white spaces between some polygons, and also the overlapping made visible by setting transparency.
The pprepair process is currently a command-line tool, but its options are simple. An input shapefile, an output shapefile and some repair parameters. But to start, all you need is "-fix".
pprepair -i neighbourhoods.shp -o fixed.shp -fix
Loading this into Qgis looks much nicer:
Everything is now clean, there's no overlaps and no holes (note the colours are generated by my semi-random topological colouring system so are different). You now have a consistent topology that won't cause odd errors when using other analysis techniques.
Ken's presentation claimed big wins on speed and memory usage over similar algorithms in both open-source and proprietary GIS packages. Once the licensing requirements of the CGAL library are solved, it would be great to see this algorithm pop up in all the open-source geospatial systems.
Tuesday, 11 September 2012
Tuesday, 10 July 2012
The World Wine Web
The ggmap package caused a buzz at UseR! last month. Well, anything gg-something plus anything to do with maps will do that. It lets you download map tiles to use as backgrounds to ggplot graphics, and has some other useful functions such as interfacing to google's routing engine and geocoding.
Dave Smith of Revolution Analytics used it to map his local wineries, a map someone on twitter called 'useful'. I disagree...
That map shows me one thing - that the wineries that you don't need an appointment for tend to be close to the main roads, and the ones you do need an appointment for are up in the hills. Its a static, dead map. If you decide to visit a winery, you've then got to do some spatial query to find out the name and address. That's not very useful.
One poster on Dave's blog asked if it could be linked to Google Vis to produce a dynamic map. Well, better than that. We can use R to create a web map with all open source tools and open data.
My off-CRAN package, webmaps has two capabilities - it can download map tiles like ggmap, and it can create slippy webmaps using OpenLayers. Recent changes to OpenLayers have however broken the slippy map functionality, and I'm working on a dedicated replacement package called openlayr. However, enough of the webmaps functionality exists to give me a good start on making a dynamic map of the wineries.
If it was all working, all I'd need to do is to divide the wineries into those that need appointments and those that don't, and then do:
osmMap(layer(Appt,name="Yes",style=s1),layer(NoAppt,name="No",style=s2),outputDir="WineMap")
and I'd get something like this:
- a web map with two layers of points (the objects s1 and s2 define the colours). Each layer can be activated on or off via the checkboxes in the pullout menu, and you can zoom and pan. And if you click on a winery...
up pops the information. Remember apart from those s1 and s2 objects and a bit of standard dataframe manipulation, this is one line of R.
The background imagery is an OpenStreetMap tile background - so I really need to add an attribution to this. Copyright OpenStreetMap Contributors, CC BY-SA. Its also possible to add other base-layers, such as the oooh-inducing Stamen.com watercolour maps.
Once I get openlayr all working nicely I can probably give you a quick script to do all this. For now, you can check out the map online and see for yourself. What happens is that each layer is converted to GML (I might change to using GeoJSON for openlayr) and a simple Javascript and HTML wrapper is created to load the layers.
With a bit more customisation you could make the popups automatically link to the winery web page! This is how easy it is to make a web map with R - remember you can push this onto a web server and anyone can see it.
I'm off now for a glass of... oh, my wine cellar is empty...
Dave Smith of Revolution Analytics used it to map his local wineries, a map someone on twitter called 'useful'. I disagree...
That map shows me one thing - that the wineries that you don't need an appointment for tend to be close to the main roads, and the ones you do need an appointment for are up in the hills. Its a static, dead map. If you decide to visit a winery, you've then got to do some spatial query to find out the name and address. That's not very useful.
One poster on Dave's blog asked if it could be linked to Google Vis to produce a dynamic map. Well, better than that. We can use R to create a web map with all open source tools and open data.
My off-CRAN package, webmaps has two capabilities - it can download map tiles like ggmap, and it can create slippy webmaps using OpenLayers. Recent changes to OpenLayers have however broken the slippy map functionality, and I'm working on a dedicated replacement package called openlayr. However, enough of the webmaps functionality exists to give me a good start on making a dynamic map of the wineries.
If it was all working, all I'd need to do is to divide the wineries into those that need appointments and those that don't, and then do:
osmMap(layer(Appt,name="Yes",style=s1),layer(NoAppt,name="No",style=s2),outputDir="WineMap")
and I'd get something like this:
- a web map with two layers of points (the objects s1 and s2 define the colours). Each layer can be activated on or off via the checkboxes in the pullout menu, and you can zoom and pan. And if you click on a winery...
up pops the information. Remember apart from those s1 and s2 objects and a bit of standard dataframe manipulation, this is one line of R.
The background imagery is an OpenStreetMap tile background - so I really need to add an attribution to this. Copyright OpenStreetMap Contributors, CC BY-SA. Its also possible to add other base-layers, such as the oooh-inducing Stamen.com watercolour maps.
Once I get openlayr all working nicely I can probably give you a quick script to do all this. For now, you can check out the map online and see for yourself. What happens is that each layer is converted to GML (I might change to using GeoJSON for openlayr) and a simple Javascript and HTML wrapper is created to load the layers.
With a bit more customisation you could make the popups automatically link to the winery web page! This is how easy it is to make a web map with R - remember you can push this onto a web server and anyone can see it.
I'm off now for a glass of... oh, my wine cellar is empty...
Wednesday, 30 May 2012
Crop Circles
There was a question on the GIS group on Linked In about creating circular buffers that were enclosed within a polygonal parcel yet of a given size. I reckon the way to do it is to try various values of the radius, computing the overlay, and repeating until you find the radius that gets you the area. The area is strictly increasing with radius, and as long as the polygon is big enough there will be a single unique solution between the min and max values. Simple. Here's the R code that does it:
The thin outline is the county boundary, and the thick line is a polygon centred at xy[1],xy[2] with area 0.3 (actually 0.299984). It should be easy enough to convert this to any language that can handle geometry objects - the only tricky part might be writing a uniroot function - the easiest way is to start at the min and max, and do interval bisection - compute the function at (min+max)/2 and then decide whether the zero is left or right of the midpoint. This used to be an exercise for our first-year Fortran programming students...
Here's another example with a weirder-shape parcel - here the required area was 0.005, and the returned polygon was 0.004999925 units.
Some brief timings tell me you can do about 3000 of these every minute on my 2yo home PC.
This is all good for a fixed centre, but someone mentioned the possibility that the centre isn't fixed, in which case you need some other condition for the shape you are finding - maximising some compactness parameter perhaps. Given that the parcel could be a complex, concave polygon (with holes?) this is a very hard problem. But I think I've solved the original question.
clipCir <- function(parcel,x,y,area){ ### if the parcel is smaller than the target area, we're stuffed: if(parea(parcel)<area){ stop("cant be done, parcel too small") } ### radius is somewhere between a bit less than sqrt(a/pi) ### and the diagonal of the parcel. bb = bbox(parcel) rmin = sqrt(area/pi)*0.99 rmax = sqrt(diff(bb[1,])^2+diff(bb[2,])^2) ### this function returns the difference between the area ### of the clipped poly and the target area: f = function(r){ area-clipAreaR(parcel,x,y,r) } ### uniroot computes f for r between rmin and rmax until f~=0, ### which is when the area of the clipped poly = the target area soln = uniroot(f,c(rmin,rmax)) ### return the polygon poly = makeClipP(parcel,x,y,soln$root) poly } clipAreaR <- function(parcel,x,y,r){ return(parea(makeClipP(parcel,x,y,r))) } makeClipP <- function(parcel,x,y,r){ ### create a polygon of a circle clipped to a polygon. ### the circle has 360 points round it, radius r, centre (x,y) theta = seq(0,2*pi,len=360) ### hoop jumping to make a SpatialPolygons object c = Polygon(cbind(x+r*sin(theta),y+r*cos(theta))) pc = Polygons(list(c),ID="A") spc = SpatialPolygons(list(pc)) proj4string(spc)=proj4string(parcel) ### intersect them gIntersection(spc,parcel) } parea <- function(sp){ ### compute the area of a spatial object by summing the ### area of the components sum(unlist(lapply(sp@polygons,function(x){x@area}))) }The code relies on sp objects and the rgeos package. The uniroot function from the base package does the root finding.
parcels = readOGR("Data/","counties") p = clipCir(parcels[2,],xy[1],xy[2],0.3) plot(parcels[2,]) plot(p,add=TRUE,lwd=4)Here's one solution:
The thin outline is the county boundary, and the thick line is a polygon centred at xy[1],xy[2] with area 0.3 (actually 0.299984). It should be easy enough to convert this to any language that can handle geometry objects - the only tricky part might be writing a uniroot function - the easiest way is to start at the min and max, and do interval bisection - compute the function at (min+max)/2 and then decide whether the zero is left or right of the midpoint. This used to be an exercise for our first-year Fortran programming students...
Here's another example with a weirder-shape parcel - here the required area was 0.005, and the returned polygon was 0.004999925 units.
Some brief timings tell me you can do about 3000 of these every minute on my 2yo home PC.
This is all good for a fixed centre, but someone mentioned the possibility that the centre isn't fixed, in which case you need some other condition for the shape you are finding - maximising some compactness parameter perhaps. Given that the parcel could be a complex, concave polygon (with holes?) this is a very hard problem. But I think I've solved the original question.
Sunday, 20 May 2012
knitr + cactus + TwitterBootstrap + Jquery
A few notes and tips from a week of preparing some course notes.
I switched from Sweave to knitr because all the cool kids are doing it. And its better. The caching has already saved me more time than I spent switching from Sweave to knitr, which wasn't much time at all. Win
At first I was building a PDF from .Rnw sources, but then thought maybe HTML would be a better delivery platform. No need to open a PDF reader. So I figured out the .Rhtml syntax and it was done. Win.
But how to do my equations? Mathjax. Problem solved. Almost pixel-perfect equations using LaTeX syntax. Users without Javascript get the LaTeX source so its not all lost on them. Win.
Making a simple HTML file is easy enough, but I want to make a few tutorial sessions, and some other info, so I want to use a template engine. My current fave is cactus.py - you design Django templates, then write your pages to fill in sections such as 'title' or 'content' in the template, then cactus.py will build the static pages. These can then be served by a simple web server or straight from the file system. Win.
So instead of a full HTML file, I just got knitr to process a Django template chunk. However, this didn't give me the CSS for the source code syntax highlighting because there's no HEAD section in the Html. Solved by copying the knitr CSS into a file and serving it up from my template. Works fine, and serving CSS from files rather than inline in HTML is a win because now the server or client can cache it. Win.
For style and structure, I just write my templates to use styles from the Twitter Bootstrap framework. Include the CSS and JS in your template base and you have good-looking responsive pages. Sure, they look like every other Twitter Bootstrap site, but the flexibility to style them is there too. Win.
Then I figured something real cool. The R syntax highlighting wraps all functions with a span of class "functioncall". One line of jQuery can turn all those spans into hyperlinks to information from my current favourite R documentation - the R Graphical Manual. Here's some code which might get mangled by Blogger's formatting and/or my inexperience trying to stick hypertext tags into a Blogger doc:
$("span.functioncall").replaceWith(function(){return '< a href="http://rgm2.lab.nig.ac.jp/RGM2/search.php?query='+$(this).text()+'" >'+$(this).text()+'< /a >'})
Now every function links to the search for that function name. It's probably hard to link directly to the help for that function exactly, since the system would have to know what package each function was in to get to exactly the right help on the RGM, but this is a big help. Win.
So that was my weekend. Also, Chelsea FC. Win.
I switched from Sweave to knitr because all the cool kids are doing it. And its better. The caching has already saved me more time than I spent switching from Sweave to knitr, which wasn't much time at all. Win
At first I was building a PDF from .Rnw sources, but then thought maybe HTML would be a better delivery platform. No need to open a PDF reader. So I figured out the .Rhtml syntax and it was done. Win.
But how to do my equations? Mathjax. Problem solved. Almost pixel-perfect equations using LaTeX syntax. Users without Javascript get the LaTeX source so its not all lost on them. Win.
Making a simple HTML file is easy enough, but I want to make a few tutorial sessions, and some other info, so I want to use a template engine. My current fave is cactus.py - you design Django templates, then write your pages to fill in sections such as 'title' or 'content' in the template, then cactus.py will build the static pages. These can then be served by a simple web server or straight from the file system. Win.
So instead of a full HTML file, I just got knitr to process a Django template chunk. However, this didn't give me the CSS for the source code syntax highlighting because there's no HEAD section in the Html. Solved by copying the knitr CSS into a file and serving it up from my template. Works fine, and serving CSS from files rather than inline in HTML is a win because now the server or client can cache it. Win.
For style and structure, I just write my templates to use styles from the Twitter Bootstrap framework. Include the CSS and JS in your template base and you have good-looking responsive pages. Sure, they look like every other Twitter Bootstrap site, but the flexibility to style them is there too. Win.
Then I figured something real cool. The R syntax highlighting wraps all functions with a span of class "functioncall". One line of jQuery can turn all those spans into hyperlinks to information from my current favourite R documentation - the R Graphical Manual. Here's some code which might get mangled by Blogger's formatting and/or my inexperience trying to stick hypertext tags into a Blogger doc:
$("span.functioncall").replaceWith(function(){return '< a href="http://rgm2.lab.nig.ac.jp/RGM2/search.php?query='+$(this).text()+'" >'+$(this).text()+'< /a >'})
Now every function links to the search for that function name. It's probably hard to link directly to the help for that function exactly, since the system would have to know what package each function was in to get to exactly the right help on the RGM, but this is a big help. Win.
So that was my weekend. Also, Chelsea FC. Win.
Sunday, 26 February 2012
Stopwatch in R
There was a discussion the other day on stackoverflow started by someone who wanted a stopwatch timer in R. It would start when they hit return, and stop when that line was finished executing. I guess they often have long-running jobs going and frequently think, "hmm, how long has that been running?".
I figured something could be done by using R's mechanisms for hooking code into various places, but there seems to be no way to hook in before code runs. There's the taskCallBack mechanism, but that happens at the end of the code, just before your prompt appears. I wasted a bit of time trying to figure something out using that, and gave up.
So the solution. A stopwatch function that evaluates its argument. You can do:
and a handy little HH:MM:SS dialog will tick along until your code is finished. Then it will stop, and you can close it.
The dialog is done with the tcltk package, using a one-second timer tick. There's just the one main function, and a timeForm function that produces the HH:MM:SS display from a time in seconds:
stopwatch <- function(expr){
# tcltk R expression timer function, Barry Rowlingson Feb 2012
require(tcltk)
start=Sys.time()
tt <- tktoplevel()
tkwm.title(tt,gsub(" ","",paste(deparse(substitute(expr)),sep="",collapse=";")))
closeme <- function(){
tkdestroy(tt)
}
labelText <- tclVar("Running...")
label1 <- tklabel(tt,text=tclvalue(labelText))
tkconfigure(label1,textvariable=labelText)
tkgrid(label1)
e = environment()
z <- function () {
tclvalue(labelText)=paste("Running: ",timeForm(Sys.time()-start));
assign("sw", tcl("after", 1000, z), env=e)
}
quit.but <- tkbutton(tt,text="Close",command=closeme)
tkgrid(quit.but)
sw = tcl("after", 1000, z)
finished=function(){
tcl("after","cancel",sw)
tclvalue(labelText)=paste("Done after: ",timeForm(Sys.time()-start));
}
tryCatch(eval(expr),finally=finished())
}
timeForm <- function(t){
t=as.numeric(t)
s=t %% 60
m=((t-s)/60) %% 60
h = t %/% (60*60)
stopifnot(s+m*60+h*60*60==t)
tstring = sprintf("%02d",c(h,m,s))
return(paste(tstring,collapse=":"))
list(h=h,m=m,s=s)
}
This code is hereby released for anyone to do anything with, but keep the credit in the comment unless you've really changed a lot and its unrecognizable from my original...
I can think of a few improvements - possibly you might want to run a few things on different lines and have a cumulative stopwatch - a bit like split lap times. You might then need a different paradigm, something like:
This would create a TclTk dialog with a timer that only runs within the timeOn evaluations.
Another possibility is a text-based timer rather than a GUI. You might also want it to return the time taken - currently it returns the value of the expression. Go ahead, knock yourself out.
I figured something could be done by using R's mechanisms for hooking code into various places, but there seems to be no way to hook in before code runs. There's the taskCallBack mechanism, but that happens at the end of the code, just before your prompt appears. I wasted a bit of time trying to figure something out using that, and gave up.
So the solution. A stopwatch function that evaluates its argument. You can do:
stopwatch(foo(a))
stopwatch({z=foo(a)})
stopwatch({x=foo(a);y=foo(b)})
and a handy little HH:MM:SS dialog will tick along until your code is finished. Then it will stop, and you can close it.
The dialog is done with the tcltk package, using a one-second timer tick. There's just the one main function, and a timeForm function that produces the HH:MM:SS display from a time in seconds:
stopwatch <- function(expr){
# tcltk R expression timer function, Barry Rowlingson Feb 2012
require(tcltk)
start=Sys.time()
tt <- tktoplevel()
tkwm.title(tt,gsub(" ","",paste(deparse(substitute(expr)),sep="",collapse=";")))
closeme <- function(){
tkdestroy(tt)
}
labelText <- tclVar("Running...")
label1 <- tklabel(tt,text=tclvalue(labelText))
tkconfigure(label1,textvariable=labelText)
tkgrid(label1)
e = environment()
z <- function () {
tclvalue(labelText)=paste("Running: ",timeForm(Sys.time()-start));
assign("sw", tcl("after", 1000, z), env=e)
}
quit.but <- tkbutton(tt,text="Close",command=closeme)
tkgrid(quit.but)
sw = tcl("after", 1000, z)
finished=function(){
tcl("after","cancel",sw)
tclvalue(labelText)=paste("Done after: ",timeForm(Sys.time()-start));
}
tryCatch(eval(expr),finally=finished())
}
timeForm <- function(t){
t=as.numeric(t)
s=t %% 60
m=((t-s)/60) %% 60
h = t %/% (60*60)
stopifnot(s+m*60+h*60*60==t)
tstring = sprintf("%02d",c(h,m,s))
return(paste(tstring,collapse=":"))
list(h=h,m=m,s=s)
}
This code is hereby released for anyone to do anything with, but keep the credit in the comment unless you've really changed a lot and its unrecognizable from my original...
I can think of a few improvements - possibly you might want to run a few things on different lines and have a cumulative stopwatch - a bit like split lap times. You might then need a different paradigm, something like:
s = stopwatch()
timeOn(s,foo(a))
timeOn(s,foo(b))
foo(c)
timeOn(s,foo(c))
This would create a TclTk dialog with a timer that only runs within the timeOn evaluations.
Another possibility is a text-based timer rather than a GUI. You might also want it to return the time taken - currently it returns the value of the expression. Go ahead, knock yourself out.
Sunday, 1 January 2012
Document Management System review - Mayan EDMS
We're redoing all our web pages in Maths and Stats. This means getting rid of our old Plone 2.1 CMS and replacing it with a groovy new Django-based system using Arkestra - see previous blog entry for more on that.
There's a casualty though. We have an archive of old exam papers going back to 1997. These are sitting on our Plone server (technically as Zope File objects in the ZODB) and are accessed via some custom page templates I wrote so students can get them by calendar year, year of study, or search for a string.
Its a mess. I've been wanting to put them into a proper document management system for a while. Here's my requirements:
There's a casualty though. We have an archive of old exam papers going back to 1997. These are sitting on our Plone server (technically as Zope File objects in the ZODB) and are accessed via some custom page templates I wrote so students can get them by calendar year, year of study, or search for a string.
Its a mess. I've been wanting to put them into a proper document management system for a while. Here's my requirements:
- Users should be able to search by course code, year of study, and calendar year.
- Users should be able to bulk download a zip file of selected PDFs
- Admins should be able to bulk upload and enter metadata
I imagine Sharepoint can do this, but I'm no Sharepoint Wrangler and I don't really want to be one. There are a few open source solutions around, such as LogicalDoc and these all seem to be Java-based Enterprisey solutions. Often there's an open-source, or 'community', version, and then an Enterprise supported version with integration features such as drag n drop, OCR, versioning etc.
Then the other day I noticed Mayan EDMS. Its a Django-based DMS which looks like it might well do everything I want. And being open-source if it doesn't, I can have a go at making it do so.
Installation
Smooth as you like. The documentation was spot on, just download, get some dependencies with pip, and it all goes nicely in a virtual environment. The whole virtual env ends up about 90M big on my system, which is comparable to Java-based systems when they bundle the JVM with themselves (which they often seem to do).
Setup
The included settings.py file has options for development and production servers. For development you can run from the usual Django dev server, and for production you can collect all your static files into /static for serving from a web server. For test purposes I set the development mode and did the usual Django syncdb/migrate dance. I couldn't see what the default superuser name and password was, so I just created one with ./manage.py createsuperuser and I was in. I was uploading documents in no time.
Features
Can it do what I want? The version I downloaded currently couldn't. The big missing thing - anonymous read-only access. We don't want our students to need to login to download exam papers, but the permission system on Mayan EDMS doesn't allow anonymous access to anything.
I had a look at creating a new user with read-only permissions. The system has a very fine level of permission control, with lots of capabilities that can bet set on or off. The access to permission is via the usual Role/Group/User style of thing. You create a Role with permissions, then give that role to a User (or a Group of Users). So I created a 'Reader' role and a 'guest' user with that role, thinking we could use that for anonymous access. But the guest user could still create Folders. At that point I had a look at the code.
Code
The code is all on Github and it's incredibly well written, seemingly by one person. It uses lots of existing django apps for some of its functionality which is a good thing. It appears to comply with Python and Django best practices, is commented, and easy to find your way around. I found the Folder creation code and noticed there was no permission check like there was with, say, Document creation. I popped a quick issue on the issue tracker.
Author
Smooth as you like. The documentation was spot on, just download, get some dependencies with pip, and it all goes nicely in a virtual environment. The whole virtual env ends up about 90M big on my system, which is comparable to Java-based systems when they bundle the JVM with themselves (which they often seem to do).
Setup
The included settings.py file has options for development and production servers. For development you can run from the usual Django dev server, and for production you can collect all your static files into /static for serving from a web server. For test purposes I set the development mode and did the usual Django syncdb/migrate dance. I couldn't see what the default superuser name and password was, so I just created one with ./manage.py createsuperuser and I was in. I was uploading documents in no time.
Features
Can it do what I want? The version I downloaded currently couldn't. The big missing thing - anonymous read-only access. We don't want our students to need to login to download exam papers, but the permission system on Mayan EDMS doesn't allow anonymous access to anything.
I had a look at creating a new user with read-only permissions. The system has a very fine level of permission control, with lots of capabilities that can bet set on or off. The access to permission is via the usual Role/Group/User style of thing. You create a Role with permissions, then give that role to a User (or a Group of Users). So I created a 'Reader' role and a 'guest' user with that role, thinking we could use that for anonymous access. But the guest user could still create Folders. At that point I had a look at the code.
Code
The code is all on Github and it's incredibly well written, seemingly by one person. It uses lots of existing django apps for some of its functionality which is a good thing. It appears to comply with Python and Django best practices, is commented, and easy to find your way around. I found the Folder creation code and noticed there was no permission check like there was with, say, Document creation. I popped a quick issue on the issue tracker.
Author
The next morning I had a response from Roberto. And that was over New Year's Eve to New Year's Day. He's already added a permission check for Folder creation on a development branch. Coming out in the next version. Not sure if that will also have support for fully anonymous users without logging in (Django returns AnonymousUser if this is the case, rather than a real User object) but it should be sufficient for us. I don't think I've ever had such a quick response from an Open Source dev for a while - and certainly never over New Year celebrations! Credit also goes to Github for its fantastic tracking and notification system, but most of the credit goes to Roberto himself!
Metadata
You can define metadata sets in Mayan EDMS, so I defined a set for our exam papers and added the properties we want to classify our papers with - calendar year, year of study, and course code. Documents can also be tagged, so we could tag very old exams with 'obsolete' - since sometimes course codes are re-used for different courses altogether.
Problems
I'm not sure some of the features are working properly yet. In the document details view, there's a 'Contents' box that I thought would extract the text contents of the document (PDF, OCR'd image etc) but it's blank even for a text document. Searches for text that occurs in that text document don't return the document. Then I noticed one of my PDFs did have text in the box. But only one. Here the text was all run on together with no spaces, but the search engine could find 'scenario' in the text containing 'modelscenariowas', but it would also match on run-on strings like 'wehavealsofoundthat'. This might be a problem but I know it is hard to extract words from PDF files sometimes.
I also tried the OCR functionality on an image but that returned no text too - looking in the Django admin showed me the OCR Document queue - with an entry in it with an error message: "get_image_cache_name() takes exactly 3 non-keyword argument (2 given)". I can run tessaract from the command line successfully so something isn't quite right. I'll have a look at these issues before reporting them on Github (and checking to see if they are fixed in dev versions!)
The other missing function I need as far as I can see is bulk downloads. From a search I'd like to click a few boxes and have a 'Download All' option, that gets everything in a ZIP (or tar.gz) file.
More Features?
The documentation for some of the features are a bit thin. But clicking around and looking at the source code revealed that it has:
So overall I'm very impressed with it - it does a lot of what the Enterprisey, Java-based solutions try to give you but in a light python package. Just about the only Enterprise feature I've seen that's not here is true Filesystem integration via SMB or WebDAV, even though there's "Filesystem Integration" on the features list. I suppose we need to see what that means. I'm sure it's good!
Metadata
You can define metadata sets in Mayan EDMS, so I defined a set for our exam papers and added the properties we want to classify our papers with - calendar year, year of study, and course code. Documents can also be tagged, so we could tag very old exams with 'obsolete' - since sometimes course codes are re-used for different courses altogether.
Problems
I'm not sure some of the features are working properly yet. In the document details view, there's a 'Contents' box that I thought would extract the text contents of the document (PDF, OCR'd image etc) but it's blank even for a text document. Searches for text that occurs in that text document don't return the document. Then I noticed one of my PDFs did have text in the box. But only one. Here the text was all run on together with no spaces, but the search engine could find 'scenario' in the text containing 'modelscenariowas', but it would also match on run-on strings like 'wehavealsofoundthat'. This might be a problem but I know it is hard to extract words from PDF files sometimes.
I also tried the OCR functionality on an image but that returned no text too - looking in the Django admin showed me the OCR Document queue - with an entry in it with an error message: "get_image_cache_name() takes exactly 3 non-keyword argument (2 given)". I can run tessaract from the command line successfully so something isn't quite right. I'll have a look at these issues before reporting them on Github (and checking to see if they are fixed in dev versions!)
The other missing function I need as far as I can see is bulk downloads. From a search I'd like to click a few boxes and have a 'Download All' option, that gets everything in a ZIP (or tar.gz) file.
More Features?
The documentation for some of the features are a bit thin. But clicking around and looking at the source code revealed that it has:
- Themes: change the look with a config option, several themes supplied
- Versioning: keep old docs around. Not sure we need this.
- History: show all the changes to files and metadata
- Multilingual: with English, Spanish, Portuguese and Russian.
- Smart Links, Indexes: I don't know what these are and I couldn't make them do anything. I'm sure they are wonderful.
- Signatures: documents can be signed by uploading a signature file. I don't know how this works.
- Public Keys: In the setup, you can upload GPG public keys which you can easily get by querying a keyserver. I'm not sure what these keys then let you do.
So overall I'm very impressed with it - it does a lot of what the Enterprisey, Java-based solutions try to give you but in a light python package. Just about the only Enterprise feature I've seen that's not here is true Filesystem integration via SMB or WebDAV, even though there's "Filesystem Integration" on the features list. I suppose we need to see what that means. I'm sure it's good!
Wednesday, 7 December 2011
Reading AfriPop data
The AfriPop project aims to produce high-resolution population density estimates for Africa. This could be very useful for some of our spatial epi projects. So I thought I'd have a look.
Their methods for computing population are explained. I can't find any licensing or usage restrictions on the data, so I'll have to clear that up before doing anything serious with it. Let's download something and read it in.
The download form asks for a few details, but there's no email check (or 'I have read your terms and conditions' checkbox) so fill in as honestly as your conscience requires. You'll get a zip file to download.
When unzipped you'll have three files. For Benin these are called apben10v3 with extensions hdr, flt and prj.
The flt file is the data, the hdr is header info, and prj is projection info. Use gdalinfo to get a summary - run it on the flt file:
and inspect the data further.
Note that the raster package does a really good job of dealing with large rasters. When plotting these things it is sampling the raster file so not all of it is read into memory. I don't have a monitor capable of showing 7000x3000 pixels (yet) so raster just grabs enough to look good. But if you try and do anything with this size of raster, you will strangle your machine and it may die. Don't do this:
plot(log(bpop))
Their methods for computing population are explained. I can't find any licensing or usage restrictions on the data, so I'll have to clear that up before doing anything serious with it. Let's download something and read it in.
The download form asks for a few details, but there's no email check (or 'I have read your terms and conditions' checkbox) so fill in as honestly as your conscience requires. You'll get a zip file to download.
When unzipped you'll have three files. For Benin these are called apben10v3 with extensions hdr, flt and prj.
The flt file is the data, the hdr is header info, and prj is projection info. Use gdalinfo to get a summary - run it on the flt file:
$ gdalinfo apben10v3.flt
Driver: EHdr/ESRI .hdr Labelled
Files: apben10v3.flt
apben10v3.hdr
apben10v3.prj
[WGS84 coordinate system info here]
Origin = (0.773989957311480,12.409206711176100)
Pixel Size = (0.000833300000000,-0.000833300000000)
Corner Coordinates:
Upper Left (0.7739900,12.4092067) (0d46'26.36"E,12d24'33.14"N)
Lower Left (0.7739900,6.2252874) (0d46'26.36"E,6d13'31.03"N)
Upper Right (3.8522002,12.4092067)(3d51' 7.92"E,12d24'33.14"N)
Lower Right (3.8522002,6.2252874) (3d51' 7.92"E,6d13'31.03"N)
Center (2.3130951,9.3172471) (2d18'47.14"E, 9d19' 2.09"N)
Band 1 Block=3694x1 Type=Byte, ColorInterp=Undefined
NoData Value=-9999
All pretty good. A standard ESRI file (except there's often nothing really standard about ESRI file formats). Let's read it and plot it using the raster package:
> bpop=raster("./apben10v3.flt")
> image(bpop)
Oh dear. Reading the docs about these files tells me that sometimes gdal just doesn't get it right. It said "Type=Byte", but the file is 109652696 bytes long, which is exactly 7421x3694x4 bytes. These are 4 byte values. In fact the AfriPop page says these files are "ESRI Float". Luckily there's an easy way to persuade gdal these are floats. Open up the hdr file (NOT the flt file. Don't open the flt file!) in a text editor (emacs, vi, notepad++, notepad) and add:
pixeltype float
to the items. Then try again:
> bpop=raster("./apben10v3.flt")
> plot(bpop,asp=1)
Looks good. Note we can get away with asp=1 since we're close to the equator. Or we can give it a projection:
projection(bpop)="+proj=tmerc"
plot(bpop)
and that's just as good. We can also zoom in using the very nifty zoom function from the raster package:
Note that the raster package does a really good job of dealing with large rasters. When plotting these things it is sampling the raster file so not all of it is read into memory. I don't have a monitor capable of showing 7000x3000 pixels (yet) so raster just grabs enough to look good. But if you try and do anything with this size of raster, you will strangle your machine and it may die. Don't do this:
plot(log(bpop))
for example. Let's sample our raster down a bit for some more exploring. First I need to restart R after the last line crashed out. I'll sample down to something of the order of a few hundred pixels by a few hundred pixels by specifying the size as 500*500 - sampleRegular will work out the exact resolution, keeping the aspect ratio. It's smart like that:
> library(raster)
> bpop=raster("apben10v3.flt")
> projection(bpop)="+proj=tmerc"
> plot(bpop)
> bpop2 = sampleRegular(bpop, 500*500, asRaster=TRUE)
> bpop2
class : RasterLayer
dimensions : 708, 352, 249216 (nrow, ncol, ncell)
resolution : 0.008744915, 0.008734349 (x, y)
extent : 0.77399, 3.8522, 6.225287, 12.40921 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=tmerc +ellps=WGS84
values : in memory
min value : 0
max value : 336.4539
> plot(bpop2)
> plot(log(bpop2))
[Perhaps we should use 'resample' here?] Now the log plot is doable:
BUT these values are now wrong. The units of the original data are persons per grid square. Our grid squares are now bigger then the original grid squares, so more people live in them. We can correct this by scaling by the ratio of the cell sizes:
> bpop2= bpop2 * prod(res(bpop2))/prod(res(bpop))
> plot(log(bpop2))
and now we get a more correct map:
I'm not sure how 'correct' or optimal this is of downsampling population density grids is - for example the total population (sum of grid cells) will be slightly different. Perhaps it is better to rescale the sampled grid to keep the total population the same? Thoughts and musings in comments for discussion please!
Subscribe to:
Posts (Atom)