Thursday, 4 April 2013

Why You Should Not Use WinBUGS (or OpenBUGS)

The BUGS project did a lot to bring Gibbs sampling to the masses. But by using a non-mainstream development environment, it has painted itself into a corner.

My initial objection to using BUGS was that WinBUGS only ran on Windows PCs. That meant I couldn't run it on my desktop, and I couldn't run it on the university high-end cluster. The only way I could run it on a Linux box would be to set up a Windows Virtual Machine.

Attempts to run it using Wine on my desktop would fail with odd errors - current one is "trap #101" and a traceback list, although I've also seen various memory errors once I have started it. I can't trust it to be doing anything right.

Trust was another reason I was far from keen on using WinBUGS. It was released under a closed-source license. There was no right to modify or redistribute WinBUGS. There was even a fee for a registration code that unlocked some features - although that fee is currently $0 and the unlocking key is freely available.

The last patch for WinBUGS was dated August 2007, over five years ago. The most recent update on the WinBUGS development page talks about a beta-test period ending nearly nine years ago. Because things have moved on. In 2004, OpenBUGS was born. The code was made available under the GNU GPL, a license that allows modifications and redistributions, and most of all allows - indeed insists on - access to the source code.

So let's look at the source code. The Download page on the OpenBUGS site even has a Linux Download section with a link to a 'source package'. But this turns out to be misleading - its actually a binary shared library with a short C file that links to it and calls the main routine in it. Where is the source code? It supposes to be on SourceForge, but although there appears to be code checkins, I find nothing in the SVN repo: http://openbugs.svn.sourceforge.net/viewvc/openbugs/

I'm sure this has all worked before. So what's going on now? In desperation, I emailed one of the developers. He reports they are ironing out a few wrinkles in the source code, and everything will be back to normal within a few weeks.

Even when I do have the source code, development is problematic. The code is written in Component Pascal, and can only be read with the BlackBox Component Builder from Oberon Microsystems. The files are not plain text. The Component Builder only runs on Windows. An email on the OpenBUGS mailing list in July 2012 claims that the BlackBox tools have been abandoned by their own developers, which is why there's no true 64-bit OpenBUGS.

How the Linux binary is produced is also a mystery. I can't find any clues on the OpenBUGS developer pages about how to build it. Clearly some kind of cross-compilation or embedding of Windows code is needed. But a handy guide to do it, I can not find.

Other open-source software is not as difficult as this. Most R packages install from source code with a one-line build command. Installing python packages can be as simple as 'pip install packagename'. I can configure, compile, and build a major open-source GIS in two command lines. Why has OpenBUGS got so difficult?

  • It used a very niche language and toolset. Pascal, and Component Pascal, was a minority interest back when the BUGS project started. But they got in deep pretty soon. Even though the language is perhaps richer than the C and C++ systems of its time, they've been hit by the problem outlined in "Don't be Distracted By Superior Technology" by James Hague.
  • It targeted Windows only, providing a low barrier to entry for simple users. However, by not building a cross-platform solution it alienated more technical users running Unix variants, the sort of people who form the majority of developers.
  • It failed to build a community. By not attracting developers, development was reliant on the paid staff. The current page of "Future Developments" has plenty of things that people could be getting on with, but the esoteric and restricted development environment shuts out many people.
  • It opened up too late. Now we have true open-source alternatives that can run many of the BUGS models - such as JAGS, stan, PyMC, and several specialised R packages. These packages are also easy to extend with custom distributions and models. Few people are going to put development effort into OpenBUGS now.
So what of the future? If the BlackBox Component Builder is dead, then there is a big brick wall ahead once that stops working with the current version of Windows. It is Open Source, but probably requires an elite skill set to get it working. But if that goes not only will you not be able to run your models, you won't even be able to open your model files. That's so important I'll repeat that. You will not be able to read your model files. Imagine not being able to open your old Word files. Oh wait, you probably cant.

 There's a paper in Statistics in Medicine (2009) by David Lunn et al that talks about a foundation for BUGS code, but there's no sign of this existing as of now. Plenty of other projects thrive without needing a foundation, you just have to open the code, make working with it easy, and people will come. I'm not sure if this is possible without re-implementing OpenBUGS in a mainstream language such as C++ or Python. Without doing that, OpenBUGS is a dead end.

Sunday, 17 February 2013

An NHS Twitter Sentiment Map

One of the project proposals at the NHS Hack Day in Oxford was about doing some kind of sentiment analysis using twitter and NHS hashtags and ids. I had a brief word with the proposer since I'd recently seen something similar done in R but he was on another project.

But this weekend I thought I'd have a go at doing something. The main idea came from Jeffrey Breen's blog Mining Twitter for Airline Consumer Sentiment - there he writes some simple R functions that looks for positive and negative words in tweets to give a score. I pretty much used his code for scoring exactly like that.

I found a twitter list of NHS accounts which I quickly pulled out of Firefox. This could probably be done with the twitteR package in R but I found it just as quick to save it from Firebug and parse the HTML with Python and BeautifulSoup. Make sure you've scrolled down to see all the members, then save the list from the HTML tab in Firebug. That gave me a parsable list of accounts with id, description, and image URL for good measure.

Then I could run the sentiment analysis, looking for tweets that mention the NHS accounts, and computing the score. This would hopefully be tweets from people mentioning those accounts, and expressing their opinion of the service.
There were some problems with escape characters (the RJSONIO package worked better than the rjson package) and other encodings, but I managed to work round them. Soon I had a table where I could order the NHS accounts by sentiment.

But I'm a map geek. How could I map this? Most of the twitter accounts were geographic, but some were individual trusts or hospitals, and some were entire PCTs, SHAs, or even the new CCGs. There were also a few nationwide NHS accounts.

So I adopted a manual approach to geocoding. First I generated a shapefile with all the accounts located in the North Sea. I loaded this into a QGIS with an OpenStreetMap background layer, and then dragged each account around until it was roughly in its place. I left the national accounts floating in the sea. 156 drags later, they all had a place.

Back in R, I put the sentiment scores and counts of number of tweets back onto the geodata and saved a shapefile. But how to distribute this? I could save as a KML and people could use Google Earth, but I thought I'd give CartoDB a try. This is a cloud mapping service where you can upload shapefiles and make pretty maps out of them. With a bit of styling, I had it. Here it is, blue for positive sentiments and reds for negative ones (to the nearest whole number):


Or use the direct link to the map
Of course you need to take a lot of this with a pinch of salt. The locations are approximate. The sentiment scoring system is quite naive. The scores are based on fairly small numbers of tweets (click on a point to see). And the scores were a snapshot of when I ran the analysis. Nevertheless, its a start. It would be interesting to automate this, and see how sentiment changes over time, but I think it requires a lot more tweets to get something amenable to statistical analysis. Once I figure out a statistical distribution for these scores (difference of two Poissons maybe) I could map significance of sentiment scores, which would take into account the small samples. Exeter may look red and angry, but that's from one single tweet!

Wednesday, 6 February 2013

A new paradigm for spatial classes

I have a love-hate relationship with R. So many things annoy me - even things not in the R Inferno. But we can fix some of them.

For example, SpatialPolygonsDataFrames (and the other spatial data frame classes) aren't really data frames. They don't inherit from data.frame, but they almost behave like data frames. And when they don't, that's an annoyance, and you end up having to get the foo@data member which really is a data frame.

So why can't data frames have a spatial component? Partly because the columns of a data frame can only be vectors of atomic objects. You can't have a column of list objects. Or a column of model fits. Or a column of dates.. wait, yes you can...

> ds = seq(as.Date("1910/1/1"), as.Date("1999/1/1"), "years")

> df = data.frame(date=ds,i=1:90)
> head(df)
        date i
1 1910-01-01 1
2 1911-01-01 2
3 1912-01-01 3
4 1913-01-01 4
5 1914-01-01 5
6 1915-01-01 6

Dates like this are atomic. Strip away its class and a date object is just a number:

> unclass(ds[1])
[1] -21915

and its the S3 OO system that prints it nicely. So how can we store geometry in an atomic data item? We can use WKT. This is a text representation of points, polygons and so on.

So here's a function to take a SpatialPolygonsDataFrame and return a data frame with a geometry column. It adds a column called the_geom of class "geom", and attaches an attribute to the data frame so that we know where the geometry column is. Anyone who has used PostGIS will recognise this.


as.newsp.SpatialPolygonsDataFrame = function(spdf){
  require(rgeos)
  spdf$the_geom = writeWKT(spdf,byid=TRUE)
  class(spdf$the_geom) = c("geom")
  df = spdf@data
  attr(df,"the_geom") = "the_geom"
  class(df) = c("newsp","data.frame")
  df
}


Now if you convert a SPDF to a "newsp" object you get a data frame. With geometry. What can you do with it? Well, anything you can do with a data frame, because that's exactly what it is. You want to do spatial things with it? Ah, well, for now here's some code that gets the geometry columns and turns it back into a SPDF:


as.SpatialPolygonsDataFrame.newsp = function(nsp){
  geom_column= attr(nsp,"the_geom")
  geom = nsp[[geom_column]]
  class(nsp) = "data.frame"
  geom = lapply(geom,readWKT)
  glist = lapply(geom,function(p){p@polygons[[1]]})
  for(i in 1:length(glist)){
    glist[[i]]@ID=as.character(i)
  }
  SpatialPolygonsDataFrame(SpatialPolygons(glist),nsp,match.ID=FALSE)
}

This just does the reverse of the previous function. So you can write a plot method for these newsp objects:


plot.newsp = function(x,...){
  d = as.SpatialPolygonsDataFrame.newsp(x)
  plot(d,...)
}

So far, so pointless? Eventually if you developed this you'd write code to interpret the WKT (or better still, use WKB for compactness) and draw from that. This is just proof-of-concept.

These WKT strings can be very long, and that messes up printing of these data frames. In order to neaten this up, let's write some methods for the geom class to truncate the output:


as.character.geom = function(x,...){
  paste0(substr(x,1,10),"...")
}

format.geom = function(x,...){
  as.character.geom(x)
}

But now we hit a problem with subclassing any classed vector. Facepalm. R drops the class. 

> z=1:10
> class(z)="foo"
> z
 [1]  1  2  3  4  5  6  7  8  9 10
attr(,"class")
[1] "foo"
> z[3:10]
[1]  3  4  5  6  7  8  9 10

Annoying. But how does a vector of dates work around this? Well, it defines a "[" method for date vectors. We'll just copy that line for line:


"[.geom" = function (x, ..., drop = TRUE) 
{
    cl = oldClass(x)
    class(x) = NULL
    val = NextMethod("[")
    class(val) = cl
    val
}


and while we're at it, we need a print method for the geom class too:


print.geom = function(x,...){
  print(paste0(substr(x,1,10),"..."))
}

Now look what I can do (using the Columbus data from one of the spdep examples):

> cnew = as.newsp.SpatialPolygonsDataFrame(columbus[,1:4])
> head(cnew)
      AREA PERIMETER COLUMBUS_ COLUMBUS_I      the_geom
0 0.309441  2.440629         2          5 POLYGON ((...
1 0.259329  2.236939         3          1 POLYGON ((...
2 0.192468  2.187547         4          6 POLYGON ((...
3 0.083841  1.427635         5          2 POLYGON ((...
4 0.488888  2.997133         6          7 POLYGON ((...
5 0.283079  2.335634         7          8 POLYGON ((...

This is what kicked this all off. Someone couldn't understand why head(spdf) didn't do what he expected - in some cases:

> pp=data.frame(x=1:10,y=1:10,z=1:10)
> coordinates(pp)=~x+y
> head(pp)
Error in `[.data.frame`(x@data, i, j, ..., drop = FALSE) : 
  undefined columns selected

Try it. Understand it. Then tell me that spatial data frames that inherit from data.frame with the spatial data in a column aren't a better idea. Its the principle of least **BOO!** surprise. Of course it would take a lot of work to re-tool all the R spatial stuff to use this, so it's not going to happen.

Another possibility may be to drop data.frames and use data.tables instead... Anyway, just an afternoon hack when I realised you could put WKT in a column. 



Tuesday, 29 January 2013

NHS Hack Day Oxford - The Good, The Bad, and The Ugly

My Experience at NHS Hack Day Oxford

Let's get the "Ugly" out of the way. By the end of day two (and possibly earlier) - the first stall had the seat broken and strewn on the floor, the second had no paper, and the third had been filled by a mysterious bulging carrier bag. Given this was a hospital, nobody had dared investigate its contents.

The "Bad"?. The networking. Getting reliable internet connections for an event like this is a must. There had been some effort to get BT Wireless connectivity, but this was costing £20 for the weekend per person, and even then there were reliability problems. I connected freely using the Eduroam system for academics but the connection would regularly drop. With an average of 2.1 WiFi devices per person at this event (that's a rough guess!) it can be a deal breaker. At least one person arrived late on Sunday to take advantage of the unlimited fast broadband at their friend's house where they were staying - although he also admitted the unlimited toast and Australian tennis action was another factor.  I wonder if wired network connections and a few switches and hubs scattered round the venue might be a better solution. For the FOSS4G Conference in Nottingham this year (shameless plug) the network connectivity has been top of our agenda since our first meeting. 

For future Hack Day events, it might be worth putting a reminder out for power extension leads a bit further in advance. Also, most of us have a couple of lanyards kicking around from previous conferences and a reminder to bring and share lanyards might be an idea too. Sticky labels were okay, but sub-optimal!

So, onto the work itself. Given the huge number of people who turned up I was surprised at the level of traffic on the email and the google group. Perhaps a dozen or so individuals voiced up on the pre-conference channels, yet at least ten times that number walked through the doors on the day. What were that 90% thinking beforehand?

Partly I suspect some of the work groups were formed in advance. There seems to be a classification of hack day work groups:
  • The group of people who have met and worked before. They have a plan for the hack day worked out already, and are perhaps using the weekend to set aside two days from the sound of ceaseless door-knocking (or the ping of arriving emails) in their day jobs to get this done. These groups can be hard for an outsider to help out with, since their work programme is decided and the division of labour is understood. They get down and get on.
  • The small group that has a plan, and knows what it needs, and knows it can find it at the hack day. Perhaps one or two people have the kernel of an idea together, but need a database expert, or a web scraper. They get them, and a designer and maybe a doctor and a patient jump in too as representative end users. Soon a group forms that adds value to the original idea and the development snowballs.
  • The randoms. A group of people with no great agenda, but assorted skill sets, who get together and come up with both a problem and a solution on the day. It may be nothing to do with any of the individual ideas, but emerges from the sum of their parts.
 I skipped the presentations and judging partly because I'm slightly uncomfortable with judging and prizes, especially  for works of art (don't talk to me about the Oscars, the Man Booker prize, or the Queen's Honours list) and I was getting a bit cabin-fever and I fancied some fresh air. I wandered down to the river and was rewarded by a refreshing downpour followed by a DOUBLE RAINBOW! which inspired me with some new ideas. I find a combination of solitary thinking time with group work most inspirational, and a giant double rainbow arcing right across the sky can only help that process.

The time spent with other people is the real "Good" of the NHS Hack Day. Even if none of the code cut at the weekend goes into production then the value in making those connections makes the weekend worthwhile. What this means is that even if the WiFi doesn't work and there's no electricity, just stick everyone in the room, lock the doors, push sandwiches and cake in at midday and keep a flow of coffee and good things will still happen.

The big challenge is getting these things continuing, which requires funding, further collaboration, acceptance and adoption. The greater challenge is integrating the spirit of the hack day into the processes that produced, for example, the ePortfolio that was so hated that a hack day group spent the weekend conspiring to produce a better one. So-called solutions are seemingly often imposed from above, based on marketing promises from big business, and the vast majority of users have no input to the problem. This is not a problem confined to the NHS, and there is a need to "loop back up" so that managers have a better awareness of user requirements, something they could learn from modern agile computer science development methods and open-source practices. 

So overall a great weekend - and I'll probably keep an eye out on some of the projects to see how they develop. All the details are available from the NHS Hack Day web site.

Comments etc to @geospacedman on twitter

Saturday, 29 September 2012

The Taarifa Project

I seem to have been drawn in to helping a bit with The Taarifa Project for one reason or another. Mostly the enthusiasm of Mark Iliffe, but also the possibility of helping out with an interesting project.

For those who don't know and are too lazy to click the link I thoughtfully provided, here's the elevator pitch:

The Taarifa Platform is an open source web application for information collection, visualization and interactive mapping. It allows people to collect and share their own stories using various mediums such as SMS, Web Forms, Email or Twitter, placing these reports into a workflow. Where these reports can be followed up and acted upon, while engaging citizens and communities.

Currently it has been deployed in Uganda, and further deployments are planned. The current codebase is a heavily modified fork of Ushahidi, but the application has grown out of that now. The new project is a reboot and a new existence as a Django application.

Which suits me fine. I've done a lot of work with Django, including content management systems, applications for disease modelling and monitoring (that's my day job), and a social site that built a linked system of interesting technological new items. 

The Django version of Taarifa, djangorifa, is open source, and hosted on github. Getting the source and the requirements took about ten minutes. I first tried to get it running with Spatialite, but currently that doesn't work because Spatialite has some limitations that mean it can't work out distances given lat-long coordinates. Seems a bit odd since any schoolkid can do it with a calculator with sin and cos buttons. But never mind.

So I stuck PostGIS on my virtual box and away it went. Using the sample OpenStreetMap dump file supplied I had a demo site working, and I could start pretend-reporting things.

There's clearly a lot to do. I think from a top-level there's still plenty of functionality to re-think. I believe the PHP version had a complicated workflow for dealing with reports, so maybe that needs doing. There's also a lot of basic web site functionality too. I thought of a few things:

  1. Translation. For something intended for developing countries it has to be multilingual. Django supports this out of the box, and translations are supported in templates, python code, and even in URLs themselves, so a site could have URLs in several languages that go to the same effective page, and that page would be translated into that language. For example, /voiture/12 would show the page for car number 12 in French, and /car/12 would show it in English, and /auto/12 would show it in German. Not sure how you handle pages where the same word is used in two languages...
  2. Search. Vital for this system. I've used Haystack integrated into Django-CMS talking to a whoosh backend. Pretty heavy, and works well. I had a quick look for new django search projects and gave django-watson a test-drive. Worked straight out of the box. Uses Postgres' search facilities so indexing is all handled there. Might be my new goto- django search system.
  3. REST API. Data in is one thing, data out is another. For a community-centric system such as this, it would be great to let people get the data and do that web 2.0 mash-up think that you may remember from a few years ago. 
  4. Spatial Analytics. Take the report data and look for hotspots, time trends and so on. Present it all as nice graphs and maps. Make that available via a WMS server for mash-up purposes.
Check the project out if it looks interesting to you! http://taarifa.org/

Tuesday, 11 September 2012

Fixing Polygons with pprepair

At the OSGIS UK conference last week the prize for best presentation went to Ken Arroyo Ohori of TU Delft and his talk on "Automatically repairing polygons and planar partitions". The presentation, jointly credited to Hugo Ledoux and Martijn Meijers detailed a method for cleaning up dodgy polygons which we've all probably seen in our geo-lifetimes.

I've noticed that computational geometry theorists seem averse to publishing code - its as if they don't want to sully the purity of their theory with dirty code, dealing with edge cases and arithmetic precision and all that. Not these guys. The code is on github, and I was eager to try it out. There was a lot of buzz about integrating this with PostGIS or QGIS or R, the stumbling block seeming to be the requirement for the CGAL library.

Anyway, back at Lancaster, and check out the code. Easy as:

git clone https://github.com/tudelft-gist/pprepair.git

and then 

make -f Makefile.linux

There were a couple of problems which I submitted as issues, but I managed to fix them before any response. Check these on the github issue tracker.

Then I wanted a sample dataset to test it on. From the workshop I gave at the UseR! conference I remembered some awful polygons outlining local police neighbourhood regions. I'd used it as an example of bad data. Some crime locations had fallen into cracks between the neighbourhood polygons, and some were in locations covered by more than one polygon. Maybe you could escape justice by claiming you were outside a police jurisdiction this way? 

Here's what a small region looks like in Qgis:
Notice the white spaces between some polygons, and also the overlapping made visible by setting transparency.

The pprepair process is currently a command-line tool, but its options are simple. An input shapefile, an output shapefile and some repair parameters. But to start, all you need is "-fix".

pprepair  -i neighbourhoods.shp -o fixed.shp -fix

Loading this into Qgis looks much nicer:
Everything is now clean, there's no overlaps and no holes (note the colours are generated by my semi-random topological colouring system so are different). You now have a consistent topology that won't cause odd errors when using other analysis techniques.

Ken's presentation claimed big wins on speed and memory usage over similar algorithms in both open-source and proprietary GIS packages. Once the licensing requirements of the CGAL library are solved, it would be great to see this algorithm pop up in all the open-source geospatial systems.


Tuesday, 10 July 2012

The World Wine Web

The ggmap package caused a buzz at UseR! last month. Well, anything gg-something plus anything to do with maps will do that. It lets you download map tiles to use as backgrounds to ggplot graphics, and has some other useful functions such as interfacing to google's routing engine and geocoding.

Dave Smith of Revolution Analytics used it to map his local wineries, a map someone on twitter called 'useful'. I disagree...

 That map shows me one thing - that the wineries that you don't need an appointment for tend to be close to the main roads, and the ones you do need an appointment for are up in the hills. Its a static, dead map. If you decide to visit a winery, you've then got to do some spatial query to find out the name and address. That's not very useful.

 One poster on Dave's blog asked if it could be linked to Google Vis to produce a dynamic map. Well, better than that. We can use R to create a web map with all open source tools and open data.

 My off-CRAN package, webmaps has two capabilities - it can download map tiles like ggmap, and it can create slippy webmaps using OpenLayers. Recent changes to OpenLayers have however broken the slippy map functionality, and I'm working on a dedicated replacement package called openlayr. However, enough of the webmaps functionality exists to give me a good start on making a dynamic map of the wineries.

 If it was all working, all I'd need to do is to divide the wineries into those that need appointments and those that don't, and then do:

osmMap(layer(Appt,name="Yes",style=s1),layer(NoAppt,name="No",style=s2),outputDir="WineMap")

and I'd get something like this:
- a web map with two layers of points (the objects s1 and s2 define the colours). Each layer can be activated on or off via the checkboxes in the pullout menu, and you can zoom and pan. And if you click on a winery...
up pops the information. Remember apart from those s1 and s2 objects and a bit of standard dataframe manipulation, this is one line of R.

The background imagery is an OpenStreetMap tile background - so I really need to add an attribution to this. Copyright OpenStreetMap Contributors, CC BY-SA. Its also possible to add other base-layers, such as the oooh-inducing Stamen.com watercolour maps.

Once I get openlayr all working nicely I can probably give you a quick script to do all this. For now, you can check out the map online and see for yourself. What happens is that each layer is converted to GML (I might change to using GeoJSON for openlayr) and a simple Javascript and HTML wrapper is created to load the layers.

With a bit more customisation you could make the popups automatically link to the winery web page! This is how easy it is to make a web map with R - remember you can push this onto a web server and anyone can see it.

I'm off now for a glass of... oh, my wine cellar is empty...