Thursday, 19 December 2013

Get Off My Server! Geocoding Attack Clients

So this year I had the joy of helping manage a WordPress (WP) instance for the FOSS4G conference in Nottingham.  That introduced me to the joy that is the paranoia of the WP manager. The justified paranoia.

WP managers lose sleep over exploits. Not sleeping is the only way they can be sure of never waking up to discover that their site has been cracked, and is now serving up malware, scraping users credentials, or part of a vast BitCoin-mining botnet. Patch everything, often.

There's also a lot of security plugins for WordPress, but I figured we ought to have something at a lower level, and my favourite first-line tool is fail2ban. You set up pattern-matching expressions, and when log files match those patterns, the system adds rules to the iptables ruleset to kick that connection.

After watching the log files and seeing the server slow down as WP tried to process hundreds of invalid requests, I figured out a rule that seemed to match most of them. My suspicion was that a lot of the WP exploit attempts used a kit, and that kit had a fairly clear signature. So along with the other handy rules in my fail2ban config, I added my rule too.

One of the outputs of fail2ban's logs is the IP address of each banned host. So I thought it might be nice to geocode them via the GeoIP database and see where they have all been coming from. "China" and "Russia" are the answers that most people seem to give when you ask them to speculate on the source of these attacks. Are they right?

So first, I took the log files that I had and extracted the IP address and timestamp of the ban. Then, using the Python GeoIP module, translated all the IP addresses to lat-long and country code. That gave me about 1200 locations from one month of retained log files.

Here's a table of the number of bans for the top few countries.

CountryBan Count
USA 573
Germany 76
France 50
Japan 49
Poland 39
Netherlands 37
Turkey 29
Spain 27
Indonesia 27
Vietnam 24
Great Britain 21
Myanmar 21
India 20
Russia 19
Austria 15

So the USA is clearly the big trouble here, with China coming in way down. Of course that's not to say all these US PCs aren't being controlled by Chinese or Russian botnets.

Now we have lat-long, we can save all this as a shapefile, and load into QGIS. Plot on an OpenStreetMap background.
Note that this map contains overlapping points and so isn't a perfect representation of density. Also, the spatial precision of the MaxMind GeoIP database varies wildly.

First I'd like to thank Australia and New Zealand for not bothering to try and hack our server. Much appreciated. Let's look east first:

Quite a good representation here, including Iran, most of south-east Asia. I don't know why Vietnam scored so highly in the table. Let's look at Europe:
Europe has a good spread of banned IP addresses, but Portugal, Greece, and Ireland have nothing. Maybe everyone unplugs their machines in the countries hit hardest by the financial crisis? Off to the biggie now, lets' check out the USA:
Mostly an east coast thing here. Examining the west coast in more details shows a lot more activity from LA than San Fran or points further north, up into Vancouver - thanks hipsters! What's going on on the east coast though? As they say on CSI, "enhance that area"...
Time to switch to Stamen BW maps here, just because I can, and because its a bit less distracting. Quite a few attackers around the state, but let's go closer. Take me into Lower Manhattan and enhance - with Bing Aerial Maps...
At this point the CSI team head off to the NYU tennis courts and find a guy with a laptop sitting in the middle on that patch of grass, trying to hack into the FOSS4G server. Of course we don't really have data at anything like that fine precision so it's quite meaningless. Only the NSA (and Mac Taylor and friends) can track you down that closely!

I don't know if there's any value in doing any more analysis of this particular data set, but it is at least handy to reverse some of those prejudices of people who say all the cyber attacks come from China or Russia. I've not used the timestamps of the data here, so it could be possible to create an animation of attack points from the data. If you'd like a copy of the data, get in touch.

Update!

I found another monthly tranch of fail logs. This looks very different, and we can point a finger at the Russians. Here's the top table:
Country Ban Count
USA 872
Russia 795
Japan 375
Peru 314
Thailand 308
Mexico 265
Philippines 206
Ukraine 182
Turkey 179
Ecuador 154
Kazakhstan 151
Iran 140
India 138
Vietnam 130
Indonesia 118
- which is a bit different! How did Peru get up there?

I had a quick play with some of QGIS' plotting functions, and discovered that if I used an SVG symbol with a few dots on it, and used the data-driven symbology to randomly rotate it, and set the opacity to something fairly small, I could get a much better impression of the density, including where overlapping points create hotspots. There's probably a density estimation plugin somewhere for QGIS, but until then, or until I can load the data into R and do a proper kernel-density estimation, this will have to do.

Monday, 7 October 2013

Refugee Camp Spatial Analysis

Zaatari Refugee camp was established in 2012 to host Syrian refugees fleeing the fighting in the Syrian civil war [Wikipedia]. After a tweet and a blog post from Lillian Pierson I thought I'd see what data was around.

The UNOSAT web site has shelter locations in the camp,as a shapefile, but I also had a look at OpenStreetMap to see what was there. As well as shelter locations, it also has locations of other facilities such as toilets, kitchens, mosques etc.

So for starters I thought a simple kernel density estimate of shelters might be something to do. The plan was to get the OSM data using the R package osmar, then create a 2d smoothing, then write that as a raster and map it in QGIS. Here's the R code:

# need these packages:
require(osmar)
require(KernSmooth)
require(raster)
# define the source for OSM data
api = osmsource_api()
# define the area and get the data
box=corner_bbox(36.310,32.281,36.338,32.303)
camp=get_osm(box,source=api)

# subset the shelters - first find shelter ids:
shelters=find(camp,node(tags(v == "shelter")))
# then subset:
shelters=subset(camp,node_ids=shelters)

# convert to spatial object for kernel smoothing
sh=as_sp(shelters)
# set bandwidth by trial and error
k1 = bkde2D(coordinates(sh$points),bandwidth=0.0002,gridsize=c(200,200))
# convert to raster
rsh = raster(list(x=k1$x1,y=k1$x2,z=k1$fhat))
# set CRS to lat-long
projection(rsh)=CRS("+init=epsg:4326")
# write a GeoTIFF file
writeRaster(rsh,"shelters.0002.tiff","GTiff")

With that done, you can load the tiff into QGIS and plot it over OSM basemap data. Stick that in a Map Composition and get this:

Obviously there's some problems with this - I should probably convert everything to a proper cartesian coordinate system so the units can be in people per square meter rather than people per square degree (which varies by latitude...), but this was a quick analysis on a Monday morning. The bounding box is also a bit small on the left and the top, the bandwidth was chosen to make the map look good and so on.

I'm not sure this analysis in itself is any practical use. I don't know how much GIS analysis the camp management are using, but this illustrates what can be done with open data and free and open source software. Anyone can do this.

Next steps might be to see how shelters relate to facilities, which is a first step to planning new facilities and is a classic GIS problem. With Zaatari becoming one of the largest settlements in Jordan, there is clearly a need for expansion planning in the camp.



Monday, 23 September 2013

My Contribution to FOSS4G

The purpose of this post is to log all the stuff I did for FOSS4G 2013 in Nottingham. I'm excluding stuff we all did, like writing the proposal, choosing presentations, proofreading and general help. We all did lots of that. I tried to keep my tasks to those that required more communication with computers rather than with human beings, and left that messy business  to others on the committee.
Its in sort-of chronological order, but plenty of these tasks overlap.

Setting up the WP skin

After Jo and Barry K had set up the Amazon server and installed WP and MySQL, we all messed around styling it for a bit until I worked out the basics of WP skinning, and built the current skin. That involved some CSS and understanding the WP method for getting page templates. I found a simple WP plugin for handling the sponsors which we thought might be useful for more of the conference data systems. I did the image carousel for the home page. Some PHP was written. Sorry.

Sysadmin duties

Jo set up email backup of the MySQL DB. I set up incremental network backups of the database and sections of the filestore to a large external USB device on my work desktop.
I requisitioned a spare server from work and used it to set up a development environment with WP and MySQL on it. Changes tested there were pushed to the live server.

OJS selection setup and admin

Before deciding on OJS for the Academic Track submissions, I looked at other possibilities including hosted solutions such as EasyChair. Eventually I installed, adminned, and themed slightly an OJS instance alongside WP on the Amazon box, and our academic chairs managed the process from there.

Graphic design and logo from contest winner

Naomi Gale's logo was selected as the winner, and I tweaked it a bit to make some usable SVG and PDF files. I also produced some sample style guides, as well as B+W and inverse colour versions. I also made some A5 flyers (laid out 2 on an A4 page for slicing) early on for publicity.

OSGeo-discuss mailings

Somehow I got the job of keeping the OSGeo-discuss list up to date with progress. We felt this necessary after the previous year's FOSS4G fell apart - one of the criticisms was lack of communication between that local team and OSGeo. So every two weeks or so, after our committee meetings, I'd summarise progress and write up a little note for the mailing list. These mailings were linked on the OSGeo wiki site.

Maptember - concept, website, logo, shirt design

I can't remember who first noticed all these good things going on in September. I think the first phrase was 'Geotember', which was a bit jarring to my ears. So I coined 'maptember'. I designed a simple web site, animated logo, and hosted it on the same Amazon server. I handled incoming event requests and found some myself to add to the page. I designed the t-shirts (and Rollo sorted the printing).

Workshop booking system

We opened registration before we had the workshops arranged, or even chosen. At that point, people could book one day or two days of workshops. Given that they might have ended up wanting to attend two half day workshops on different days, we decided to go for a one- or two-day of credits scheme, and allow flexibility.
So I built the booking system. At this time I was also investigating conference management solutions, including hosted solutions from Eldarion (using the open-source Symposion system). However, most of what they do we had already done (conference registration, paper submission) so instead I took large chunks from the German python user group's Django site (pyconde) and built the workshop reservation system. This was developed on a home Linux box and managed via github - changes were simply pushed to the live server.
Most of the user handling stuff was already there (logins, passwords etc). I created a new Django app to create a few tables for the workshops in the database. The user front-end for booking carefully made sure the user didn't book more than they'd paid for or book overlapping workshops. In all it registered about 300 people.
User data came into the system via a spreadsheet emailed daily from Claire, taken out of the main registration system. I ran this through a python script that updated the database and then printed a list of new user email addresses to whom I would then send an announcement. This step could have been automated completely, but I wanted to keep a close check on the process so ran it by hand most days.
Some custom reports were written for the workshop system so that the admin desk staff and workshop presenters can check off attendees and possibly open up new spaces if people don't show.

Conference timetable system - data integration from submissions and OJS. 

The program selection was done via shared google docs spreadsheets on which the committee recorded their votes and the community votes were added. After the process Rollo arranged the selected presentations on a spreadsheet in timetable format with sessions. I created the first provisional timetable on the web site by basically saving that as HTML and tweaking it greatly to make it a bit more usable. However this was not easily maintainable so I set about thinking of a better system.
Meanwhile Rollo was refining the order of presentations on the timetable. To make this easier for him I developed 'slotdrop', so he could drag-and-drop presentations between slots on the web page, with the changes in slot assignments being reflected in the database. The database now had presentations, sessions, people, locations, tags and so on.
Parts of the pyconde system already being used for bookings could have handled this, since it had facility for submissions, rooms, and schedules. However I considered that we already had a lot of this functionality already sorted and it would be easier to simply build a few more Django database tables and construct it all in custom views. 
We were now considering the Django database as the master record of presentation sessions, and all changes were being made against that. This was becoming out of sync with the web site's provisional timetable and people were starting to notice. So I then spent a few days developing the database-driven timetable pages, including plenary sessions and other special events. These pages included a python-whoosh search index, tags to highlight certain talks, hyperlinks between sessions, presentations, authors, buildings etc. A simple cookie-based favourites system was implemented, with an ICS calendar file download option.
I used Leaflet.js for the first time to produce building location map pages.

Mobile App

Although the timetable was usable on mobile apps it wasn't completely optimal - for one thing it needed a live internet connection. I went looking for suitable mobile apps with disconnected operation, although not having an iOS device meant I knew it would be hard to find one for that platform.
There are a number of companies who will take your conference and produce a stylish mobile conference app for it. For a price (and when the price isn't specified, you know its a big price). I looked for free and preferably open solutions.
"Giggity" provided that solution for Android . It is an open-source app that reads schedule files and keeps them for off-line use. It can show a timetable, room streams, give reminders and so on. I wrote a Django view to convert our conference programme into the right XML format and publicised the URL for it.
For iOS, I did find a similar application, but the file format wasn't documented. It was plain text, and I started a reverse-engineering effort, but without a device to test it on I didn't want to waste my time to much. The app came with a Windows program for creating schedules but for one of our size that wasn't an option. Recently that app's web site has been down, so I'm not sure if it is well-supported.
I also looked at using HTML5 offline storage for a mobile solution. The Chaos Computer Club (a group of hackers I first encountered in 1985, but that's another story) have an HTML/JS/CSS solution for their conference schedule which I attempted to convert to using the HTML5 Application Cache. I did have this working but I was unsure about exactly how the cache worked. And anyway, we plan to have good connectivity at the conference.

Volunteer system

Abi Page took on the job of volunteer wrangler for the conference. I created some Django database tables for recording volunteer activity and some reports so she could see what volunteers were signed up for. This included daily volunteer roster sheets and tables showing where volunteers were needed. Abi entered the data into the system.

Pledge site, concept, implementation, admin

During one of our fortnightly conference calls I had this pledge page idea. I pitched it to Steven and he loved it. First I thought I could implement it using Google Forms, and investigated ways to customise Google Forms pages and collect pledges that way. However at the time I was also setting up Django for the workshop database and decided it would be easier done that way. The site is one form with some client and server-side validation and a simple mathematical CAPTCHA field. New pledges are notified by email. I can then use the Django admin to accept, reject or delete the pledge. The page of pledges shows ten random pledges and allows the viewer to 'like' a pledge. A simple cookie is used to stop trivial multiple voting. For the end of the conference I quickly re-purposed a Javascript "Star Wars" style scroller found online to display all the pledges during the run-up to the closing session.

Map Exhibit

I created the web page with the voting system, and the screens for the iPad wall. Ken supplied some static graphics from each map entry and produced the movie for the large plasma screens. I wrote a web page that showed thumbnails of all the entries in random order, using Isotope for a dynamic box layout and fancybox to popup a larger preview. A link went to the URL for web entries, the PDF for static maps, or YouTube for movie entries. Voting used a similar system to the pledge voting to count the popular vote. I wrote a report to list the current vote count, and checked for obvious ballot-stuffing.
The wall of iPads was driven by a web page that took a range of map entries and displayed each one, with a title and author overlay, with a 'Ken Burns'-style transition between them. This was all made easy with various jQuery plugins. With 77 entries we had 7 maps on each of 11 iPads, and a 12th iPad was used to display a set of credits and information.

Nerds

I suggested we try and hire The Festival of the Spoken Nerd for the Gala Night after seeing some of their performances on TV and on YouTube. I contacted their agent and despite only two of them being available we decided they could still put on a great show. And they did.


Wednesday, 12 June 2013

Formation of a Research Computing Users Group at Lancaster

As part of my Fellowship with the Software Sustainability Institute, I want to encourage better communication and collaboration between people using computing in their research. Anyone who spends a chunk of their research time writing programs, manipulating data or producing interesting graphs - as well as those who have an interest in improving their skills in those areas - are welcome to the inaugural meeting of the totally informal Research Computing Users Group. This will bring together researchers across faculties to share best practice and new skills in research computing.

The initial meeting will be at 1pm on the 5th of July in the A54 Lecture Theatre of the Postgraduate Statistics Centre. As I have no idea of numbers at this stage, could you please fill in this doodle poll if you are interested in attending. A light lunch will be supplied, sponsored by the Software Sustainability Institute.

I've also prepared a short questionnaire on aspects of research computing.

For the first meeting we'll probably spend 20 minutes in general discussion, and I'll spend 20 minutes with a short presentation on magically turning your data into publications. If anyone would like to do a short presentation on any aspect of their research computing processes, please email me with details.

Please encourage your fellow researchers and PhD students to attend.

Thursday, 4 April 2013

Why You Should Not Use WinBUGS (or OpenBUGS)

The BUGS project did a lot to bring Gibbs sampling to the masses. But by using a non-mainstream development environment, it has painted itself into a corner.

My initial objection to using BUGS was that WinBUGS only ran on Windows PCs. That meant I couldn't run it on my desktop, and I couldn't run it on the university high-end cluster. The only way I could run it on a Linux box would be to set up a Windows Virtual Machine.

Attempts to run it using Wine on my desktop would fail with odd errors - current one is "trap #101" and a traceback list, although I've also seen various memory errors once I have started it. I can't trust it to be doing anything right.

Trust was another reason I was far from keen on using WinBUGS. It was released under a closed-source license. There was no right to modify or redistribute WinBUGS. There was even a fee for a registration code that unlocked some features - although that fee is currently $0 and the unlocking key is freely available.

The last patch for WinBUGS was dated August 2007, over five years ago. The most recent update on the WinBUGS development page talks about a beta-test period ending nearly nine years ago. Because things have moved on. In 2004, OpenBUGS was born. The code was made available under the GNU GPL, a license that allows modifications and redistributions, and most of all allows - indeed insists on - access to the source code.

So let's look at the source code. The Download page on the OpenBUGS site even has a Linux Download section with a link to a 'source package'. But this turns out to be misleading - its actually a binary shared library with a short C file that links to it and calls the main routine in it. Where is the source code? It supposes to be on SourceForge, but although there appears to be code checkins, I find nothing in the SVN repo: http://openbugs.svn.sourceforge.net/viewvc/openbugs/

I'm sure this has all worked before. So what's going on now? In desperation, I emailed one of the developers. He reports they are ironing out a few wrinkles in the source code, and everything will be back to normal within a few weeks.

Even when I do have the source code, development is problematic. The code is written in Component Pascal, and can only be read with the BlackBox Component Builder from Oberon Microsystems. The files are not plain text. The Component Builder only runs on Windows. An email on the OpenBUGS mailing list in July 2012 claims that the BlackBox tools have been abandoned by their own developers, which is why there's no true 64-bit OpenBUGS.

How the Linux binary is produced is also a mystery. I can't find any clues on the OpenBUGS developer pages about how to build it. Clearly some kind of cross-compilation or embedding of Windows code is needed. But a handy guide to do it, I can not find.

Other open-source software is not as difficult as this. Most R packages install from source code with a one-line build command. Installing python packages can be as simple as 'pip install packagename'. I can configure, compile, and build a major open-source GIS in two command lines. Why has OpenBUGS got so difficult?

  • It used a very niche language and toolset. Pascal, and Component Pascal, was a minority interest back when the BUGS project started. But they got in deep pretty soon. Even though the language is perhaps richer than the C and C++ systems of its time, they've been hit by the problem outlined in "Don't be Distracted By Superior Technology" by James Hague.
  • It targeted Windows only, providing a low barrier to entry for simple users. However, by not building a cross-platform solution it alienated more technical users running Unix variants, the sort of people who form the majority of developers.
  • It failed to build a community. By not attracting developers, development was reliant on the paid staff. The current page of "Future Developments" has plenty of things that people could be getting on with, but the esoteric and restricted development environment shuts out many people.
  • It opened up too late. Now we have true open-source alternatives that can run many of the BUGS models - such as JAGS, stan, PyMC, and several specialised R packages. These packages are also easy to extend with custom distributions and models. Few people are going to put development effort into OpenBUGS now.
So what of the future? If the BlackBox Component Builder is dead, then there is a big brick wall ahead once that stops working with the current version of Windows. It is Open Source, but probably requires an elite skill set to get it working. But if that goes not only will you not be able to run your models, you won't even be able to open your model files. That's so important I'll repeat that. You will not be able to read your model files. Imagine not being able to open your old Word files. Oh wait, you probably cant.

 There's a paper in Statistics in Medicine (2009) by David Lunn et al that talks about a foundation for BUGS code, but there's no sign of this existing as of now. Plenty of other projects thrive without needing a foundation, you just have to open the code, make working with it easy, and people will come. I'm not sure if this is possible without re-implementing OpenBUGS in a mainstream language such as C++ or Python. Without doing that, OpenBUGS is a dead end.

Sunday, 17 February 2013

An NHS Twitter Sentiment Map

One of the project proposals at the NHS Hack Day in Oxford was about doing some kind of sentiment analysis using twitter and NHS hashtags and ids. I had a brief word with the proposer since I'd recently seen something similar done in R but he was on another project.

But this weekend I thought I'd have a go at doing something. The main idea came from Jeffrey Breen's blog Mining Twitter for Airline Consumer Sentiment - there he writes some simple R functions that looks for positive and negative words in tweets to give a score. I pretty much used his code for scoring exactly like that.

I found a twitter list of NHS accounts which I quickly pulled out of Firefox. This could probably be done with the twitteR package in R but I found it just as quick to save it from Firebug and parse the HTML with Python and BeautifulSoup. Make sure you've scrolled down to see all the members, then save the list from the HTML tab in Firebug. That gave me a parsable list of accounts with id, description, and image URL for good measure.

Then I could run the sentiment analysis, looking for tweets that mention the NHS accounts, and computing the score. This would hopefully be tweets from people mentioning those accounts, and expressing their opinion of the service.
There were some problems with escape characters (the RJSONIO package worked better than the rjson package) and other encodings, but I managed to work round them. Soon I had a table where I could order the NHS accounts by sentiment.

But I'm a map geek. How could I map this? Most of the twitter accounts were geographic, but some were individual trusts or hospitals, and some were entire PCTs, SHAs, or even the new CCGs. There were also a few nationwide NHS accounts.

So I adopted a manual approach to geocoding. First I generated a shapefile with all the accounts located in the North Sea. I loaded this into a QGIS with an OpenStreetMap background layer, and then dragged each account around until it was roughly in its place. I left the national accounts floating in the sea. 156 drags later, they all had a place.

Back in R, I put the sentiment scores and counts of number of tweets back onto the geodata and saved a shapefile. But how to distribute this? I could save as a KML and people could use Google Earth, but I thought I'd give CartoDB a try. This is a cloud mapping service where you can upload shapefiles and make pretty maps out of them. With a bit of styling, I had it. Here it is, blue for positive sentiments and reds for negative ones (to the nearest whole number):


Or use the direct link to the map
Of course you need to take a lot of this with a pinch of salt. The locations are approximate. The sentiment scoring system is quite naive. The scores are based on fairly small numbers of tweets (click on a point to see). And the scores were a snapshot of when I ran the analysis. Nevertheless, its a start. It would be interesting to automate this, and see how sentiment changes over time, but I think it requires a lot more tweets to get something amenable to statistical analysis. Once I figure out a statistical distribution for these scores (difference of two Poissons maybe) I could map significance of sentiment scores, which would take into account the small samples. Exeter may look red and angry, but that's from one single tweet!

Wednesday, 6 February 2013

A new paradigm for spatial classes

I have a love-hate relationship with R. So many things annoy me - even things not in the R Inferno. But we can fix some of them.

For example, SpatialPolygonsDataFrames (and the other spatial data frame classes) aren't really data frames. They don't inherit from data.frame, but they almost behave like data frames. And when they don't, that's an annoyance, and you end up having to get the foo@data member which really is a data frame.

So why can't data frames have a spatial component? Partly because the columns of a data frame can only be vectors of atomic objects. You can't have a column of list objects. Or a column of model fits. Or a column of dates.. wait, yes you can...

> ds = seq(as.Date("1910/1/1"), as.Date("1999/1/1"), "years")

> df = data.frame(date=ds,i=1:90)
> head(df)
        date i
1 1910-01-01 1
2 1911-01-01 2
3 1912-01-01 3
4 1913-01-01 4
5 1914-01-01 5
6 1915-01-01 6

Dates like this are atomic. Strip away its class and a date object is just a number:

> unclass(ds[1])
[1] -21915

and its the S3 OO system that prints it nicely. So how can we store geometry in an atomic data item? We can use WKT. This is a text representation of points, polygons and so on.

So here's a function to take a SpatialPolygonsDataFrame and return a data frame with a geometry column. It adds a column called the_geom of class "geom", and attaches an attribute to the data frame so that we know where the geometry column is. Anyone who has used PostGIS will recognise this.


as.newsp.SpatialPolygonsDataFrame = function(spdf){
  require(rgeos)
  spdf$the_geom = writeWKT(spdf,byid=TRUE)
  class(spdf$the_geom) = c("geom")
  df = spdf@data
  attr(df,"the_geom") = "the_geom"
  class(df) = c("newsp","data.frame")
  df
}


Now if you convert a SPDF to a "newsp" object you get a data frame. With geometry. What can you do with it? Well, anything you can do with a data frame, because that's exactly what it is. You want to do spatial things with it? Ah, well, for now here's some code that gets the geometry columns and turns it back into a SPDF:


as.SpatialPolygonsDataFrame.newsp = function(nsp){
  geom_column= attr(nsp,"the_geom")
  geom = nsp[[geom_column]]
  class(nsp) = "data.frame"
  geom = lapply(geom,readWKT)
  glist = lapply(geom,function(p){p@polygons[[1]]})
  for(i in 1:length(glist)){
    glist[[i]]@ID=as.character(i)
  }
  SpatialPolygonsDataFrame(SpatialPolygons(glist),nsp,match.ID=FALSE)
}

This just does the reverse of the previous function. So you can write a plot method for these newsp objects:


plot.newsp = function(x,...){
  d = as.SpatialPolygonsDataFrame.newsp(x)
  plot(d,...)
}

So far, so pointless? Eventually if you developed this you'd write code to interpret the WKT (or better still, use WKB for compactness) and draw from that. This is just proof-of-concept.

These WKT strings can be very long, and that messes up printing of these data frames. In order to neaten this up, let's write some methods for the geom class to truncate the output:


as.character.geom = function(x,...){
  paste0(substr(x,1,10),"...")
}

format.geom = function(x,...){
  as.character.geom(x)
}

But now we hit a problem with subclassing any classed vector. Facepalm. R drops the class. 

> z=1:10
> class(z)="foo"
> z
 [1]  1  2  3  4  5  6  7  8  9 10
attr(,"class")
[1] "foo"
> z[3:10]
[1]  3  4  5  6  7  8  9 10

Annoying. But how does a vector of dates work around this? Well, it defines a "[" method for date vectors. We'll just copy that line for line:


"[.geom" = function (x, ..., drop = TRUE) 
{
    cl = oldClass(x)
    class(x) = NULL
    val = NextMethod("[")
    class(val) = cl
    val
}


and while we're at it, we need a print method for the geom class too:


print.geom = function(x,...){
  print(paste0(substr(x,1,10),"..."))
}

Now look what I can do (using the Columbus data from one of the spdep examples):

> cnew = as.newsp.SpatialPolygonsDataFrame(columbus[,1:4])
> head(cnew)
      AREA PERIMETER COLUMBUS_ COLUMBUS_I      the_geom
0 0.309441  2.440629         2          5 POLYGON ((...
1 0.259329  2.236939         3          1 POLYGON ((...
2 0.192468  2.187547         4          6 POLYGON ((...
3 0.083841  1.427635         5          2 POLYGON ((...
4 0.488888  2.997133         6          7 POLYGON ((...
5 0.283079  2.335634         7          8 POLYGON ((...

This is what kicked this all off. Someone couldn't understand why head(spdf) didn't do what he expected - in some cases:

> pp=data.frame(x=1:10,y=1:10,z=1:10)
> coordinates(pp)=~x+y
> head(pp)
Error in `[.data.frame`(x@data, i, j, ..., drop = FALSE) : 
  undefined columns selected

Try it. Understand it. Then tell me that spatial data frames that inherit from data.frame with the spatial data in a column aren't a better idea. Its the principle of least **BOO!** surprise. Of course it would take a lot of work to re-tool all the R spatial stuff to use this, so it's not going to happen.

Another possibility may be to drop data.frames and use data.tables instead... Anyway, just an afternoon hack when I realised you could put WKT in a column. 



Tuesday, 29 January 2013

NHS Hack Day Oxford - The Good, The Bad, and The Ugly

My Experience at NHS Hack Day Oxford

Let's get the "Ugly" out of the way. By the end of day two (and possibly earlier) - the first stall had the seat broken and strewn on the floor, the second had no paper, and the third had been filled by a mysterious bulging carrier bag. Given this was a hospital, nobody had dared investigate its contents.

The "Bad"?. The networking. Getting reliable internet connections for an event like this is a must. There had been some effort to get BT Wireless connectivity, but this was costing £20 for the weekend per person, and even then there were reliability problems. I connected freely using the Eduroam system for academics but the connection would regularly drop. With an average of 2.1 WiFi devices per person at this event (that's a rough guess!) it can be a deal breaker. At least one person arrived late on Sunday to take advantage of the unlimited fast broadband at their friend's house where they were staying - although he also admitted the unlimited toast and Australian tennis action was another factor.  I wonder if wired network connections and a few switches and hubs scattered round the venue might be a better solution. For the FOSS4G Conference in Nottingham this year (shameless plug) the network connectivity has been top of our agenda since our first meeting. 

For future Hack Day events, it might be worth putting a reminder out for power extension leads a bit further in advance. Also, most of us have a couple of lanyards kicking around from previous conferences and a reminder to bring and share lanyards might be an idea too. Sticky labels were okay, but sub-optimal!

So, onto the work itself. Given the huge number of people who turned up I was surprised at the level of traffic on the email and the google group. Perhaps a dozen or so individuals voiced up on the pre-conference channels, yet at least ten times that number walked through the doors on the day. What were that 90% thinking beforehand?

Partly I suspect some of the work groups were formed in advance. There seems to be a classification of hack day work groups:
  • The group of people who have met and worked before. They have a plan for the hack day worked out already, and are perhaps using the weekend to set aside two days from the sound of ceaseless door-knocking (or the ping of arriving emails) in their day jobs to get this done. These groups can be hard for an outsider to help out with, since their work programme is decided and the division of labour is understood. They get down and get on.
  • The small group that has a plan, and knows what it needs, and knows it can find it at the hack day. Perhaps one or two people have the kernel of an idea together, but need a database expert, or a web scraper. They get them, and a designer and maybe a doctor and a patient jump in too as representative end users. Soon a group forms that adds value to the original idea and the development snowballs.
  • The randoms. A group of people with no great agenda, but assorted skill sets, who get together and come up with both a problem and a solution on the day. It may be nothing to do with any of the individual ideas, but emerges from the sum of their parts.
 I skipped the presentations and judging partly because I'm slightly uncomfortable with judging and prizes, especially  for works of art (don't talk to me about the Oscars, the Man Booker prize, or the Queen's Honours list) and I was getting a bit cabin-fever and I fancied some fresh air. I wandered down to the river and was rewarded by a refreshing downpour followed by a DOUBLE RAINBOW! which inspired me with some new ideas. I find a combination of solitary thinking time with group work most inspirational, and a giant double rainbow arcing right across the sky can only help that process.

The time spent with other people is the real "Good" of the NHS Hack Day. Even if none of the code cut at the weekend goes into production then the value in making those connections makes the weekend worthwhile. What this means is that even if the WiFi doesn't work and there's no electricity, just stick everyone in the room, lock the doors, push sandwiches and cake in at midday and keep a flow of coffee and good things will still happen.

The big challenge is getting these things continuing, which requires funding, further collaboration, acceptance and adoption. The greater challenge is integrating the spirit of the hack day into the processes that produced, for example, the ePortfolio that was so hated that a hack day group spent the weekend conspiring to produce a better one. So-called solutions are seemingly often imposed from above, based on marketing promises from big business, and the vast majority of users have no input to the problem. This is not a problem confined to the NHS, and there is a need to "loop back up" so that managers have a better awareness of user requirements, something they could learn from modern agile computer science development methods and open-source practices. 

So overall a great weekend - and I'll probably keep an eye out on some of the projects to see how they develop. All the details are available from the NHS Hack Day web site.

Comments etc to @geospacedman on twitter