Wednesday, 22 December 2010

Mapping The Kendal Mint Quake

About 11pm last night my wall of photographs rattled. It's not unusual, it seems to happen whenever anyone is having a washing machine delivered and a truck crawls past my house. But a minute after that I was on the computer and someone local was tweeting about her house shaking. Earthquake? A few twitter searches revealed twitterers in Cumbria and North Lancashire had indeed experienced a small earthquake.

So I did what all statisticians would do - go looking for data. Tweets can be geo-coded, but getting reliable info might be hard. I checked the USGS earthquake site, but it's not real-time. But then I found the British Geological Society earthquakes page, with true real-time reports from a bunch of sensors around the country. Under the Monitoring menu there's the Real-Time Data option.

Here you can choose the station and a day and get a seismogram. Here's what the one in Keswick (KESW) was showing:


BIG spike. It went off the scale of the plot. I looked at a few other plots and discovered most of them had recorded it too. And I noticed the quake had hit at slightly different times. Maybe I could use that information to work out the location of the quake.

At this point the BGS site was getting slow as worried Cumbrians checked to see if it was an earthquake or if the local nuclear power station had exploded. Anyway, it was bedtime, so I left it to the morning, as something to do in my lunch hour.

I couldn't find a way to get the raw data, so I decided I'd get what I could from the images. A quick script pulled all the GIF images from the sites, except for a couple that seemed to be off-line. Another script pulled all the station information book HTML files and from there I got the locations.

I then stepped through each of the 24 images, identifying the point near 11pm where the tremors started, and entered the data.

Now I had the location and time of the points where the quake had been sensed. This was all converted to OSGB coordinates using rgdal in R.

I didn't have a zero-time for the quake or a location. A quick plot did show the quake had to be somewhere in Cumbria.


I don't know how professional seismologists work out earthquake epicentres, so I came up with something quick and dirty.

Assume the velocity of the quake waves are constant across the UK. Then if the quake was at location (x,y) then a plot of distance against time should be a straight line, with all the points on it. Some quick experiments showed I could fit a straight line then use the sum of the residuals squared as a measure to be minimised over (x,y). A few lines of code later and I had a coordinate!

> fit$par
[1] 307499.9 506035.7

By now the quake had hit the news and the USGS web site was showing it:


How did that compare to my point? I converted it to OSGB coordinates:

> coordinates(usgsos)
     coords.x1 coords.x2
[1,]  328994.6  500054.7

20km further east, and 6km south. Not bad considering I had to eyeball the time from the seismograms and various other assumptions. I don't know what kind of uncertainty the USGS put on their predictions, and I can't put an uncertainty on my location either since there's not a proper statistical model behind this. I can give you the data if you have a better idea - or you can scrape it again from the web site! Here's a comparison of the epicentre predictions on a map:


Here's what the distance/time plot looks like for my predicted epicentre:


And from the slope of this we can get the speed of travel of the quake waves - about 7,500 km/s - which is about what a quick google search on earthquakes tells me.

Footnote: I think on the bike home tonight I worked out a way to do this statistically. Suppose the earthquake is at (x,y), at time t0, and the waves move with speed v. Then suppose the time of arrival of the quake at a point at distance d is Normally distributed with mean time t0+d/v and variance k(d;theta), where k() is some increasing function of d. This allows us to consider the increasing noise in the system for points further away. Fit the whole thing by maximum likelihood and we get estimates (with errors) for location, time of origin, and wave speed. Something for next lunchtime...

Tuesday, 21 December 2010

Speeding up some R Code

Part of my job involves looking at code written by research students and trying to make it do other things. Often a student is focused on their application area or particular dataset, and so the code and data are inextricably linked. Add another observation to the dataset and you'll find yourself having to change 3467 to 3468 in a dozen places. And maybe find some 6934s and change them to 6936s. The other problem is that research students are mostly new to serious programming, and so still need to develop both intuition and formal skills.

Today I've been working on such code for predicting epidemic outcomes. It was taking 14 seconds to run, which isn't too bad (especially considering some other work I'm doing takes 6 hours using OpenMP on an 8-core machine), but in my inspection I'd spotted some odd R code.

exp(var%*%t(coef[1,]))

The oddity here is that var is one-dimensional and so is coef[1,]. This matrix multiply is really just something like exp(sum(var*coef[,1])). So I tried that, and the code took 6 times longer to run. Matrix multiplications are efficient in R, even in one dimension.

So I switched back to the original code. Time to get profiling. This is easy in R.


Rprof("run.out")
dostuff()
Rprof("")
summaryProfile("run.out")


This told me most of the time was spent in the t function, within the function I was playing with earlier. Then I noticed that pretty much wherever coef was used, it was transposed with the t function.

I refactored the code to do t(coef) once, and then pass the transposed matrix around. That sped the code up from the original 14 seconds to 3 seconds. Sweet.

Note that at each stage I check that the new output is the same as the old output - it can be useful to write some little test functions at this point, but doing formal test-driven development can be a bit tricky for statistical applications in R.

There's still a bit of work to do to make this code a bit more general purpose. Currently it works on predictions for one particular country - we want to do predictions over half of Africa!