Tuesday, 21 December 2010

Speeding up some R Code

Part of my job involves looking at code written by research students and trying to make it do other things. Often a student is focused on their application area or particular dataset, and so the code and data are inextricably linked. Add another observation to the dataset and you'll find yourself having to change 3467 to 3468 in a dozen places. And maybe find some 6934s and change them to 6936s. The other problem is that research students are mostly new to serious programming, and so still need to develop both intuition and formal skills.

Today I've been working on such code for predicting epidemic outcomes. It was taking 14 seconds to run, which isn't too bad (especially considering some other work I'm doing takes 6 hours using OpenMP on an 8-core machine), but in my inspection I'd spotted some odd R code.


The oddity here is that var is one-dimensional and so is coef[1,]. This matrix multiply is really just something like exp(sum(var*coef[,1])). So I tried that, and the code took 6 times longer to run. Matrix multiplications are efficient in R, even in one dimension.

So I switched back to the original code. Time to get profiling. This is easy in R.


This told me most of the time was spent in the t function, within the function I was playing with earlier. Then I noticed that pretty much wherever coef was used, it was transposed with the t function.

I refactored the code to do t(coef) once, and then pass the transposed matrix around. That sped the code up from the original 14 seconds to 3 seconds. Sweet.

Note that at each stage I check that the new output is the same as the old output - it can be useful to write some little test functions at this point, but doing formal test-driven development can be a bit tricky for statistical applications in R.

There's still a bit of work to do to make this code a bit more general purpose. Currently it works on predictions for one particular country - we want to do predictions over half of Africa!

1 comment:

  1. You don't have to be doing TDD for dedicated testing frameworks to be useful. What's more, it's not that hard to do TDD even for statistical applications. Most people that don't use testing frameworks and don't do repeatable testing make excuses such as this because they are lazy. In fact the reality is that, in the long term, investing in repeatable tests results in less work being done and so the truly lazy should test more than they do!