Today I've been working on such code for predicting epidemic outcomes. It was taking 14 seconds to run, which isn't too bad (especially considering some other work I'm doing takes 6 hours using OpenMP on an 8-core machine), but in my inspection I'd spotted some odd R code.
exp(var%*%t(coef[1,]))
The oddity here is that var is one-dimensional and so is coef[1,]. This matrix multiply is really just something like exp(sum(var*coef[,1])). So I tried that, and the code took 6 times longer to run. Matrix multiplications are efficient in R, even in one dimension.
So I switched back to the original code. Time to get profiling. This is easy in R.
Rprof("run.out")
dostuff()
Rprof("")
summaryProfile("run.out")
This told me most of the time was spent in the
t
function, within the function I was playing with earlier. Then I noticed that pretty much wherever coef
was used, it was transposed with the t
function.I refactored the code to do
t(coef)
once, and then pass the transposed matrix around. That sped the code up from the original 14 seconds to 3 seconds. Sweet.Note that at each stage I check that the new output is the same as the old output - it can be useful to write some little test functions at this point, but doing formal test-driven development can be a bit tricky for statistical applications in R.
There's still a bit of work to do to make this code a bit more general purpose. Currently it works on predictions for one particular country - we want to do predictions over half of Africa!
You don't have to be doing TDD for dedicated testing frameworks to be useful. What's more, it's not that hard to do TDD even for statistical applications. Most people that don't use testing frameworks and don't do repeatable testing make excuses such as this because they are lazy. In fact the reality is that, in the long term, investing in repeatable tests results in less work being done and so the truly lazy should test more than they do!
ReplyDelete