One of the project proposals at the NHS Hack Day in Oxford was about doing some kind of sentiment analysis using twitter and NHS hashtags and ids. I had a brief word with the proposer since I'd recently seen something similar done in R but he was on another project.
But this weekend I thought I'd have a go at doing something. The main idea came from Jeffrey Breen's blog Mining Twitter for Airline Consumer Sentiment - there he writes some simple R functions that looks for positive and negative words in tweets to give a score. I pretty much used his code for scoring exactly like that.
I found a twitter list of NHS accounts which I quickly pulled out of Firefox. This could probably be done with the twitteR package in R but I found it just as quick to save it from Firebug and parse the HTML with Python and BeautifulSoup. Make sure you've scrolled down to see all the members, then save the list from the HTML tab in Firebug. That gave me a parsable list of accounts with id, description, and image URL for good measure.
Then I could run the sentiment analysis, looking for tweets that mention the NHS accounts, and computing the score. This would hopefully be tweets from people mentioning those accounts, and expressing their opinion of the service.
There were some problems with escape characters (the RJSONIO package worked better than the rjson package) and other encodings, but I managed to work round them. Soon I had a table where I could order the NHS accounts by sentiment.
But I'm a map geek. How could I map this? Most of the twitter accounts were geographic, but some were individual trusts or hospitals, and some were entire PCTs, SHAs, or even the new CCGs. There were also a few nationwide NHS accounts.
So I adopted a manual approach to geocoding. First I generated a shapefile with all the accounts located in the North Sea. I loaded this into a QGIS with an OpenStreetMap background layer, and then dragged each account around until it was roughly in its place. I left the national accounts floating in the sea. 156 drags later, they all had a place.
Back in R, I put the sentiment scores and counts of number of tweets back onto the geodata and saved a shapefile. But how to distribute this? I could save as a KML and people could use Google Earth, but I thought I'd give CartoDB a try. This is a cloud mapping service where you can upload shapefiles and make pretty maps out of them. With a bit of styling, I had it. Here it is, blue for positive sentiments and reds for negative ones (to the nearest whole number):
Or use the direct link to the map
Of course you need to take a lot of this with a pinch of salt. The locations are approximate. The sentiment scoring system is quite naive. The scores are based on fairly small numbers of tweets (click on a point to see). And the scores were a snapshot of when I ran the analysis.
Nevertheless, its a start. It would be interesting to automate this, and see how sentiment changes over time, but I think it requires a lot more tweets to get something amenable to statistical analysis. Once I figure out a statistical distribution for these scores (difference of two Poissons maybe) I could map significance of sentiment scores, which would take into account the small samples. Exeter may look red and angry, but that's from one single tweet!
No comments:
Post a Comment