Data journalists generally use Microsoft Excel for analysing data. Why? Well when you run Windows on your machine, Word will most likely be your text editor and Excel is just there bundled along with it in Microsoft Office. You don’t have to install anything; and it works for making a top-ten, calculating an average or percentages or making a (pivot) table out of your variables.
No ad to show here.
For deeper statistical analysis, SPSS (Statistical Package Social Sciences) is popular, especially when working at universities who have a license for use of the package. There are 5 good reasons for data journalists to rethink the use of these tools and decide to step into the direction of using R-project.
As Gregor Aisch of the Open Knowledge Foundation notes in the Data Journalism Handbook: “It is hard to find any visualisation method or data wrangling technique that is not already built into R. R is a universe in its own, the mecca of visual data analysis. …Trained data journalists can use R to analyze huge dataset which extends the limits of Excel”
1. R is free and open source
That means you don’t have to pay to use it and can download the software for free. On top of this, the software is constantly developing: users and programmers add new packages to R all the time, opening new areas and better tools for analysis.
2. R is available for all major platforms
It doesn’t matter whether you’re on Windows, Apple or Linux, your R experience will be the same. It was mind boggling when I was doing data analysis with a group of journalists with two different Excel(2003 and 2007/2010) versions and Excel used on Apple. You easily get lost in all the menu differences, ribbons and available context options.
3. R is not a single software program like Excel for making calculations
Instead, it’s a language to be used in combination with packages developed for specific jobs. When downloading R, a number of standard packages are installed, enough to do a simple analysis and producing some graphs.
For specific tasks, other packages have to be installed from the CRAN servers, where all R packages are available.
There are, for example, different packages for social network analysis, scraping data, or producing high end graphics. Argh…that sounds difficult. Well, R is just a terminal screen with a prompt waiting for your command to install new packages. R Studio does however give you a nice GUI to do all this. On top of that R Commander offers a complete GUI for detailed statistical analysis.
4. You are not alone in R
There is an active community out there, writing manuals, hand outs and giving examples of analysis. Just go to R and explore the veritable gold mine of different resources. A great source to follow is R-Bloggers, for ready to use examples on scraping or interesting how tos for making a nice scatter plot.
5. The interest for the use of R is growing
It is estimated that there are around 20-30 downloads of R packages per week. The page views on Wikipedia for R meanwhile add up to 1 000 per day.
Interest in the job market is also increasing. The demand in the industry for data specialists with R skills is even bigger than for SPSS. For journalists who have lost their jobs recently and have an interest in data journalism, this opens up new possibilities.
One minor snag
With all these goodies in the basket, there is, inevitably, a snag. As Aisch points out, “one drawback is that you need to learn (yet another) programming language as R has it’s own language. But once you have taken the initial climb on the learning curve, there’s no tool more powerful than R”.
Except for R Commander, R is not just clicking around in menus. You need to tell the software what to do; from importing you data(. xls .dbf or .cvs), producing a table, calculating the margins, plotting a regression line or making a histogram or even a choropleth map, it is all command driven.
R Studio is the environment to use for this and Swirl is a great help to learn how to use it. The good thing is that it helps you to track exactly what you have done. Store your analysis with the commands and the results in file to review later. Finally, you could turn the whole operation into a script to use later again with other data.
R does not teach you how to do statistics. It applies statistics to your data. Using R presupposes some statistical knowledge on how to an analysis and what to calculate. But that is the same in Excel. Of course you can learn both; there are interesting books on how to do statistical analysis in R.
I started with data analysis long ago, doing calculations by hand and a piece of paper, then SPSS and later Excel and Calc(Excel for Open Office). I use Excel in training for data journalists and it is fine as a first step. I think however that learning R is worth the effort; it is much more flexible with a wider range of possibilities.