It was work in progress, but after almost one year and 40 ‘versions’ later, Paul Bradshaw’s ‘Scraping for journalists‘ is published. Bradshaw is teaching at London City University and the City University at Birmingham, but he is also a respected data journalist and blogger at the Online Journalism Blog. And not without reason.
You can order a copy of the work as an e-book, available in PDF, Mobi or Epub formats. Leanpub, where you can obtain a copy, has an interesting concept: it offers all the tools for the production and for the publishing of a book. You can make changes and additions while publishing, and, not an unimportant factor, the royalties are higher compared to traditional publishing. Bradshaw says he has “become a huge fan” as “the format combines the best qualities of traditional book publishing with those of blogging and social media.”
‘Scraping for journalists’ is a must read for data journalists. One of the problems is how to get your data from the online resources into a spreadsheet. Scraping is the answer. But how do you do that, given the fact that most journalists are not coders? In 30 chapters and almost 500 pages Bradshaw gives his recipes for scraping data. The book is not for reading from cover to cover but rather learning by doing. You follow the recipes step by step on your computer, add some variation to the examples and finally you try to apply the recipes on your own data. This works wonderfully, because starting with programming takes too much time before you get results. Now you have some readymade code, which works, and you can experiment until you can successfully apply it to your own data.
Already from chapter one you can make a quick start. Within five minutes, you can scrape your first data. Bradshaw starts with explaining the commands Import HTML and Import XML used in Google Drive to import data from a web page into spreadsheets. The trick is to find the right table or list of the data. You can dig deep into the html or xml soup but you can also guess and experiment. Just try some numbers in the expression, advises Bradshaw.
Of course extracting tables from a website can be done faster with a nice tool called Outwit Hub. You just load your data web page in Outwit and push the ‘table’ button and there is your scraped data ready to be exported in Excel format. The free version works but Bradshaw advises to buy the official one for about 60 Euros, because it does not have the limitation of scraping only a hundred lines. This is useful when you are scraping a lot of data. Take, for example, 150 members of parliament, who all have their own web pages. If they’re structured in the same way, with a heading/paragraph where the members state their education and former jobs, doing this by hand page after page is pretty boring and time consuming. You can rather make a scraper, based on the opening- and end-tags for education and jobs, then run it over the 150 individual member pages. Have a cup of coffee and after a while, your data will be ready for exporting to Excel. Bradshaw takes great effort in explaining how to find the opening- and end-tag in the html soup for the data you are looking for. This makes sure you will get it working after a while.
You are not the only journalist who is scraping data. Scraperwiki is the playground to meet your friends and share your skills. On Scraperwiki you will find various scrapers used by others to collect data. Copy them and make a revision for your own purposes and run it. This sounds simple, however scrapers are written in code, and generally three languages are used, namely PHP, Ruby and Python. You don’t have to be a programmer to use the scripts. After Bradshaw’s explanation of the structure of a scraper you can start experimenting yourself. And, as any good educator and trainer, Bradshaw gives you some assignments at the end of each chapter.
There is much more to discover: do you know how to scrape a PDF, cells in a large spreadsheet, or data in CSV file? In the book you will find the recipes. When I show the tricks in training sessions, participants always ask: do you have this in writing? Now it is.