Every data journalist knows the feeling: you’re working on a massive project, you’ve finally found the data… but it is in PDF format.
Last month I had a crime reporter from Cape Town in one of my data journalism training sessions, who had managed to get around 60 PDF pages worth of stats out the relevant authorities. She explored and analyzed them by hand, which took days. That set me thinking. The problem can’t be all that uncommon and there must be a good few data journalists out there who could use a quick guide to scraping spreadsheets from PDFs.
The ideal of course is not getting your data in PDF form in the first place. It all comes from the same database, and it shouldn’t be any effort for the people concerned to save the same data in an Excel spreadsheet. The unfortunate truth however is that a lot of officials aren’t willing to do that out of fear that you’ll tinker with their data.
There are some web services like cometdocs or pdftoexcelonline that could help you out. Or you could try to build a scraper yourself, but then you have to read Paul Bradshaw‘s Scraping for Journalists first.
My favourite tool though is Tabula. Tabula describes itself as “a tool for liberating data tables trapped inside PDF files”. It’s fairly easy to use too. All you have to do is import your PDF, select your data, push a button and there is your spreadsheet! You save the scraped page in CSV and from there you can import it into any spreadsheet program.
One small problem is that Tabula only scrapes one PDF page at a time. So 10 PDF pages worth of data gives you 10 spreadsheets.
Installing Tabula is a piece of cake: download, unzip and run. Tabula is written in Java (so you should have Java installed) and uses Ruby for scraping, which is one of the languages used on Scraperwiki to build tailor-made PDF scrapers.