• Motorburn
      Because cars are gadgets
    • Gearburn
      Incisive reviews for the gadget obsessed
    • Ventureburn
      Startup news for emerging markets
    • Jobsburn
      Digital industry jobs for the anti 9 to 5!

The 5 minute guide to scraping data from PDFs

Every data journalist knows the feeling: you’re working on a massive project, you’ve finally found the data… but it is in PDF format.

Last month I had a crime reporter from Cape Town in one of my data journalism training sessions, who had managed to get around 60 PDF pages worth of stats out the relevant authorities. She explored and analyzed them by hand, which took days. That set me thinking. The problem can’t be all that uncommon and there must be a good few data journalists out there who could use a quick guide to scraping spreadsheets from PDFs.

The ideal of course is not getting your data in PDF form in the first place. It all comes from the same database, and it shouldn’t be any effort for the people concerned to save the same data in an Excel spreadsheet. The unfortunate truth however is that a lot of officials aren’t willing to do that out of fear that you’ll tinker with their data.

There are some web services like cometdocs or pdftoexcelonline that could help you out. Or you could try to build a scraper yourself, but then you have to read Paul Bradshaw‘s Scraping for Journalists first.


My favourite tool though is Tabula. Tabula describes itself as “a tool for liberating data tables trapped inside PDF files”. It’s fairly easy to use too. All you have to do is import your PDF, select your data, push a button and there is your spreadsheet! You save the scraped page in CSV and from there you can import it into any spreadsheet program.

One small problem is that Tabula only scrapes one PDF page at a time. So 10 PDF pages worth of data gives you 10 spreadsheets.

Installing Tabula is a piece of cake: download, unzip and run. Tabula is written in Java (so you should have Java installed) and uses Ruby for scraping, which is one of the languages used on Scraperwiki to build tailor-made PDF scrapers.

Author | Peter Verweij

Peter Verweij
After 30 years of lecturing and training at the School of Journalism at Utrecht in journalism, politics and new media, Peter Verweij, started in 2005 his own company D3-Media, which focuses on the following areas: Production of journalistic content for multimedia media and blogs; Research in the area... More