The 5 minute guide to scraping data from PDFs

email article email article print article print article tip @techmeme

Data

Every data journalist knows the feeling: you’re working on a massive project, you’ve finally found the data… but it is in PDF format.

Last month I had a crime reporter from Cape Town in one of my data journalism training sessions, who had managed to get around 60 PDF pages worth of stats out the relevant authorities. She explored and analyzed them by hand, which took days. That set me thinking. The problem can’t be all that uncommon and there must be a good few data journalists out there who could use a quick guide to scraping spreadsheets from PDFs.

The ideal of course is not getting your data in PDF form in the first place. It all comes from the same database, and it shouldn’t be any effort for the people concerned to save the same data in an Excel spreadsheet. The unfortunate truth however is that a lot of officials aren’t willing to do that out of fear that you’ll tinker with their data.

There are some web services like cometdocs or pdftoexcelonline that could help you out. Or you could try to build a scraper yourself, but then you have to read Paul Bradshaw‘s Scraping for Journalists first.

Tabula

My favourite tool though is Tabula. Tabula describes itself as “a tool for liberating data tables trapped inside PDF files”. It’s fairly easy to use too. All you have to do is import your PDF, select your data, push a button and there is your spreadsheet! You save the scraped page in CSV and from there you can import it into any spreadsheet program.

One small problem is that Tabula only scrapes one PDF page at a time. So 10 PDF pages worth of data gives you 10 spreadsheets.

Installing Tabula is a piece of cake: download, unzip and run. Tabula is written in Java (so you should have Java installed) and uses Ruby for scraping, which is one of the languages used on Scraperwiki to build tailor-made PDF scrapers.

email article email article print article print article

Related articles

Topics for this article

[ advertising enquiries ]

Share
  • BURN MEDIA TV

    WATCH THE LATEST EPISODE NOW
    Latest Episode
    Sony Xperia Z2 Review

MORE HEADLINES

news

VIEW MORE

interviews

VIEW MORE

future trends

VIEW MORE

entrepreneurship

VIEW MORE

social media

VIEW MORE

facebook

VIEW MORE

twitter

VIEW MORE

google

VIEW MORE

advertising & marketing

VIEW MORE

online media

VIEW MORE

design

VIEW MORE

mobile

VIEW MORE

More in Online journalism

Meet Crowdynews: the social media news wire every journo should use

Read More »