The 5 minute guide to scraping data from PDFs

email article email article print article print article tip @techmeme

Data

Every data journalist knows the feeling: you’re working on a massive project, you’ve finally found the data… but it is in PDF format.

Last month I had a crime reporter from Cape Town in one of my data journalism training sessions, who had managed to get around 60 PDF pages worth of stats out the relevant authorities. She explored and analyzed them by hand, which took days. That set me thinking. The problem can’t be all that uncommon and there must be a good few data journalists out there who could use a quick guide to scraping spreadsheets from PDFs.

The ideal of course is not getting your data in PDF form in the first place. It all comes from the same database, and it shouldn’t be any effort for the people concerned to save the same data in an Excel spreadsheet. The unfortunate truth however is that a lot of officials aren’t willing to do that out of fear that you’ll tinker with their data.

There are some web services like cometdocs or pdftoexcelonline that could help you out. Or you could try to build a scraper yourself, but then you have to read Paul Bradshaw‘s Scraping for Journalists first.

Tabula

My favourite tool though is Tabula. Tabula describes itself as “a tool for liberating data tables trapped inside PDF files”. It’s fairly easy to use too. All you have to do is import your PDF, select your data, push a button and there is your spreadsheet! You save the scraped page in CSV and from there you can import it into any spreadsheet program.

One small problem is that Tabula only scrapes one PDF page at a time. So 10 PDF pages worth of data gives you 10 spreadsheets.

Installing Tabula is a piece of cake: download, unzip and run. Tabula is written in Java (so you should have Java installed) and uses Ruby for scraping, which is one of the languages used on Scraperwiki to build tailor-made PDF scrapers.

email article email article print article print article

  • http://www.windowindia.net/ Window India

    Scraping Data From Website

    The Smart
    Web Data mining software is an illuminating example of such software which
    has helped many business organizations in getting the data from the various
    websites. It extracts the configured data from the various websites and mines
    the configured data accordingly. Now here comes the task of feature discussion
    of this case. The most attractive feature of this Smart Web Data Scraper is Task
    Scheduler in this users has to scheduled the task on a selected date or time , the task will be
    repeated as a Hourly, Daily, Weekly and
    Monthly. Selected tasks will be Edit
    or delete by the users. Another feature is Auto Pause and Auto Save the mined data.

    For more detail: http://www.windowindia.net/smart-data-scrapper.html
    Contact Email: info@windowindia.net

  • Pingback: Data journalism is changing the way we report elections: here’s how | memeburn

Related articles

Topics for this article

[ advertising enquiries ]

Share
  • BURN MEDIA TV

    WATCH THE LATEST EPISODE NOW
    Latest Episode
    Data woes? Here's 6 data saving tips for your smartphone

MORE HEADLINES

news

VIEW MORE

interviews

VIEW MORE

future trends

VIEW MORE

entrepreneurship

VIEW MORE

social media

VIEW MORE

facebook

VIEW MORE

twitter

VIEW MORE

google

VIEW MORE

advertising & marketing

VIEW MORE

online media

VIEW MORE

design

VIEW MORE

mobile

VIEW MORE

More in Online journalism

Meet Crowdynews: the social media news wire every journo should use

Read More »