Burn Media Sites

HONOR 400 and 400 Pro Redefine AI Smartphone Innovation for South African Users

HONOR has officially opened pre-orders for its much-anticipated HONOR 400 and HONOR 400 Pro smartphones in South Africa — ushering in a bold new…

Epson Partners with Global Icon Shakira to Inspire Youth Creativity Across Africa and Beyond

Epson has announced international music icon and philanthropist Shakira as the new face of its ‘Imagine New Possibilities’ campaign across Africa, the Middle East,…

Why South Africa’s AI Moment Demands Bold Leadership, Not Just Tech Adoption

Artificial Intelligence is no longer a distant promise or a Silicon Valley experiment. It’s embedded in the now. South Africans are already using generative…

Ocellics Reinvents Fund Factsheets with Automation-Driven Platform for Investment Managers

In an industry where speed, accuracy, and trust are non-negotiable, one software company is quietly transforming the way investment managers manage a critical aspect…

AWIEF 2025 Awards Nominations Now Open

Africa’s entrepreneurial energy is surging and it’s women who are leading from the frontlines. With less than a month to go, entries are now…

Why 3 Out of 4 South African SMEs Still Can’t Raise Capital

World SME Day on 27 June is supposed to be a celebration of entrepreneurial grit. But for most small businesses in South Africa, it…

Acer Unleashes Beastly New Predator BiFrost and Nitro GPUs with AMD Radeon RX 9000 Series

Acer is raising the bar for gaming and content creation with the launch of its latest Predator BiFrost and Nitro graphics cards, now powered…

Philips Evnia Drops Jaw-Dropping QD OLED Monitors: 240Hz, Ambiglow, and All the Good Stuff

Game On, Reality Off: Philips Evnia Unleashes QD OLED Mayhem Let’s cut to the chase: Philips Evnia just nuked the gaming monitor scene. The…

Microsoft launches new Surface devices in new AI era

Microsoft today announced the general availability of the all-new Surface Pro and the all-new Surface Laptop to empower users in South Africa to unlock…

Data centres and defence are reviving diesel

Data centres will command power equivalent to the entire Japanese power grid by 2030. It’s a startling prediction and one that infrastructure futurists, data…

The most recognisable tactical pickup truck evolves

Perhaps the most iconic of all light tactical vehicles is the Toyota Land Cruiser Technical. These pickup trucks have been a platform of choice…

JLR channels its Camel Trophy history

JLR’s Range Rover might be the most profitable vehicle it markets, but the Defender generates the highest volume of revenue. It has become the…

Continue in 10 seconds

Skip

Online journalism • 28 Nov 2013

The 5 minute guide to scraping data from PDFs

By Peter Verweij

Data

Every data journalist knows the feeling: you’re working on a massive project, you’ve finally found the data… but it is in PDF format.

Last month I had a crime reporter from Cape Town in one of my data journalism training sessions, who had managed to get around 60 PDF pages worth of stats out the relevant authorities. She explored and analyzed them by hand, which took days. That set me thinking. The problem can’t be all that uncommon and there must be a good few data journalists out there who could use a quick guide to scraping spreadsheets from PDFs.

The ideal of course is not getting your data in PDF form in the first place. It all comes from the same database, and it shouldn’t be any effort for the people concerned to save the same data in an Excel spreadsheet. The unfortunate truth however is that a lot of officials aren’t willing to do that out of fear that you’ll tinker with their data.

There are some web services like cometdocs or pdftoexcelonline that could help you out. Or you could try to build a scraper yourself, but then you have to read Paul Bradshaw‘s Scraping for Journalists first.

Tabula

My favourite tool though is Tabula. Tabula describes itself as “a tool for liberating data tables trapped inside PDF files”. It’s fairly easy to use too. All you have to do is import your PDF, select your data, push a button and there is your spreadsheet! You save the scraped page in CSV and from there you can import it into any spreadsheet program.

One small problem is that Tabula only scrapes one PDF page at a time. So 10 PDF pages worth of data gives you 10 spreadsheets.

Installing Tabula is a piece of cake: download, unzip and run. Tabula is written in Java (so you should have Java installed) and uses Ruby for scraping, which is one of the languages used on Scraperwiki to build tailor-made PDF scrapers.

Peter Verweij

Facebook looking at feature that lets you save links to read later

Facebook • 28 Nov 2013

We use cookies

To improve your experience, deliver personalised content and advertising. Find out more by reading our cookie policy.

Sign up to our newsletter to get the latest in digital insights. sign up

Welcome to Memeburn

By signing up for this email you agree to receive the latest info from Burnmedia Group.

Learn more via our Privacy Policy.

HONOR 400 and 400 Pro Redefine AI Smartphone Innovation for South African Users

Epson Partners with Global Icon Shakira to Inspire Youth Creativity Across Africa and Beyond

Why South Africa’s AI Moment Demands Bold Leadership, Not Just Tech Adoption

Ocellics Reinvents Fund Factsheets with Automation-Driven Platform for Investment Managers

AWIEF 2025 Awards Nominations Now Open

Why 3 Out of 4 South African SMEs Still Can’t Raise Capital

Acer Unleashes Beastly New Predator BiFrost and Nitro GPUs with AMD Radeon RX 9000 Series

Philips Evnia Drops Jaw-Dropping QD OLED Monitors: 240Hz, Ambiglow, and All the Good Stuff

Microsoft launches new Surface devices in new AI era

Data centres and defence are reviving diesel

The most recognisable tactical pickup truck evolves

JLR channels its Camel Trophy history

The 5 minute guide to scraping data from PDFs

Peter Verweij

News

HONOR 400 and 400 Pro Redefine AI Smartphone Innovation for South African Users

Epson Partners with Global Icon Shakira to Inspire Youth Creativity Across Africa and Beyond

Why South Africa’s AI Moment Demands Bold Leadership, Not Just Tech Adoption

Why AI in Education Still Needs Human Teachers

We use cookies

Welcome to Memeburn