Burn Media Sites

Young scientists to showcase research at Indonesia science competition

Four young scientists, accompanied by Eskom Expo’s Business Manager, Mmamoloko Rancia Riba, won their place to represent the country at the Indonesia competition, renowned…

Young women in Science talent search applications now open

This year the foundation L’Oréal Foundation in partnership with UNESCO is excited to bring back the sixth edition of the Young Women in Science…

Entries Are Open: You Mailed It 2024 Email Marketing Awards!

Calling all marketers, do you think you send the best emails? Make it official by claiming victory at Everlytic’s You Mailed It Email Marketing…

Cisco ramps up AI-era security with Hypershield

As the artificial intelligence revolution accelerates, the scale and complexity of data centres are straining conventional cybersecurity approaches. In response, Cisco Systems, the networking…

Deloitte and AWS Join Forces to Drive Cloud Adoption Globally

In a strategic move to accelerate cloud computing adoption across growth markets, Deloitte and Amazon Web Services (AWS) have entered into a multi-year Strategic…

MFA Fatigue Attacks: The New Social Engineering Threat Plaguing Enterprises

While multifactor authentication (MFA) has long been heralded as an essential security measure for keeping corporate networks safe from cybercriminals, a new type of…

World’s Largest 115” QD-Mini LED TV, now available in South Africa

TCL Electronics introduces the world’s largest QD mini LED TV to the South African market. The TV and electronics brand promises an immersive viewing…

Slack founder backs Amplifier Security with $3.3m for Ampy AI

AI continues to revolutionize cybersecurity by focusing on the weakest link, user behavior and other major breaches triggered by simple user error. This has…

Realme 12 series promises affordable premium photography

Smartphone brand Realme is set to launch the Realme 12 series in the country sooner than you could say reel me in. Jokes aside,…

Ford Puma review

Puma might be a famous sport and streetwear brand for many, but if you’re into Ford, it’s always been a compact driver’s car. In…

R3 is the rightsizing of EV design

R3 and R3x are the design disruption EV product planners need to understand. The two most compelling EV car companies are a curious antithesis…

Ranger designers rethink mixed reality

Ford’s T6 series platform has truly become the brand’s global car of the 2020s. Everest and Ranger are built on the advanced T6.1 series…

Continue in 10 seconds

Skip

Online journalism • 28 Nov 2013

The 5 minute guide to scraping data from PDFs

By Peter Verweij

Data

Every data journalist knows the feeling: you’re working on a massive project, you’ve finally found the data… but it is in PDF format.

Last month I had a crime reporter from Cape Town in one of my data journalism training sessions, who had managed to get around 60 PDF pages worth of stats out the relevant authorities. She explored and analyzed them by hand, which took days. That set me thinking. The problem can’t be all that uncommon and there must be a good few data journalists out there who could use a quick guide to scraping spreadsheets from PDFs.

The ideal of course is not getting your data in PDF form in the first place. It all comes from the same database, and it shouldn’t be any effort for the people concerned to save the same data in an Excel spreadsheet. The unfortunate truth however is that a lot of officials aren’t willing to do that out of fear that you’ll tinker with their data.

There are some web services like cometdocs or pdftoexcelonline that could help you out. Or you could try to build a scraper yourself, but then you have to read Paul Bradshaw‘s Scraping for Journalists first.

Tabula

My favourite tool though is Tabula. Tabula describes itself as “a tool for liberating data tables trapped inside PDF files”. It’s fairly easy to use too. All you have to do is import your PDF, select your data, push a button and there is your spreadsheet! You save the scraped page in CSV and from there you can import it into any spreadsheet program.

One small problem is that Tabula only scrapes one PDF page at a time. So 10 PDF pages worth of data gives you 10 spreadsheets.

Installing Tabula is a piece of cake: download, unzip and run. Tabula is written in Java (so you should have Java installed) and uses Ruby for scraping, which is one of the languages used on Scraperwiki to build tailor-made PDF scrapers.

Peter Verweij

Facebook looking at feature that lets you save links to read later

Facebook • 28 Nov 2013

We use cookies

To improve your experience, deliver personalised content and advertising. Find out more by reading our cookie policy.

Sign up to our newsletter to get the latest in digital insights. sign up

Welcome to Memeburn

By signing up for this email you agree to receive the latest info from Burnmedia Group.

Learn more via our Privacy Policy.

Young scientists to showcase research at Indonesia science competition

Young women in Science talent search applications now open

Entries Are Open: You Mailed It 2024 Email Marketing Awards!

Cisco ramps up AI-era security with Hypershield

World’s Largest 115” QD-Mini LED TV, now available in South Africa

Slack founder backs Amplifier Security with $3.3m for Ampy AI

Realme 12 series promises affordable premium photography

Ford Puma review

R3 is the rightsizing of EV design

Ranger designers rethink mixed reality

The 5 minute guide to scraping data from PDFs

Peter Verweij

News

Young scientists to showcase research at Indonesia science competition

Young women in Science talent search applications now open

Entries Are Open: You Mailed It 2024 Email Marketing Awards!

Once upon a time in the future, we spot Huawei’s recipe for growth

We use cookies

Welcome to Memeburn