Burn Media Sites

Once upon a time in the future, we spot Huawei’s recipe for growth

This week we landed in the home country of the consumer group Huawei among other electronics manufacturers. We mention Huawei due to the overall…

Here’s what SA business use ChatGPT for

In a compelling survey on the use of generative AI in Africa and the Middle East, we spot the looming threats that are pleasantly…

Once upon a time in the future, my near collision with a robot waiter

I took a left, it took a left, and my right was met by a pause from the big-eyed digital waitress who left a…

Cisco ramps up AI-era security with Hypershield

As the artificial intelligence revolution accelerates, the scale and complexity of data centres are straining conventional cybersecurity approaches. In response, Cisco Systems, the networking…

Deloitte and AWS Join Forces to Drive Cloud Adoption Globally

In a strategic move to accelerate cloud computing adoption across growth markets, Deloitte and Amazon Web Services (AWS) have entered into a multi-year Strategic…

MFA Fatigue Attacks: The New Social Engineering Threat Plaguing Enterprises

While multifactor authentication (MFA) has long been heralded as an essential security measure for keeping corporate networks safe from cybercriminals, a new type of…

Slack founder backs Amplifier Security with $3.3m for Ampy AI

AI continues to revolutionize cybersecurity by focusing on the weakest link, user behavior and other major breaches triggered by simple user error. This has…

Realme 12 series promises affordable premium photography

Smartphone brand Realme is set to launch the Realme 12 series in the country sooner than you could say reel me in. Jokes aside,…

What causes lithium-ion battery fires?

Behind the convenience of lithium-ion batteries lies a potentially hazardous science. SafeQuip, a leading distributor of fire-related equipment, delves into the construction of lithium-ion…

Ford Puma review

Puma might be a famous sport and streetwear brand for many, but if you’re into Ford, it’s always been a compact driver’s car. In…

R3 is the rightsizing of EV design

R3 and R3x are the design disruption EV product planners need to understand. The two most compelling EV car companies are a curious antithesis…

Ranger designers rethink mixed reality

Ford’s T6 series platform has truly become the brand’s global car of the 2020s. Everest and Ranger are built on the advanced T6.1 series…

Continue in 10 seconds

Skip

Online journalism • 28 May 2013

Aspiring data journalist? This book is a must-read

By Peter Verweij

Data

It was work in progress, but after almost one year and 40 ‘versions’ later, Paul Bradshaw’s ‘Scraping for journalists‘ is published. Bradshaw is teaching at London City University and the City University at Birmingham, but he is also a respected data journalist and blogger at the Online Journalism Blog. And not without reason.

You can order a copy of the work as an e-book, available in PDF, Mobi or Epub formats. Leanpub, where you can obtain a copy, has an interesting concept: it offers all the tools for the production and for the publishing of a book. You can make changes and additions while publishing, and, not an unimportant factor, the royalties are higher compared to traditional publishing. Bradshaw says he has “become a huge fan” as “the format combines the best qualities of traditional book publishing with those of blogging and social media.”

Must read

‘Scraping for journalists’ is a must read for data journalists. One of the problems is how to get your data from the online resources into a spreadsheet. Scraping is the answer. But how do you do that, given the fact that most journalists are not coders? In 30 chapters and almost 500 pages Bradshaw gives his recipes for scraping data. The book is not for reading from cover to cover but rather learning by doing. You follow the recipes step by step on your computer, add some variation to the examples and finally you try to apply the recipes on your own data. This works wonderfully, because starting with programming takes too much time before you get results. Now you have some readymade code, which works, and you can experiment until you can successfully apply it to your own data.

Fast start

Already from chapter one you can make a quick start. Within five minutes, you can scrape your first data. Bradshaw starts with explaining the commands Import HTML and Import XML used in Google Drive to import data from a web page into spreadsheets. The trick is to find the right table or list of the data. You can dig deep into the html or xml soup but you can also guess and experiment. Just try some numbers in the expression, advises Bradshaw.

Of course extracting tables from a website can be done faster with a nice tool called Outwit Hub. You just load your data web page in Outwit and push the ‘table’ button and there is your scraped data ready to be exported in Excel format. The free version works but Bradshaw advises to buy the official one for about 60 Euros, because it does not have the limitation of scraping only a hundred lines. This is useful when you are scraping a lot of data. Take, for example, 150 members of parliament, who all have their own web pages. If they’re structured in the same way, with a heading/paragraph where the members state their education and former jobs, doing this by hand page after page is pretty boring and time consuming. You can rather make a scraper, based on the opening- and end-tags for education and jobs, then run it over the 150 individual member pages. Have a cup of coffee and after a while, your data will be ready for exporting to Excel. Bradshaw takes great effort in explaining how to find the opening- and end-tag in the html soup for the data you are looking for. This makes sure you will get it working after a while.

Scraperwiki

You are not the only journalist who is scraping data. Scraperwiki is the playground to meet your friends and share your skills. On Scraperwiki you will find various scrapers used by others to collect data. Copy them and make a revision for your own purposes and run it. This sounds simple, however scrapers are written in code, and generally three languages are used, namely PHP, Ruby and Python. You don’t have to be a programmer to use the scripts. After Bradshaw’s explanation of the structure of a scraper you can start experimenting yourself. And, as any good educator and trainer, Bradshaw gives you some assignments at the end of each chapter.

There is much more to discover: do you know how to scrape a PDF, cells in a large spreadsheet, or data in CSV file? In the book you will find the recipes. When I show the tricks in training sessions, participants always ask: do you have this in writing? Now it is.

Peter Verweij

Sexism in tech: why gender needs to stop being an issue for female geeks

General Tech • 28 May 2013

We use cookies

To improve your experience, deliver personalised content and advertising. Find out more by reading our cookie policy.

Sign up to our newsletter to get the latest in digital insights. sign up

Welcome to Memeburn

By signing up for this email you agree to receive the latest info from Burnmedia Group.

Learn more via our Privacy Policy.

Once upon a time in the future, we spot Huawei’s recipe for growth

Here’s what SA business use ChatGPT for

Once upon a time in the future, my near collision with a robot waiter

Cisco ramps up AI-era security with Hypershield

Slack founder backs Amplifier Security with $3.3m for Ampy AI

Realme 12 series promises affordable premium photography

What causes lithium-ion battery fires?

Ford Puma review

R3 is the rightsizing of EV design

Ranger designers rethink mixed reality

Aspiring data journalist? This book is a must-read

Peter Verweij

News

Once upon a time in the future, we spot Huawei’s recipe for growth

Here’s what SA business use ChatGPT for

Once upon a time in the future, my near collision with a robot waiter

How AI can help students land their first job

We use cookies

Welcome to Memeburn