Scraping for Journalists

How to grab information from hundreds of sources, put it in data you can interrogate - and still hit deadlines

939 readers

100% Happiness guarantee Learn more

Minimum: $15.10

Suggested: $20.01+

Ways to buy

  • Free sample download
  • 939 readers
  • 559 pages
  • 87,635 words
  • Book language: English
  • Updated

About the Book

Scraping - getting a computer to capture information from online sources - is one of the most powerful techniques for data-savvy journalists who want to get to the story first, or find exclusives that no one else has spotted. Faster than FOI and more detailed than advanced search techniques, scraping also allows you to grab data that organisations would rather you didn’t have - and put it into a form that allows you to get answers.

Scraping for Journalists introduces you to a range of scraping techniques - from very simple scraping techniques which are no more complicated than a spreadsheet formula, to more complex challenges such as scraping databases or hundreds of documents. At every stage you'll see results - but you'll also be building towards more ambitious and powerful tools.

You’ll be scraping within 5 minutes of reading the first chapter - but more importantly you'll be learning key principles and techniques for dealing with scraping problems.

Unlike general books about programming languages, everything in this book has a direct application for journalism, and each principle of programming is related to their application in scraping for newsgathering. And unlike standalone guides and blog posts that cover particular tools or techniques, this book aims to give you skills that you can apply in new situations and with new tools.

Buy A Bundle, And Save

About the Author

Paul Bradshaw runs the MA in Online Journalism at Birmingham City University, and is a Visiting Professor at City University’s School of Journalism in London. He publishes the Online Journalism Blog, and is the founder of investigative journalism website HelpMeInvestigate. He has written for journalism.co.uk, Press Gazette, the Guardian and Telegraph’s data blogs, InPublishing, Nieman Reports and the Poynter Institute in the US. He is the co-author of the Online Journalism Handbook with former Financial Times web editor Liisa Rohumaa, and of Magazine Editing (3rd Edition) with John Morrish. Other books which Bradshaw has contributed to include Investigative Journalism (second edition), Web Journalism: A New Form of Citizenship; and Citizen Journalism: Global Perspectives.

Bradshaw has been listed in Journalism.co.uk’s list of the leading innovators in journalism and media and Poynter’s most influential people in social media. In 2010, he was shortlisted for Multimedia Publisher of the Year.

In addition to teaching and writing, Paul acts as a consultant and trainer to a number of organisations on social media and data journalism. You can find him on Twitter @paulbradshaw

The Leanpub Unconditional, No Risk, 100% Happiness Guarantee

♥♥♥♥♥

Within 45 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks. We process the refunds manually, so they may take a few days to show up. See full terms.

If you buy a Leanpub book you get all the updates to the book for free! All books are available in PDF, EPUB (for iPad) and MOBI (for Kindle). There is no DRM. There is no risk, just guaranteed happiness or your money back.

Other books by this author

Table of Contents

  • 1 Introduction
    • A book about not reading books
    • I’m not a programmer
    • PS: This isn’t a book
  • 2 Scraper #1: Start scraping in 5 minutes
    • How it works: functions and parameters
    • What are the parameters? Strings and indexes
    • Tables and lists?
    • Recap
    • Tests
  • 3 Scraper #2: What happens when the data isn’t in a table?
    • Strong structure: XML
    • Scraping XML
    • Recap
    • Tests
  • 4 Scraper #3: Looking for structure in HTML
    • Detour: Introduction to HTML and the LIFO rule
    • Attributes and values
    • Classifying sections of content: div, span, classes and ids
    • Back to Scraper #3: Scraping a <div> in a HTML webpage
    • Recap
    • Tests
  • 5 Scraper #4: Finding more structure in webpages: Xpath
    • Recap
    • Tests
  • 6 Scraper #5: Scraping multiple pages with Google Docs
    • Recap
    • Tests
  • 7 Scraper #6: Structure in URLs - using Google Refine as a scraper
    • Assembling the ingredients
    • Bringing your data into Google Refine
    • Grabbing the HTML for each page
    • Extracting data from the raw HTML with parseHTML
    • Recap
    • Tests
  • 8 Scraper #7: Scraping multiple pages with ‘next’ links using Outwit Hub
    • Creating a basic scraper in OutWit Hub
    • Customised scrapers in OutWit
    • Recap
    • Tests
  • 9 Scraper #8: Poorly formatted webpages - solving problems with OutWit
    • Identifying what structure there is
    • Repeating a heading or other piece of data for each part within it
    • Splitting a larger piece of data into bits: using separators
    • Recap
    • Tests
  • 10 Scraper #9: Scraping uglier HTML and ‘regular expressions’ in an OutWit scraper
    • Introducing Regex
    • Using regex to specify a range of possible matches
    • Catching the regular expression too
    • I want any character: the wildcard and quantifiers
    • Matching zero, one or more characters - quantifiers
    • 3 questions: What characters, how many, where?
    • Using regex on an ugly page
    • What’s the pattern?
    • Matching non-textual characters
    • What if my data contains full stops, forward slashes or other special characters?
    • ‘Anything but that!’ - negative matches
    • This or that - looking for more than one regular expression at the same time
    • Only here - specifying location
    • Back to the scraper: grabbing the rest of the data
    • Which dash? Negative matches in practice.
    • Recap
    • Tests
  • 11 Scrapers #10 and #11: Scraping hidden and ‘invisible’ data on a webpage: icons and ‘reveals’
    • Scraping accessibility data on Olympic venues
    • Hidden HTML
    • Recap
    • Tests
  • 12 Scraper #12: Scraperwiki intro: adapting a Twitter and Google News scraper
    • Forking a scraper
    • Scraping Google News
    • Recap
    • Tests
  • 13 Scraper #13: Tracing the code - libraries and functions, and documentation in Scraperwiki
    • Parent/child relationships
    • Parameters (again)
    • Detour: Variables
    • Back to Scraper #9
    • Recap
    • Tests
  • 14 Scraper #13 continued: Scraperwiki’s tutorial scraper 2
    • What are those variables?
    • Detour: loops (for and while)
    • Back to scraper #13: Storing the data
    • Detour: Unique keys, primary keys, and databases
    • A unique key can’t be empty: fixing the ValueError
    • Summing up the scraper
    • Recap
    • Tests
  • 15 Scraper #14: Adapting the code to scrape a different webpage
    • Dealing with errors
    • Recap
    • Tests
  • 16 Scraper #15: Scraping multiple cells and pages
    • Creating your own functions: def
    • If statements - asking a question
    • Numbers in square brackets: indexes again
    • Attributes
    • Recap
    • Tests
  • 17 Scraper #16: Adapting your third scraper: creating more than one column of data
    • Recap
    • Tests
  • 18 Scraper #17: Scraping a list of pages
    • Creating the list of codes
    • Recap
    • Tests
  • 19 Scraper #18: Scraping a page - and the pages linked (badly) from it
    • Using ranges to avoid errors
    • Using len to test lists
    • Other workarounds
    • Recap
    • Scraper tip: a checklist for understanding someone else’s code
  • 20 Scraper #19: Scraping scattered data from multiple websites that share the same CMS
    • Finding websites using the same content management system (CMS)
    • Writing the scraper: looking at HTML structure
    • Using if statements to avoid errors when data doesn’t exist
    • The variable that doesn’t exist
    • Initialising an empty variable
    • Recap
    • Tests
  • 21 Scraper #20: Automating database searches (forms)
    • Understanding URLs: queries and parameters
    • When the URL doesn’t change
    • Solving the cookie problem: Mechanize
    • Recap
    • Tests
  • 22 Scraper #21: Storing the results of a search
    • Recap
    • Scraper tip: using print to monitor progress
    • Tests
  • 23 Scraper #22: Scraping PDFs part 1
    • Detour: indexes and slicing shortcuts
    • Back to the scraper
    • Detour: operators
    • Back to the scraper (again)
    • Detour: the % sign explained
    • Back to the scraper (again) (again)
    • Recap
    • Tests
  • 24 Scraper 23: Scraping PDFs part 2
    • Where’s the ‘view source’ on a PDF?
    • Scraping speed camera PDFs - welcome back to XPath
    • Ifs and buts: measuring and matching data
    • Recap
    • Tests
  • 25 Scraper 24: Scraping multiple PDFs
    • The code
    • Tasks 1 and 2: Find a pattern in the HTML and grab the links within
    • XPath contains…
    • The code: scraping more than one PDF
    • The wrong kind of data: calculations with strings
    • Putting square pegs in square holes: saving data based on properties
    • Recap
    • Tests
  • 26 Scraper 25: Text, not tables, in PDFs - regex
    • Starting the code: importing a regex library
    • Code continued: Find all the links to PDF reports on a particular webpage
    • Detour: global variables and local variables
    • The code part 3: scraping each PDF
    • Re: Python’s regex library
    • Other functions from the re library
    • Back to the code
    • Joining lists of items into a single string
    • The code in full
    • Recap
    • Tests
  • 27 Scraper 26: Scraping CSV files
    • The CSV library
    • Process of elimination 1: putting blind spots in the code
    • Process of elimination 2: amending the source data
    • Encoding, decoding, extracting
    • Removing the header row
    • Ready to scrape multiple sheets
    • Combining CSV files on your computer
    • Recap
    • Tests
  • 28 Scraper 27: Scraping Excel spreadsheets part 1
    • A library for scraping spreadsheets
    • What can you learn from a broken scraper?
    • But what is the scraper doing?
    • Recap
    • Tests
  • 29 Scraper 28: Scraping Excel spreadsheets part 2: scraping one sheet
    • Testing on one sheet of a spreadsheet
    • Recap
    • Tests
  • 30 Scraper 28 continued: Scraping Excel spreadsheets part 3: scraping multiple sheets
    • One dataset, or multiple ones
    • Using header row values as keys
    • Recap
    • Tests
  • 31 Scraper 28 continued: Scraping Excel spreadsheets part 4: Dealing with dates in spreadsheets
    • More string formatting: replacing bad characters
    • Scraping multiple spreadsheets
    • Loops within loops
    • Scraper tip: creating a sandbox
    • Recap
    • Tests
  • 32 Scraper 29: Scraping ASPX pages
    • The code
    • Submitting links in javascript
    • Saving the data
    • Recap
    • Tests
  • 33 The final chapter: where do you go from here?
    • The map is not the territory
    • If you’re API and you know it
    • Recommended reading and viewing
    • End != End
  • 34 Bonus scraper #1: Scraping XLSX spreadsheets with Openpyxl
    • Standing on the shoulders of Zarino Zappia
    • The scraper in full
    • Step by step: importing new libraries
    • 4 new functions
    • Function 1: grabbing only xlsx links
    • Functions 2 and 3: Grab the workbook from the spreadsheet
    • Functions 2 and 4: Grab the data from the spreadsheet
    • Function 2 resumes to completion
    • Recap
    • Tests
  • 35 Acknowledgements
  • 36 List of websites scraped
  • 37 Glossary

Last updated

This Book is for Sale Through Leanpub

Authors and publishers use Leanpub to publish amazing in-progress and completed books like this one.

Leanpub_logo_medium

Read

Leanpub has a wide selection of in-progress and completed books available to browse, purchase, and enjoy.

Browse the Bookstore

Write

Leanpub is free for authors and publishers to use. We pay authors or publishers 90% royalties minus 50 cents per sale.

Write a Book