Text Processing with Ruby

A comprehensive guide to extracting useful information from text with Ruby

Don't know your ARGF from your elbow? Having trouble with regular expressions? Want to know how to efficiently extract useful information from huge text files? Fear not! Help is at hand.

100% Happiness guarantee Learn more

Minimum: $6.99

Suggested: $9.99+

Ways to buy

  • Free sample download
  • 80 pages
  • 85% complete
  • Book language: English
  • Updated

About the Book

Text is everywhere. From the content of webpages to the output of system commands, from keyboard input to data formats like XML and JSON, much of the data our programs interact with is, when it comes down to it, text.

Knowing how to get the most out of code that processes text is therefore essential not only to being a productive developer, but essential to efficiency elsewhere too — from helping you monitor servers to getting insight into your business and its metrics.

TPWR examines how the Ruby programming language's robust text handling capabilities can be used to quickly and painlessly deal with large datasets, to write shell one-liners, to extract fields from delimited data, and much more.

Working from the foundations upwards, TPWR examines how data actually gets into your program — from keyboard input to streaming large files — and how that data can be read efficiently. It looks at how you can extract information easily from both computer-generated files and passages of human-written text, and shows you how you can use regular expressions to identify patterns in text, extract them, and manipulate them.

It also shows you how you can use scraping techniques to extract data from even the most badly written web pages, allowing you to get data even where there isn't an API available.

Aimed at the novice-to-intermediate Ruby developer — someone who's perhaps comfortable writing Rails apps but has recently found themselves branching out into writing command-line tools, or a data scientist who enjoys Ruby but isn't the most experienced developer — this book requires little previous experience with handling text in Ruby. Even experienced developers, though, will find that there are some dark corners of Ruby's elaborate text processing abilities that they weren't aware of.

About the Author

Rob Miller is a web developer who works mainly in Ruby and SQL, and who spends his days crunching numbers and herding cats at Big Fish, a design, branding, and marketing agency based in London where he's Head of Digital.

Much of his day job involves mangling data from one arcane format into another, attempting to make sense of the information within; this involves writing everything from one-liners in the shell to sprawling custom libraries.

In his spare time he works on on open source projects, some of which you can find them on his GitHub profile. Among the more popular ones are: ruby-wpdb, a WordPress binding for Ruby; varnisher, a tool for purging Varnish HTTP caches; and batchtapaper, a mass-adding tool for Instapaper.

The Leanpub Unconditional, No Risk, 100% Happiness Guarantee


Within 45 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks. We process the refunds manually, so they may take a few days to show up. See full terms.

If you buy a Leanpub book you get all the updates to the book for free! All books are available in PDF, EPUB (for iPad) and MOBI (for Kindle). There is no DRM. There is no risk, just guaranteed happiness or your money back.

Table of Contents

    • Introduction
    • Preface
  • Part One: Acquiring Text
    • Working with files
      • Opening a filehandle
      • Reading a file in one go
      • Line-by-line processing
      • Reading huge files
      • More on streams
    • Working with standard input
      • Keyboard input
      • Redirecting input
      • Rewriting uniq
    • Shell one-liners
      • The -e switch
      • The -n switch
      • The -p switch
      • Using BEGIN blocks
      • Using END blocks
      • A more practical example
    • Having your cake and eating it too: ARGF
      • Basic usage
      • Reading from files
      • Reading from standard input
      • Two behaviours
      • Enumerating input
      • Some ARGF-specific features
      • Moving on
    • Delimited data
      • Parsing a TSV
      • Delimiters and the command line
      • The CSV format
    • Working with binary files
      • Extracting text from PDFs
      • Extracting text from Word Documents
  • Part Two: Extracting Data
    • Regular expressions basics
      • Regular expressions in Ruby
      • Defining regular expressions
      • Pattern syntax
      • Pattern modifiers
      • Ruby’s idiosyncrasies
      • Digging deeper
    • Extraction and substitution with regular expressions
      • Checking if a string matches a pattern
      • Extracting matches
      • Transforming text
    • Parsing text with StringScanner
    • Scraping HTML
      • Regular expressions
      • Installing Nokogiri
      • The document
      • Searching the document: XPath selectors
      • Searching the document: CSS selectors
      • What to do with elements
      • Getting a feel for a page
      • A practical example

Last updated

This Book is for Sale Through Leanpub

Authors and publishers use Leanpub to publish amazing in-progress and completed books like this one.



Leanpub has a wide selection of in-progress and completed books available to browse, purchase, and enjoy.

Browse the Bookstore


Leanpub is free for authors and publishers to use. We pay authors or publishers 90% royalties minus 50 cents per sale.

Write a Book