Getting Structured Data from Internet: Web Scraping and Rest APIs

Getting Structured Data from Internet: Web Scraping and Rest APIs

About the Book

Note: this book is now available for ordering at Apress with lots of extra content, and titled "Getting structured data from internet: Running Web Crawlers/Scrapers on a Big Data Production Scale"

This book will teach you web scraping to quickly get unlimited amounts of free data available on the web in structured format. You'll learn Python scripts to not only to access free APIs to get structured data from websites such as Twitter, but you'll also learn to scrape data from any HTML and Javascript page and convert that into Excel, CSV or SQL database of your choice. We will go beyond the basics of web scraping, and cover advanced topics such as natural language processing and text analytics to extract out top keywords, text summary, names of people, places, email addresses and contact details etc. from a page. All the code used in the book will be available to help you understand the concepts in practice and write your own web scraper.

About the Author

Jay M. Patel
Jay M. Patel

My name is Jay M. Patel and I am a fulltime freelance software developer and data scientist specializing in data mining, web crawling/scraping, natural language processing (NLP) projects. Please check out my consulting page for details on how to hire me for your project.

I worked at US Environmental Protection Agency (US EPA) for about five years before quitting in 2018 to do consulting fulltime and bootstrap my startup, Specrom Analytics, which applies AI algorithms for marketing, social listening and creating alternative financial datasets.

In my time at US EPA, I designed text mining and NLP algorithms to extract useful insights from hundreds of thousands of documents which were parts of regulatory filings from companies. I also led one of the first research teams within the agency to use Apache Spark based workflows for traditional cheminformatics applications such as chemical similarities and quantitative structure activity relationships. We also developed recurrent neural networks and more advanced LSTM models in Tensorflow for chemical SMILES generation. Please check out my Google Scholar for a full list of all my research papers and presentations.

I graduated with Bachelors in chemical engineering from UDCT, India and M.S. in computational chemistry from University of Georgia, Athens, GA, USA. Check out my CV for more information.

My blog posts here will be focused on digital marketing, alternative financial datasets, my current work, data science, and my experiences as a startup founder. I also have couple of book projects in the works and one published book, please check it out here for more info.

In my free time, I also volunteer in Dangs district in India to assist tribal community in building homes, getting clean water and sanitation.

Connect with me on Linkedin, Github or email me at jay@jaympatel.com for any questions.

Table of Contents

  • 1. Introduction to web scraping: Why is web scraping essential and who uses web scraping?
  • 2. Intro to web services to get structured data
  • 2.1 Getting data from Twitter APIs
  • 2.2 Getting stock market data from Alphavantage
  • 3. Web scraping in python using Beautiful Soup library
  • 3.1 Tags and structure of HTML documents
  • 3.2 Cascading style sheets (CSS)
  • 3.3 Building first scraper with Beautiful Soup
  • 3.4 Scraping a HTML table into pandas dataframe
  • 3.5 Scraping XML files from clinicaltrials.gov
  • 4. Using selenium to scrape from Javascript
  • 5. Advanced Topics
  • 5.1 Boilerplate text removal
  • 5.2 Solving captchas
  • 5.3 Extracting top keywords, and text summarization from scraped documents
  • 5.4 Extracting names, entities from scraped documents

The Leanpub 60 Day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

Now, this is technically risky for us, since you'll have the book or course files either way. But we're so confident in our products and services, and in our authors and readers, that we're happy to offer a full money back guarantee for everything we sell.

You can only find out how good something is by trying it, and because of our 100% money back guarantee there's literally no risk to do so!

So, there's no reason not to click the Add to Cart button, is there?

See full terms...

80% Royalties. Earn $16 on a $20 book.

We pay 80% royalties. That's not a typo: you earn $16 on a $20 sale. If we sell 5000 non-refunded copies of your book or course for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earnedover $13 millionwriting, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Learn more about Leanpub's ebook formats and where to read them

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub