Introduction to Data Engineering
$9.99
Minimum price
$9.99
Suggested price

Introduction to Data Engineering

Learn the skills needed to break into Data Engineering.

About the Book

This is a book about the basic theories around data engineering. It's not about writing code in a particular language, it's about the concepts that you can use to learn and thrive as a data engineer.

About the Author

Daniel Beach
Daniel Beach

Daniel Beach is a data engineer who has been building high throughput, large, scalable data pipelines for data warehousing and machine learning system for years.

Table of Contents

  • Introduction
    • What is a Data Engineer?
    • What To Expect
    • The Focus of This Book
    • Knowledge and Experience
    • What are the topics we will cover?
    • Summary
  • Chapter 1 - The Theory.
    • What Is a Data Pipeline?
    • Data Pipelines built with Passion and Creativity
    • Storage and File Types
    • Access
    • Repeatable
    • Resilient
    • Scalable
    • In Summary
  • Chapter 2 - Data Pipeline Basics
    • Project Structure
    • Data Pipeline Code Structure
    • Code Readability and Organization
    • Tests.
    • Documentation
    • Containerzation
    • Architecture First
    • Review
  • Chapter 3 - Pipeline Architecture
    • Architecture Applied to Data
    • Data Size and Velocity
    • Calculating Compute Requirements
    • Calculating Storage Requirements
    • Understanding the End Result
    • Understanding Cost
    • Code Architecture
    • Batch vs Streaming Architecture
    • Puzzle Pieces
    • Summary
  • Chapter 4 - Storage
    • Access Patterns
    • SQL/NoSQL Databases vs files.
    • File Types
    • Row vs Columnar Storage.
    • Common file types in data engineering.
    • Parquet.
    • Avro.
    • Orc.
    • CSV / Flat-file.
    • JSON
    • Compression.
    • Storage location.
    • Partitions.
  • Chapter 5 - Compute and Resources
    • Overview
    • RAM/Memory
    • CPU/Cores
    • Storage
    • Cluster/Nodes
  • Chapter 6 - Mastering SQL
    • Introduction To SQL
    • Does the type of database matter?
    • The fundamentals of SQL/Databases.
    • OLTP vs. OLAP
    • Table design/layout.
    • Table Design in Real Life.
    • Understanding Indexing Basics.
    • How to write fast/tune queries.
    • Where to look for common problems.
    • SQL Fundementals
    • Python + SQL
    • SQL Summary
  • Chapter 7 - Data Warehousing / Data Lakes
    • Data Warehouse vs Data Lake vs Lake House
    • Data Modeling in Data Warehouses, Data Lakes, and Lake Houses.
    • Facts and Dimensions.
    • Constraints and Schema.
    • Data Types.
    • Column Names.
    • The Role of ID’s in a Data Warehouses or Data Lake.
    • CDC / History Tracking.
    • Summary
  • Chapter 8 - Data Modeling
    • Data Types and Schema.
    • Data Types.
    • Example
    • Data Size.
    • Constraints.
    • Data Definitions.
    • Modeling Data Logically.
    • Logical data models lead to physical relationships.
    • Grain of Data.
    • Uniqueness of Data.
    • Access Patterns.
    • Example
    • Talking to the Business.
    • Normal Forms.
    • De-Duplication of Data.
    • Join Integrity.
    • Keys - Primary and Foreign.
    • The Idea Behind Keys.
    • Relational Databases (SQL) vs Data Lake (File Based) Modeling.
    • The number of Fact tables and Dimensions and normalization.
    • File size and table size matter in the new File-Based Data Lakes.
    • Partitions vs Indexes.
    • Walking the data model line between old and new.
  • Chapter 9 - Data Quality
    • What is Data Quality.
    • Reasoning about data.
    • Double meanings.
    • Data value quality.
    • Measures of Data Quality.
    • Correct Header or Column Names.
    • Correct File Formatting.
    • Correct data types.
    • Values ranges and values integrity.
  • Chapter 10 - DevOps for Data Engineers
    • Dockerfiles and Docker-compose.
    • Unit Testing.
    • CI/CD.
    • Automation is the name of the game.
  • Conclusion

The Leanpub 60-day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

See full terms

Do Well. Do Good.

Authors have earned$11,714,583writing, publishing and selling on Leanpub, earning 80% royalties while saving up to 25 million pounds of CO2 and up to 46,000 trees.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers), EPUB (for phones and tablets) and MOBI (for Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Learn more about Leanpub's ebook formats and where to read them

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF, EPUB and/or MOBI files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub