Introduction to Data Engineering
Introduction to Data Engineering
Learn the skills needed to break into Data Engineering.
About the Book
This is a book about the basic theories around data engineering. It's not about writing code in a particular language, it's about the concepts that you can use to learn and thrive as a data engineer.
Table of Contents
-
Introduction
- Knowledge and Experience
- What are the topics we will cover?
-
Chapter 1 - The Theory.
- What Is a Data Pipeline?
- Data Pipelines built with Passion and Creativity
- Storage and File Types
- Access
- Repeatable
- Resilient
- Scalable
- In Summary
-
Chapter 2 - Data Pipeline Basics
- Project Structure
- Data Pipeline Code Structure
- Code Readability and Organization
- Tests.
- Documentation
- Containerzation
- Architecture First
- Review
-
Chapter 3 - Pipeline Architecture
- Architecture Applied to Data
- Data Size and Velocity
- Calculating Compute Requirements
- Calculating Storage Requirements
- Understanding the End Result
- Understanding Cost
- Code Architecture
- Batch vs Streaming Architecture
- Puzzle Pieces
- Summary
-
Chapter 4 - Storage
- Access Patterns
- SQL/NoSQL Databases vs Files.
- File Types
- Row vs Columnar Storage.
- Common file types in data engineering.
- Parquet.
- Avro.
- Orc.
- CSV / Flat-file.
- JSON
- Compression.
- Storage location.
- Partitions.
-
Chapter 5 - Compute and Resources
- Overview
- RAM/Memory
- CPU/Cores
- Storage
- Cluster/Nodes
-
Chapter 6 - Mastering SQL
- Introduction To SQL
- Does the type of database matter?
- The fundamentals of SQL/Databases.
- OLTP vs. OLAP
- Table design/layout.
- Table Design in Real Life.
- Understanding Indexing Basics.
- How to write fast/tune queries.
- Where to look for common problems.
- SQL Fundementals
- Python + SQL
- SQL Summary
-
Chapter 7 - Data Warehousing / Data Lakes
- Data Warehouse vs Data Lake vs Lake House
- Data Modeling in Data Warehouses, Data Lakes, and Lake Houses.
- Facts and Dimensions.
- Constraints and Schema.
- Data Types.
- Column Names.
- The Role of ID’s in a Data Warehouses or Data Lake.
- CDC / History Tracking.
- Summary
-
Chapter 8 - Data Modeling
- Data Types and Schema.
- Data Types.
- Example
- Data Size.
- Constraints.
- Data Definitions.
- Modeling Data Logically.
- Logical data models lead to physical relationships.
- Grain of Data.
- Uniqueness of Data.
- Access Patterns.
- Example
- Talking to the Business.
- Normal Forms.
- De-Duplication of Data.
- Join Integrity.
- Keys - Primary and Foreign.
- The Idea Behind Keys.
- Relational Databases (SQL) vs Data Lake (File Based) Modeling.
- The number of Fact tables and Dimensions and normalization.
- File size and table size matter in the new File-Based Data Lakes.
- Partitions vs Indexes.
- Walking the data model line between old and new.
-
Chapter 9 - Data Quality
- What is Data Quality.
- Reasoning about data.
- Double meanings.
- Data value quality.
- Measures of Data Quality.
- Correct Header or Column Names.
- Correct File Formatting.
- Correct data types.
- Values ranges and values integrity.
- Data Quality Applied
-
Chapter 10 - DevOps for Data Engineers
- DevOps applied to Data Engineering
- Dockerfiles and Docker-compose.
- Unit Testing.
- CI/CD.
- Automation is the name of the game.
- CI for Data Engineering
- Conclusion
The Leanpub 60 Day 100% Happiness Guarantee
Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.
Now, this is technically risky for us, since you'll have the book or course files either way. But we're so confident in our products and services, and in our authors and readers, that we're happy to offer a full money back guarantee for everything we sell.
You can only find out how good something is by trying it, and because of our 100% money back guarantee there's literally no risk to do so!
So, there's no reason not to click the Add to Cart button, is there?
See full terms...
Earn $8 on a $10 Purchase, and $16 on a $20 Purchase
We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.
(Yes, some authors have already earned much more than that on Leanpub.)
In fact, authors have earnedover $14 millionwriting, publishing and selling on Leanpub.
Learn more about writing on Leanpub
Free Updates. DRM Free.
If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).
Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.
Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.
Learn more about Leanpub's ebook formats and where to read them