Leanpub Header

Skip to main content

How Query Engines Work

An Introductory Guide

This book provides an introduction to the high-level concepts behind query engines and walks through all aspects of building a fully working SQL query engine in Kotlin.

Please note that this is a short introductory book (around 100 pages). Around 4% of readers ask for a refund because they were expecting something far more comprehensive.

Minimum price

$19.99

$29.99

You pay

$29.99

Author earns

$23.99
$

...Or Buy With Credits!

You can get credits with a paid monthly or annual Reader Membership, or you can buy them here.
PDF
EPUB
788
Readers
143
Pages
25,984Words
About

About

About the Book

Andy Grove has worked on numerous projects that required custom query engines or integrations with existing query engines and this book provides an approachable introduction to the topic.

The book provides an introduction to the high-level concepts behind query engines and walks through every step of building a SQL query engine in Kotlin with full source code available in a companion github repository. Most of the book is programming language agnostic and Kotlin was chosen for the code examples due to its conciseness and readability. The concepts should be easily translatable to other programming languages.

Andy is a PMC member of Apache Arrow where he donated the initial Rust implementation and later donated the DataFusion query engine.

Please note that this is a short introductory book (around 100 pages). Around 4% of readers ask for a refund because they were expecting something far more comprehensive.

Author

About the Author

Andy Grove

Andy Grove is a PMC member of Apache Arrow where he donated the initial Rust implementation and also donated the DataFusion query engine.

Leanpub Podcast

Episode 194

An Interview with Andy Grove

Contents

Table of Contents

Acknowledgments

Preface

  1. Feedback

Introduction

  1. Who This Book Is For
  2. What You Will Learn
  3. How This Book Is Organized

The KQuery Project

  1. Why Kotlin?
  2. Repository Structure
  3. Building the Project
  4. Running Examples

1What Is a Query Engine?

  1. 1.1From Code to Queries
  2. 1.2Anatomy of a Query Engine
  3. 1.3A Concrete Example
  4. 1.4SQL: The Universal Query Language
  5. 1.5Beyond SQL: DataFrame APIs
  6. 1.6Why Build a Query Engine?
  7. 1.7What This Book Covers

2Apache Arrow

  1. 2.1Why Columnar?
  2. 2.2What Is Apache Arrow?
  3. 2.3Arrow Memory Layout
  4. 2.4Record Batches
  5. 2.5Schemas and Types
  6. 2.6Language Implementations
  7. 2.7Why Arrow for Our Query Engine?
  8. 2.8Further Reading

3Type System

  1. 3.1Why Types Matter
  2. 3.2Building on Arrow
  3. 3.3Schemas and Fields
  4. 3.4Column Vectors
  5. 3.5Literal Values
  6. 3.6Record Batches
  7. 3.7Type Coercion
  8. 3.8Putting It Together

4Data Sources

  1. 4.1Why Abstract Data Sources?
  2. 4.2The DataSource Interface
  3. 4.3CSV Data Source
  4. 4.4Parquet Data Source
  5. 4.5In-Memory Data Source
  6. 4.6Other Data Sources
  7. 4.7Schema-less Sources
  8. 4.8Connecting Data Sources to the Query Engine

5Logical Plans and Expressions

  1. 5.1Why Separate Logical from Physical?
  2. 5.2The LogicalPlan Interface
  3. 5.3Printing Logical Plans
  4. 5.4Logical Expressions
  5. 5.5The LogicalExpr Interface
  6. 5.6Column Expressions
  7. 5.7Literal Expressions
  8. 5.8Binary Expressions
  9. 5.9Aggregate Expressions
  10. 5.10Aliased Expressions
  11. 5.11Logical Plans
  12. 5.12Putting It Together
  13. 5.13Serialization

6DataFrame API

  1. 6.1Building Plans The Hard Way
  2. 6.2The DataFrame Approach
  3. 6.3The DataFrame Interface
  4. 6.4Implementation
  5. 6.5Execution Context
  6. 6.6Convenience Methods
  7. 6.7DataFrames vs SQL
  8. 6.8The Underlying Plan

7SQL Support

  1. 7.1The Journey from SQL to Logical Plan
  2. 7.2Tokenizing
  3. 7.3Parsing with Pratt Parsers
  4. 7.4SQL Expressions
  5. 7.5Precedence in Action
  6. 7.6Parsing SELECT Statements
  7. 7.7SQL Planning: The Hard Part
  8. 7.8Aggregate Queries
  9. 7.9Why Build Your Own Parser?

8Physical Plans and Expressions

  1. 8.1Why Separate Physical from Logical?
  2. 8.2The PhysicalPlan Interface
  3. 8.3Physical Expressions
  4. 8.4Physical Plans
  5. 8.5Execution Model
  6. 8.6Next Steps

9Query Planner

  1. 9.1What the Query Planner Does
  2. 9.2The QueryPlanner Class
  3. 9.3Translating Expressions
  4. 9.4Translating Plans
  5. 9.5A Complete Example
  6. 9.6Where Optimization Fits
  7. 9.7Error Handling

10Joins

  1. 10.1Join Types
  2. 10.2Join Conditions
  3. 10.3Join Algorithms
  4. 10.4Hash Join in Detail
  5. 10.5Join Ordering
  6. 10.6Bloom Filters
  7. 10.7Summary

11Subqueries

  1. 11.1Types of Subqueries
  2. 11.2Planning Subqueries
  3. 11.3Implementation Complexity
  4. 11.4When Decorrelation Is Not Possible

12Query Optimizations

  1. 12.1Why Optimize?
  2. 12.2Rule-Based Optimization
  3. 12.3Projection Push-Down
  4. 12.4Predicate Push-Down
  5. 12.5Eliminate Common Subexpressions
  6. 12.6Cost-Based Optimization
  7. 12.7Other Optimizations

13Query Execution

  1. 13.1The Execution Context
  2. 13.2The Execution Pipeline
  3. 13.3Running a Query
  4. 13.4Lazy Evaluation
  5. 13.5Consuming Results
  6. 13.6Example: NYC Taxi Data
  7. 13.7The Impact of Optimization
  8. 13.8Comparison with Apache Spark
  9. 13.9Error Handling
  10. 13.10What We Have Built

14Parallel Query Execution

  1. 14.1Why Parallelism Helps
  2. 14.2Data Parallelism
  3. 14.3A Practical Example
  4. 14.4Combining Results
  5. 14.5Partitioning Strategies
  6. 14.6Partition Pruning
  7. 14.7Parallel Joins
  8. 14.8Repartitioning and Exchange
  9. 14.9Limits of Parallelism

15Distributed Query Execution

  1. 15.1When to Go Distributed
  2. 15.2Architecture Overview
  3. 15.3Embarrassingly Parallel Operators
  4. 15.4Distributed Aggregates
  5. 15.5Distributed Joins
  6. 15.6Query Stages
  7. 15.7Producing a Distributed Query Plan
  8. 15.8Serializing a Query Plan
  9. 15.9Serializing Data
  10. 15.10Choosing a Protocol
  11. 15.11Streaming vs Blocking Operators
  12. 15.12Data Locality
  13. 15.13Fault Tolerance
  14. 15.14Custom Code
  15. 15.15Distributed Query Optimizations

16Testing

  1. 16.1Unit Testing
  2. 16.2Integration Testing
  3. 16.3Fuzzing

17Benchmarks

  1. 17.1Measuring Performance
  2. 17.2Measuring Scalability
  3. 17.3Concurrency
  4. 17.4Automation
  5. 17.5Comparing Benchmarks
  6. 17.6Publishing Benchmark Results
  7. 17.7Transaction Processing Council (TPC) Benchmarks

Further Resources

  1. Open-Source Projects
  2. YouTube
  3. Sample Data

The Leanpub 60 Day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

Now, this is technically risky for us, since you'll have the book or course files either way. But we're so confident in our products and services, and in our authors and readers, that we're happy to offer a full money back guarantee for everything we sell.

You can only find out how good something is by trying it, and because of our 100% money back guarantee there's literally no risk to do so!

So, there's no reason not to click the Add to Cart button, is there?

See full terms...

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earned over $14 million writing, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Learn more about Leanpub's ebook formats and where to read them

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub