Acknowledgments
Preface
- Feedback
Introduction
- Who This Book Is For
- What You Will Learn
- How This Book Is Organized
The KQuery Project
- Why Kotlin?
- Repository Structure
- Building the Project
- Running Examples
1What Is a Query Engine?
- 1.1From Code to Queries
- 1.2Anatomy of a Query Engine
- 1.3A Concrete Example
- 1.4SQL: The Universal Query Language
- 1.5Beyond SQL: DataFrame APIs
- 1.6Why Build a Query Engine?
- 1.7What This Book Covers
2Apache Arrow
- 2.1Why Columnar?
- 2.2What Is Apache Arrow?
- 2.3Arrow Memory Layout
- 2.4Record Batches
- 2.5Schemas and Types
- 2.6Language Implementations
- 2.7Why Arrow for Our Query Engine?
- 2.8Further Reading
3Type System
- 3.1Why Types Matter
- 3.2Building on Arrow
- 3.3Schemas and Fields
- 3.4Column Vectors
- 3.5Literal Values
- 3.6Record Batches
- 3.7Type Coercion
- 3.8Putting It Together
4Data Sources
- 4.1Why Abstract Data Sources?
- 4.2The DataSource Interface
- 4.3CSV Data Source
- 4.4Parquet Data Source
- 4.5In-Memory Data Source
- 4.6Other Data Sources
- 4.7Schema-less Sources
- 4.8Connecting Data Sources to the Query Engine
5Logical Plans and Expressions
- 5.1Why Separate Logical from Physical?
- 5.2The LogicalPlan Interface
- 5.3Printing Logical Plans
- 5.4Logical Expressions
- 5.5The LogicalExpr Interface
- 5.6Column Expressions
- 5.7Literal Expressions
- 5.8Binary Expressions
- 5.9Aggregate Expressions
- 5.10Aliased Expressions
- 5.11Logical Plans
- 5.12Putting It Together
- 5.13Serialization
6DataFrame API
- 6.1Building Plans The Hard Way
- 6.2The DataFrame Approach
- 6.3The DataFrame Interface
- 6.4Implementation
- 6.5Execution Context
- 6.6Convenience Methods
- 6.7DataFrames vs SQL
- 6.8The Underlying Plan
7SQL Support
- 7.1The Journey from SQL to Logical Plan
- 7.2Tokenizing
- 7.3Parsing with Pratt Parsers
- 7.4SQL Expressions
- 7.5Precedence in Action
- 7.6Parsing SELECT Statements
- 7.7SQL Planning: The Hard Part
- 7.8Aggregate Queries
- 7.9Why Build Your Own Parser?
8Physical Plans and Expressions
- 8.1Why Separate Physical from Logical?
- 8.2The PhysicalPlan Interface
- 8.3Physical Expressions
- 8.4Physical Plans
- 8.5Execution Model
- 8.6Next Steps
9Query Planner
- 9.1What the Query Planner Does
- 9.2The QueryPlanner Class
- 9.3Translating Expressions
- 9.4Translating Plans
- 9.5A Complete Example
- 9.6Where Optimization Fits
- 9.7Error Handling
10Joins
- 10.1Join Types
- 10.2Join Conditions
- 10.3Join Algorithms
- 10.4Hash Join in Detail
- 10.5Join Ordering
- 10.6Bloom Filters
- 10.7Summary
11Subqueries
- 11.1Types of Subqueries
- 11.2Planning Subqueries
- 11.3Implementation Complexity
- 11.4When Decorrelation Is Not Possible
12Query Optimizations
- 12.1Why Optimize?
- 12.2Rule-Based Optimization
- 12.3Projection Push-Down
- 12.4Predicate Push-Down
- 12.5Eliminate Common Subexpressions
- 12.6Cost-Based Optimization
- 12.7Other Optimizations
13Query Execution
- 13.1The Execution Context
- 13.2The Execution Pipeline
- 13.3Running a Query
- 13.4Lazy Evaluation
- 13.5Consuming Results
- 13.6Example: NYC Taxi Data
- 13.7The Impact of Optimization
- 13.8Comparison with Apache Spark
- 13.9Error Handling
- 13.10What We Have Built
14Parallel Query Execution
- 14.1Why Parallelism Helps
- 14.2Data Parallelism
- 14.3A Practical Example
- 14.4Combining Results
- 14.5Partitioning Strategies
- 14.6Partition Pruning
- 14.7Parallel Joins
- 14.8Repartitioning and Exchange
- 14.9Limits of Parallelism
15Distributed Query Execution
- 15.1When to Go Distributed
- 15.2Architecture Overview
- 15.3Embarrassingly Parallel Operators
- 15.4Distributed Aggregates
- 15.5Distributed Joins
- 15.6Query Stages
- 15.7Producing a Distributed Query Plan
- 15.8Serializing a Query Plan
- 15.9Serializing Data
- 15.10Choosing a Protocol
- 15.11Streaming vs Blocking Operators
- 15.12Data Locality
- 15.13Fault Tolerance
- 15.14Custom Code
- 15.15Distributed Query Optimizations
16Testing
- 16.1Unit Testing
- 16.2Integration Testing
- 16.3Fuzzing
17Benchmarks
- 17.1Measuring Performance
- 17.2Measuring Scalability
- 17.3Concurrency
- 17.4Automation
- 17.5Comparing Benchmarks
- 17.6Publishing Benchmark Results
- 17.7Transaction Processing Council (TPC) Benchmarks
Further Resources
- Open-Source Projects
- YouTube
- Sample Data
