Modern GPU Architecture Second Edition — Volume One
Graphics Pipeline Design and Hardware Implementation

Modern GPUs are the most complex and efficient parallel processors ever created—and this book shows you exactly how they work at the hardware level.

Unlike typical graphics or programming guides, this volume takes you inside the GPU itself:
how instructions flow through pipelines, how memory hierarchies sustain bandwidth, how shader cores and fixed-function units cooperate to render billions of pixels per second.

You’ll explore every major stage of the graphics pipeline in depth—geometry, rasterization, shading, texturing, and render output—all supported by clear mathematical models and synthesizable Verilog examples. This is not “theory for theory’s sake”; it’s engineering detail you can apply directly in design, simulation, or hardware verification.

By reading this book, you’ll gain:

Architectural intuition — understand how throughput, latency, and bandwidth interact in real GPUs.
Practical RTL-level insight — see how each stage can be implemented with clean, synthesizable Verilog.
A foundation for advanced design — build the knowledge required for AI acceleration, compute architectures, or FPGA-based GPU prototyping.
Confidence to analyze real silicon — reason about performance, bottlenecks, and tradeoffs like a hardware architect.

Every chapter bridges concept and implementation, making it invaluable for anyone designing graphics hardware, studying computer architecture, or seeking mastery of parallel computation systems.

Dense, detailed, and unapologetically technical, this book is written for those who want to understand modern GPUs—not just use them.

⚠️ This isn’t entertainment. It’s engineering.
If that excites you, welcome aboard.
If it intimidates you, this book isn’t for you.

From the Editor at Burst Books — Gareth Thomas

A Smarter Kind of Learning Has Arrived — Thinking on Its Own.

Forget tired textbooks from years past. These AI-crafted STEM editions advance at the speed of discovery. Each page is built by intelligence trained on thousands of trusted sources, delivering crystal-clear explanations, flawless equations, and functional examples — all refreshed through the latest breakthroughs.

Best of all, these editions cost a fraction of traditional texts yet surpass expectations. You’re gaining more than a book — you’re enhancing the mind’s performance.

Explore BurstBooksPublishing on GitHub to find technical samples, infographics, and additional study material — a complete hub that supports deeper, hands-on learning.

In this age of AI, leave the past behind and learn directly from tomorrow.

Chapter 1. GPU Architecture Principles

Section 1. GPU vs CPU Philosophy

Throughput over latency
Task parallelism vs instruction-level parallelism
Scalability and energy efficiency
Suitability for graphics and compute workloads

Section 2. SIMT Execution Model

Threads and warps/wavefronts
SIMD-style execution with per-thread state
Divergence handling and reconvergence
Warp scheduling and masking

Section 3. Modern GPU Overview

NVIDIA SM, AMD CU, Intel Xe-core comparison
Streaming multiprocessor internal organization
Compute and graphics pipeline integration
Role of schedulers and dispatch units

Section 4. Memory Hierarchy Snapshot

Register files and local storage
Shared/LDS memory
L1, L2, and VRAM hierarchy
Latency and bandwidth tradeoffs

Chapter 2. Digital Design for GPUs

Section 1. Verilog and SystemVerilog Essentials

Modules, ports, and parameters
Procedural and continuous assignments
Always blocks and combinational logic
Timing control and simulation

Section 2. Pipeline Design

Pipeline depth and balancing
Register placement and retiming
Valid-ready handshaking
Control and data path separation

Section 3. Clock Domain Crossing and Handshaking

Metastability and synchronization
FIFO-based CDC mechanisms
Dual-clock RAM interfaces
Gray-coded counters

Section 4. Floating-Point Units

FP32, FP16, and BF16 support
Fused multiply-add (FMA) structure
Normalization and rounding
Verification and corner-case handling

Section 5. Verification Basics

Testbench architecture
Assertions and functional coverage
Waveform inspection and debugging
Regression testing

Chapter 3. 3D Graphics Fundamentals

Section 1. Geometric Primitives

Vertices, edges, and triangles
Indexed and non-indexed primitives
Vertex attributes and interpolation
Coordinate spaces and transformations

Section 2. Transformations

Model, view, and projection matrices
Perspective vs orthographic projection
Homogeneous coordinates
Matrix composition order

Section 3. Rasterization

Edge equations and barycentric coordinates
Pixel coverage determination
Sampling and antialiasing
Tile-based versus immediate rendering

Section 4. Shading and Lighting

Gouraud and Phong shading
Lambertian reflection
Specular highlights
BRDF introduction

Section 5. Texture Mapping

UV coordinate generation
Filtering and mipmapping
Texture addressing modes
Texture compression formats

Chapter 4. System Architecture

Section 1. Clocking and Power

Multi-domain clock networks
Dynamic frequency scaling
Clock gating and power gating
Thermal limits and throttling

Section 2. Memory Interfaces

GDDR6 and HBM3 architectures
Channel width and data rates
Memory controller scheduling
ECC and error recovery

Section 3. PCIe and Interconnects

PCIe generations and bandwidth
NVLink and Infinity Fabric
CXL and coherent interconnects
Topologies for multi-GPU systems

Section 4. Command Processing

DMA engines
Queue management
Workload submission
Context switching and synchronization

Chapter 5. Vertex Processing and Primitive Assembly

Section 1. Vertex Fetch and Transformation

Attribute fetch unit
MVP matrix multiplication
Viewport transformation
Perspective division

Section 2. Clipping and Culling

Frustum clipping logic
Backface culling
Guard band optimization
Degenerate triangle handling

Section 3. Triangle Setup

Edge equations
Gradient and interpolation setup
Subpixel precision
Fixed-point arithmetic

Section 4. Verilog Implementation

Matrix multiplier
Clipping module
Triangle setup unit
Testbench for vertex stage

Chapter 6. Rasterization

Section 1. Tile-Based Scan Conversion

Screen space subdivision
Tile binning and culling
Parallel raster pipelines
Memory locality optimization

Section 2. Early-Z and Hierarchical-Z

Depth pre-pass techniques
Hierarchical Z-buffer design
Early depth rejection
Performance considerations

Section 3. Attribute Interpolation

Perspective-correct interpolation
Fixed-point vs floating-point gradients
Interpolator pipeline stages
Precision and rounding effects

Section 4. Variable Rate Shading

Coarse pixel shading
Shading rate maps
Foveated rendering applications
Bandwidth and power benefits

Section 5. Verilog Implementation

Rasterizer core
Edge function generator
Z-test pipeline
Tile buffer controller

Chapter 7. Fragment Processing and Texturing

Section 1. Texture Units

Texture addressing and coordinate wrapping
Filtering modes (nearest, bilinear, trilinear)
Anisotropic filtering
Mipmapping hardware

Section 2. Alpha Blending and Depth Testing

Alpha test and discard
Blending equations
Depth comparison logic
Write mask control

Section 3. Texture Compression

BC and ASTC formats
Block decoding pipeline
Hardware decompression logic
Compression ratio and quality tradeoffs

Section 4. Texture Cache Architecture

Cache hierarchy for texture data
Tag-data organization
Replacement policies (LRU, random)
Cache line size optimization
Texture cache coherency

Section 5. Texture Unit Pipeline

Address calculation stage
Cache lookup and fetch
Decompression stage
Filtering computation
Output buffering

Section 6. Lighting Calculations

Ambient, diffuse, and specular components
Normal and bump mapping
Light source models
Multi-light accumulation

Section 7. Verilog Implementation

Texture address generator
Bilinear filter module
Texture cache controller
Fragment shader datapath
Texture pipeline testbench

Chapter 8. Shader Core Architecture

Section 1. Programmable Shader Overview

Evolution from fixed-function to programmable
Unified shader architecture
Shader stages and types
Program compilation and linkage

Section 2. Instruction Set Architecture

Scalar vs vector ISA
Arithmetic and transcendental instructions
Memory access operations
Control flow encoding

Section 3. Warp Scheduler

Warp and wavefront execution
Dependency scoreboarding
Issue policies and priorities
Latency hiding mechanisms

Section 4. Register File Design

Banked register organization
Read/write port design
Register renaming
Spilling and allocation

Section 5. Execution Units

Integer and floating-point ALUs
Special function units
Load/store pipelines
Data forwarding and bypassing

Section 6. Branch Divergence Handling

Active mask tracking
Divergence stack logic
Reconvergence hardware
Performance impact analysis

Section 7. Instruction Cache

Cache organization and fetch width
Branch prediction
Instruction buffering
Prefetch and alignment

Section 8. Verilog Implementation

Warp scheduler FSM
Banked register file
ALU and FPU datapaths
Divergence stack
Shader core testbench

Chapter 9. Memory Subsystem Design

Section 1. Memory Architecture Overview

GPU memory hierarchy
Bandwidth and latency tradeoffs
Parallel memory channels
Performance balancing

Section 2. Coalescing and Memory Transactions

Memory access coalescing
Transaction formation
Strided and misaligned access handling
Alignment requirements

Section 3. Shared Memory and Local Data Share

Banked architecture
Bank conflict resolution
Barrier synchronization
Common programming patterns

Section 4. L1 Cache Design

Set-associative structure
Write-back vs write-through
Coherency mechanisms
Replacement policies

Section 5. L2 Cache Architecture

Unified cache for all cores
Partitioning and crossbar interconnect
Victim cache optimization
Cache slice arbitration

Section 6. Memory Controller

DRAM command generation
Row buffer management
Bank interleaving
QoS scheduling

Section 7. Atomic Operations

Read-modify-write primitives
Memory ordering rules
Atomic unit architecture
Performance tradeoffs

Section 8. Verilog Implementation

Coalescing unit
Banked shared memory
Set-associative cache
Memory controller FSM
Memory system testbench

Chapter 10. Render Output Pipeline

Section 1. Depth and Stencil Testing

Z-buffer algorithm
Depth and stencil compare functions
Early-Z and late-Z pipelines
Depth bounds optimization

Section 2. Blending Operations

Alpha blending modes
Dual-source blending
Logical pixel operations
Independent render target blending

Section 3. Color Compression

Delta color compression (DCC)
Fast clear optimization
Metadata management
Compression efficiency

Section 4. Render Target Cache

Color and depth cache structure
Tile-based write combining
Compression integration
Eviction policy design

Section 5. Multi-Sample Anti-Aliasing

Sampling positions and coverage masks
Centroid and resolve operations
Bandwidth considerations
Quality and performance balance

Section 6. Framebuffer Organization

Linear and tiled layouts
Swizzling and Z-order curves
Multi-render target support
Memory access optimization

Section 7. Verilog Implementation

Depth test module
Blending unit
ROP cache controller
MSAA resolve logic
ROP testbench

Chapter 11. Compute Architecture

Section 1. GPGPU Programming Model

Kernels, threads, and work-groups
Hierarchical execution model
Global and shared memory scopes
Synchronization mechanisms

Section 2. Compute Dispatch

Kernel launch process
Command packet format
Work-group distribution
Concurrent kernel execution

Section 3. Occupancy and Resource Management

Register pressure and allocation
Shared memory partitioning
Warp and block occupancy
Resource scheduling

Section 4. Work Distribution

Static and dynamic scheduling
Load balancing algorithms
Persistent threads model
Cooperative group synchronization

Section 5. Data Parallel Processing Patterns

Map, reduce, and scan operations
Histogram and sort algorithms
Matrix operations
Prefix sums and reductions

Section 6. Verilog Implementation

Compute dispatch unit
Work-group scheduler
Resource allocator
Barrier synchronization logic
Compute kernel testbench

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earned over $15 million writing, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub

You pay

Author earns

About

Share this book

Categories

Feedback

Bundles

Modern GPU Architecture and Programming Complete Bundle

$37.00

Modern GPU Architecture Mini Bundle

$29.00

Modern GPU Architecture and Programming Mini Bundle

$57.00

Modern GPU Architecture and Programming Complete Bundle

$87.00

Author

Contents

Chapter 1. GPU Architecture Principles

Section 1. GPU vs CPU Philosophy

Section 2. SIMT Execution Model

Section 3. Modern GPU Overview

Section 4. Memory Hierarchy Snapshot

Chapter 2. Digital Design for GPUs

Section 1. Verilog and SystemVerilog Essentials

Section 2. Pipeline Design

Section 3. Clock Domain Crossing and Handshaking

Section 4. Floating-Point Units

Section 5. Verification Basics

Chapter 3. 3D Graphics Fundamentals

Section 1. Geometric Primitives

Section 2. Transformations

Section 3. Rasterization

Section 4. Shading and Lighting

Section 5. Texture Mapping

Chapter 4. System Architecture

Section 1. Clocking and Power

Section 2. Memory Interfaces

Section 3. PCIe and Interconnects

Section 4. Command Processing

Chapter 5. Vertex Processing and Primitive Assembly

Section 1. Vertex Fetch and Transformation

Section 2. Clipping and Culling

Section 3. Triangle Setup

Section 4. Verilog Implementation

Chapter 6. Rasterization

Section 1. Tile-Based Scan Conversion

Section 2. Early-Z and Hierarchical-Z

Section 3. Attribute Interpolation

Section 4. Variable Rate Shading

Section 5. Verilog Implementation

Chapter 7. Fragment Processing and Texturing

Section 1. Texture Units

Section 2. Alpha Blending and Depth Testing

Section 3. Texture Compression

Section 4. Texture Cache Architecture

Section 5. Texture Unit Pipeline

Section 6. Lighting Calculations

Section 7. Verilog Implementation

Chapter 8. Shader Core Architecture

Section 1. Programmable Shader Overview

Section 2. Instruction Set Architecture

Section 3. Warp Scheduler

Section 4. Register File Design

Section 5. Execution Units

Section 6. Branch Divergence Handling

Section 7. Instruction Cache

Section 8. Verilog Implementation

Chapter 9. Memory Subsystem Design

Section 1. Memory Architecture Overview

Section 2. Coalescing and Memory Transactions

Section 3. Shared Memory and Local Data Share

Section 4. L1 Cache Design

Section 5. L2 Cache Architecture

Section 6. Memory Controller

Section 7. Atomic Operations

Section 8. Verilog Implementation

Chapter 10. Render Output Pipeline

Section 1. Depth and Stencil Testing

Section 2. Blending Operations

Section 3. Color Compression