Leanpub Header

Skip to main content

Modern GPU Architecture Second Edition

Volume One Graphics Pipeline Design and Hardware Implementation

Graphics Pipeline Design and Hardware Implementation

Modern GPUs are the most complex and efficient parallel processors ever created—and this book shows you exactly how they work at the hardware level.

Unlike typical graphics or programming guides, this volume takes you inside the GPU itself: how instructions flow through pipelines, how memory hierarchies sustain bandwidth, how shader cores and fixed-function units cooperate to render billions of pixels per second.

Minimum price

$19.00

$29.00

You pay

$29.00

Author earns

$23.20
$

...Or Buy With Credits!

You can get credits with a paid monthly or annual Reader Membership, or you can buy them here.
PDF
About

About

About the Book

Modern GPU Architecture Second Edition — Volume One
Graphics Pipeline Design and Hardware Implementation

Modern GPUs are the most complex and efficient parallel processors ever created—and this book shows you exactly how they work at the hardware level.

Unlike typical graphics or programming guides, this volume takes you inside the GPU itself:
how instructions flow through pipelines, how memory hierarchies sustain bandwidth, how shader cores and fixed-function units cooperate to render billions of pixels per second.

You’ll explore every major stage of the graphics pipeline in depth—geometry, rasterization, shading, texturing, and render output—all supported by clear mathematical models and synthesizable Verilog examples. This is not “theory for theory’s sake”; it’s engineering detail you can apply directly in design, simulation, or hardware verification.

By reading this book, you’ll gain:

  • Architectural intuition — understand how throughput, latency, and bandwidth interact in real GPUs.
  • Practical RTL-level insight — see how each stage can be implemented with clean, synthesizable Verilog.
  • A foundation for advanced design — build the knowledge required for AI acceleration, compute architectures, or FPGA-based GPU prototyping.
  • Confidence to analyze real silicon — reason about performance, bottlenecks, and tradeoffs like a hardware architect.

Every chapter bridges concept and implementation, making it invaluable for anyone designing graphics hardware, studying computer architecture, or seeking mastery of parallel computation systems.

Dense, detailed, and unapologetically technical, this book is written for those who want to understand modern GPUs—not just use them.

⚠️ This isn’t entertainment. It’s engineering.
If that excites you, welcome aboard.
If it intimidates you, this book isn’t for you.

From the Editor at Burst Books — Gareth Thomas

A Smarter Kind of Learning Has Arrived — Thinking on Its Own.

Forget tired textbooks from years past. These AI-crafted STEM editions advance at the speed of discovery. Each page is built by intelligence trained on thousands of trusted sources, delivering crystal-clear explanations, flawless equations, and functional examples — all refreshed through the latest breakthroughs.

Best of all, these editions cost a fraction of traditional texts yet surpass expectations. You’re gaining more than a book — you’re enhancing the mind’s performance.

Explore BurstBooksPublishing on GitHub to find technical samples, infographics, and additional study material — a complete hub that supports deeper, hands-on learning.

In this age of AI, leave the past behind and learn directly from tomorrow.

Share this book

Categories

Bundle

Bundles that include this book

Author

About the Author

gareth thomas

Gareth Morgan Thomas is a qualified expert with extensive expertise across multiple STEM fields. Holding six university diplomas in electronics, software development, web development, and project management, along with qualifications in computer networking, CAD, diesel engineering, well drilling, and welding, he has built a robust foundation of technical knowledge.

Educated in Auckland, New Zealand, Gareth Morgan Thomas also spent three years serving in the New Zealand Army, where he honed his discipline and problem-solving skills. With years of technical training, Gareth Morgan Thomas is now dedicated to sharing his deep understanding of science, technology, engineering, and mathematics through a series of specialized books aimed at both beginners and advanced learners.


Contents

Table of Contents

Chapter 1. GPU Architecture Principles

Section 1. GPU vs CPU Philosophy

  • Throughput over latency
  • Task parallelism vs instruction-level parallelism
  • Scalability and energy efficiency
  • Suitability for graphics and compute workloads

Section 2. SIMT Execution Model

  • Threads and warps/wavefronts
  • SIMD-style execution with per-thread state
  • Divergence handling and reconvergence
  • Warp scheduling and masking

Section 3. Modern GPU Overview

  • NVIDIA SM, AMD CU, Intel Xe-core comparison
  • Streaming multiprocessor internal organization
  • Compute and graphics pipeline integration
  • Role of schedulers and dispatch units

Section 4. Memory Hierarchy Snapshot

  • Register files and local storage
  • Shared/LDS memory
  • L1, L2, and VRAM hierarchy
  • Latency and bandwidth tradeoffs

Chapter 2. Digital Design for GPUs

Section 1. Verilog and SystemVerilog Essentials

  • Modules, ports, and parameters
  • Procedural and continuous assignments
  • Always blocks and combinational logic
  • Timing control and simulation

Section 2. Pipeline Design

  • Pipeline depth and balancing
  • Register placement and retiming
  • Valid-ready handshaking
  • Control and data path separation

Section 3. Clock Domain Crossing and Handshaking

  • Metastability and synchronization
  • FIFO-based CDC mechanisms
  • Dual-clock RAM interfaces
  • Gray-coded counters

Section 4. Floating-Point Units

  • FP32, FP16, and BF16 support
  • Fused multiply-add (FMA) structure
  • Normalization and rounding
  • Verification and corner-case handling

Section 5. Verification Basics

  • Testbench architecture
  • Assertions and functional coverage
  • Waveform inspection and debugging
  • Regression testing

Chapter 3. 3D Graphics Fundamentals

Section 1. Geometric Primitives

  • Vertices, edges, and triangles
  • Indexed and non-indexed primitives
  • Vertex attributes and interpolation
  • Coordinate spaces and transformations

Section 2. Transformations

  • Model, view, and projection matrices
  • Perspective vs orthographic projection
  • Homogeneous coordinates
  • Matrix composition order

Section 3. Rasterization

  • Edge equations and barycentric coordinates
  • Pixel coverage determination
  • Sampling and antialiasing
  • Tile-based versus immediate rendering

Section 4. Shading and Lighting

  • Gouraud and Phong shading
  • Lambertian reflection
  • Specular highlights
  • BRDF introduction

Section 5. Texture Mapping

  • UV coordinate generation
  • Filtering and mipmapping
  • Texture addressing modes
  • Texture compression formats

Chapter 4. System Architecture

Section 1. Clocking and Power

  • Multi-domain clock networks
  • Dynamic frequency scaling
  • Clock gating and power gating
  • Thermal limits and throttling

Section 2. Memory Interfaces

  • GDDR6 and HBM3 architectures
  • Channel width and data rates
  • Memory controller scheduling
  • ECC and error recovery

Section 3. PCIe and Interconnects

  • PCIe generations and bandwidth
  • NVLink and Infinity Fabric
  • CXL and coherent interconnects
  • Topologies for multi-GPU systems

Section 4. Command Processing

  • DMA engines
  • Queue management
  • Workload submission
  • Context switching and synchronization

Chapter 5. Vertex Processing and Primitive Assembly

Section 1. Vertex Fetch and Transformation

  • Attribute fetch unit
  • MVP matrix multiplication
  • Viewport transformation
  • Perspective division

Section 2. Clipping and Culling

  • Frustum clipping logic
  • Backface culling
  • Guard band optimization
  • Degenerate triangle handling

Section 3. Triangle Setup

  • Edge equations
  • Gradient and interpolation setup
  • Subpixel precision
  • Fixed-point arithmetic

Section 4. Verilog Implementation

  • Matrix multiplier
  • Clipping module
  • Triangle setup unit
  • Testbench for vertex stage

Chapter 6. Rasterization

Section 1. Tile-Based Scan Conversion

  • Screen space subdivision
  • Tile binning and culling
  • Parallel raster pipelines
  • Memory locality optimization

Section 2. Early-Z and Hierarchical-Z

  • Depth pre-pass techniques
  • Hierarchical Z-buffer design
  • Early depth rejection
  • Performance considerations

Section 3. Attribute Interpolation

  • Perspective-correct interpolation
  • Fixed-point vs floating-point gradients
  • Interpolator pipeline stages
  • Precision and rounding effects

Section 4. Variable Rate Shading

  • Coarse pixel shading
  • Shading rate maps
  • Foveated rendering applications
  • Bandwidth and power benefits

Section 5. Verilog Implementation

  • Rasterizer core
  • Edge function generator
  • Z-test pipeline
  • Tile buffer controller

Chapter 7. Fragment Processing and Texturing

Section 1. Texture Units

  • Texture addressing and coordinate wrapping
  • Filtering modes (nearest, bilinear, trilinear)
  • Anisotropic filtering
  • Mipmapping hardware

Section 2. Alpha Blending and Depth Testing

  • Alpha test and discard
  • Blending equations
  • Depth comparison logic
  • Write mask control

Section 3. Texture Compression

  • BC and ASTC formats
  • Block decoding pipeline
  • Hardware decompression logic
  • Compression ratio and quality tradeoffs

Section 4. Texture Cache Architecture

  • Cache hierarchy for texture data
  • Tag-data organization
  • Replacement policies (LRU, random)
  • Cache line size optimization
  • Texture cache coherency

Section 5. Texture Unit Pipeline

  • Address calculation stage
  • Cache lookup and fetch
  • Decompression stage
  • Filtering computation
  • Output buffering

Section 6. Lighting Calculations

  • Ambient, diffuse, and specular components
  • Normal and bump mapping
  • Light source models
  • Multi-light accumulation

Section 7. Verilog Implementation

  • Texture address generator
  • Bilinear filter module
  • Texture cache controller
  • Fragment shader datapath
  • Texture pipeline testbench

Chapter 8. Shader Core Architecture

Section 1. Programmable Shader Overview

  • Evolution from fixed-function to programmable
  • Unified shader architecture
  • Shader stages and types
  • Program compilation and linkage

Section 2. Instruction Set Architecture

  • Scalar vs vector ISA
  • Arithmetic and transcendental instructions
  • Memory access operations
  • Control flow encoding

Section 3. Warp Scheduler

  • Warp and wavefront execution
  • Dependency scoreboarding
  • Issue policies and priorities
  • Latency hiding mechanisms

Section 4. Register File Design

  • Banked register organization
  • Read/write port design
  • Register renaming
  • Spilling and allocation

Section 5. Execution Units

  • Integer and floating-point ALUs
  • Special function units
  • Load/store pipelines
  • Data forwarding and bypassing

Section 6. Branch Divergence Handling

  • Active mask tracking
  • Divergence stack logic
  • Reconvergence hardware
  • Performance impact analysis

Section 7. Instruction Cache

  • Cache organization and fetch width
  • Branch prediction
  • Instruction buffering
  • Prefetch and alignment

Section 8. Verilog Implementation

  • Warp scheduler FSM
  • Banked register file
  • ALU and FPU datapaths
  • Divergence stack
  • Shader core testbench

Chapter 9. Memory Subsystem Design

Section 1. Memory Architecture Overview

  • GPU memory hierarchy
  • Bandwidth and latency tradeoffs
  • Parallel memory channels
  • Performance balancing

Section 2. Coalescing and Memory Transactions

  • Memory access coalescing
  • Transaction formation
  • Strided and misaligned access handling
  • Alignment requirements

Section 3. Shared Memory and Local Data Share

  • Banked architecture
  • Bank conflict resolution
  • Barrier synchronization
  • Common programming patterns

Section 4. L1 Cache Design

  • Set-associative structure
  • Write-back vs write-through
  • Coherency mechanisms
  • Replacement policies

Section 5. L2 Cache Architecture

  • Unified cache for all cores
  • Partitioning and crossbar interconnect
  • Victim cache optimization
  • Cache slice arbitration

Section 6. Memory Controller

  • DRAM command generation
  • Row buffer management
  • Bank interleaving
  • QoS scheduling

Section 7. Atomic Operations

  • Read-modify-write primitives
  • Memory ordering rules
  • Atomic unit architecture
  • Performance tradeoffs

Section 8. Verilog Implementation

  • Coalescing unit
  • Banked shared memory
  • Set-associative cache
  • Memory controller FSM
  • Memory system testbench

Chapter 10. Render Output Pipeline

Section 1. Depth and Stencil Testing

  • Z-buffer algorithm
  • Depth and stencil compare functions
  • Early-Z and late-Z pipelines
  • Depth bounds optimization

Section 2. Blending Operations

  • Alpha blending modes
  • Dual-source blending
  • Logical pixel operations
  • Independent render target blending

Section 3. Color Compression

  • Delta color compression (DCC)
  • Fast clear optimization
  • Metadata management
  • Compression efficiency

Section 4. Render Target Cache

  • Color and depth cache structure
  • Tile-based write combining
  • Compression integration
  • Eviction policy design

Section 5. Multi-Sample Anti-Aliasing

  • Sampling positions and coverage masks
  • Centroid and resolve operations
  • Bandwidth considerations
  • Quality and performance balance

Section 6. Framebuffer Organization

  • Linear and tiled layouts
  • Swizzling and Z-order curves
  • Multi-render target support
  • Memory access optimization

Section 7. Verilog Implementation

  • Depth test module
  • Blending unit
  • ROP cache controller
  • MSAA resolve logic
  • ROP testbench

Chapter 11. Compute Architecture

Section 1. GPGPU Programming Model

  • Kernels, threads, and work-groups
  • Hierarchical execution model
  • Global and shared memory scopes
  • Synchronization mechanisms

Section 2. Compute Dispatch

  • Kernel launch process
  • Command packet format
  • Work-group distribution
  • Concurrent kernel execution

Section 3. Occupancy and Resource Management

  • Register pressure and allocation
  • Shared memory partitioning
  • Warp and block occupancy
  • Resource scheduling

Section 4. Work Distribution

  • Static and dynamic scheduling
  • Load balancing algorithms
  • Persistent threads model
  • Cooperative group synchronization

Section 5. Data Parallel Processing Patterns

  • Map, reduce, and scan operations
  • Histogram and sort algorithms
  • Matrix operations
  • Prefix sums and reductions

Section 6. Verilog Implementation

  • Compute dispatch unit
  • Work-group scheduler
  • Resource allocator
  • Barrier synchronization logic
  • Compute kernel testbench

The Leanpub 60 Day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

Now, this is technically risky for us, since you'll have the book or course files either way. But we're so confident in our products and services, and in our authors and readers, that we're happy to offer a full money back guarantee for everything we sell.

You can only find out how good something is by trying it, and because of our 100% money back guarantee there's literally no risk to do so!

So, there's no reason not to click the Add to Cart button, is there?

See full terms...

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earned over $14 million writing, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Learn more about Leanpub's ebook formats and where to read them

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub