Leanpub Header

Skip to main content

Modern GPU Architecture Second Edition

Volume Two Compute Acceleration Tensor Cores, and Advanced Systems

Modern GPUs are the most complex and efficient parallel processors ever created—and this book shows you exactly how they work at the hardware level. Unlike typical graphics or programming guides, this volume takes you inside the GPU itself: how instructions flow through pipelines, how memory hierarchies sustain bandwidth, how shader cores and fixed-function units cooperate to render billions of pixels per second.

Minimum price

$19.00

$29.00

You pay

$29.00

Author earns

$23.20
$

...Or Buy With Credits!

You can get credits with a paid monthly or annual Reader Membership, or you can buy them here.
PDF
About

About

About the Book

Modern GPU Architecture Second Edition — Volume One
Graphics Pipeline Design and Hardware Implementation

Modern GPUs are the most complex and efficient parallel processors ever created—and this book shows you exactly how they work at the hardware level.

Unlike typical graphics or programming guides, this volume takes you inside the GPU itself:
how instructions flow through pipelines, how memory hierarchies sustain bandwidth, how shader cores and fixed-function units cooperate to render billions of pixels per second.

You’ll explore every major stage of the graphics pipeline in depth—geometry, rasterization, shading, texturing, and render output—all supported by clear mathematical models and synthesizable Verilog examples. This is not “theory for theory’s sake”; it’s engineering detail you can apply directly in design, simulation, or hardware verification.

By reading this book, you’ll gain:

  • Architectural intuition — understand how throughput, latency, and bandwidth interact in real GPUs.
  • Practical RTL-level insight — see how each stage can be implemented with clean, synthesizable Verilog.
  • A foundation for advanced design — build the knowledge required for AI acceleration, compute architectures, or FPGA-based GPU prototyping.
  • Confidence to analyze real silicon — reason about performance, bottlenecks, and tradeoffs like a hardware architect.

Every chapter bridges concept and implementation, making it invaluable for anyone designing graphics hardware, studying computer architecture, or seeking mastery of parallel computation systems.

Dense, detailed, and unapologetically technical, this book is written for those who want to understand modern GPUs—not just use them.

⚠️ This isn’t entertainment. It’s engineering.
If that excites you, welcome aboard.
If it intimidates you, this book isn’t for you.

From the Editor at Burst Books — Gareth Thomas

A Smarter Kind of Learning Has Arrived — Thinking on Its Own.

Forget tired textbooks from years past. These AI-crafted STEM editions advance at the speed of discovery. Each page is built by intelligence trained on thousands of trusted sources, delivering crystal-clear explanations, flawless equations, and functional examples — all refreshed through the latest breakthroughs.

Best of all, these editions cost a fraction of traditional texts yet surpass expectations. You’re gaining more than a book — you’re enhancing the mind’s performance.

Explore BurstBooksPublishing on GitHub to find technical samples, infographics, and additional study material — a complete hub that supports deeper, hands-on learning.

In this age of AI, leave the past behind and learn directly from tomorrow.

Share this book

Bundle

Bundles that include this book

Author

About the Author

gareth thomas

Gareth Thomas is the publisher-author behind BurstBooks, creating rigorous, code-rich STEM textbooks for engineers and advanced learners. Based in Auckland, New Zealand, he builds practical, reference-grade titles that blend clear exposition with working math, diagrams, and real-world examples. His catalog focuses on high-impact domains—including humanoid robotics, GPU architecture and programming, electronic warfare systems, satellite and space systems, EEG/neuro-engineering, and applied AI/ML—designed for hands-on use in labs, teams, and self-study.

Gareth’s workflow is unapologetically modern: LaTeX for precision, reproducible figures and listings, and AI-assisted drafting to accelerate iteration while maintaining strict technical accuracy. Each BurstBooks edition aims to minimize fluff and maximize utility—clean structure, consistent notation, and actionable takeaways. When possible, he complements books with supporting Github repositories, infographics, and exercises to help readers move from theory to implementation quickly.

Note: Many BurstBooks titles are AI-assisted technical editions—structured for information density and correctness over prose style.

Contents

Table of Contents

Chapter 12. Tensor and Matrix Acceleration

Section 1. Matrix Multiplication Fundamentals

  • GEMM operation principles
  • Blocking and tiling strategies
  • Data reuse optimization
  • Arithmetic intensity considerations

Section 2. Tensor Core Architecture

  • Systolic array organization
  • Matrix multiply-accumulate units
  • Dataflow and accumulation patterns
  • Precision and throughput balance

Section 3. Mixed Precision Support

  • FP16, BF16, TF32 computation
  • INT8 and INT4 quantization
  • Accumulator precision control
  • Conversion and normalization units

Section 4. Tensor Memory Layout

  • Row-major and column-major ordering
  • Tiled and blocked formats
  • Swizzling for conflict avoidance
  • Efficient memory access patterns

Section 5. Sparse Matrix Acceleration

  • CSR and COO representations
  • Structured sparsity (2:4, 4:8)
  • Zero-skipping hardware logic
  • Compression and decompression paths

Section 6. Verilog Implementation

  • Systolic array module
  • Matrix multiply-accumulate block
  • Data distribution network
  • Tensor core testbench

Chapter 13. Ray Tracing Hardware

Section 1. Ray Tracing Fundamentals

  • Ray representation
  • Intersection with primitives
  • BVH construction principles
  • Traversal algorithms

Section 2. RT Core Architecture

  • Ray-box intersection logic
  • Ray-triangle intersection unit
  • BVH traversal engine
  • Hit and miss determination

Section 3. Acceleration Structures

  • BVH node hierarchy
  • Memory layout optimization
  • Update and refit mechanisms
  • Build-time vs runtime tradeoffs

Section 4. Ray Coherence and Sorting

  • Coherent ray batching
  • Ray binning and bucketing
  • Cache-aware reordering
  • Wavefront path tracing

Section 5. Integration with Rasterization

  • Hybrid rendering pipeline
  • Shader-based ray generation
  • Shader binding table
  • Payload management

Section 6. Verilog Implementation

  • Ray-box intersection module
  • Ray-triangle intersection unit
  • BVH traversal FSM
  • RT core testbench

Chapter 14. Synchronization and Memory Ordering

Section 1. Memory Consistency Models

  • Sequential and relaxed models
  • Acquire-release semantics
  • Visibility scopes
  • GPU-specific ordering rules

Section 2. Barriers and Fences

  • Block and grid-level barriers
  • Memory fence types
  • System-wide synchronization
  • Performance overhead

Section 3. Cache Coherence

  • Write-invalidate protocols
  • Directory-based coherence
  • Cross-core consistency
  • Heterogeneous CPU-GPU models

Section 4. Atomic Operations

  • Read-modify-write logic
  • Compare-and-swap
  • Arbitration circuits
  • Performance optimizations

Section 5. Lock-Free Algorithms

  • Wait-free synchronization
  • ABA problem handling
  • Lock-free queues
  • GPU-specific design considerations

Section 6. Verilog Implementation

  • Barrier synchronization module
  • Atomic operation unit
  • Memory fence controller
  • Synchronization testbench

Chapter 15. Advanced Rendering Features

Section 1. Tessellation Pipeline

  • Hull shader and control points
  • Fixed-function tessellator
  • Domain shader operations
  • Adaptive tessellation control

Section 2. Geometry Processing

  • Geometry shader stage
  • Primitive amplification
  • Stream output
  • Layered rendering

Section 3. Mesh Shaders

  • Meshlet-based processing
  • Task and mesh shader stages
  • Workgroup culling and amplification
  • Hardware resource mapping

Section 4. Variable Rate Shading

  • Shading rate images
  • Coarse shading patterns
  • Foveated rendering
  • Performance and power gains

Section 5. Deferred Rendering Architecture

  • G-buffer composition
  • Geometry and lighting passes
  • Tile-based deferred shading
  • Bandwidth and efficiency analysis

Section 6. Verilog Implementation

  • Tessellator hardware
  • Meshlet processor
  • VRS controller
  • G-buffer manager testbench

Chapter 16. Display and Video Engines

Section 1. Display Controller

  • Timing generation (HSYNC and VSYNC)
  • Frame buffer scanning
  • Pixel pipeline organization
  • Multi-display management

Section 2. Display Compression

  • Display Stream Compression (DSC)
  • Encoder and decoder design
  • Bandwidth reduction analysis
  • Visual quality metrics

Section 3. Video Decode Acceleration

  • H.264, H.265, VP9, and AV1 decoding
  • Bitstream parsing and entropy decoding
  • Motion compensation hardware
  • Parallel decode engines

Section 4. Video Encode Acceleration

  • Motion estimation logic
  • Rate control mechanisms
  • Entropy encoder design
  • Multi-format support

Section 5. Video Processing Pipeline

  • Scaling and filtering
  • Color space conversion
  • Deinterlacing and denoising
  • HDR tone mapping

Section 6. Verilog Implementation

  • Display timing generator
  • Video decoder FSM
  • Motion estimation module
  • Video pipeline testbench

Chapter 17. Interconnect and Communication

Section 1. On-Chip Networks

  • Mesh and crossbar topologies
  • Router architecture and buffering
  • Flow control and arbitration
  • Deadlock prevention

Section 2. Memory Crossbar

  • SM-to-memory partition links
  • Bandwidth scheduling
  • Virtual channels
  • QoS enforcement

Section 3. PCIe Interface

  • Protocol layer overview
  • DMA engine design
  • Peer-to-peer communication
  • Error handling

Section 4. High-Speed Serial Links

  • NVLink and Infinity Fabric
  • CXL and coherent interfaces
  • PHY design and equalization
  • Latency and throughput tuning

Section 5. Multi-GPU Communication

  • GPU-to-GPU transfers
  • Collective operations
  • Topology optimization
  • Scalability challenges

Section 6. Verilog Implementation

  • Crossbar switch module
  • Round-robin arbiter
  • PCIe transaction engine
  • NoC router and testbench

Chapter 18. Performance Analysis and Optimization

Section 1. Performance Metrics

  • Throughput (GFLOPS and TFLOPS)
  • Bandwidth utilization
  • Cache efficiency
  • Power and thermal metrics

Section 2. Bottleneck Identification

  • Memory-bound and compute-bound workloads
  • Latency versus bandwidth limits
  • Roofline analysis
  • Profiling methodology

Section 3. Performance Counters

  • Counter and sampler architecture
  • Multiplexing techniques
  • Key hardware metrics
  • PMU software interfaces

Section 4. Workload Characterization

  • Instruction mix and balance
  • Cache and memory patterns
  • Thread divergence statistics
  • Power behavior profiling

Section 5. Optimization Techniques

  • Occupancy tuning
  • Coalesced memory access
  • Shared memory utilization
  • Instruction scheduling

Section 6. Power and Thermal Management

  • Dynamic power reduction
  • Workload-based DVFS
  • Thermal throttling
  • Efficiency optimization

Chapter 19. Physical Design and Manufacturing

Section 1. Floorplanning

  • Die partitioning and hierarchy
  • Block placement and routing
  • Power grid and clock tree
  • Thermal optimization

Section 2. Synthesis and Timing Closure

  • RTL synthesis flow
  • Timing constraints and setup
  • Clock domain verification
  • Multi-mode optimization

Section 3. Place and Route

  • Placement algorithms
  • Routing congestion management
  • Signal integrity checks
  • IR drop and EM control

Section 4. Design for Test

  • Scan insertion
  • Built-in self-test
  • JTAG and boundary scan
  • Yield analysis

Section 5. Packaging Technologies

  • Flip-chip and BGA packaging
  • Thermal interface materials
  • Multi-chip modules
  • TSV-based stacking

Section 6. Advanced Integration

  • Chiplet architectures
  • Die-to-die interconnects
  • UCIe protocol
  • Heterogeneous integration

Chapter 20. Future Directions and Emerging Technologies

Section 1. Modern GPU Case Studies

  • NVIDIA Hopper and AMD RDNA3
  • Intel Arc and Apple GPU designs
  • Mobile GPUs (Mali, Adreno)
  • Design tradeoff comparisons

Section 2. Specialized AI Accelerators

  • Google TPU and Cerebras engine
  • Graphcore IPU and Groq LPU
  • Comparison with general GPUs
  • Domain-specific optimization

Section 3. Beyond Moore’s Law

  • Process scaling limits
  • GAA and CFET transistor evolution
  • Advanced packaging methods
  • Economic and design impact

Section 4. Emerging Memory Technologies

  • HBM4 and next-gen DRAM
  • Processing-in-memory concepts
  • Persistent and near-data memory
  • Memory-centric system design

Section 5. Novel Computing Paradigms

  • Neuromorphic and optical computing
  • Quantum-accelerated systems
  • Approximate computation
  • Stochastic and hybrid models

Section 6. Sustainability and Green Computing

  • Power efficiency trends
  • Carbon footprint reduction
  • Lifecycle optimization
  • Renewable-powered data centers

Section 7. Research Frontiers

  • AI-driven hardware design
  • Self-optimizing microarchitectures
  • Secure and open GPU initiatives
  • Future scalability challenges

The Leanpub 60 Day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

Now, this is technically risky for us, since you'll have the book or course files either way. But we're so confident in our products and services, and in our authors and readers, that we're happy to offer a full money back guarantee for everything we sell.

You can only find out how good something is by trying it, and because of our 100% money back guarantee there's literally no risk to do so!

So, there's no reason not to click the Add to Cart button, is there?

See full terms...

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earned over $14 million writing, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Learn more about Leanpub's ebook formats and where to read them

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub