Modern GPU Architecture Second Edition

Name: Modern GPU Architecture Second Edition
Brand: Leanpub
Price: 19.00 USD
Availability: InStock

Volume Two Compute Acceleration Tensor Cores, and Advanced Systems

This book is 100% completeLast updated on 2026-03-08

gareth thomas

Modern GPUs are the most complex and efficient parallel processors ever created—and this book shows you exactly how they work at the hardware level. Unlike typical graphics or programming guides, this volume takes you inside the GPU itself: how instructions flow through pipelines, how memory hierarchies sustain bandwidth, how shader cores and fixed-function units cooperate to render billions of…

This book is 100% completeLast updated on 2026-03-08

gareth thomas

Minimum price

$19.00

$29.00

You pay

Author earns

PDF

About

Modern GPU Architecture Second Edition

Minimum price

$19.00

$29.00

You pay

Author earns

About

About the Book

Modern GPU Architecture Second Edition — Volume One
Graphics Pipeline Design and Hardware Implementation

Modern GPUs are the most complex and efficient parallel processors ever created—and this book shows you exactly how they work at the hardware level.

Unlike typical graphics or programming guides, this volume takes you inside the GPU itself:
how instructions flow through pipelines, how memory hierarchies sustain bandwidth, how shader cores and fixed-function units cooperate to render billions of pixels per second.

You’ll explore every major stage of the graphics pipeline in depth—geometry, rasterization, shading, texturing, and render output—all supported by clear mathematical models and synthesizable Verilog examples. This is not “theory for theory’s sake”; it’s engineering detail you can apply directly in design, simulation, or hardware verification.

By reading this book, you’ll gain:

Architectural intuition — understand how throughput, latency, and bandwidth interact in real GPUs.
Practical RTL-level insight — see how each stage can be implemented with clean, synthesizable Verilog.
A foundation for advanced design — build the knowledge required for AI acceleration, compute architectures, or FPGA-based GPU prototyping.
Confidence to analyze real silicon — reason about performance, bottlenecks, and tradeoffs like a hardware architect.

Every chapter bridges concept and implementation, making it invaluable for anyone designing graphics hardware, studying computer architecture, or seeking mastery of parallel computation systems.

Dense, detailed, and unapologetically technical, this book is written for those who want to understand modern GPUs—not just use them.

⚠️ This isn’t entertainment. It’s engineering.
If that excites you, welcome aboard.
If it intimidates you, this book isn’t for you.

From the Editor at Burst Books — Gareth Thomas

A Smarter Kind of Learning Has Arrived — Thinking on Its Own.

Forget tired textbooks from years past. These AI-crafted STEM editions advance at the speed of discovery. Each page is built by intelligence trained on thousands of trusted sources, delivering crystal-clear explanations, flawless equations, and functional examples — all refreshed through the latest breakthroughs.

Best of all, these editions cost a fraction of traditional texts yet surpass expectations. You’re gaining more than a book — you’re enhancing the mind’s performance.

Explore BurstBooksPublishing on GitHub to find technical samples, infographics, and additional study material — a complete hub that supports deeper, hands-on learning.

In this age of AI, leave the past behind and learn directly from tomorrow.

Share this book

Feedback

Email the Author

Bundles

Bundles that include this book

Modern GPU Architecture and Programming Complete Bundle
4 Books
Pricing
$37.00
Minimum price
Bought separately$116
Suggested price$97.00
Modern GPU Architecture Mini Bundle
2 Books
Pricing
$29.00
Minimum price
Bought separately$58.00
Suggested price$29.00
Modern GPU Architecture and Programming Mini Bundle
4 Books
Pricing
$57.00
Minimum price
Bought separately$116
Suggested price$87.00
Modern GPU Architecture and Programming Complete Bundle
7 Books
Pricing
$87.00
Minimum price
Bought separately$203
Suggested price$197

Author

About the Author

gareth thomas

Gareth Thomas is the publisher-author behind BurstBooks, creating rigorous, code-rich STEM textbooks for engineers and advanced learners. Based in Auckland, New Zealand, he builds practical, reference-grade titles that blend clear exposition with working math, diagrams, and real-world examples. His catalog focuses on high-impact domains—including humanoid robotics, GPU architecture and programming, electronic warfare systems, satellite and space systems, EEG/neuro-engineering, and applied AI/ML—designed for hands-on use in labs, teams, and self-study.

Gareth’s workflow is unapologetically modern: LaTeX for precision, reproducible figures and listings, and AI-assisted drafting to accelerate iteration while maintaining strict technical accuracy. Each BurstBooks edition aims to minimize fluff and maximize utility—clean structure, consistent notation, and actionable takeaways. When possible, he complements books with supporting Github repositories, infographics, and exercises to help readers move from theory to implementation quickly.

Note: Many BurstBooks titles are AI-assisted technical editions—structured for information density and correctness over prose style.

Table of Contents

Chapter 12. Tensor and Matrix Acceleration

Section 1. Matrix Multiplication Fundamentals

GEMM operation principles
Blocking and tiling strategies
Data reuse optimization
Arithmetic intensity considerations

Section 2. Tensor Core Architecture

Systolic array organization
Matrix multiply-accumulate units
Dataflow and accumulation patterns
Precision and throughput balance

Section 3. Mixed Precision Support

FP16, BF16, TF32 computation
INT8 and INT4 quantization
Accumulator precision control
Conversion and normalization units

Section 4. Tensor Memory Layout

Row-major and column-major ordering
Tiled and blocked formats
Swizzling for conflict avoidance
Efficient memory access patterns

Section 5. Sparse Matrix Acceleration

CSR and COO representations
Structured sparsity (2:4, 4:8)
Zero-skipping hardware logic
Compression and decompression paths

Section 6. Verilog Implementation

Systolic array module
Matrix multiply-accumulate block
Data distribution network
Tensor core testbench

Chapter 13. Ray Tracing Hardware

Section 1. Ray Tracing Fundamentals

Ray representation
Intersection with primitives
BVH construction principles
Traversal algorithms

Section 2. RT Core Architecture

Ray-box intersection logic
Ray-triangle intersection unit
BVH traversal engine
Hit and miss determination

Section 3. Acceleration Structures

BVH node hierarchy
Memory layout optimization
Update and refit mechanisms
Build-time vs runtime tradeoffs

Section 4. Ray Coherence and Sorting

Coherent ray batching
Ray binning and bucketing
Cache-aware reordering
Wavefront path tracing

Section 5. Integration with Rasterization

Hybrid rendering pipeline
Shader-based ray generation
Shader binding table
Payload management

Section 6. Verilog Implementation

Ray-box intersection module
Ray-triangle intersection unit
BVH traversal FSM
RT core testbench

Chapter 14. Synchronization and Memory Ordering

Section 1. Memory Consistency Models

Sequential and relaxed models
Acquire-release semantics
Visibility scopes
GPU-specific ordering rules

Section 2. Barriers and Fences

Block and grid-level barriers
Memory fence types
System-wide synchronization
Performance overhead

Section 3. Cache Coherence

Write-invalidate protocols
Directory-based coherence
Cross-core consistency
Heterogeneous CPU-GPU models

Section 4. Atomic Operations

Read-modify-write logic
Compare-and-swap
Arbitration circuits
Performance optimizations

Section 5. Lock-Free Algorithms

Wait-free synchronization
ABA problem handling
Lock-free queues
GPU-specific design considerations

Section 6. Verilog Implementation

Barrier synchronization module
Atomic operation unit
Memory fence controller
Synchronization testbench

Chapter 15. Advanced Rendering Features

Section 1. Tessellation Pipeline

Hull shader and control points
Fixed-function tessellator
Domain shader operations
Adaptive tessellation control

Section 2. Geometry Processing

Geometry shader stage
Primitive amplification
Stream output
Layered rendering

Section 3. Mesh Shaders

Meshlet-based processing
Task and mesh shader stages
Workgroup culling and amplification
Hardware resource mapping

Section 4. Variable Rate Shading

Shading rate images
Coarse shading patterns
Foveated rendering
Performance and power gains

Section 5. Deferred Rendering Architecture

G-buffer composition
Geometry and lighting passes
Tile-based deferred shading
Bandwidth and efficiency analysis

Section 6. Verilog Implementation

Tessellator hardware
Meshlet processor
VRS controller
G-buffer manager testbench

Chapter 16. Display and Video Engines

Section 1. Display Controller

Timing generation (HSYNC and VSYNC)
Frame buffer scanning
Pixel pipeline organization
Multi-display management

Section 2. Display Compression

Display Stream Compression (DSC)
Encoder and decoder design
Bandwidth reduction analysis
Visual quality metrics

Section 3. Video Decode Acceleration

H.264, H.265, VP9, and AV1 decoding
Bitstream parsing and entropy decoding
Motion compensation hardware
Parallel decode engines

Section 4. Video Encode Acceleration

Motion estimation logic
Rate control mechanisms
Entropy encoder design
Multi-format support

Section 5. Video Processing Pipeline

Scaling and filtering
Color space conversion
Deinterlacing and denoising
HDR tone mapping

Section 6. Verilog Implementation

Display timing generator
Video decoder FSM
Motion estimation module
Video pipeline testbench

Chapter 17. Interconnect and Communication

Section 1. On-Chip Networks

Mesh and crossbar topologies
Router architecture and buffering
Flow control and arbitration
Deadlock prevention

Section 2. Memory Crossbar

SM-to-memory partition links
Bandwidth scheduling
Virtual channels
QoS enforcement

Section 3. PCIe Interface

Protocol layer overview
DMA engine design
Peer-to-peer communication
Error handling

Section 4. High-Speed Serial Links

NVLink and Infinity Fabric
CXL and coherent interfaces
PHY design and equalization
Latency and throughput tuning

Section 5. Multi-GPU Communication

GPU-to-GPU transfers
Collective operations
Topology optimization
Scalability challenges

Section 6. Verilog Implementation

Crossbar switch module
Round-robin arbiter
PCIe transaction engine
NoC router and testbench

Chapter 18. Performance Analysis and Optimization

Section 1. Performance Metrics

Throughput (GFLOPS and TFLOPS)
Bandwidth utilization
Cache efficiency
Power and thermal metrics

Section 2. Bottleneck Identification

Memory-bound and compute-bound workloads
Latency versus bandwidth limits
Roofline analysis
Profiling methodology

Section 3. Performance Counters

Counter and sampler architecture
Multiplexing techniques
Key hardware metrics
PMU software interfaces

Section 4. Workload Characterization

Instruction mix and balance
Cache and memory patterns
Thread divergence statistics
Power behavior profiling

Section 5. Optimization Techniques

Occupancy tuning
Coalesced memory access
Shared memory utilization
Instruction scheduling

Section 6. Power and Thermal Management

Dynamic power reduction
Workload-based DVFS
Thermal throttling
Efficiency optimization

Chapter 19. Physical Design and Manufacturing

Section 1. Floorplanning

Die partitioning and hierarchy
Block placement and routing
Power grid and clock tree
Thermal optimization

Section 2. Synthesis and Timing Closure

RTL synthesis flow
Timing constraints and setup
Clock domain verification
Multi-mode optimization

Section 3. Place and Route

Placement algorithms
Routing congestion management
Signal integrity checks
IR drop and EM control

Section 4. Design for Test

Scan insertion
Built-in self-test
JTAG and boundary scan
Yield analysis

Section 5. Packaging Technologies

Flip-chip and BGA packaging
Thermal interface materials
Multi-chip modules
TSV-based stacking

Section 6. Advanced Integration

Chiplet architectures
Die-to-die interconnects
UCIe protocol
Heterogeneous integration

Chapter 20. Future Directions and Emerging Technologies

Section 1. Modern GPU Case Studies

NVIDIA Hopper and AMD RDNA3
Intel Arc and Apple GPU designs
Mobile GPUs (Mali, Adreno)
Design tradeoff comparisons

Section 2. Specialized AI Accelerators

Google TPU and Cerebras engine
Graphcore IPU and Groq LPU
Comparison with general GPUs
Domain-specific optimization

Section 3. Beyond Moore’s Law

Process scaling limits
GAA and CFET transistor evolution
Advanced packaging methods
Economic and design impact

Section 4. Emerging Memory Technologies

HBM4 and next-gen DRAM
Processing-in-memory concepts
Persistent and near-data memory
Memory-centric system design

Section 5. Novel Computing Paradigms

Neuromorphic and optical computing
Quantum-accelerated systems
Approximate computation
Stochastic and hybrid models

Section 6. Sustainability and Green Computing

Power efficiency trends
Carbon footprint reduction
Lifecycle optimization
Renewable-powered data centers

Section 7. Research Frontiers

AI-driven hardware design
Self-optimizing microarchitectures
Secure and open GPU initiatives
Future scalability challenges

Also by the Author

The Leanpub 60 Day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

See full terms...

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earned over $15 million writing, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Learn more about Leanpub's ebook formats and where to read them

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub

You pay

Author earns

About

Share this book

Categories

Feedback

Bundles

Modern GPU Architecture and Programming Complete Bundle

$37.00

Modern GPU Architecture Mini Bundle

$29.00

Modern GPU Architecture and Programming Mini Bundle

$57.00

Modern GPU Architecture and Programming Complete Bundle

$87.00

Author

Contents

Chapter 12. Tensor and Matrix Acceleration

Section 1. Matrix Multiplication Fundamentals

Section 2. Tensor Core Architecture

Section 3. Mixed Precision Support

Section 4. Tensor Memory Layout

Section 5. Sparse Matrix Acceleration

Section 6. Verilog Implementation

Chapter 13. Ray Tracing Hardware

Section 1. Ray Tracing Fundamentals

Section 2. RT Core Architecture

Section 3. Acceleration Structures

Section 4. Ray Coherence and Sorting

Section 5. Integration with Rasterization

Section 6. Verilog Implementation

Chapter 14. Synchronization and Memory Ordering

Section 1. Memory Consistency Models

Section 2. Barriers and Fences

Section 3. Cache Coherence

Section 4. Atomic Operations

Section 5. Lock-Free Algorithms

Section 6. Verilog Implementation

Chapter 15. Advanced Rendering Features

Section 1. Tessellation Pipeline

Section 2. Geometry Processing

Section 3. Mesh Shaders

Section 4. Variable Rate Shading

Section 5. Deferred Rendering Architecture

Section 6. Verilog Implementation

Chapter 16. Display and Video Engines

Section 1. Display Controller

Section 2. Display Compression

Section 3. Video Decode Acceleration

Section 4. Video Encode Acceleration

Section 5. Video Processing Pipeline

Section 6. Verilog Implementation

Chapter 17. Interconnect and Communication

Section 1. On-Chip Networks

Section 2. Memory Crossbar

Section 3. PCIe Interface

Section 4. High-Speed Serial Links

Section 5. Multi-GPU Communication

Section 6. Verilog Implementation

Chapter 18. Performance Analysis and Optimization

Section 1. Performance Metrics

Section 2. Bottleneck Identification

Section 3. Performance Counters

Section 4. Workload Characterization

Section 5. Optimization Techniques

Section 6. Power and Thermal Management

Chapter 19. Physical Design and Manufacturing

Section 1. Floorplanning

Section 2. Synthesis and Timing Closure

Section 3. Place and Route

Section 4. Design for Test

Section 5. Packaging Technologies

Section 6. Advanced Integration

Chapter 20. Future Directions and Emerging Technologies

Section 1. Modern GPU Case Studies

Section 2. Specialized AI Accelerators

Section 3. Beyond Moore’s Law

Section 4. Emerging Memory Technologies

Section 5. Novel Computing Paradigms

Section 6. Sustainability and Green Computing