Chapter 1. GPU Architecture Principles
Section 1. GPU vs CPU Philosophy
- Throughput over latency
- Task parallelism vs instruction-level parallelism
- Scalability and energy efficiency
- Suitability for graphics and compute workloads
Section 2. SIMT Execution Model
- Threads and warps/wavefronts
- SIMD-style execution with per-thread state
- Divergence handling and reconvergence
- Warp scheduling and masking
Section 3. Modern GPU Overview
- NVIDIA SM, AMD CU, Intel Xe-core comparison
- Streaming multiprocessor internal organization
- Compute and graphics pipeline integration
- Role of schedulers and dispatch units
Section 4. Memory Hierarchy Snapshot
- Register files and local storage
- Shared/LDS memory
- L1, L2, and VRAM hierarchy
- Latency and bandwidth tradeoffs
Chapter 2. Digital Design for GPUs
Section 1. Verilog and SystemVerilog Essentials
- Modules, ports, and parameters
- Procedural and continuous assignments
- Always blocks and combinational logic
- Timing control and simulation
Section 2. Pipeline Design
- Pipeline depth and balancing
- Register placement and retiming
- Valid-ready handshaking
- Control and data path separation
Section 3. Clock Domain Crossing and Handshaking
- Metastability and synchronization
- FIFO-based CDC mechanisms
- Dual-clock RAM interfaces
- Gray-coded counters
Section 4. Floating-Point Units
- FP32, FP16, and BF16 support
- Fused multiply-add (FMA) structure
- Normalization and rounding
- Verification and corner-case handling
Section 5. Verification Basics
- Testbench architecture
- Assertions and functional coverage
- Waveform inspection and debugging
- Regression testing
Chapter 3. 3D Graphics Fundamentals
Section 1. Geometric Primitives
- Vertices, edges, and triangles
- Indexed and non-indexed primitives
- Vertex attributes and interpolation
- Coordinate spaces and transformations
Section 2. Transformations
- Model, view, and projection matrices
- Perspective vs orthographic projection
- Homogeneous coordinates
- Matrix composition order
Section 3. Rasterization
- Edge equations and barycentric coordinates
- Pixel coverage determination
- Sampling and antialiasing
- Tile-based versus immediate rendering
Section 4. Shading and Lighting
- Gouraud and Phong shading
- Lambertian reflection
- Specular highlights
- BRDF introduction
Section 5. Texture Mapping
- UV coordinate generation
- Filtering and mipmapping
- Texture addressing modes
- Texture compression formats
Chapter 4. System Architecture
Section 1. Clocking and Power
- Multi-domain clock networks
- Dynamic frequency scaling
- Clock gating and power gating
- Thermal limits and throttling
Section 2. Memory Interfaces
- GDDR6 and HBM3 architectures
- Channel width and data rates
- Memory controller scheduling
- ECC and error recovery
Section 3. PCIe and Interconnects
- PCIe generations and bandwidth
- NVLink and Infinity Fabric
- CXL and coherent interconnects
- Topologies for multi-GPU systems
Section 4. Command Processing
- DMA engines
- Queue management
- Workload submission
- Context switching and synchronization
Chapter 5. Vertex Processing and Primitive Assembly
Section 1. Vertex Fetch and Transformation
- Attribute fetch unit
- MVP matrix multiplication
- Viewport transformation
- Perspective division
Section 2. Clipping and Culling
- Frustum clipping logic
- Backface culling
- Guard band optimization
- Degenerate triangle handling
Section 3. Triangle Setup
- Edge equations
- Gradient and interpolation setup
- Subpixel precision
- Fixed-point arithmetic
Section 4. Verilog Implementation
- Matrix multiplier
- Clipping module
- Triangle setup unit
- Testbench for vertex stage
Chapter 6. Rasterization
Section 1. Tile-Based Scan Conversion
- Screen space subdivision
- Tile binning and culling
- Parallel raster pipelines
- Memory locality optimization
Section 2. Early-Z and Hierarchical-Z
- Depth pre-pass techniques
- Hierarchical Z-buffer design
- Early depth rejection
- Performance considerations
Section 3. Attribute Interpolation
- Perspective-correct interpolation
- Fixed-point vs floating-point gradients
- Interpolator pipeline stages
- Precision and rounding effects
Section 4. Variable Rate Shading
- Coarse pixel shading
- Shading rate maps
- Foveated rendering applications
- Bandwidth and power benefits
Section 5. Verilog Implementation
- Rasterizer core
- Edge function generator
- Z-test pipeline
- Tile buffer controller
Chapter 7. Fragment Processing and Texturing
Section 1. Texture Units
- Texture addressing and coordinate wrapping
- Filtering modes (nearest, bilinear, trilinear)
- Anisotropic filtering
- Mipmapping hardware
Section 2. Alpha Blending and Depth Testing
- Alpha test and discard
- Blending equations
- Depth comparison logic
- Write mask control
Section 3. Texture Compression
- BC and ASTC formats
- Block decoding pipeline
- Hardware decompression logic
- Compression ratio and quality tradeoffs
Section 4. Texture Cache Architecture
- Cache hierarchy for texture data
- Tag-data organization
- Replacement policies (LRU, random)
- Cache line size optimization
- Texture cache coherency
Section 5. Texture Unit Pipeline
- Address calculation stage
- Cache lookup and fetch
- Decompression stage
- Filtering computation
- Output buffering
Section 6. Lighting Calculations
- Ambient, diffuse, and specular components
- Normal and bump mapping
- Light source models
- Multi-light accumulation
Section 7. Verilog Implementation
- Texture address generator
- Bilinear filter module
- Texture cache controller
- Fragment shader datapath
- Texture pipeline testbench
Chapter 8. Shader Core Architecture
Section 1. Programmable Shader Overview
- Evolution from fixed-function to programmable
- Unified shader architecture
- Shader stages and types
- Program compilation and linkage
Section 2. Instruction Set Architecture
- Scalar vs vector ISA
- Arithmetic and transcendental instructions
- Memory access operations
- Control flow encoding
Section 3. Warp Scheduler
- Warp and wavefront execution
- Dependency scoreboarding
- Issue policies and priorities
- Latency hiding mechanisms
Section 4. Register File Design
- Banked register organization
- Read/write port design
- Register renaming
- Spilling and allocation
Section 5. Execution Units
- Integer and floating-point ALUs
- Special function units
- Load/store pipelines
- Data forwarding and bypassing
Section 6. Branch Divergence Handling
- Active mask tracking
- Divergence stack logic
- Reconvergence hardware
- Performance impact analysis
Section 7. Instruction Cache
- Cache organization and fetch width
- Branch prediction
- Instruction buffering
- Prefetch and alignment
Section 8. Verilog Implementation
- Warp scheduler FSM
- Banked register file
- ALU and FPU datapaths
- Divergence stack
- Shader core testbench
Chapter 9. Memory Subsystem Design
Section 1. Memory Architecture Overview
- GPU memory hierarchy
- Bandwidth and latency tradeoffs
- Parallel memory channels
- Performance balancing
Section 2. Coalescing and Memory Transactions
- Memory access coalescing
- Transaction formation
- Strided and misaligned access handling
- Alignment requirements
Section 3. Shared Memory and Local Data Share
- Banked architecture
- Bank conflict resolution
- Barrier synchronization
- Common programming patterns
Section 4. L1 Cache Design
- Set-associative structure
- Write-back vs write-through
- Coherency mechanisms
- Replacement policies
Section 5. L2 Cache Architecture
- Unified cache for all cores
- Partitioning and crossbar interconnect
- Victim cache optimization
- Cache slice arbitration
Section 6. Memory Controller
- DRAM command generation
- Row buffer management
- Bank interleaving
- QoS scheduling
Section 7. Atomic Operations
- Read-modify-write primitives
- Memory ordering rules
- Atomic unit architecture
- Performance tradeoffs
Section 8. Verilog Implementation
- Coalescing unit
- Banked shared memory
- Set-associative cache
- Memory controller FSM
- Memory system testbench
Chapter 10. Render Output Pipeline
Section 1. Depth and Stencil Testing
- Z-buffer algorithm
- Depth and stencil compare functions
- Early-Z and late-Z pipelines
- Depth bounds optimization
Section 2. Blending Operations
- Alpha blending modes
- Dual-source blending
- Logical pixel operations
- Independent render target blending
Section 3. Color Compression
- Delta color compression (DCC)
- Fast clear optimization
- Metadata management
- Compression efficiency
Section 4. Render Target Cache
- Color and depth cache structure
- Tile-based write combining
- Compression integration
- Eviction policy design
Section 5. Multi-Sample Anti-Aliasing
- Sampling positions and coverage masks
- Centroid and resolve operations
- Bandwidth considerations
- Quality and performance balance
Section 6. Framebuffer Organization
- Linear and tiled layouts
- Swizzling and Z-order curves
- Multi-render target support
- Memory access optimization
Section 7. Verilog Implementation
- Depth test module
- Blending unit
- ROP cache controller
- MSAA resolve logic
- ROP testbench
Chapter 11. Compute Architecture
Section 1. GPGPU Programming Model
- Kernels, threads, and work-groups
- Hierarchical execution model
- Global and shared memory scopes
- Synchronization mechanisms
Section 2. Compute Dispatch
- Kernel launch process
- Command packet format
- Work-group distribution
- Concurrent kernel execution
Section 3. Occupancy and Resource Management
- Register pressure and allocation
- Shared memory partitioning
- Warp and block occupancy
- Resource scheduling
Section 4. Work Distribution
- Static and dynamic scheduling
- Load balancing algorithms
- Persistent threads model
- Cooperative group synchronization
Section 5. Data Parallel Processing Patterns
- Map, reduce, and scan operations
- Histogram and sort algorithms
- Matrix operations
- Prefix sums and reductions
Section 6. Verilog Implementation
- Compute dispatch unit
- Work-group scheduler
- Resource allocator
- Barrier synchronization logic
- Compute kernel testbench
