Chapter 1. Introduction to GPU Assembly Languages
Section 1. The Evolution of GPU Programming
- From fixed-function pipelines to programmable shaders to GPGPU
- The CUDA revolution and abstraction layers
Section 2. Why PTX and SASS Matter
- Performance ceilings of high-level abstractions
- Case study: 10x speedup in production ML inference
- When to drop down to assembly
Section 3. PTX: The Portable Intermediate Representation
- Virtual ISA design philosophy
- Role in CUDA compilation pipeline (nvcc → PTX → SASS)
- Forward compatibility and JIT compilation
Section 4. SASS: The Hardware Truth
- Architecture-specific binary code
- PTX-to-SASS transformation (ptxas)
- Why SASS analysis matters for optimization
Section 5. PTX ISA Evolution: Versions 3.0 to 9.0
- Generational highlights (Kepler → Blackwell)
- PTX 9.0 features: cluster primitives, enhanced tensor ops
- Backward compatibility guarantees
Section 6. Comparison with Other GPU ISAs
- AMD GCN/RDNA/CDNA assembly
- Intel Xe ISA
- Apple GPU assembly (brief overview)
- Portability considerations
Chapter 2. Quick Start – Your First Optimization
Section 1. Case Study Setup: Naive Vector Addition
- CUDA C++ baseline implementation
- Performance measurement methodology
- Profiling with Nsight Compute
Section 2. Step 1: Examining Generated PTX
- Disassembly walkthrough
- Identifying optimization opportunities
- Memory access patterns
Section 3. Step 2: Hand-Optimized PTX Version
- Vectorized loads/stores (ld.v4, st.v4)
- Unrolling and instruction-level parallelism
- Complete annotated code listing
Section 4. Step 3: Understanding the SASS
- How ptxas transformed your PTX
- Register allocation analysis
- Instruction scheduling insights
Section 5. Performance Results and Analysis
- Benchmark across architectures (Volta, Ampere, Hopper)
- Bandwidth utilization
- When hand-optimization matters vs compiler sufficiency
Section 6. Lessons Learned and Best Practices
- Development workflow for PTX/SASS optimization
- Tools and iteration strategies
Chapter 3. The PTX Programming Model
Section 1. The GPU as Massively Parallel Coprocessor
- Host-device interaction model
- Asynchronous execution and streams
- Kernel launch mechanics
Section 2. Thread Hierarchy: Foundation
- Threads and SIMT execution
- Cooperative Thread Arrays
- Thread Block Clusters
- Grids and Multi-Device Scaling
Section 3. Memory Hierarchy Deep Dive
- State spaces overview
- Memory performance characteristics
- Memory consistency model
Section 4. Execution Semantics
- Thread scheduling and warp execution
- Divergence and reconvergence mechanisms
- Predication vs branching
- Independent thread scheduling
Chapter 4. The PTX Machine Model
Section 1. GPU Architecture Primer
- Streaming Multiprocessors structure
- CUDA cores, tensor cores, special function units
- Evolution across architectures
Section 2. Warp Schedulers and Execution
- Warp scheduling policies
- Instruction issue and dual-issue capabilities
- Scoreboarding and dependency tracking
Section 3. Register Files and Predication
- Register file organization
- Predicate registers and execution masks
- Register pressure and occupancy tradeoffs
Section 4. Memory Subsystem Architecture
- Load/store units
- Texture units and TMA
- Memory controller and DRAM access
- NVLink and inter-GPU communication
Section 5. Cluster-Level Hardware
- Distributed shared memory hardware
- Cluster barriers implementation
- Performance implications
Section 6. From PTX to SASS: The Compilation Process
- ptxas optimization passes
- Instruction selection and scheduling
- Register allocation strategies
- Tools: nvdisasm, cuobjdump, nvcc flags
Section 7. SASS Instruction Encoding
- Binary format overview
- Opcode structure
- Why SASS differs across architectures
Chapter 5. PTX Syntax Fundamentals
Section 1. Source Format and Structure
- Module organization
- Version directives (.version)
- Target specifications (.target, .address_size)
- Entry points (.entry, .func)
Section 2. Comments and Readability
- Single-line and block comments
- Documentation conventions
Section 3. Statements and Directives
- Instruction format
- Labels and control flow targets
- Pragmas and compiler hints
Section 4. Identifiers and Naming
- Register naming conventions
- User-defined identifiers
- Special registers
Section 5. Constants and Literals
- Integer constants
- Floating-point constants
- Vector and tensor constants
- String literals
Section 6. Basic Assembly Example
- Complete minimal PTX module walkthrough
- Line-by-line explanation
Chapter 6. State Spaces, Types, and Variables
Section 1. State Space Deep Dive
- Register space (.reg)
- Shared memory (.shared)
- Global memory (.global)
- Constant memory (.const)
- Local memory (.local)
- Texture and surface memory
- Parameter space (.param)
Section 2. Data Types Comprehensive
- Fundamental types
- Vector types (.v2, .v4)
- Specialized types
Section 3. Variable Declarations
- Syntax and alignment attributes
- Array declarations
- Initialization
Section 4. Tensor Types and Memory Layouts
- NDHWC and NCDHW formats
- Tensor descriptor objects
- Interleave modes
- Swizzling patterns
Section 5. Type Qualifiers and Attributes
- Volatile and cache control hints
- Relaxed consistency
Chapter 7. Instruction Operands and Addressing
Section 1. Operand Types and Constraints
- Source and destination operands
- Type compatibility rules
- Immediate values
Section 2. Addressing Modes
- Register direct
- Register indirect
- Immediate addressing
Section 3. Array and Vector Addressing
- Element indexing syntax
- Strided access patterns
- Vector unpacking operations
Section 4. Special Address Computation
- Pointer arithmetic in PTX
- Address register usage
- Generic pointers and space casting
Section 5. Type Conversion Instructions
- Conversion between types
- Rounding modes
- Saturation modes
- Sign and zero extension
Section 6. Predication Mechanics
- Predicate register usage
- Conditional execution
- Predicate combination operations
Section 7. Inline PTX in CUDA C++
- Asm volatile syntax
- Operand constraints
- Register allocation interaction
- Best practices and pitfalls
Chapter 8. Core Instruction Set
Section 1. Data Movement Instructions
- Load instructions
- Store instructions
- Move instructions
Section 2. Arithmetic Instructions
- Integer arithmetic
- Floating-point arithmetic
- Special math functions
- Multiply-add patterns
Section 3. Comparison and Selection
- Set predicate
- Select operations
- Set register
Section 4. Logic and Bit Manipulation
- and, or, xor, not
- Shift operations
- Bit reversal
- Bit field operations
- Population count
- Leading/trailing zeros
Section 5. SASS Equivalents and Performance
- Instruction latency and throughput
- Pipeline utilization
- Opcode mapping examples
Chapter 9. Control Flow Instructions
Section 1. Branching Fundamentals
- Unconditional branch
- Conditional branch
- Uniform vs divergent branches
Section 2. Function Calls
- Call and return
- Device function optimization
Section 3. Synchronization Primitives
- Barrier synchronization
- Memory barriers
- Cluster barriers
Section 4. Exception and Debugging
- Trap
- Breakpoint
- Exit
Section 5. Advanced Control Patterns
- Loop construction
- Switch statements
Section 6. SASS Control Flow Analysis
- Branch target encoding
- Predication vs masking
- Reconvergence stack
Chapter 10. Memory Patterns and Advanced Layouts
Section 1. Coalescing Deep Dive
- Transaction sizes
- Perfect vs partial coalescing
- Misaligned penalties
Section 2. Shared Memory Bank Conflicts
- Bank structure
- Conflict-free patterns
- Multi-way conflicts
Section 3. Tiling and Blocking Strategies
- Matrix multiplication tiling
- Cache blocking
- Register tiling
Section 4. Tensor Memory Access Patterns
- NHWC vs NCHW layouts
- Tensor core layout requirements
- Im2col and Winograd layouts
Section 5. Swizzling Patterns
- XOR-based swizzling
- Tensor operation applications
Section 6. TMA (Tensor Memory Accelerator)
- Descriptor creation
- Async operations
- Performance advantages
Section 7. Software Pipelining
- Double and triple buffering
- Async copy overlap with compute
- Barrier phasing
Chapter 11. Tensor and Matrix Operations
Section 1. Introduction to Tensor Cores
- Hardware architecture and capabilities
- Mixed-precision compute
- Throughput advantages
Section 2. Matrix Multiply-Accumulate (mma)
- MMA instruction formats
- Complete GEMM example
- MMA performance analysis
Section 3. Warpgroup MMA (wgmma)
- Larger operation sizes
- Descriptor-based interface
- Async execution model
- Performance comparison
Section 4. Tensor Map Objects
- Tensor descriptor creation
- Tensor load/store operations
Section 5. Convolution-Specific Features
- Im2col mode
- Direct convolution optimization
Section 6. Specialized Tensor Operations
- Reductions
- Transpose and permute
- Broadcast patterns


