Bundles

Bundles that include this book

GPU PTX and SASS Mini Bundle
2 Books
Pricing
$29.00
Minimum price
Bought separately$58.00
Suggested price$29.00
Modern GPU Architecture and Programming Mini Bundle
4 Books
Pricing
$57.00
Minimum price
Bought separately$116
Suggested price$87.00
Modern GPU Architecture and Programming Complete Bundle
7 Books
Pricing
$87.00
Minimum price
Bought separately$203
Suggested price$197

Author

About the Author

Gareth Morgan Thomas is a qualified expert with extensive expertise across multiple STEM fields. Holding six university diplomas in electronics, software development, web development, and project management, along with qualifications in computer networking, CAD, diesel engineering, well drilling, and welding, he has built a robust foundation of technical knowledge.

Educated in Auckland, New Zealand, Gareth Morgan Thomas also spent three years serving in the New Zealand Army, where he honed his discipline and problem-solving skills. With years of technical training, Gareth Morgan Thomas is now dedicated to sharing his deep understanding of science, technology, engineering, and mathematics through a series of specialized books aimed at both beginners and advanced learners.

Chapter 1. Introduction to GPU Assembly Languages

Section 1. The Evolution of GPU Programming

From fixed-function pipelines to programmable shaders to GPGPU
The CUDA revolution and abstraction layers

Section 2. Why PTX and SASS Matter

Performance ceilings of high-level abstractions
Case study: 10x speedup in production ML inference
When to drop down to assembly

Section 3. PTX: The Portable Intermediate Representation

Virtual ISA design philosophy
Role in CUDA compilation pipeline (nvcc → PTX → SASS)
Forward compatibility and JIT compilation

Section 4. SASS: The Hardware Truth

Architecture-specific binary code
PTX-to-SASS transformation (ptxas)
Why SASS analysis matters for optimization

Section 5. PTX ISA Evolution: Versions 3.0 to 9.0

Generational highlights (Kepler → Blackwell)
PTX 9.0 features: cluster primitives, enhanced tensor ops
Backward compatibility guarantees

Section 6. Comparison with Other GPU ISAs

AMD GCN/RDNA/CDNA assembly
Intel Xe ISA
Apple GPU assembly (brief overview)
Portability considerations

Chapter 2. Quick Start – Your First Optimization

Section 1. Case Study Setup: Naive Vector Addition

CUDA C++ baseline implementation
Performance measurement methodology
Profiling with Nsight Compute

Section 2. Step 1: Examining Generated PTX

Disassembly walkthrough
Identifying optimization opportunities
Memory access patterns

Section 3. Step 2: Hand-Optimized PTX Version

Vectorized loads/stores (ld.v4, st.v4)
Unrolling and instruction-level parallelism
Complete annotated code listing

Section 4. Step 3: Understanding the SASS

How ptxas transformed your PTX
Register allocation analysis
Instruction scheduling insights

Section 5. Performance Results and Analysis

Benchmark across architectures (Volta, Ampere, Hopper)
Bandwidth utilization
When hand-optimization matters vs compiler sufficiency

Section 6. Lessons Learned and Best Practices

Development workflow for PTX/SASS optimization
Tools and iteration strategies

Chapter 3. The PTX Programming Model

Section 1. The GPU as Massively Parallel Coprocessor

Host-device interaction model
Asynchronous execution and streams
Kernel launch mechanics

Section 2. Thread Hierarchy: Foundation

Threads and SIMT execution
Cooperative Thread Arrays
Thread Block Clusters
Grids and Multi-Device Scaling

Section 3. Memory Hierarchy Deep Dive

State spaces overview
Memory performance characteristics
Memory consistency model

Section 4. Execution Semantics

Thread scheduling and warp execution
Divergence and reconvergence mechanisms
Predication vs branching
Independent thread scheduling

Chapter 4. The PTX Machine Model

Section 1. GPU Architecture Primer

Streaming Multiprocessors structure
CUDA cores, tensor cores, special function units
Evolution across architectures

Section 2. Warp Schedulers and Execution

Warp scheduling policies
Instruction issue and dual-issue capabilities
Scoreboarding and dependency tracking

Section 3. Register Files and Predication

Register file organization
Predicate registers and execution masks
Register pressure and occupancy tradeoffs

Section 4. Memory Subsystem Architecture

Load/store units
Texture units and TMA
Memory controller and DRAM access
NVLink and inter-GPU communication

Section 5. Cluster-Level Hardware

Distributed shared memory hardware
Cluster barriers implementation
Performance implications

Section 6. From PTX to SASS: The Compilation Process

ptxas optimization passes
Instruction selection and scheduling
Register allocation strategies
Tools: nvdisasm, cuobjdump, nvcc flags

Section 7. SASS Instruction Encoding

Binary format overview
Opcode structure
Why SASS differs across architectures

Chapter 5. PTX Syntax Fundamentals

Section 1. Source Format and Structure

Module organization
Version directives (.version)
Target specifications (.target, .address_size)
Entry points (.entry, .func)

Section 2. Comments and Readability

Single-line and block comments
Documentation conventions

Section 3. Statements and Directives

Instruction format
Labels and control flow targets
Pragmas and compiler hints

Section 4. Identifiers and Naming

Register naming conventions
User-defined identifiers
Special registers

Section 5. Constants and Literals

Integer constants
Floating-point constants
Vector and tensor constants
String literals

Section 6. Basic Assembly Example

Complete minimal PTX module walkthrough
Line-by-line explanation

Chapter 6. State Spaces, Types, and Variables

Section 1. State Space Deep Dive

Register space (.reg)
Shared memory (.shared)
Global memory (.global)
Constant memory (.const)
Local memory (.local)
Texture and surface memory
Parameter space (.param)

Section 2. Data Types Comprehensive

Fundamental types
Vector types (.v2, .v4)
Specialized types

Section 3. Variable Declarations

Syntax and alignment attributes
Array declarations
Initialization

Section 4. Tensor Types and Memory Layouts

NDHWC and NCDHW formats
Tensor descriptor objects
Interleave modes
Swizzling patterns

Section 5. Type Qualifiers and Attributes

Volatile and cache control hints
Relaxed consistency

Chapter 7. Instruction Operands and Addressing

Section 1. Operand Types and Constraints

Source and destination operands
Type compatibility rules
Immediate values

Section 2. Addressing Modes

Register direct
Register indirect
Immediate addressing

Section 3. Array and Vector Addressing

Element indexing syntax
Strided access patterns
Vector unpacking operations

Section 4. Special Address Computation

Pointer arithmetic in PTX
Address register usage
Generic pointers and space casting

Section 5. Type Conversion Instructions

Conversion between types
Rounding modes
Saturation modes
Sign and zero extension

Section 6. Predication Mechanics

Predicate register usage
Conditional execution
Predicate combination operations

Section 7. Inline PTX in CUDA C++

Asm volatile syntax
Operand constraints
Register allocation interaction
Best practices and pitfalls

Chapter 8. Core Instruction Set

Section 1. Data Movement Instructions

Load instructions
Store instructions
Move instructions

Section 2. Arithmetic Instructions

Integer arithmetic
Floating-point arithmetic
Special math functions
Multiply-add patterns

Section 3. Comparison and Selection

Set predicate
Select operations
Set register

Section 4. Logic and Bit Manipulation

and, or, xor, not
Shift operations
Bit reversal
Bit field operations
Population count
Leading/trailing zeros

Section 5. SASS Equivalents and Performance

Instruction latency and throughput
Pipeline utilization
Opcode mapping examples

Chapter 9. Control Flow Instructions

Section 1. Branching Fundamentals

Unconditional branch
Conditional branch
Uniform vs divergent branches

Section 2. Function Calls

Call and return
Device function optimization

Section 3. Synchronization Primitives

Barrier synchronization
Memory barriers
Cluster barriers

Section 4. Exception and Debugging

Trap
Breakpoint
Exit

Section 5. Advanced Control Patterns

Loop construction
Switch statements

Section 6. SASS Control Flow Analysis

Branch target encoding
Predication vs masking
Reconvergence stack

Chapter 10. Memory Patterns and Advanced Layouts

Section 1. Coalescing Deep Dive

Transaction sizes
Perfect vs partial coalescing
Misaligned penalties

Section 2. Shared Memory Bank Conflicts

Bank structure
Conflict-free patterns
Multi-way conflicts

Section 3. Tiling and Blocking Strategies

Matrix multiplication tiling
Cache blocking
Register tiling

Section 4. Tensor Memory Access Patterns

NHWC vs NCHW layouts
Tensor core layout requirements
Im2col and Winograd layouts

Section 5. Swizzling Patterns

XOR-based swizzling
Tensor operation applications

Section 6. TMA (Tensor Memory Accelerator)

Descriptor creation
Async operations
Performance advantages

Section 7. Software Pipelining

Double and triple buffering
Async copy overlap with compute
Barrier phasing

Chapter 11. Tensor and Matrix Operations

Section 1. Introduction to Tensor Cores

Hardware architecture and capabilities
Mixed-precision compute
Throughput advantages

Section 2. Matrix Multiply-Accumulate (mma)

MMA instruction formats
Complete GEMM example
MMA performance analysis

Section 3. Warpgroup MMA (wgmma)

Larger operation sizes
Descriptor-based interface
Async execution model
Performance comparison

Section 4. Tensor Map Objects

Tensor descriptor creation
Tensor load/store operations

Section 5. Convolution-Specific Features

Im2col mode
Direct convolution optimization

Section 6. Specialized Tensor Operations

Reductions
Transpose and permute
Broadcast patterns

Also by the Author

You pay

Author earns

About

Share this book

Categories

Feedback