Leanpub Header

Skip to main content

Mastering PTX and SASS

Volume I — The PTX Language and Architecture Foundations

If you’ve ever wondered why your GPU code hits a wall long before the hardware’s limits, this book tells you why—and how to break through it.

Most programmers stop where the compiler starts. They trust nvcc to make the right decisions, to manage registers, to schedule instructions, and to use memory efficiently. But the compiler doesn’t know your problem. It guesses. And in GPU computing, guessing costs performance.

Minimum price

$19.00

$29.00

You pay

$29.00

Author earns

$23.20
$

...Or Buy With Credits!

You can get credits with a paid monthly or annual Reader Membership, or you can buy them here.
PDF
About

About

About the Book

If you’ve ever wondered why your GPU code hits a wall long before the hardware’s limits, this book tells you why—and how to break through it.

Most programmers stop where the compiler starts. They trust nvcc to make the right decisions, to manage registers, to schedule instructions, and to use memory efficiently. But the compiler doesn’t know your problem. It guesses. And in GPU computing, guessing costs performance.

Mastering PTX and SASS – Volume I pulls back the curtain on NVIDIA’s virtual machine—the PTX instruction set that every CUDA kernel becomes before it touches silicon. You’ll learn how threads, warps, and memory really behave at the hardware level, how each instruction interacts with caches and pipelines, and how to read, write, and reason about PTX like an architect, not just a coder.

This isn’t a surface-level “how-to.” It’s a deep, methodical tour through the machinery of modern GPUs—built for professionals who want measurable, repeatable speedups, not guesswork. You’ll discover how the compiler transforms your high-level logic into executable reality, and where you can step in to take control.

By the time you finish, you won’t be relying on compiler magic. You’ll understand it, improve it, and surpass it.

Mastering PTX and SASS – Volume I gives you the foundation; Volume II takes you to the bleeding edge of optimization. Together, they turn GPU performance from a mystery into a science.

Share this book

Bundles

Bundles that include this book

Author

About the Author

gareth thomas

Gareth Morgan Thomas is a qualified expert with extensive expertise across multiple STEM fields. Holding six university diplomas in electronics, software development, web development, and project management, along with qualifications in computer networking, CAD, diesel engineering, well drilling, and welding, he has built a robust foundation of technical knowledge.

Educated in Auckland, New Zealand, Gareth Morgan Thomas also spent three years serving in the New Zealand Army, where he honed his discipline and problem-solving skills. With years of technical training, Gareth Morgan Thomas is now dedicated to sharing his deep understanding of science, technology, engineering, and mathematics through a series of specialized books aimed at both beginners and advanced learners.

Contents

Table of Contents

Chapter 1. Introduction to GPU Assembly Languages

Section 1. The Evolution of GPU Programming

  • From fixed-function pipelines to programmable shaders to GPGPU
  • The CUDA revolution and abstraction layers

Section 2. Why PTX and SASS Matter

  • Performance ceilings of high-level abstractions
  • Case study: 10x speedup in production ML inference
  • When to drop down to assembly

Section 3. PTX: The Portable Intermediate Representation

  • Virtual ISA design philosophy
  • Role in CUDA compilation pipeline (nvcc → PTX → SASS)
  • Forward compatibility and JIT compilation

Section 4. SASS: The Hardware Truth

  • Architecture-specific binary code
  • PTX-to-SASS transformation (ptxas)
  • Why SASS analysis matters for optimization

Section 5. PTX ISA Evolution: Versions 3.0 to 9.0

  • Generational highlights (Kepler → Blackwell)
  • PTX 9.0 features: cluster primitives, enhanced tensor ops
  • Backward compatibility guarantees

Section 6. Comparison with Other GPU ISAs

  • AMD GCN/RDNA/CDNA assembly
  • Intel Xe ISA
  • Apple GPU assembly (brief overview)
  • Portability considerations

Chapter 2. Quick Start – Your First Optimization

Section 1. Case Study Setup: Naive Vector Addition

  • CUDA C++ baseline implementation
  • Performance measurement methodology
  • Profiling with Nsight Compute

Section 2. Step 1: Examining Generated PTX

  • Disassembly walkthrough
  • Identifying optimization opportunities
  • Memory access patterns

Section 3. Step 2: Hand-Optimized PTX Version

  • Vectorized loads/stores (ld.v4, st.v4)
  • Unrolling and instruction-level parallelism
  • Complete annotated code listing

Section 4. Step 3: Understanding the SASS

  • How ptxas transformed your PTX
  • Register allocation analysis
  • Instruction scheduling insights

Section 5. Performance Results and Analysis

  • Benchmark across architectures (Volta, Ampere, Hopper)
  • Bandwidth utilization
  • When hand-optimization matters vs compiler sufficiency

Section 6. Lessons Learned and Best Practices

  • Development workflow for PTX/SASS optimization
  • Tools and iteration strategies

Chapter 3. The PTX Programming Model

Section 1. The GPU as Massively Parallel Coprocessor

  • Host-device interaction model
  • Asynchronous execution and streams
  • Kernel launch mechanics

Section 2. Thread Hierarchy: Foundation

  • Threads and SIMT execution
  • Cooperative Thread Arrays
  • Thread Block Clusters
  • Grids and Multi-Device Scaling

Section 3. Memory Hierarchy Deep Dive

  • State spaces overview
  • Memory performance characteristics
  • Memory consistency model

Section 4. Execution Semantics

  • Thread scheduling and warp execution
  • Divergence and reconvergence mechanisms
  • Predication vs branching
  • Independent thread scheduling

Chapter 4. The PTX Machine Model

Section 1. GPU Architecture Primer

  • Streaming Multiprocessors structure
  • CUDA cores, tensor cores, special function units
  • Evolution across architectures

Section 2. Warp Schedulers and Execution

  • Warp scheduling policies
  • Instruction issue and dual-issue capabilities
  • Scoreboarding and dependency tracking

Section 3. Register Files and Predication

  • Register file organization
  • Predicate registers and execution masks
  • Register pressure and occupancy tradeoffs

Section 4. Memory Subsystem Architecture

  • Load/store units
  • Texture units and TMA
  • Memory controller and DRAM access
  • NVLink and inter-GPU communication

Section 5. Cluster-Level Hardware

  • Distributed shared memory hardware
  • Cluster barriers implementation
  • Performance implications

Section 6. From PTX to SASS: The Compilation Process

  • ptxas optimization passes
  • Instruction selection and scheduling
  • Register allocation strategies
  • Tools: nvdisasm, cuobjdump, nvcc flags

Section 7. SASS Instruction Encoding

  • Binary format overview
  • Opcode structure
  • Why SASS differs across architectures

Chapter 5. PTX Syntax Fundamentals

Section 1. Source Format and Structure

  • Module organization
  • Version directives (.version)
  • Target specifications (.target, .address_size)
  • Entry points (.entry, .func)

Section 2. Comments and Readability

  • Single-line and block comments
  • Documentation conventions

Section 3. Statements and Directives

  • Instruction format
  • Labels and control flow targets
  • Pragmas and compiler hints

Section 4. Identifiers and Naming

  • Register naming conventions
  • User-defined identifiers
  • Special registers

Section 5. Constants and Literals

  • Integer constants
  • Floating-point constants
  • Vector and tensor constants
  • String literals

Section 6. Basic Assembly Example

  • Complete minimal PTX module walkthrough
  • Line-by-line explanation

Chapter 6. State Spaces, Types, and Variables

Section 1. State Space Deep Dive

  • Register space (.reg)
  • Shared memory (.shared)
  • Global memory (.global)
  • Constant memory (.const)
  • Local memory (.local)
  • Texture and surface memory
  • Parameter space (.param)

Section 2. Data Types Comprehensive

  • Fundamental types
  • Vector types (.v2, .v4)
  • Specialized types

Section 3. Variable Declarations

  • Syntax and alignment attributes
  • Array declarations
  • Initialization

Section 4. Tensor Types and Memory Layouts

  • NDHWC and NCDHW formats
  • Tensor descriptor objects
  • Interleave modes
  • Swizzling patterns

Section 5. Type Qualifiers and Attributes

  • Volatile and cache control hints
  • Relaxed consistency

Chapter 7. Instruction Operands and Addressing

Section 1. Operand Types and Constraints

  • Source and destination operands
  • Type compatibility rules
  • Immediate values

Section 2. Addressing Modes

  • Register direct
  • Register indirect
  • Immediate addressing

Section 3. Array and Vector Addressing

  • Element indexing syntax
  • Strided access patterns
  • Vector unpacking operations

Section 4. Special Address Computation

  • Pointer arithmetic in PTX
  • Address register usage
  • Generic pointers and space casting

Section 5. Type Conversion Instructions

  • Conversion between types
  • Rounding modes
  • Saturation modes
  • Sign and zero extension

Section 6. Predication Mechanics

  • Predicate register usage
  • Conditional execution
  • Predicate combination operations

Section 7. Inline PTX in CUDA C++

  • Asm volatile syntax
  • Operand constraints
  • Register allocation interaction
  • Best practices and pitfalls

Chapter 8. Core Instruction Set

Section 1. Data Movement Instructions

  • Load instructions
  • Store instructions
  • Move instructions

Section 2. Arithmetic Instructions

  • Integer arithmetic
  • Floating-point arithmetic
  • Special math functions
  • Multiply-add patterns

Section 3. Comparison and Selection

  • Set predicate
  • Select operations
  • Set register

Section 4. Logic and Bit Manipulation

  • and, or, xor, not
  • Shift operations
  • Bit reversal
  • Bit field operations
  • Population count
  • Leading/trailing zeros

Section 5. SASS Equivalents and Performance

  • Instruction latency and throughput
  • Pipeline utilization
  • Opcode mapping examples

Chapter 9. Control Flow Instructions

Section 1. Branching Fundamentals

  • Unconditional branch
  • Conditional branch
  • Uniform vs divergent branches

Section 2. Function Calls

  • Call and return
  • Device function optimization

Section 3. Synchronization Primitives

  • Barrier synchronization
  • Memory barriers
  • Cluster barriers

Section 4. Exception and Debugging

  • Trap
  • Breakpoint
  • Exit

Section 5. Advanced Control Patterns

  • Loop construction
  • Switch statements

Section 6. SASS Control Flow Analysis

  • Branch target encoding
  • Predication vs masking
  • Reconvergence stack

Chapter 10. Memory Patterns and Advanced Layouts

Section 1. Coalescing Deep Dive

  • Transaction sizes
  • Perfect vs partial coalescing
  • Misaligned penalties

Section 2. Shared Memory Bank Conflicts

  • Bank structure
  • Conflict-free patterns
  • Multi-way conflicts

Section 3. Tiling and Blocking Strategies

  • Matrix multiplication tiling
  • Cache blocking
  • Register tiling

Section 4. Tensor Memory Access Patterns

  • NHWC vs NCHW layouts
  • Tensor core layout requirements
  • Im2col and Winograd layouts

Section 5. Swizzling Patterns

  • XOR-based swizzling
  • Tensor operation applications

Section 6. TMA (Tensor Memory Accelerator)

  • Descriptor creation
  • Async operations
  • Performance advantages

Section 7. Software Pipelining

  • Double and triple buffering
  • Async copy overlap with compute
  • Barrier phasing

Chapter 11. Tensor and Matrix Operations

Section 1. Introduction to Tensor Cores

  • Hardware architecture and capabilities
  • Mixed-precision compute
  • Throughput advantages

Section 2. Matrix Multiply-Accumulate (mma)

  • MMA instruction formats
  • Complete GEMM example
  • MMA performance analysis

Section 3. Warpgroup MMA (wgmma)

  • Larger operation sizes
  • Descriptor-based interface
  • Async execution model
  • Performance comparison

Section 4. Tensor Map Objects

  • Tensor descriptor creation
  • Tensor load/store operations

Section 5. Convolution-Specific Features

  • Im2col mode
  • Direct convolution optimization

Section 6. Specialized Tensor Operations

  • Reductions
  • Transpose and permute
  • Broadcast patterns

The Leanpub 60 Day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

Now, this is technically risky for us, since you'll have the book or course files either way. But we're so confident in our products and services, and in our authors and readers, that we're happy to offer a full money back guarantee for everything we sell.

You can only find out how good something is by trying it, and because of our 100% money back guarantee there's literally no risk to do so!

So, there's no reason not to click the Add to Cart button, is there?

See full terms...

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earned over $14 million writing, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Learn more about Leanpub's ebook formats and where to read them

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub