Leanpub Header

Skip to main content

Mastering PTX and SASS

Volume II — Optimization, SASS, and Advanced Techniques

You’ve mastered the architecture—now it’s time to own the performance.

Every GPU developer hits the same wall: the profiler says you’re close to peak, but you know there’s still headroom. What’s missing isn’t another compiler flag—it’s visibility into the hardware’s final truth.

Minimum price

$19.00

$29.00

You pay

$29.00

Author earns

$23.20
$

...Or Buy With Credits!

You can get credits with a paid monthly or annual Reader Membership, or you can buy them here.
PDF
About

About

About the Book

You’ve mastered the architecture—now it’s time to own the performance.

Every GPU developer hits the same wall: the profiler says you’re close to peak, but you know there’s still headroom. What’s missing isn’t another compiler flag—it’s visibility into the hardware’s final truth. That truth lives in SASS, the real machine code running on NVIDIA GPUs.

Mastering PTX and SASS – Volume II takes you past theory into the territory where nanoseconds matter. Here you’ll learn how to read, analyze, and tune instruction streams with surgical precision. You’ll uncover how schedulers pair ops, how register pressure throttles throughput, and how to turn your kernels into clock-cycle-balanced engines of pure efficiency.

This book is for engineers who refuse to settle for “good enough.” It turns profiling, disassembly, and optimization into a repeatable process—one grounded in data, not superstition. From tensor cores to warp shuffles, from atomic operations to multi-GPU scaling, you’ll learn how real experts bend hardware to their will.

Volume I built the foundation; Volume II shows you how to weaponize it.
If you’re ready to squeeze every drop of performance from your GPU—and understand exactly how you did it—this is the manual you’ve been waiting for.

Share this book

Bundles

Bundles that include this book

Author

About the Author

gareth thomas

Gareth Morgan Thomas is a qualified expert with extensive expertise across multiple STEM fields. Holding six university diplomas in electronics, software development, web development, and project management, along with qualifications in computer networking, CAD, diesel engineering, well drilling, and welding, he has built a robust foundation of technical knowledge.

Educated in Auckland, New Zealand, Gareth Morgan Thomas also spent three years serving in the New Zealand Army, where he honed his discipline and problem-solving skills. With years of technical training, Gareth Morgan Thomas is now dedicated to sharing his deep understanding of science, technology, engineering, and mathematics through a series of specialized books aimed at both beginners and advanced learners.

Contents

Table of Contents

Chapter 12. Warp-Level Primitives

Section 1. Warp Shuffle Instructions

  • Variants (shfl.sync, shfl.up, shfl.down, shfl.bfly)
  • Data sharing patterns (reductions, prefix sum, broadcast, transpose)
  • Shuffle performance (latency, divergence considerations)

Section 2. Vote Instructions

  • Variants (vote.all, vote.any, vote.uni, vote.ballot)
  • Applications (early termination, predicate computation, dynamic scheduling)

Section 3. Warp-Level Synchronization

  • Implicit sync in SIMT
  • Explicit bar.warp.sync
  • Independent thread scheduling implications

Section 4. Warp Specialization Techniques

  • Producer-consumer patterns
  • Persistent kernels
  • Warp specialization for tasks
  • Load balancing

Section 5. Match and Reduce

  • match.any.sync and duplicates
  • Warp-level hash table operations

Chapter 13. Atomic Operations and Synchronization

Section 1. Atomic Memory Operations

  • Basic atomics (add, sub, min, max, inc, dec, exch, cas)
  • Supported state spaces (.global, .shared)
  • Scope modifiers (.gpu, .sys, .cluster)
  • Semantic modes (.relaxed, .acquire, .release, .acq_rel)

Section 2. Atomic Performance

  • Throughput and serialization effects
  • Optimization strategies

Section 3. Reduction Operations

  • Global reduction atomics vs atom
  • Use cases

Section 4. Compare-and-Swap Patterns

  • Lock-free data structures
  • Optimistic concurrency
  • ABA problem and mitigations

Section 5. Cluster Synchronization

  • Cluster barriers
  • Cluster reductions
  • Distributed shared memory atomics

Section 6. Memory Ordering and Consistency

  • Happens-before relationships
  • Acquire-release semantics
  • Memory model examples

Section 7. Advanced Synchronization Patterns

  • Producer-consumer queues
  • Reader-writer locks
  • Counting barriers
  • Phased synchronization

Section 8. Scope and Performance Tradeoffs

  • Thread, warp, block, cluster level tradeoffs
  • Optimization decision tree

Chapter 14. Special Instructions and System Operations

Section 1. Video and Image Processing Instructions

  • Video decode and encode hints
  • Pixel format conversions
  • Sub-byte data manipulation

Section 2. Graphics Interop Instructions

  • Surface memory operations
  • Texture sampling in compute
  • Ray tracing hint instructions

Section 3. Cryptography and Security

  • AES instructions
  • Hash acceleration
  • RNG support

Section 4. System-Level Operations

  • Device management
  • Debugging support
  • ABI and calling conventions

Section 5. Inline Assembly Integration

  • Mixing PTX with CUDA C++
  • Constraint specifications
  • Custom intrinsic implementation example

Section 6. Runtime Compilation (NVRTC)

  • Dynamic PTX generation
  • Just-in-time optimization
  • Domain-specific languages and adaptive kernels

Section 7. Multi-GPU and NVLink Operations

  • Peer access
  • Inter-GPU transfers
  • NVLink atomics

Chapter 15. Memory Optimization Masterclass

Section 1. Memory Hierarchy Recap and Strategy

  • Latency hiding through occupancy
  • Bandwidth optimization
  • Roofline model applied to PTX

Section 2. Global Memory Optimization

  • Coalescing mastery
  • Cache behavior and PTX hints
  • Prefetching and async copy

Section 3. Shared Memory Optimization

  • Bank conflict elimination
  • Shared memory capacity and occupancy effects

Section 4. Register Optimization

  • Pressure management
  • Reuse strategies and manual management

Section 5. Texture and Constant Memory

  • Optimal use cases
  • Broadcast patterns and read-only optimizations

Section 6. Memory Access Analysis Tools

  • Nsight Compute memory analysis
  • Manual PTX and SASS inspection
  • Benchmarking methodology

Section 7. Case Studies

  • Matrix transpose optimization
  • Reduction algorithm comparison
  • Stencil computation optimization

Chapter 16. Compute Optimization and Instruction-Level Tuning

Section 1. Instruction Throughput

  • Latency and throughput by architecture
  • SP vs DP performance
  • SFU utilization
  • Integer vs floating-point tradeoffs

Section 2. Instruction-Level Parallelism

  • Dependency analysis
  • Loop unrolling and pipelining

Section 3. Dual-Issue and Instruction Pairing

  • Architecture-specific dual-issue rules
  • Instruction mixing for throughput

Section 4. Divergence Minimization

  • Understanding divergence costs
  • Reduction strategies

Section 5. Occupancy Optimization

  • Theory of active warps
  • Occupancy-performance relationship

Section 6. Mixed-Precision Optimization

  • FP16, BF16, TF32 strategies
  • Precision vs performance tradeoffs

Section 7. Compiler Optimization Flags

  • ptxas options
  • CUDA compilation flags
  • Architecture-specific tuning

Section 8. SASS-Level Tuning

  • Scheduler decisions
  • Manual interventions
  • Knowing when to stop

Chapter 17. Advanced Optimization Techniques

Section 1. Kernel Fusion

  • Eliminating intermediate memory traffic
  • Limitations and tradeoffs

Section 2. Persistent Kernels

  • Launch overhead reduction
  • Work stealing and load balancing
  • Termination strategies

Section 3. Cooperative Groups

  • Thread block clusters
  • Dynamic warp formation
  • Synchronization patterns

Section 4. Multi-GPU Optimization

  • Work distribution strategies
  • NVLink utilization
  • Peer-to-peer transfers

Section 5. Dynamic Parallelism

  • Child kernel launches
  • Use cases and limitations

Section 6. Streams and Concurrency

  • Overlap of compute and memory
  • Multi-stream patterns

Section 7. Power and Thermal Optimization

  • DVFS awareness
  • Instruction mix for power efficiency

Section 8. Algorithmic Optimization

  • GPU-friendly algorithms
  • Data structure design
  • Tiling for problem sizes

Section 9. AI-Driven Optimization

  • Auto-tuning frameworks
  • Reinforcement learning for scheduling
  • Evolutionary search methods

Chapter 18. Debugging and Profiling

Section 1. Debugging PTX Code

  • cuda-gdb debugging workflow
  • Nsight debugger visual inspection
  • Compute Sanitizer for race and memory detection

Section 2. Profiling with Nsight Compute

  • Performance metrics
  • PTX and SASS analysis
  • Guided bottleneck analysis

Section 3. Nsight Systems

  • Timeline visualization
  • Kernel duration analysis
  • Multi-GPU profiling

Section 4. Manual Instrumentation

  • Cycle counters (%clock, %clock64)
  • Custom timers and event logging

Chapter 19. Tools, Ecosystem, and Future Directions

Section 1. Companion Resources

  • GitHub repository and Docker containers
  • Online community and forums
  • Updates for new CUDA and PTX versions

Section 2. Ecosystem Integration

  • CUTLASS and cuBLAS comparisons
  • Machine learning framework integration
  • Compiler toolchains and research DSLs

Section 3. Research and Emerging Directions

  • AI-assisted kernel optimization
  • Future ISA features
  • Hardware roadmap

Section 4. Practical Outlook

  • Industry applications in AI, HPC, and scientific computing
  • Performance engineering as a discipline
  • Advice for practitioners

The Leanpub 60 Day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

Now, this is technically risky for us, since you'll have the book or course files either way. But we're so confident in our products and services, and in our authors and readers, that we're happy to offer a full money back guarantee for everything we sell.

You can only find out how good something is by trying it, and because of our 100% money back guarantee there's literally no risk to do so!

So, there's no reason not to click the Add to Cart button, is there?

See full terms...

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earned over $14 million writing, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Learn more about Leanpub's ebook formats and where to read them

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub