Mastering PTX and SASS

Name: Mastering PTX and SASS
Brand: Leanpub
Price: 19.00 USD
Availability: InStock

Volume II — Optimization, SASS, and Advanced Techniques

This book is 100% completeLast updated on 2026-03-09

gareth thomas

You’ve mastered the architecture—now it’s time to own the performance.

Every GPU developer hits the same wall: the profiler says you’re close to peak, but you know there’s still headroom. What’s missing isn’t another compiler flag—it’s visibility into the hardware’s final truth.

This book is 100% completeLast updated on 2026-03-09

gareth thomas

You’ve mastered the architecture—now it’s time to own the performance.

Minimum price

$19.00

$29.00

You pay

Author earns

PDF

About

Mastering PTX and SASS

Minimum price

$19.00

$29.00

You pay

Author earns

About

About the Book

You’ve mastered the architecture—now it’s time to own the performance.

Mastering PTX and SASS – Volume II takes you past theory into the territory where nanoseconds matter. Here you’ll learn how to read, analyze, and tune instruction streams with surgical precision. You’ll uncover how schedulers pair ops, how register pressure throttles throughput, and how to turn your kernels into clock-cycle-balanced engines of pure efficiency.

This book is for engineers who refuse to settle for “good enough.” It turns profiling, disassembly, and optimization into a repeatable process—one grounded in data, not superstition. From tensor cores to warp shuffles, from atomic operations to multi-GPU scaling, you’ll learn how real experts bend hardware to their will.

Volume I built the foundation; Volume II shows you how to weaponize it.
If you’re ready to squeeze every drop of performance from your GPU—and understand exactly how you did it—this is the manual you’ve been waiting for.

Share this book

Feedback

Email the Author

Bundles

Bundles that include this book

GPU PTX and SASS Mini Bundle
2 Books
Pricing
$29.00
Minimum price
Bought separately$58.00
Suggested price$29.00
Modern GPU Architecture and Programming Mini Bundle
4 Books
Pricing
$57.00
Minimum price
Bought separately$116
Suggested price$87.00
Modern GPU Architecture and Programming Complete Bundle
7 Books
Pricing
$87.00
Minimum price
Bought separately$203
Suggested price$197

Author

About the Author

gareth thomas

Gareth Morgan Thomas is a qualified expert with extensive expertise across multiple STEM fields. Holding six university diplomas in electronics, software development, web development, and project management, along with qualifications in computer networking, CAD, diesel engineering, well drilling, and welding, he has built a robust foundation of technical knowledge.

Educated in Auckland, New Zealand, Gareth Morgan Thomas also spent three years serving in the New Zealand Army, where he honed his discipline and problem-solving skills. With years of technical training, Gareth Morgan Thomas is now dedicated to sharing his deep understanding of science, technology, engineering, and mathematics through a series of specialized books aimed at both beginners and advanced learners.

Table of Contents

Chapter 12. Warp-Level Primitives

Section 1. Warp Shuffle Instructions

Variants (shfl.sync, shfl.up, shfl.down, shfl.bfly)
Data sharing patterns (reductions, prefix sum, broadcast, transpose)
Shuffle performance (latency, divergence considerations)

Section 2. Vote Instructions

Variants (vote.all, vote.any, vote.uni, vote.ballot)
Applications (early termination, predicate computation, dynamic scheduling)

Section 3. Warp-Level Synchronization

Implicit sync in SIMT
Explicit bar.warp.sync
Independent thread scheduling implications

Section 4. Warp Specialization Techniques

Producer-consumer patterns
Persistent kernels
Warp specialization for tasks
Load balancing

Section 5. Match and Reduce

match.any.sync and duplicates
Warp-level hash table operations

Chapter 13. Atomic Operations and Synchronization

Section 1. Atomic Memory Operations

Basic atomics (add, sub, min, max, inc, dec, exch, cas)
Supported state spaces (.global, .shared)
Scope modifiers (.gpu, .sys, .cluster)
Semantic modes (.relaxed, .acquire, .release, .acq_rel)

Section 2. Atomic Performance

Throughput and serialization effects
Optimization strategies

Section 3. Reduction Operations

Global reduction atomics vs atom
Use cases

Section 4. Compare-and-Swap Patterns

Lock-free data structures
Optimistic concurrency
ABA problem and mitigations

Section 5. Cluster Synchronization

Cluster barriers
Cluster reductions
Distributed shared memory atomics

Section 6. Memory Ordering and Consistency

Happens-before relationships
Acquire-release semantics
Memory model examples

Section 7. Advanced Synchronization Patterns

Producer-consumer queues
Reader-writer locks
Counting barriers
Phased synchronization

Section 8. Scope and Performance Tradeoffs

Thread, warp, block, cluster level tradeoffs
Optimization decision tree

Chapter 14. Special Instructions and System Operations

Section 1. Video and Image Processing Instructions

Video decode and encode hints
Pixel format conversions
Sub-byte data manipulation

Section 2. Graphics Interop Instructions

Surface memory operations
Texture sampling in compute
Ray tracing hint instructions

Section 3. Cryptography and Security

AES instructions
Hash acceleration
RNG support

Section 4. System-Level Operations

Device management
Debugging support
ABI and calling conventions

Section 5. Inline Assembly Integration

Mixing PTX with CUDA C++
Constraint specifications
Custom intrinsic implementation example

Section 6. Runtime Compilation (NVRTC)

Dynamic PTX generation
Just-in-time optimization
Domain-specific languages and adaptive kernels

Section 7. Multi-GPU and NVLink Operations

Peer access
Inter-GPU transfers
NVLink atomics

Chapter 15. Memory Optimization Masterclass

Section 1. Memory Hierarchy Recap and Strategy

Latency hiding through occupancy
Bandwidth optimization
Roofline model applied to PTX

Section 2. Global Memory Optimization

Coalescing mastery
Cache behavior and PTX hints
Prefetching and async copy

Section 3. Shared Memory Optimization

Bank conflict elimination
Shared memory capacity and occupancy effects

Section 4. Register Optimization

Pressure management
Reuse strategies and manual management

Section 5. Texture and Constant Memory

Optimal use cases
Broadcast patterns and read-only optimizations

Section 6. Memory Access Analysis Tools

Nsight Compute memory analysis
Manual PTX and SASS inspection
Benchmarking methodology

Section 7. Case Studies

Matrix transpose optimization
Reduction algorithm comparison
Stencil computation optimization

Chapter 16. Compute Optimization and Instruction-Level Tuning

Section 1. Instruction Throughput

Latency and throughput by architecture
SP vs DP performance
SFU utilization
Integer vs floating-point tradeoffs

Section 2. Instruction-Level Parallelism

Dependency analysis
Loop unrolling and pipelining

Section 3. Dual-Issue and Instruction Pairing

Architecture-specific dual-issue rules
Instruction mixing for throughput

Section 4. Divergence Minimization

Understanding divergence costs
Reduction strategies

Section 5. Occupancy Optimization

Theory of active warps
Occupancy-performance relationship

Section 6. Mixed-Precision Optimization

FP16, BF16, TF32 strategies
Precision vs performance tradeoffs

Section 7. Compiler Optimization Flags

ptxas options
CUDA compilation flags
Architecture-specific tuning

Section 8. SASS-Level Tuning

Scheduler decisions
Manual interventions
Knowing when to stop

Chapter 17. Advanced Optimization Techniques

Section 1. Kernel Fusion

Eliminating intermediate memory traffic
Limitations and tradeoffs

Section 2. Persistent Kernels

Launch overhead reduction
Work stealing and load balancing
Termination strategies

Section 3. Cooperative Groups

Thread block clusters
Dynamic warp formation
Synchronization patterns

Section 4. Multi-GPU Optimization

Work distribution strategies
NVLink utilization
Peer-to-peer transfers

Section 5. Dynamic Parallelism

Child kernel launches
Use cases and limitations

Section 6. Streams and Concurrency

Overlap of compute and memory
Multi-stream patterns

Section 7. Power and Thermal Optimization

DVFS awareness
Instruction mix for power efficiency

Section 8. Algorithmic Optimization

GPU-friendly algorithms
Data structure design
Tiling for problem sizes

Section 9. AI-Driven Optimization

Auto-tuning frameworks
Reinforcement learning for scheduling
Evolutionary search methods

Chapter 18. Debugging and Profiling

Section 1. Debugging PTX Code

cuda-gdb debugging workflow
Nsight debugger visual inspection
Compute Sanitizer for race and memory detection

Section 2. Profiling with Nsight Compute

Performance metrics
PTX and SASS analysis
Guided bottleneck analysis

Section 3. Nsight Systems

Timeline visualization
Kernel duration analysis
Multi-GPU profiling

Section 4. Manual Instrumentation

Cycle counters (%clock, %clock64)
Custom timers and event logging

Chapter 19. Tools, Ecosystem, and Future Directions

Section 1. Companion Resources

GitHub repository and Docker containers
Online community and forums
Updates for new CUDA and PTX versions

Section 2. Ecosystem Integration

CUTLASS and cuBLAS comparisons
Machine learning framework integration
Compiler toolchains and research DSLs

Section 3. Research and Emerging Directions

AI-assisted kernel optimization
Future ISA features
Hardware roadmap

Section 4. Practical Outlook

Industry applications in AI, HPC, and scientific computing
Performance engineering as a discipline
Advice for practitioners

Also by the Author

The Leanpub 60 Day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

See full terms...

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earned over $15 million writing, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Learn more about Leanpub's ebook formats and where to read them

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub

You pay

Author earns

About

Share this book

Categories

Feedback

Bundles

GPU PTX and SASS Mini Bundle

$29.00

Modern GPU Architecture and Programming Mini Bundle

$57.00

Modern GPU Architecture and Programming Complete Bundle

$87.00

Author

Contents

Chapter 12. Warp-Level Primitives

Section 1. Warp Shuffle Instructions

Section 2. Vote Instructions

Section 3. Warp-Level Synchronization

Section 4. Warp Specialization Techniques

Section 5. Match and Reduce

Chapter 13. Atomic Operations and Synchronization

Section 1. Atomic Memory Operations

Section 2. Atomic Performance

Section 3. Reduction Operations

Section 4. Compare-and-Swap Patterns

Section 5. Cluster Synchronization

Section 6. Memory Ordering and Consistency

Section 7. Advanced Synchronization Patterns

Section 8. Scope and Performance Tradeoffs

Chapter 14. Special Instructions and System Operations

Section 1. Video and Image Processing Instructions

Section 2. Graphics Interop Instructions

Section 3. Cryptography and Security

Section 4. System-Level Operations

Section 5. Inline Assembly Integration

Section 6. Runtime Compilation (NVRTC)

Section 7. Multi-GPU and NVLink Operations

Chapter 15. Memory Optimization Masterclass

Section 1. Memory Hierarchy Recap and Strategy

Section 2. Global Memory Optimization

Section 3. Shared Memory Optimization

Section 4. Register Optimization

Section 5. Texture and Constant Memory

Section 6. Memory Access Analysis Tools

Section 7. Case Studies

Chapter 16. Compute Optimization and Instruction-Level Tuning

Section 1. Instruction Throughput

Section 2. Instruction-Level Parallelism

Section 3. Dual-Issue and Instruction Pairing

Section 4. Divergence Minimization

Section 5. Occupancy Optimization

Section 6. Mixed-Precision Optimization

Section 7. Compiler Optimization Flags

Section 8. SASS-Level Tuning

Chapter 17. Advanced Optimization Techniques

Section 1. Kernel Fusion

Section 2. Persistent Kernels

Section 3. Cooperative Groups

Section 4. Multi-GPU Optimization

Section 5. Dynamic Parallelism

Section 6. Streams and Concurrency

Section 7. Power and Thermal Optimization

Section 8. Algorithmic Optimization

Section 9. AI-Driven Optimization

Chapter 18. Debugging and Profiling

Section 1. Debugging PTX Code

Section 2. Profiling with Nsight Compute

Section 3. Nsight Systems

Section 4. Manual Instrumentation

Chapter 19. Tools, Ecosystem, and Future Directions

Section 1. Companion Resources

Section 2. Ecosystem Integration

Section 3. Research and Emerging Directions

Section 4. Practical Outlook

Also by the Author

The Leanpub 60 Day 100% Happiness Guarantee

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

Free Updates. DRM Free.

Write and Publish on Leanpub