Name: Running Local LLMs on Your Own Hardware
Brand: Leanpub
Price: 19.99 USD
Availability: InStock

Preface i
1 Why Run LLMs Locally, and the Open-Weight Landscape 2
- 1.1 What This Book Means by ``Local'' and ``Your Own Hardware'' 2
- 1.2 A Short History of How Local LLMs Became Practical 4
- 1.3 Five Reasons to Run Models Locally 5
- 1.4 What You Give Up 9
- 1.5 A Vocabulary for the Rest of the Book 12
- 1.6 ``Open Weight'' Versus ``Open Source'' 13
- 1.7 The Open-Weight Model Landscape 14
- 1.8 Where the Models Live 15
- 1.9 What Can You Realistically Run? 17
- 1.10 Choosing Where to Run: Deployment Shapes 17
- 1.11 What People Actually Build With Local Models 18
- 1.12 What Local Models Are Good and Bad At 19
- 1.13 The Tooling You Will Meet in This Book 20
- 1.14 A Ten-Minute Taste: Run a Model Right Now 21
- 1.15 Staying Current Without Getting Overwhelmed 27
- 1.16 How to Use This Book 27
- 1.17 Common Misconceptions to Leave Behind 29
- 1.18 Frequently Asked First Questions 30
- 1.19 A Note on Responsible Use 30
- 1.20 A Readiness Checklist Before Chapter~2 31
- 1.21 Recap and What Is Next 32
2 How LLMs Work Under the Hood 34
- 2.1 What a Language Model Actually Does 34
- 2.2 Training Versus Inference: What You Are and Are Not Doing 36
- 2.3 Tokens: The Unit Everything Is Measured In 36
- 2.4 From Tokens to Vectors: Embeddings 39
- 2.5 Inside a Transformer Block 41
- 2.6 Attention and Why Context Is Expensive 45
- 2.7 Autoregressive Decoding: One Token at a Time 46
- 2.8 Prefill Versus Decode: Two Very Different Phases 47
- 2.9 The KV Cache: Why Context Costs Memory 48
- 2.10 Counting the Memory: Parameters, Precision, and Footprint 50
- 2.11 Anatomy of a Model File 52
- 2.12 Why Decode Is Memory-Bandwidth Bound 53
- 2.13 Throughput Versus Latency 55
- 2.14 The Context Window, in Memory Terms 56
- 2.15 The Journey of a Prompt, End to End 57
- 2.16 Putting It Together: A Worked Example 57
- 2.17 A Diagnostic Table: Symptom to Mechanism 59
- 2.18 Common Misconceptions to Leave Behind 59
- 2.19 Pitfalls and Practical Reminders 60
- 2.20 Recap and What Is Next 61
3 Understanding Your Hardware 64
- 3.1 The Hierarchy That Actually Matters 64
- 3.2 Taking Inventory of Your Machine 66
- 3.3 The GPU and VRAM: The Hard Limit 69
- 3.4 Memory Bandwidth: The Real Speed Determinant 71
- 3.5 The CPU Path: Running Without a GPU 73
- 3.6 System RAM and Its Several Roles 74
- 3.7 Apple Silicon and Unified Memory 75
- 3.8 AMD GPUs and ROCm 76
- 3.9 Multiple GPUs 77
- 3.10 Cloud and Rented GPUs as a Hardware Option 78
- 3.11 Laptops, Desktops, and Small Form Factors 79
- 3.12 Common Machine Profiles 80
- 3.13 What Each Class Can and Cannot Do 80
- 3.14 Matching Hardware to Your Task 82
- 3.15 Choosing GPUs by Budget 83
- 3.16 The Unglamorous Constraints: PCIe, Power, Cooling, and Disk 84
- 3.17 Putting It Together: Estimating What You Can Run 85
- 3.18 From Inventory to a Model Shortlist 88
- 3.19 Choosing or Upgrading Hardware 89
- 3.20 Running Costs and Always-On Considerations 91
- 3.21 Common Pitfalls 91
- 3.22 A Hardware Diagnostic Table 92
- 3.23 Planning for the Future 92
- 3.24 A Hardware Quick Reference 93
- 3.25 Recap and What Is Next 93
4 Choosing and Sizing Models 96
- 4.1 What Parameter Counts Mean in Practice 96
- 4.2 The Memory Math, Made Precise 98
- 4.3 Quantization Formats: The Landscape 100
- 4.4 Context Length as a Sizing Factor 102
- 4.5 Speed Targets: What Tokens per Second Feels Like 103
- 4.6 Base, Instruct, Chat, and Specialized Variants 103
- 4.7 Mixture-of-Experts Models in Selection 105
- 4.8 Sizing Differs for Serving and Fine-Tuning 106
- 4.9 Reading a Model Card Critically 106
- 4.10 Evaluating Candidates on Your Own Tasks 108
- 4.11 Using More Than One Model 111
- 4.12 Reading Benchmarks Without Being Misled 111
- 4.13 Matching the Model to Your Task 113
- 4.14 Where to Discover Models 113
- 4.15 A Repeatable Selection Procedure 114
- 4.16 Finding and Downloading the Model You Chose 116
- 4.17 Worked Selections 120
- 4.18 Realistic Expectations by Task 121
- 4.19 Common Pitfalls 121
- 4.20 Documenting and Pinning Your Choice 122
- 4.21 Selection Is Iterative, Not Final 123
- 4.22 Frequently Asked Questions About Choosing 123
- 4.23 Choosing Auxiliary Models: Embeddings and Rerankers 124
- 4.24 When Local Is Not the Right Model Choice 125
- 4.25 Recap and What Is Next 125
5 Setting Up Your Environment 128
- 5.1 The Shape of a Local-Inference Software Stack 128
- 5.2 Two Paths: The Easy Way and the Full Way 129
- 5.3 NVIDIA: Drivers and CUDA on Debian and Ubuntu 130
- 5.4 AMD: Installing ROCm 133
- 5.5 Apple Silicon: Minimal Prerequisites 134
- 5.6 Windows: Use WSL2 134
- 5.7 CPU-Only Environments 135
- 5.8 Python Done Right 135
- 5.9 Users, Permissions, and Running as a Service 137
- 5.10 Containers: Docker and the NVIDIA Container Toolkit 137
- 5.11 Tailoring Setup to Your Goal 139
- 5.12 Why Setup Feels Hard --- and Why It Gets Easier 140
- 5.13 Hugging Face Authentication 140
- 5.14 Where Models Will Live: Storage Layout and Disk Planning 141
- 5.15 Environment Variables That Matter 143
- 5.16 Networks, Firewalls, and Proxies 144
- 5.17 Offline and Air-Gapped Operation 145
- 5.18 Multi-GPU Environment Setup 145
- 5.19 Where Configuration Lives 146
- 5.20 Verifying the Whole Stack 146
- 5.21 A Worked End-to-End Setup 149
- 5.22 Backing Up and Moving to a New Machine 151
- 5.23 Keeping the Setup Reproducible 152
- 5.24 Updating and Maintaining the Environment 152
- 5.25 Troubleshooting Setup Problems 153
- 5.26 When You Are Truly Stuck: Getting Help 154
- 5.27 Common Pitfalls 155
- 5.28 Recap and What Is Next 156
6 Inference Engines and Runtimes 158
- 6.1 What an Inference Engine Does 158
- 6.2 How to Choose an Engine 159
- 6.3 Ollama: The Easy Default 161
- 6.4 llama.cpp: The Foundation 162
- 6.5 vLLM: High-Throughput Serving 164
- 6.6 Graphical Tools: LM Studio and Jan 165
- 6.7 text-generation-webui: The Experimenter's Workbench 166
- 6.8 Hugging Face TGI and the EXL2 Stack 166
- 6.9 What Each Engine's API Offers 167
- 6.10 Engines Beyond Chat: Embeddings, Vision, and More 169
- 6.11 Managing Models Across Engines 169
- 6.12 Engine Performance Is Not Uniform 170
- 6.13 How Engines Load Models 171
- 6.14 Concurrency: One Engine, Many Requests 171
- 6.15 Other Engines Worth Knowing 172
- 6.16 Choosing by Platform 173
- 6.17 What to Expect on First Run with Each Engine 174
- 6.18 The Capability Matrix 174
- 6.19 Running Engines in Containers 176
- 6.20 A Worked Multi-Engine Setup 176
- 6.21 Keeping Engines Updated 178
- 6.22 Network and Security Defaults 179
- 6.23 The OpenAI-Compatible API: The Unifying Thread 179
- 6.24 A Decision Guide by Use Case 181
- 6.25 Troubleshooting Engine Startup 182
- 6.26 Stability, Maturity, and Production Use 183
- 6.27 Running an Engine Alongside Your Other Work 184
- 6.28 The Engine's Own Footprint and Observability 185
- 6.29 Trust and Isolation of Engines 185
- 6.30 Common Pitfalls 186
- 6.31 If You Only Remember One Thing 186
- 6.32 Recap and What Is Next 187
7 Running Your First Model 189
- 7.1 The Shape of a First Run 189
- 7.2 The Obtain Stage in Practice 190
- 7.3 First Run with Ollama 191
- 7.4 First Run with llama.cpp 192
- 7.5 First Run on Other Platforms 194
- 7.6 First Run with a Graphical Tool 194
- 7.7 First Run in a Container 195
- 7.8 Your First API Call 195
- 7.9 Chat Templates in Practice 199
- 7.10 Interactive Mode Versus Server Mode 200
- 7.11 The Interactive Session 200
- 7.12 Setting Expectations for First-Run Output 201
- 7.13 Managing the Running Model 202
- 7.14 First-Run Problem: Out of Memory 203
- 7.15 First-Run Problem: Silent CPU Fallback 204
- 7.16 First-Run Problem: Garbled or Endless Output 205
- 7.17 First-Run Problem: Repetitive or Looping Output 205
- 7.18 First-Run Problem: Refusals and Over-Caution 206
- 7.19 First-Run Problem: Too Slow 206
- 7.20 First-Run Problem: Will Not Load 207
- 7.21 Engine-Specific First-Run Notes 208
- 7.22 A First-Run Decision Tree 208
- 7.23 A First-Run Troubleshooting Table 209
- 7.24 A Complete Worked Session 210
- 7.25 What a Successful First Run Looks Like 210
- 7.26 From First Run to a Habit 211
- 7.27 Confirming It Truly Runs Locally 211
- 7.28 Keeping Notes on Your First Setup 212
- 7.29 Your First Useful Task 212
- 7.30 What Comes After a Successful First Run 213
- 7.31 Leaving the Model Running for Your Devices 213
- 7.32 Trying Several Models in One Sitting 214
- 7.33 Frequently Asked First-Run Questions 214
- 7.34 Common Pitfalls 215
- 7.35 Recap and What Is Next 216
8 Quantization Deep Dive 219
- 8.1 How 4-Bit Became Normal 219
- 8.2 What Quantization Actually Does 220
- 8.3 Integer Versus Float Quantization 222
- 8.4 The Quality Cost and How to Measure It 224
- 8.5 GGUF Quantization Types in Detail 225
- 8.6 Importance-Matrix (IQ) Quants 228
- 8.7 GPU-Native Quantization Formats 229
- 8.8 Quantizing a Model Yourself 232
- 8.9 Quantizing the KV Cache 234
- 8.10 Per-Tier Recommendations 236
- 8.11 Measuring Quality on Your Own Tasks 237
- 8.12 A Worked Comparison: Q4 Versus Q8 239
- 8.13 A Decision Procedure for Quantization 239
- 8.14 The Trajectory of Quantization 241
- 8.15 Frequently Asked Questions About Quantization 241
- 8.16 What People Actually Use 242
- 8.17 Quantizing the Whole Pipeline 242
- 8.18 The Quantization Choices You Actually Make 243
- 8.19 Common Misconceptions 244
- 8.20 Quantization in the Full Memory Picture 244
- 8.21 Quantization Is Not Faking It 245
- 8.22 Common Pitfalls 245
- 8.23 Recap and What Is Next 246
9 Prompting, Sampling, and Generation Parameters 249
- 9.1 The Pipeline and Where Each Lever Acts 249
- 9.2 Chat Templates and Special Tokens 250
- 9.3 The System Prompt: Your Most Powerful Lever 250
- 9.4 Prompt Engineering Essentials 252
- 9.5 Temperature: The Core Sampling Knob 256
- 9.6 Top-p, Top-k, and Min-p 257
- 9.7 Penalties: Controlling Repetition 259
- 9.8 Controlling Length and Stopping 260
- 9.9 Seeds and Determinism 261
- 9.10 Structured Output: Schemas and Grammars 262
- 9.11 Managing the Context Window 265
- 9.12 Prompt Caching 267
- 9.13 Putting Parameters Together: Presets by Task 268
- 9.14 Combining the Levers: A Worked Task 270
- 9.15 A Diagnostic Table: Symptom to Setting 271
- 9.16 Common Misconceptions 272
- 9.17 When a Model Ignores Your Instructions 272
- 9.18 Prompt Injection: A Preview 273
- 9.19 What Prompting and Parameters Cannot Fix 274
- 9.20 Common Pitfalls 274
- 9.21 Prompting Is a Craft, Not Incantation 275
- 9.22 Building a Prompt-Evaluation Habit 275
- 9.23 Treating Prompts as Part of the System 276
- 9.24 Recap and What Is Next 277
10 Serving Local LLMs as APIs 280
- 10.1 The OpenAI-Compatible API as the Contract 280
- 10.2 To Serve or to Embed 281
- 10.3 Anatomy of a Request and Response 282
- 10.4 The Endpoints 283
- 10.5 Streaming Responses 284
- 10.6 Client Code in Any Language 285
- 10.7 Running the Engine as a Service 287
- 10.8 Reverse Proxy and TLS 289
- 10.9 Authentication and Rate Limiting 290
- 10.10 Concurrency and When to Use a Serving Engine 292
- 10.11 Multiple Models and Routing 292
- 10.12 Capacity Planning for a Served Model 293
- 10.13 Connecting Tools and Frameworks 295
- 10.14 Health Checks and Monitoring 297
- 10.15 Deployment Topologies 299
- 10.16 A Reference Deployment 300
- 10.17 Local Network Versus Internet Exposure 301
- 10.18 Logging, Privacy, and What a Service Records 302
- 10.19 Streaming in Real Applications 303
- 10.20 Securing What the Model Can Do 303
- 10.21 Common Pitfalls 304
- 10.22 When a Hosted API Is the Better Serving Choice 305
- 10.23 Choosing Your Serving Stack 305
- 10.24 Frequently Asked Questions About Serving 306
- 10.25 Operating the Service Day to Day 307
- 10.26 The Serving Mindset 307
- 10.27 Recap and What Is Next 308
11 Retrieval-Augmented Generation and Embeddings 311
- 11.1 Why Retrieval-Augmented Generation 311
- 11.2 The RAG Pipeline 312
- 11.3 Embeddings and Local Embedding Models 314
- 11.4 Chunking 317
- 11.5 Vector Stores 318
- 11.6 Retrieval 320
- 11.7 Building the Prompt and Generating 322
- 11.8 Embeddings Beyond Retrieval 324
- 11.9 Frameworks Versus Building It Yourself 325
- 11.10 Reranking 325
- 11.11 Hybrid Search 326
- 11.12 Evaluating Retrieval 328
- 11.13 Keeping It Offline 330
- 11.14 Retrieved Knowledge Versus the Model's Own 332
- 11.15 Improving RAG and Its Failure Modes 332
- 11.16 RAG Versus Fine-Tuning Versus Long Context 333
- 11.17 A Worked RAG Project 333
- 11.18 The Cost and Latency of RAG 335
- 11.19 Maintaining a RAG System Over Time 335
- 11.20 RAG and Agents 336
- 11.21 Frequently Asked Questions About RAG 336
- 11.22 Common Pitfalls 337
- 11.23 Why RAG Is the Quintessential Local Application 338
- 11.24 Recap and What Is Next 339
12 Fine-Tuning and Adapters 342
- 12.1 When to Fine-Tune --- and When Not To 342
- 12.2 Fine-Tuning as Amortized Prompting 344
- 12.3 What Fine-Tuning Does --- and Does Not Do 345
- 12.4 Full Fine-Tuning, LoRA, and QLoRA 348
- 12.5 VRAM Budgeting for Training 350
- 12.6 Datasets: Quality Over Quantity 352
- 12.7 Tooling 354
- 12.8 A QLoRA Run, End to End 356
- 12.9 Hyperparameters 358
- 12.10 Merging and Exporting to GGUF 359
- 12.11 Using Adapters at Inference 360
- 12.12 Evaluating a Fine-Tune 362
- 12.13 Common Problems and How to Fix Them 363
- 12.14 Fine-Tuning Other Model Types 365
- 12.15 RAG, Fine-Tuning, or Both 365
- 12.16 Realistic Expectations 366
- 12.17 Frequently Asked Questions About Fine-Tuning 367
- 12.18 Common Pitfalls 368
- 12.19 The Effort and Economics of Fine-Tuning 368
- 12.20 Recap and What Is Next 369
13 Performance Optimization and Benchmarking 373
- 13.1 Measure Before You Optimize 373
- 13.2 The Metrics That Matter 374
- 13.3 Benchmarking Tools 375
- 13.4 Benchmarking Methodology 375
- 13.5 Monitoring During Inference 377
- 13.6 Establishing a Baseline 378
- 13.7 The Roofline: Compute Versus Memory 379
- 13.8 GPU Offload Tuning 380
- 13.9 Flash Attention 380
- 13.10 KV-Cache Quantization for Speed 381
- 13.11 Context Length and Batch Size 382
- 13.12 Speculative Decoding 382
- 13.13 Prefill Optimization and Prompt Caching 383
- 13.14 Optimizing Throughput 384
- 13.15 Finding the Bottleneck 385
- 13.16 Quantization as a Speed Lever 386
- 13.17 Hardware-Level Factors 386
- 13.18 Quantization Format and Engine Speed 387
- 13.19 CPU and Hybrid Inference Tuning 387
- 13.20 Keeping a Benchmark Log 388
- 13.21 A Worked Optimization 389
- 13.22 Model Load Time and Storage 389
- 13.23 The Cost of Sampling and Generation Parameters 390
- 13.24 Latency in Interactive Use 391
- 13.25 Continuous Batching in Depth 391
- 13.26 Tokens per Watt: Efficiency 392
- 13.27 An Optimization Decision Guide 393
- 13.28 Memory Is Speed 394
- 13.29 Profiling Where the Time Goes 394
- 13.30 Realistic Expectations by Hardware Tier 395
- 13.31 Diagnosing a Silent CPU Fallback 396
- 13.32 Comparing Engines on Your Hardware 396
- 13.33 Speed Versus Quality 397
- 13.34 A Second Worked Example: Serving Throughput 398
- 13.35 Server-Level Tuning for Serving 398
- 13.36 Knowing When to Stop 399
- 13.37 Optimization in Context 400
- 13.38 Automating Your Benchmarks 400
- 13.39 Common Pitfalls 401
- 13.40 Recap and What Is Next 401
14 Multi-GPU, CPU Offload, and Scaling on a Budget 404
- 14.1 The Central Trade: Capacity Versus Speed 404
- 14.2 When You Need to Scale 405
- 14.3 Multi-GPU: Adding VRAM at VRAM Speed 406
- 14.4 Layer Split Versus Tensor Parallel 408
- 14.5 CPU Offload: Capacity at System-Memory Speed 408
- 14.6 Measuring the Offload Cliff 411
- 14.7 RAM Spill and the Disk Cliff 412
- 14.8 Mixing Mismatched GPUs 413
- 14.9 Serving a Large Model Across GPUs 414
- 14.10 Budget Scaling Strategy 415
- 14.11 Memory Accounting Across Devices 417
- 14.12 Predicting Speed Before You Commit 417
- 14.13 The Interconnect Decides Multi-GPU Speed 418
- 14.14 Prefill Versus Decode in Multi-Device Setups 419
- 14.15 MoE Models and Scaling 419
- 14.16 Operating a Multi-GPU Machine 420
- 14.17 Renting GPUs for Occasional Big Models 420
- 14.18 Quantization as the First Scaling Tool 421
- 14.19 Apple Silicon and Unified Memory 422
- 14.20 Offloading Weights Versus the KV Cache 422
- 14.21 Speculative Decoding for Scaled Models 423
- 14.22 Distributed Inference Across Machines 424
- 14.23 Loading Large Models 424
- 14.24 Realistic Speeds by Scaling Path 425
- 14.25 Knowing When to Stop Scaling 425
- 14.26 A Worked Scaling Decision 426
- 14.27 Long Context as a Scaling Problem 427
- 14.28 Running Several Models at Once 427
- 14.29 The Long-Term View 428
- 14.30 Multi-GPU for Fine-Tuning 429
- 14.31 Reliability of Scaled Setups 429
- 14.32 Containers and Scaled Deployments 430
- 14.33 Scaling in the Whole-System Picture 431
- 14.34 Common Pitfalls 431
- 14.35 Recap and What Is Next 432
15 Security, Privacy, Maintenance, and Troubleshooting 434
- 15.1 The Local Security Model 434
- 15.2 Never Expose the Raw Engine 435
- 15.3 Authenticated, Encrypted Access 436
- 15.4 Least Privilege for Capability 437
- 15.5 Prompt Injection and Untrusted Content 440
- 15.6 Agents, Tool Use, and the Model Context Protocol 441
- 15.7 The Model Supply Chain 444
- 15.8 Network Architecture for Access 445
- 15.9 Multi-User Access and Abuse Control 446
- 15.10 Resource Limits and Availability 447
- 15.11 Privacy in Earnest 448
- 15.12 Where Your Data Can Actually Go 449
- 15.13 Data Retention and Conversation History 451
- 15.14 Compliance and Regulated Data 451
- 15.15 Maintaining the Stack 452
- 15.16 Backups and Recovery 453
- 15.17 Keeping Models Current 454
- 15.18 Monitoring and Health Checks 455
- 15.19 Systematic Troubleshooting 456
- 15.20 Hardening the Host 458
- 15.21 Securing the RAG and Document Pipeline 459
- 15.22 Incident Response and Recovery 459
- 15.23 Operating at Different Scales 460
- 15.24 A Troubleshooting Method 461
- 15.25 Documenting Your Deployment 461
- 15.26 Routine Operations 462
- 15.27 Where Local Changes the Calculus 463
- 15.28 Two Worked Troubleshooting Cases 464
- 15.29 A Security and Operations Checklist 464
- 15.30 The Operator's Mindset 466
- 15.31 Common Pitfalls 466
- 15.32 Recap and What Is Next 467

Running Local LLMs on Your Own Hardware

You pay

Author earns

You pay

Author earns

About

Share this book

Categories

Feedback

Author

Contents

Get the free sample chapters

The Leanpub 60 Day 100% Happiness Guarantee

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

Free Updates. DRM Free.

Write and Publish on Leanpub