OpenShift AI Platform Guide
OpenShift AI Platform Guide
Platform Engineering, GPUs, and Air-Gapped Clusters with OpenShift AI
About the Book
OpenShift AI Platform Guide is a practical handbook for platform engineers who need to turn OpenShift into a real internal AI platform, not “just a Kubernetes cluster.”
Starting from the CNCF platform engineering whitepaper, the book shows how to apply those ideas on OpenShift: treating the platform as a product, reducing cognitive load for app teams, and building opinionated “golden paths” instead of one-off snowflakes.
From there, you’ll walk through end-to-end, production-grade scenarios:
- Installing OpenShift 4.20 in fully air-gapped environments with a local Quay registry
- Configuring cluster-wide proxies, NFS storage, and disconnected OperatorHub catalogs
- Deploying and managing key operators like Node Feature Discovery and the NVIDIA GPU Operator
- Enabling InfiniBand and RDMA networking with SR-IOV and the NVIDIA Network Operator
- Integrating observability with DCGM, Prometheus, and Grafana for GPU-aware monitoring
- Using GitOps (OpenShift GitOps / Argo CD + GitLab) for declarative, auditable platform config
- Running LLM performance benchmarks as code with Hugging Face’s Inference-Benchmarker and visualizing results with a Gradio dashboard
The guide is written in a “do this, then this” style, with YAML examples, command snippets, and explanations of why each piece matters for a modern AI platform.
If you are a platform engineer, SRE, or infrastructure-minded ML practitioner responsible for OpenShift-based GPU clusters—especially in regulated or disconnected environments—this book gives you a concrete, repeatable blueprint.
Table of Contents
- Introduction - Platform Engineering
- Platform Engineering
- 1. What platform engineering is really about
- 2. Key principles from the CNCF whitepaper
- 3. Why OpenShift is a strong foundation for an internal platform
- 4. Mapping CNCF platform capabilities to OpenShift
- 5. Product & app teams vs capability & service providers
- 6. Building a “thinnest viable platform” on OpenShift
- 7. Making it secure and governed by default
- 8. Measuring success of your OpenShift-based platform
- 9. A practical adoption roadmap
- Installation
- Agent-based Installation in Air-Gapped Environments
- 1. What we’re building
- 2. Mirroring OpenShift and ecosystem images into Quay
- 3. Auth & trust: pull secrets and CA bundle
- 4. The install-config.yaml for air-gapped, bare-metal, agent-based
- 5. AgentConfig: describing your hosts
- 6. Telling the installer to use your mirrored release image
- 7. Generating the Agent ISO (air-gapped aware)
- 8. Booting the nodes & running the install
- 9. Verifying the cluster is using your Quay mirror
- 10. Troubleshooting on the rendezvous host
- 11. Summary: your end-to-end air-gapped flow
- Proxy Configuration for Installation
- 1. Why Agent-based installation with a proxy?
- 2. Where the proxy lives:
install-config.yaml
- 3. Example: Agent-based install with your proxy
- 4. Creating the Agent-based config files (with proxy)
- 5. Generating the Agent ISO (with proxy embedded)
- 6. Authentication & custom CA during Agent-based install
- 7. How the proxy behaves during and after install
- 8. Verifying your Agent-based install with proxy
- 9. Tips & pitfalls specific to Agent-based installs with proxy
- 10. Summary
- Local Quay Registry Setup
- 1. Where to specify the local Quay registry
- 2. Example: full
install-config.yamlwith proxy + local Quay
- 3. Make sure auth and TLS match your Quay
- 4. What happens during agent-based install
- 5. Quick sanity checks
- Using oc-mirror with Quay
- 1. Why oc-mirror v2 + Quay?
- 2. Requirements & network allowlist
- 3. Preparing auth & environment for oc-mirror
- 4. Designing the ImageSetConfiguration (4.20 + operators)
- 5. Running oc-mirror v2: Mirror-to-Disk
- 6. Disk-to-Mirror: pushing into Quay
- 7. Applying cluster resources in OpenShift
- 8. Verifying everything is working with Quay
- 9. Operational tips with oc-mirror v2 and Quay
- 10. Common pitfalls & troubleshooting
- 11. Summary
- Cluster Configuration
- Cluster-Wide Proxy Configuration
- 1. What the OpenShift cluster-wide proxy actually does
- 2. The Proxy resource – core fields
- 3. Basic proxy configuration (no authentication)
- 4. Proxy with basic auth
- 5. Custom CA for the proxy (trustedCA)
- 6. Configuring a proxy at install time vs Day 2
- 7. How workloads interact with the cluster proxy
- 8. Verifying the proxy configuration
- 9. Updating or removing the proxy
- 10. Common pitfalls & best practices
- 11. Summary
- NFS Storage Configuration
- 1. What we’re building
- 2. OpenShift + NFS basics (very short theory)
- 3. Setting up the NFS server (RHEL 9 example)
- 4. Cluster-side requirements (OpenShift)
- 5. Deploying nfs-subdir-external-provisioner on OpenShift
- 6. Understanding the result on the NFS server
- 7. Alternative: using a values.yaml instead of
--set
- 8. Creating and using a PVC from OpenShift
- 9. Troubleshooting common issues
- 10. Hardening & best practices
- 11. Static NFS PVs vs. dynamic (what you built)
- 12. Summary of your working configuration
- Air-Gapped Operations
- OperatorHub with Local Quay
- 1. Background: OperatorHub, CatalogSource, ClusterCatalog
- 2. Step 1 – Disable the default catalog sources
- 3. Step 2 – Enable Quay-backed catalogs
- 4. Step 3 – Verify catalog health in
openshift-marketplace
- 5. Operational tips & gotchas
- 6. Summary: What you have now
- Operators
- Node Feature Discovery (NFD)
- 1. What Node Feature Discovery actually does
- 2. High-level GitOps structure for NFD
- 3. Namespace: isolating the NFD operator
- 4. OperatorGroup: scoping where the operator works
- 5. Subscription: installing NFD from your internal catalog
- 6. GitOps RBAC: allowing Argo CD to manage NFD CRs
- 7. The NodeFeatureDiscovery CR: how you configure NFD
- 8. How this ties into your GPU / platform stack
- 9. Summary
- NVIDIA GPU Operator
- 1. What the NVIDIA GPU Operator does on OpenShift
- 2. Namespace: where the operator and operands live
- 3. OperatorGroup: scoping the operator
- 4. Subscription: install the certified operator from your Quay-backed catalog
- 5. ClusterPolicy: the GPU Operator’s master configuration
- 6. ConfigMap: device plugin config placeholder
- 7. How this ties in with your NFD & GitOps stack
- 8. Summary
- InstaSlice for Dynamic GPU Partitioning
- Instaslice on OpenShift (OCP).
- 1. Common Prep (All OpenShift Scenarios)
- 2. OpenShift – Emulated Mode (no GPUs)
- 3. OpenShift – Real GPUs + MIG (A100/H100)
- 4. Day-2 Operations on OpenShift
- 5. Troubleshooting on OpenShift
- Networking
- InfiniBand and RDMA Configuration
- InfiniBand + RDMA on OpenShift AI with SR-IOV and NVIDIA Network Operator (Legacy Mode)
- 0. Prerequisites & Assumptions
- 1. Node Feature Discovery: Label the Right Nodes
- 2. SR-IOV Network Operator: Preparing the Legacy SR-IOV Path
- 3. NVIDIA Network Operator: Enabling DOCA/OFED in Legacy Mode
- 4. NicClusterPolicy: Deploying DOCA/OFED for InfiniBand
- 5. Defining SR-IOV InfiniBand Resources
- 6. Test Pod: GPU + InfiniBand RDMA
- 7. Troubleshooting Checklist
- 8. Files in This Setup
- 9. Where to Go Next
- Observability
- NVIDIA DCGM for GPU Monitoring
- 1. What is NVIDIA DCGM?
- 2. Key Capabilities of DCGM
- 3. DCGM Exporter: Bridge to Prometheus
- 4. DCGM in Kubernetes and OpenShift
- 5. Prometheus Integration: Scraping DCGM Metrics
- 6. Common DCGM / DCGM Exporter Metrics to Watch
- 7. Use Cases: Beyond “Nice Dashboards”
- 8. Best Practices and Gotchas
- 9. Putting It All Together
- Automation
- Ansible Automation with redhatci.ocp
- redhatci.ocp Ansible Collection
- What Is
redhatci.ocp?
- Installation and Packaging
- A Tour of the Capabilities
- Example Workflows
- Where
redhatci.ocpFits in the Ecosystem
- Community, Support, and Contributions
- When Should You Use
redhatci.ocp?
- OpenShift Installation
site.yml
- Automate OpenShift GitOps
- What Is
- ✅ 1. Relevant Roles in
redhatci.ocp
- ✅ 2. Minimal Requirements
- ✅ 3. Simple Example: Install GitOps Operator Only
- ✅ 4. Full Example: Install Operator + Configure Repo + Create an App
- ✅ 5. Using the “All-in-One” GitOps Wrapper Role
- ✅ 6. Typical Workflow After a Fresh OCP Install
- If you want, I can generate…
- OpenShift AI
- Model Registry
- Enabling and Using the Model Registry in Red Hat OpenShift AI (with MySQL)
- 1. What the Model Registry Component Does
- 2. Prerequisites
- 3. Enabling the
modelregistryComponent in OpenShift AI
- 4. Verifying the Model Registry Component
- 5. Preparing the External MySQL Database
- 6. Creating a Model Registry in the OpenShift AI Dashboard
- 7. Securing the Database Connection (Optional but Recommended)
- 8. Verifying the New Model Registry
- 9. Next Steps: Permissions, Catalog, and MLOps Integration
- 10. Common Troubleshooting Tips
- Wrap-Up
- Run:AI
- Troubleshooting Run:AI
- Troubleshooting Run:ai and NVIDIA GPU Operator on OpenShift
- GitOps
- Repository Settings and Certificates
- 1. What “repository settings” mean in OpenShift GitOps
- 2. Step 1 – Add the GitLab TLS certificate (trust
gitlab.example.local)
- 3. Step 2 – Add the GitLab repository with credentials and proxy
- 4. How proxy settings actually work for repositories
- 5. Declarative equivalent: TLS cert & repo with proxy as YAML
- 6. Quick health checks & troubleshooting
- 7. Summary: what you’ve achieved
- Applications and ApplicationSets
- 1. Quick mental model
- 2. Git layout: base / envs pattern
- 3. Argo CD Application: point to a single path
- 4. ApplicationSet: generate many Applications from one template
- 5. ApplicationSet for per-environment CSI Isilon
- 6. How this fits GitOps best practices
- 7. When to use Application vs ApplicationSet
- 8. Summary
- Benchmarking
- GPU Benchmarking with Inference-Benchmarker
- Deploying a GPU LLM Benchmark-as-Code Pipeline on Kubernetes with Inference-Benchmarker
- 1. Cluster preparation: GPUs & namespace
- 2. Clone the Inference-Benchmarker Helm chart
- 3. Configure
values.yamlfor your GPU benchmark
- 4. (Optional) Persist benchmark results with a PVC
- 5. Install the benchmark stack with Helm
- 6. Track benchmark progress
- 7. Collect the JSON results
- 8. Visualize GPU performance with the Gradio dashboard
- 9. Cleanup
- 10. Architecture overview
- 11. Troubleshooting guide
- 12. Next steps: turning this into a benchmark catalog
- Visualizing Benchmark Results
- Visualizing GPU Benchmark Results with the Inference-Benchmarker Dashboard
- 0. Prerequisites
- 1. Grab the result files from the cluster
- 2. Install the dashboard dependencies (one-time)
- 3. Launch the Gradio dashboard
- 4. Open the web UI
- 5. Pro tips for better visualization workflows
- 6. Cleanup
- 7. Summary
The Leanpub 60 Day 100% Happiness Guarantee
Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.
Now, this is technically risky for us, since you'll have the book or course files either way. But we're so confident in our products and services, and in our authors and readers, that we're happy to offer a full money back guarantee for everything we sell.
You can only find out how good something is by trying it, and because of our 100% money back guarantee there's literally no risk to do so!
So, there's no reason not to click the Add to Cart button, is there?
See full terms...
Earn $8 on a $10 Purchase, and $16 on a $20 Purchase
We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.
(Yes, some authors have already earned much more than that on Leanpub.)
In fact, authors have earnedover $14 millionwriting, publishing and selling on Leanpub.
Learn more about writing on Leanpub
Free Updates. DRM Free.
If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).
Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.
Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.
Learn more about Leanpub's ebook formats and where to read them