OpenShift AI Platform Guide [Leanpub PDF/iPad/Kindle]

OpenShift AI Platform Guide is a practical handbook for platform engineers who need to turn OpenShift into a real internal AI platform, not “just a Kubernetes cluster.”

Starting from the CNCF platform engineering whitepaper, the book shows how to apply those ideas on OpenShift: treating the platform as a product, reducing cognitive load for app teams, and building opinionated “golden paths” instead of one-off snowflakes.

From there, you’ll walk through end-to-end, production-grade scenarios:

Installing OpenShift 4.20 in fully air-gapped environments with a local Quay registry
Configuring cluster-wide proxies, NFS storage, and disconnected OperatorHub catalogs
Deploying and managing key operators like Node Feature Discovery and the NVIDIA GPU Operator
Enabling InfiniBand and RDMA networking with SR-IOV and the NVIDIA Network Operator
Integrating observability with DCGM, Prometheus, and Grafana for GPU-aware monitoring
Using GitOps (OpenShift GitOps / Argo CD + GitLab) for declarative, auditable platform config
Running LLM performance benchmarks as code with Hugging Face’s Inference-Benchmarker and visualizing results with a Gradio dashboard

The guide is written in a “do this, then this” style, with YAML examples, command snippets, and explanations of why each piece matters for a modern AI platform.

If you are a platform engineer, SRE, or infrastructure-minded ML practitioner responsible for OpenShift-based GPU clusters—especially in regulated or disconnected environments—this book gives you a concrete, repeatable blueprint.

- Introduction - Platform Engineering
- Platform Engineering
  - 1. What platform engineering is really about
  - 2. Key principles from the CNCF whitepaper
  - 3. Why OpenShift is a strong foundation for an internal platform
  - 4. Mapping CNCF platform capabilities to OpenShift
  - 5. Product & app teams vs capability & service providers
  - 6. Building a “thinnest viable platform” on OpenShift
  - 7. Making it secure and governed by default
  - 8. Measuring success of your OpenShift-based platform
  - 9. A practical adoption roadmap
- Installation
- Agent-based Installation in Air-Gapped Environments
  - 1. What we’re building
  - 2. Mirroring OpenShift and ecosystem images into Quay
  - 3. Auth & trust: pull secrets and CA bundle
  - 4. The install-config.yaml for air-gapped, bare-metal, agent-based
  - 5. AgentConfig: describing your hosts
  - 6. Telling the installer to use your mirrored release image
  - 7. Generating the Agent ISO (air-gapped aware)
  - 8. Booting the nodes & running the install
  - 9. Verifying the cluster is using your Quay mirror
  - 10. Troubleshooting on the rendezvous host
  - 11. Summary: your end-to-end air-gapped flow
- Proxy Configuration for Installation
  - 1. Why Agent-based installation with a proxy?
  - 2. Where the proxy lives: install-config.yaml
  - 3. Example: Agent-based install with your proxy
  - 4. Creating the Agent-based config files (with proxy)
  - 5. Generating the Agent ISO (with proxy embedded)
  - 6. Authentication & custom CA during Agent-based install
  - 7. How the proxy behaves during and after install
  - 8. Verifying your Agent-based install with proxy
  - 9. Tips & pitfalls specific to Agent-based installs with proxy
  - 10. Summary
- Local Quay Registry Setup
  - 1. Where to specify the local Quay registry
  - 2. Example: full install-config.yaml with proxy + local Quay
  - 3. Make sure auth and TLS match your Quay
  - 4. What happens during agent-based install
  - 5. Quick sanity checks
- Using oc-mirror with Quay
  - 1. Why oc-mirror v2 + Quay?
  - 2. Requirements & network allowlist
  - 3. Preparing auth & environment for oc-mirror
  - 4. Designing the ImageSetConfiguration (4.20 + operators)
  - 5. Running oc-mirror v2: Mirror-to-Disk
  - 6. Disk-to-Mirror: pushing into Quay
  - 7. Applying cluster resources in OpenShift
  - 8. Verifying everything is working with Quay
  - 9. Operational tips with oc-mirror v2 and Quay
  - 10. Common pitfalls & troubleshooting
  - 11. Summary
- Cluster Configuration
- Cluster-Wide Proxy Configuration
  - 1. What the OpenShift cluster-wide proxy actually does
  - 2. The Proxy resource – core fields
  - 3. Basic proxy configuration (no authentication)
  - 4. Proxy with basic auth
  - 5. Custom CA for the proxy (trustedCA)
  - 6. Configuring a proxy at install time vs Day 2
  - 7. How workloads interact with the cluster proxy
  - 8. Verifying the proxy configuration
  - 9. Updating or removing the proxy
  - 10. Common pitfalls & best practices
  - 11. Summary
- NFS Storage Configuration
  - 1. What we’re building
  - 2. OpenShift + NFS basics (very short theory)
  - 3. Setting up the NFS server (RHEL 9 example)
  - 4. Cluster-side requirements (OpenShift)
  - 5. Deploying nfs-subdir-external-provisioner on OpenShift
  - 6. Understanding the result on the NFS server
  - 7. Alternative: using a values.yaml instead of --set
  - 8. Creating and using a PVC from OpenShift
  - 9. Troubleshooting common issues
  - 10. Hardening & best practices
  - 11. Static NFS PVs vs. dynamic (what you built)
  - 12. Summary of your working configuration
- Air-Gapped Operations
- OperatorHub with Local Quay
  - 1. Background: OperatorHub, CatalogSource, ClusterCatalog
  - 2. Step 1 – Disable the default catalog sources
  - 3. Step 2 – Enable Quay-backed catalogs
  - 4. Step 3 – Verify catalog health in openshift-marketplace
  - 5. Operational tips & gotchas
  - 6. Summary: What you have now
- Operators
- Node Feature Discovery (NFD)
  - 1. What Node Feature Discovery actually does
  - 2. High-level GitOps structure for NFD
  - 3. Namespace: isolating the NFD operator
  - 4. OperatorGroup: scoping where the operator works
  - 5. Subscription: installing NFD from your internal catalog
  - 6. GitOps RBAC: allowing Argo CD to manage NFD CRs
  - 7. The NodeFeatureDiscovery CR: how you configure NFD
  - 8. How this ties into your GPU / platform stack
  - 9. Summary
- NVIDIA GPU Operator
  - 1. What the NVIDIA GPU Operator does on OpenShift
  - 2. Namespace: where the operator and operands live
  - 3. OperatorGroup: scoping the operator
  - 4. Subscription: install the certified operator from your Quay-backed catalog
  - 5. ClusterPolicy: the GPU Operator’s master configuration
  - 6. ConfigMap: device plugin config placeholder
  - 7. How this ties in with your NFD & GitOps stack
  - 8. Summary
- InstaSlice for Dynamic GPU Partitioning
- Instaslice on OpenShift (OCP).
  - 1. Common Prep (All OpenShift Scenarios)
  - 2. OpenShift – Emulated Mode (no GPUs)
  - 3. OpenShift – Real GPUs + MIG (A100/H100)
  - 4. Day-2 Operations on OpenShift
  - 5. Troubleshooting on OpenShift
- Networking
- InfiniBand and RDMA Configuration
- InfiniBand + RDMA on OpenShift AI with SR-IOV and NVIDIA Network Operator (Legacy Mode)
  - 0. Prerequisites & Assumptions
  - 1. Node Feature Discovery: Label the Right Nodes
  - 2. SR-IOV Network Operator: Preparing the Legacy SR-IOV Path
  - 3. NVIDIA Network Operator: Enabling DOCA/OFED in Legacy Mode
  - 4. NicClusterPolicy: Deploying DOCA/OFED for InfiniBand
  - 5. Defining SR-IOV InfiniBand Resources
  - 6. Test Pod: GPU + InfiniBand RDMA
  - 7. Troubleshooting Checklist
  - 8. Files in This Setup
  - 9. Where to Go Next
- Observability
- NVIDIA DCGM for GPU Monitoring
  - 1. What is NVIDIA DCGM?
  - 2. Key Capabilities of DCGM
  - 3. DCGM Exporter: Bridge to Prometheus
  - 4. DCGM in Kubernetes and OpenShift
  - 5. Prometheus Integration: Scraping DCGM Metrics
  - 6. Common DCGM / DCGM Exporter Metrics to Watch
  - 7. Use Cases: Beyond “Nice Dashboards”
  - 8. Best Practices and Gotchas
  - 9. Putting It All Together
- Automation
- Ansible Automation with redhatci.ocp
- redhatci.ocp Ansible Collection
  - What Is redhatci.ocp?
  - Installation and Packaging
  - A Tour of the Capabilities
  - Example Workflows
  - Where redhatci.ocp Fits in the Ecosystem
  - Community, Support, and Contributions
  - When Should You Use redhatci.ocp?
  - OpenShift Installation
  - site.yml
  - Automate OpenShift GitOps
- ✅ 1. Relevant Roles in redhatci.ocp
- ✅ 2. Minimal Requirements
- ✅ 3. Simple Example: Install GitOps Operator Only
- ✅ 4. Full Example: Install Operator + Configure Repo + Create an App
- ✅ 5. Using the “All-in-One” GitOps Wrapper Role
- ✅ 6. Typical Workflow After a Fresh OCP Install
- If you want, I can generate…
- OpenShift AI
- Model Registry
- Enabling and Using the Model Registry in Red Hat OpenShift AI (with MySQL)
  - 1. What the Model Registry Component Does
  - 2. Prerequisites
  - 3. Enabling the modelregistry Component in OpenShift AI
  - 4. Verifying the Model Registry Component
  - 5. Preparing the External MySQL Database
  - 6. Creating a Model Registry in the OpenShift AI Dashboard
  - 7. Securing the Database Connection (Optional but Recommended)
  - 8. Verifying the New Model Registry
  - 9. Next Steps: Permissions, Catalog, and MLOps Integration
  - 10. Common Troubleshooting Tips
  - Wrap-Up
- Run:AI
- Troubleshooting Run:AI
  - Troubleshooting Run:ai and NVIDIA GPU Operator on OpenShift
- GitOps
- Repository Settings and Certificates
  - 1. What “repository settings” mean in OpenShift GitOps
  - 2. Step 1 – Add the GitLab TLS certificate (trust gitlab.example.local)
  - 3. Step 2 – Add the GitLab repository with credentials and proxy
  - 4. How proxy settings actually work for repositories
  - 5. Declarative equivalent: TLS cert & repo with proxy as YAML
  - 6. Quick health checks & troubleshooting
  - 7. Summary: what you’ve achieved
- Applications and ApplicationSets
  - 1. Quick mental model
  - 2. Git layout: base / envs pattern
  - 3. Argo CD Application: point to a single path
  - 4. ApplicationSet: generate many Applications from one template
  - 5. ApplicationSet for per-environment CSI Isilon
  - 6. How this fits GitOps best practices
  - 7. When to use Application vs ApplicationSet
  - 8. Summary
- Benchmarking
- GPU Benchmarking with Inference-Benchmarker
- Deploying a GPU LLM Benchmark-as-Code Pipeline on Kubernetes with Inference-Benchmarker
  - 1. Cluster preparation: GPUs & namespace
  - 2. Clone the Inference-Benchmarker Helm chart
  - 3. Configure values.yaml for your GPU benchmark
  - 4. (Optional) Persist benchmark results with a PVC
  - 5. Install the benchmark stack with Helm
  - 6. Track benchmark progress
  - 7. Collect the JSON results
  - 8. Visualize GPU performance with the Gradio dashboard
  - 9. Cleanup
  - 10. Architecture overview
  - 11. Troubleshooting guide
  - 12. Next steps: turning this into a benchmark catalog
- Visualizing Benchmark Results
- Visualizing GPU Benchmark Results with the Inference-Benchmarker Dashboard
  - 0. Prerequisites
  - 1. Grab the result files from the cluster
  - 2. Install the dashboard dependencies (one-time)
  - 3. Launch the Gradio dashboard
  - 4. Open the web UI
  - 5. Pro tips for better visualization workflows
  - 6. Cleanup
  - 7. Summary

OpenShift AI Platform Guide

You pay

Author earns

...Or Buy With Credits!

About

Share this book

Categories

Feedback

Author

Contents

The Leanpub 60 Day 100% Happiness Guarantee

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

Free Updates. DRM Free.

Write and Publish on Leanpub