Can you let an AI assistant near a production Kubernetes cluster without it doing something stupid? Not "probably won't" — cannot. This book answers that with a concrete artifact: a safe, governed AI SRE agent skill that reviews a platform running on Talos Linux and helps your team understand its health, its risks, and its options — read-only by default, and structurally unable to mutate the cluster.
You build it one capability per chapter against a local, throwaway Talos lab: cluster health and reliability, security drift, certificate-expiry prediction, a scored platform maturity report, and GitOps remediation where the agent proposes fixes as pull requests — it changes Git, never the cluster. The guardrails aren't promises; they're enforced in code (read-only by an allow-list, ask-which-cluster-first, show-every-command, refuse an unrecognized context).
Then it goes past the toy. You'll tackle vulnerabilities and fearless upgrades (Talos's dry-run preflight and atomic A/B rollback), the honest economics of leaving the cloud for bare metal and on-prem, running stateful databases and a data lakehouse on Kubernetes, and finally a sovereign, air-gapped AI operator that runs the model itself on your own hardware — so nothing, not even the reasoning, leaves the building.
Two commitments run through every page. Safety is a property of the system, not a slogan — you can read exactly why each action is or isn't allowed. And the numbers are honest — where the book quotes a result, it was measured on a real, hourly-rented bare-metal cluster, and where something wasn't measured, it says so. The companion repository holds every experiment so you can run it, break it, and measure it yourself.