Preface
This book started as a Slack message.
After nine months of building AI agents on AWS Bedrock in a large enterprise, I sent a message to my team lead: “We should write down everything we learned. Nobody else is writing about this stuff — the real stuff, not the tutorial stuff.”
The “real stuff” was everything that happened after the demo. The demo took five days. Getting the agent to production took the rest of the year. And the gap between those two milestones was filled with problems that no AWS documentation, no Medium article, no conference talk had prepared us for.
VPC endpoints that silently dropped connections. IAM policies with explicit denies that blocked every deployment path we tried. A managed service that promised stateful workflows but forgot your state every 60 seconds. A database that slowed to a crawl because we used DELETE instead of TTL. A proxy configuration change at 2 AM that took the entire agent offline — and left zero error logs to explain why.
None of this was in the Bedrock getting-started guide.
Who This Book Is For
You are an engineer, architect, or tech lead at a company with more than a few hundred engineers. Your company uses AWS. Someone — maybe you, maybe your VP — decided that AI agents are the next thing to ship. You built a demo, it worked great, and now you need to put it in production.
That is where this book picks up.
I am not going to teach you what a large language model is. I am not going to walk you through the AWS console to create your first agent. AWS has tutorials for that, and they are fine. What AWS does not have is a guide to surviving the enterprise reality: the IAM policies that block you, the networking that silently breaks, the state management you have to build yourself, the cost tracking you forgot until it was too late, and the security review that takes longer than the actual coding.
This book is that guide.
One piece of framing before we start: an AI agent is not a new kind of software. It is a Lambda function where the decision-making happens to be a language model instead of a switch statement. It reads inputs, picks an action, executes it, and checks the result. The LLM replaces the if/else tree — it does not replace the infrastructure around it. You still need IAM roles, VPC endpoints, deployment pipelines, monitoring, and cost controls. Everything you already know about building production systems still applies. The LLM is a component, not a revolution.
This matters because the marketing around AI agents implies you are building something fundamentally new. You are not. You are building automation. The sooner your team internalizes that framing, the faster you will ship.
But here is what makes this automation different from every integration you have built before: it can reason. A traditional automation fails on an unexpected API response and calls you at 3 AM. An agent reads the error, correlates it with what it knows about the system, and decides whether to retry, escalate, or take a different path — all with guardrails that prevent it from doing anything destructive. That gap between “automation that follows a script” and “automation that understands the situation” is where enterprise AI agents live. And that gap is worth the engineering effort it takes to get there.
How This Book Is Organized
Part 1 (Chapters 1-2) sets the stage. What makes enterprise different, how Bedrock Agents actually work under the hood, and where the managed service stops and your code begins.
Part 2 (Chapters 3-5) covers building the agent. Prompt engineering that works in production (not in demos), action groups and tool integration — including the “Agent Factory” pattern that lets non-coders build agents — and data architecture for real-time workflows.
Part 3 (Chapters 6-9) is the hard part. IAM and security, enterprise networking, deployment automation, and cost engineering. These chapters exist because they represent where we spent 70% of our time. The code was the easy part. Getting it through security review, deploying it behind VPC endpoints, and keeping the LLM bill under control — that was the job.
Part 4 (Chapters 10-13) covers production operations. Testing non-deterministic systems, observability that actually helps you debug agent behavior, a production checklist, and the full list of lessons we learned the hard way.
The Appendices contain ready-to-use templates: CloudFormation, IAM policies, agent instruction samples, a cost calculator, and a troubleshooting guide with the 13 most common errors we hit.
A Note on Code Samples
All code in this book reflects real-world enterprise patterns that I have encountered and worked with. Company names, internal URLs, application identifiers, and API schemas are fictional — created to illustrate the architecture without referencing any specific organization. The patterns, the failure modes, and the solutions are drawn from hands-on experience. The names and numbers are made up.
Every CloudFormation template, IAM policy, and code sample uses parameterized placeholders (marked with # REPLACE comments). Copy them, fill in your values, and they work. We designed them that way on purpose — a template with your company’s ARNs hardcoded is useless to everyone else. A template with ${AWS::AccountId} placeholders is useful to everyone.
Acknowledgments
This book would not exist without the team that built the system it describes. We argued about architecture in pull request comments, debugged proxy issues at midnight, and spent three weeks in meetings about a single IAM permission. The lessons in this book are theirs as much as mine.
Thanks also to the enterprise platform and security teams who said “no” to our first seventeen deployment attempts. You made the final architecture better. We did not enjoy the process.
This book reflects the state of AWS Bedrock Agents as of early 2026. AWS moves fast. Some limitations described here may be addressed by the time you read this. The architectural patterns and enterprise constraints, however, tend to stick around.