A running joke in the developer community is the idea of throwing your computer out the window when debugging seems endless. AI took this joke a step further. It deleted an entire production environment and started over. After 13 hours of downtime, no one was laughing.
To add fuel to the fire, Amazon's official statement was that this was "a coincidence that AI tools were involved" and "user error, not AI error."
No matter if you're on the AI hype train or in the skeptic corner, I think we can all agree this is a wake-up call, not a one-off outage. This is a turning point where we question what happens when every major tech company is pushing 80% AI adoption targets without building the guardrails to match.
In plain English: if we wouldn't give the intern this level of access, why are we giving it to AI?
So What Had Happened Was...
Back in July 2025 (keep note of this date) Amazon launched Kiro, their own internal agentic AI tool. Jumping to December 2025, engineers are using the tool and gave Kiro access to make changes to a production system. During that time, Kiro decided the best way to fix the issue was to delete the environment and start over.
This wasn't the issue. This is expected behavior from AI.
The problem stems from Kiro being designed as an extension of the engineer who invoked it. That means it had all the same permissions to execute as the senior developer. It didn't need additional approvals to execute. So it did its job, which ended in a 13-hour production outage in mainland China regions.
The worst part about this recap is that it's not the first time an AI-related production issue has happened according to multiple Amazon employees. Despite the company's goal of 80% weekly AI tool adoption, many engineers reported they were hesitant to use these tools because they were known to create more issues than they solved. Even with that internal sentiment, and the tool being out since July of 2025, safeguards weren't in place until after the December outage.
"User Error vs. AI Error" Is the Wrong Frame
Let's take a moment to explore the reality of AI vs human coding. AI agents that can autonomously chain commands are optimized to be as efficient as possible. Make decisions as quickly as possible. Execute as fast as possible. All to move on to the next todo item.
In comparison, humans seem flawed until we get to errors. Humans move slower, we type things out by hand, we pause to think. We reach out to other humans. Most of us would have anxiety about deleting a production environment and would take a natural pause to question if this was the right direction. It wouldn't take long for a human to know deleting prod is never the answer.
AI changes the risk profile of what gets delivered, and if we don't acknowledge that, more 13-hour outages are coming.
Most of us using AI tools to code have taken on the mantra that AI is just another tool in our toolbox as engineers. And that's clearly reflected in the way Kiro was designed. The user that invokes the AI also gives it the privilege they have by design. So for Amazon to classify this as "user error" is like saying you'd be at fault for an accident if you were hit by a car with no brakes. The system was designed to allow this to happen when it could have been prevented.
Calling it "user error" hides the systemic failure to save face, and it misses the opportunity to address the real issue at a moment when the whole industry could learn from it.
Amazon pushed for 80% adoption rates without equally pushing for quality, safety, and oversight. The things we track are the qualities we promote and the culture we create. If you don't explicitly track quality, safety, security, and responsible AI use with adoption, you're building a culture that has to unlearn bad habits before it can reach a sustainable goal.
That's why Amazon framing this as user error didn't really hit. From the beginning, the company pushed adoption without pushing safeguards. This was bound to happen.
So What Should Have Been in Place?
I've spent years shipping ML and AI solutions in highly regulated spaces: healthcare, education, insurance. The consequences of something going wrong aren't just a 13-hour outage. They're a wrong diagnosis. The incorrect medication sent out.
This experience led me to create a framework for breaking down AI workflows by risk level. Having a framework like this is just the starting point for safeguards that could have been in place.
The Phoenixing Oversight Framework
This framework started as a necessity. As an early adopter of AI coding tools, I quickly learned they were most reliable for simple tasks and grunt work: linting, formatting, boilerplate. The more complex and higher risk the task, the more human oversight was needed.
What began as a practical way of working has evolved into a philosophy for how teams, especially in regulated industries, can responsibly increase the amount of AI-generated code that's pushed to production. It's a realistic way of working that acknowledges these models will have major flaws before they're reliable enough for full autonomy.
The question isn't whether to use AI. It's how much oversight each task demands.
The Five Levels
Level 0: Full Autonomy - AI acts, logs, human reviews later
- Use for: Linting, formatting, auto-generated test scaffolding, documentation updates
- Risk profile: Low. Reversible. No production impact.
- Example: AI auto-formats code on commit. Nobody needs to approve this.
Level 1: Notify and Proceed - AI acts, immediately notifies human
- Use for: Non-critical code changes, dependency updates, minor refactors
- Risk profile: Medium-low. Reversible with minor effort.
- Example: AI updates a non-breaking dependency and pings the team channel.
Level 2: Propose and Wait - AI proposes, human approves before execution
- Use for: Database migrations, API contract changes, configuration updates
- Risk profile: Medium. May require coordination. Not trivially reversible.
- Example: AI drafts a migration script and opens a PR. Human reviews and merges.
Level 3: Collaborative - AI drafts, human modifies, AI executes
- Use for: Infrastructure changes, security-sensitive modifications, cross-service changes
- Risk profile: High. Affects multiple systems. Hard to reverse.
- Example: AI proposes an infrastructure change. Human reviews, adjusts scope, and AI executes the approved version with a human watching.
Level 4: Human Only with AI Assist - Human decides and acts, AI just provides context
- Use for: Production deployments, data deletion, environment management, access control changes
- Risk profile: Critical. Irreversible or extremely costly to reverse.
- Example: Human decides to modify production. AI surfaces relevant context, flags risks, suggests approach. Human executes.
Using this framework, the Amazon incident is clear: Kiro had the ability to operate at Level 0 on a Level 4 task. That's not "user error." That's calibration failure.
Working in healthcare has made me double down on getting this calibration right. Here's how the framework could apply to real scenarios:
- A model that generates summaries? Level 0.
- A model that flags potential health risk? Level 2 at minimum.
- A model that influences a clinical decision? Level 4. No exceptions.
Risk determines the oversight. Not convenience, not speed, not adoption targets. If it's an afterthought, it's too late.
The Pattern Is Here. It's Not Just Amazon.
The tech industry has always tried to move fast and break things. But before, we were moving at human speed. Now it's powered by AI. The pattern is becoming impossible to ignore: as more companies ship AI-generated code with fewer humans in the loop, the scale of mistakes grows with it.
Company after company is racing to hit that perfect adoption number. They want to prove that AI investments are paying off. They want to show that they can move faster and have bigger impact by leveraging these new tools. But adoption targets without guardrails aren't a measure of progress. They're a measure of risk exposure.
- Microsoft: ~30% of code now AI-generated. Satya Nadella has been public about this.
- NVIDIA: Mandating AI tool use for 30,000 engineers.
- GitHub outage (Feb 9, 2026): When GitHub went down, AI coding agents across the industry couldn't push code, run CI/CD, or open PRs and it revealed how deeply integrated (and dependent) these tools have become.
- SailPoint survey: 80% of IT professionals have seen AI agents act unexpectedly or perform unauthorized actions.
- The adoption guardrails gap: Every company is racing to adopt. Almost none are racing to build the oversight infrastructure.
The Phoenixing Oversight Framework isn't optional. It's the minimum.
It's not a coincidence that NIST launched an AI agent standards initiative February 17th to help focus on identity, security, and authorization. If the federal government can see what's coming next, can you?
Now What?
Whether you're an engineering manager, director, VP, or an engineer looking for answers, here are four things to implement this week:
Audit your AI tool permissions. Map every AI tool your team uses to the level of access it has. If any tool has write access to production without human approval, fix that today.
Classify your workflows by risk level. Use the Phoenixing Oversight Framework. Not every task needs Level 4 oversight (that wouldn't help the metrics leadership wants to see from AI coding tools), but every task needs the RIGHT level.
Require peer review for AI-initiated production changes. Amazon implemented this AFTER their outage. You can implement it before yours.
Track quality, not just adoption. It's fine to say "80% of engineers should use AI tools weekly." It's dangerous to leave out quality metrics and mix adoption with the belief that AI tools should act autonomously. What you track becomes the norm. Adoption without quality is a recipe for burnout and poor system health.
Final Thoughts
When AI fails at this stage, it's because the system was designed to let it fail. Pushing the wrong metric creates oversight gaps. Mismatched permissions in production environments become the perfect storm for major outages.
Amazon's AI didn't go rogue. It did what it was designed to do, with the permissions it was given, in the environment it was pointed at.
Humans can only make mistakes that systems allow. Errors like these are a lesson to learn about the cracks in our system. If we focus on risk from the start, we'll have fewer incidents like this in the future.
If we don't, the next outage won't be 13 hours. It'll be worse.
Want help implementing AI governance frameworks for your team? Book a discovery call to discuss how to ship AI safely.