How Cloudflare Strengthened Its Network: The Inside Story of 'Code Orange: Fail Small'

Published: 2026-05-06 20:29:05 | Category: Education & Careers

Cloudflare recently completed a major engineering initiative called Code Orange: Fail Small, aimed at making its infrastructure more resilient, secure, and reliable. After global outages in late 2025, the team focused on safer configuration changes, reducing failure impact, improving break-glass procedures, and enhancing customer communication. This Q&A dives into the key changes and what they mean for users.

1. What was Code Orange: Fail Small, and why did Cloudflare undertake it?

Code Orange: Fail Small was an intensive, multi-quarter engineering effort starting after Cloudflare experienced two major global outages on November 18, 2025, and December 5, 2025. The project aimed to prevent similar incidents by making the network more resilient. Key areas included safer configuration deployments, reducing the blast radius of failures, revising emergency access procedures, improving incident management, and preventing configuration drift over time. The team designed new tools and methodologies so that internal changes no longer reach the network instantly; instead, they roll out progressively with real-time health monitoring. While resiliency is never truly “done,” the specific work that would have avoided those two outages is now complete. This means that future configuration changes are far less likely to cause widespread disruption, and if problems do occur, automated systems can catch and revert them before they affect customer traffic.

How Cloudflare Strengthened Its Network: The Inside Story of 'Code Orange: Fail Small' — Source: blog.cloudflare.com

2. How did Cloudflare make configuration changes safer?

Configuration changes were a root cause of both outages. Previously, some changes propagated instantly across Cloudflare’s global network, leaving no time to catch issues. The fix was to adopt health-mediated deployment for all high-risk configuration pipelines. Instead of instant updates, changes are now bundled into packages and released gradually to a subset of servers while monitoring real-time health metrics. If a problem is detected, the system automatically rolls back the change before it can impact more customers. A new internal component called Snapstone makes this unified and easy. Snapstone allows teams to define any unit of configuration — whether a data file (like the November cause) or a control flag (like December’s) — and apply progressive rollout, health monitoring, and automated rollback. This replaces the previous ad-hoc approach that required significant per-team effort and was inconsistently applied across the network.

3. What is Snapstone, and how does it help prevent outages?

Snapstone is a new internal system Cloudflare built to bring health-mediated deployment to configuration changes. Before Snapstone, applying progressive rollout and health monitoring to configuration was possible but difficult; each team had to build custom tooling, leading to inconsistent coverage. Snapstone provides a unified platform where teams can define any unit of configuration — whether it’s a data file or a control flag — and have it deployed gradually with real-time health checks. If the health metrics degrade, Snapstone automatically triggers a rollback. This flexibility means it’s not a fix for specific past failures but a general mechanism that adapts to any future configuration risk. By using Snapstone, Cloudflare ensures that even high-risk changes are now deployed with the same safeguards as software releases, dramatically reducing the chance of a global outage caused by a bad configuration push.

4. What other areas did the project improve besides configuration changes?

Beyond safer configuration changes, Code Orange: Fail Small addressed several other critical areas to strengthen the network. Reducing the impact of failure meant designing systems so that when something goes wrong, only a small portion of traffic is affected, not the entire network. Teams revised break-glass procedures — emergency access protocols for engineers to fix critical issues quickly — to ensure they are both secure and effective while minimizing risk. Incident management was overhauled with clearer roles, faster escalation paths, and better coordination across teams. To prevent drift and regressions, the project introduced automated testing and monitoring that continuously checks for deviations from the desired configuration state. Finally, communication during outages was strengthened: Cloudflare now provides more transparent, timely updates to customers, including clearer explanations of root causes and expected resolution times, so users are never left in the dark.

5. How does this affect Cloudflare customers?

For most customers, the benefits are invisible but impactful. The most important change is that internal configuration changes no longer reach the Cloudflare network instantly. Instead, they are rolled out progressively with real-time health monitoring, meaning problems can be caught and reverted before affecting customer traffic. This reduces the probability of large-scale outages that disrupt websites, APIs, or other services. Even if a change does go wrong, the blast radius is smaller, so fewer customers experience issues. Additionally, improved incident management and customer communication mean that if an incident does occur, you’ll receive clearer, more timely updates about what’s happening and when services will resume. In short, Cloudflare’s network is now more resilient, secure, and reliable — and customers can expect fewer surprises and faster resolution if issues arise.

6. What were the root causes of the November and December 2025 outages?

The November 18, 2025 outage was caused by a configuration change that involved a data file. This file was pushed out instantly across Cloudflare’s network, and it contained an error that affected traffic processing. Because there was no health-mediated rollout, the bad configuration spread globally within seconds, resulting in widespread disruption. The December 5, 2025 outage had a similar pattern but involved a control flag in Cloudflare’s global configuration system. Again, the flag was changed without progressive deployment or monitoring, and the impact was immediate and network-wide. Both incidents highlighted the lack of safety mechanisms for configuration changes — exactly what Code Orange: Fail Small was designed to fix. The project’s completed work ensures that both data files and control flags are now deployed via Snapstone with gradual rollout, health checks, and automatic rollback, so such errors cannot propagate unchecked.

7. Will Cloudflare ever be 100% outage-free?

No system can guarantee zero outages, and Cloudflare acknowledges that improving resiliency is an ongoing process, never truly “done.” Code Orange: Fail Small focused specifically on the type of failures that caused the November and December incidents — configuration changes that went bad. Those risks are now greatly mitigated by Snapstone and health-mediated deployment. However, other failure modes — such as network fiber cuts, hardware failures, or software bugs in third-party components — can still cause disruptions. Cloudflare continues to invest in redundancy, disaster recovery, and chaos engineering to minimize impact. The project also introduced measures to prevent drift and regressions over time, so the improvements are sustained. While customers can expect fewer and shorter outages, absolute perfection is not realistic. The key is that Cloudflare now has stronger safeguards in place, faster detection, and better communication — making the network more resilient than ever.

Buconos