Buconos

5 Critical Insights into a CUBIC Congestion Control Bug in QUIC

Published: 2026-05-16 15:03:07 | Category: Linux & DevOps

When you hear the word 'idle,' you probably think of inactivity—a quiet network with no data flowing. But in the world of congestion control, 'idle' can trigger a cascade of hidden behaviors. This article dives into a fascinating bug where a Linux kernel optimization for CUBIC—the default congestion controller in most TCP and QUIC connections—backfired when ported to Cloudflare's QUIC implementation, quiche. The result? The congestion window got permanently stuck at its minimum, preventing recovery from network congestion. We'll walk through five key takeaways from this detective story.

1. How CUBIC Manages Network Congestion

At the heart of every loss-based congestion control algorithm (CCA) is the congestion window (cwnd)—a sender's cap on how many bytes can be in flight at once. CUBIC, standardized in RFC 9438, grows cwnd aggressively when the network is healthy and shrinks it on detecting packet loss. This 'probe and back off' logic aims to maximize throughput without overwhelming the network. In its normal state, CUBIC gracefully balances bandwidth probing and loss recovery. But behind this simplicity lies a delicate state machine, especially when the connection becomes application-limited—meaning the sender has no data to push, a common scenario in QUIC's multiplexed streams.

5 Critical Insights into a CUBIC Congestion Control Bug in QUIC
Source: blog.cloudflare.com

2. A Test That Failed 61% of the Time

Cloudflare's integration tests for its ingress proxy juggle dozens of congestion control scenarios. One particular test, designed to simulate heavy packet loss early in a connection, kept failing—61% of the time. This wasn't a fluke; it was a systematic bug. Recovery from congestion collapse is an edge case rarely explored in CCA tests, which usually focus on steady-state throughput. But here, after the connection was beaten down to its minimum cwnd, the congestion window never grew back. The controller remained stuck, effectively halting data transfer. This symptom pointed to something deeper: a mismatch between how CUBIC handles 'idle' periods and how quiche interprets time since last data transmission.

3. A Linux Kernel Optimization Goes Wrong

The root cause traces back to a Linux kernel patch intended to bring CUBIC in line with RFC 9438 §4.2–12, which defines an app-limited exclusion. In TCP, when the sender is app-limited (no data to send), the algorithm should not count that idle period as time for window growth. The kernel fix correctly excluded these intervals from CUBIC's epoch calculations. However, when quiche—Cloudflare's QUIC implementation—ported that same logic, it surfaced an unintended consequence. QUIC's multiplexed nature means a connection can be app-limited on one stream while other streams are active. The ported code misinterpreted these states, causing the cwnd to never update its last loss recovery timestamp.

5 Critical Insights into a CUBIC Congestion Control Bug in QUIC
Source: blog.cloudflare.com

4. Root Cause: When 'Idle' Is Not Idle

The key insight is that in QUIC, the notion of 'idle' is more nuanced than in TCP. The Linux patch assumed that if a connection is app-limited, no congestion window growth should occur. But quiche's state machine tracked a single last successful time variable. After a loss event, this variable wasn't updated when the connection was deemed 'idle' by the app-limited exclusion. Consequently, the cwnd remained pinned at its minimum indefinitely. Each subsequent loss recovery phase tried to start a new epoch but found that the elapsed time was still too short to grow, because the 'last successful' timestamp was frozen in the past. The cycle repeated endlessly—a classic 'stuck at minimum' bug.

5. The One-Line Fix That Saved the Day

The solution was surprisingly elegant. The quiche team realized that the app-limited exclusion should only skip window growth if the connection had been truly idle—i.e., no data was sent at all. By adding a single check to update the 'last successful' timestamp even during app-limited periods, they broke the cycle. The fix allowed the congestion window to recover normally after loss, even when the application had brief gaps in data transmission. This near-one-line code change restored reliability to Cloudflare's QUIC traffic and highlighted how subtle protocol interactions can cause dramatic failures. The story also underscores the value of testing edge cases in congestion control—especially the 'idle' state that turns out to be anything but idle.

Conclusion

This bug serves as a powerful reminder that even well-tested algorithms like CUBIC can behave unpredictably when ported across protocols. The 'idle' assumption from TCP doesn't map cleanly to QUIC's multiplexed streams, and a single misaligned timestamp can freeze an entire connection. Thanks to the quiche team's detective work, the fix was simple, but the journey to find it was rich with insight. If you're building on QUIC, pay close attention to how your congestion controller treats app-limited states—never assume that 'no data' means 'no problem.'