fastapi-streaming-client-disconnect-cleanup

A streaming endpoint that survives a happy-path curl test can still leak resources, hold database transactions open, or pile up orphaned subprocesses the moment a real client closes the connection mid-flight. I've been bitten by this exact bug enough times that I now write the disconnect test before I write the feature. It's one of those quiet failures that doesn't show up in unit tests, doesn't raise an alarm in development, and only appears in production as a slow climb in open connection count, file descriptor exhaustion, or a worker that just refuses to take new requests until you restart it. The root cause is almost always the same: an async generator yielded from a StreamingResponse that has no idea the client has gone away.

If you're building anything that streams — server-sent events, chunked LLM token deltas, ndjson rows from a large query, or a slow file proxy — you eventually have to engineer the disconnect path with the same care you gave the success path. The good news is that FastAPI, by virtue of being a thin layer over Starlette and uvicorn, gives you the primitives. The catch is that the primitives are scattered across three layers, and the docs tend to show the happy path in isolation. Let me walk through what actually happens when a streaming client drops, why the obvious code does the wrong thing, and the patterns I've seen hold up under load.

Why streaming is different

What does your handler actually own once it has handed an async iterable to the framework — and at what moment does that ownership end? A non-streaming endpoint has a simple contract. The handler runs, returns a response object, and the framework serializes it and writes it to the socket. If the client disconnects after the handler returns but before the bytes are flushed, that's the framework's problem — the handler has already finished. There's nothing to clean up on your side.

A streaming endpoint flips the timing. Your generator is producing bytes lazily, frame by frame, while the socket is still open. The handler doesn't return in the traditional sense; it hands an async iterable to the response object, which calls __anext__ on it repeatedly and writes each chunk. If the client closes the TCP connection — or in HTTP/2 sends a RST_STREAM — the write to the socket will start failing. But your generator doesn't know about the socket. It will happily keep computing the next chunk, holding whatever resources it needs (a row cursor, an upstream HTTP connection, a subprocess pipe), until either it finishes naturally or someone tells it to stop.

The ASGI spec, which FastAPI implements via Starlette, defines exactly how that someone-tells-it-to-stop message arrives. When the client disconnects, the ASGI server (uvicorn, hypercorn, daphne) sends an http.disconnect message to the application. Starlette translates that into cancellation of the task running your generator. Cancellation in asyncio is delivered as a CancelledError raised at the next await point. If your generator is awaiting on something — a database fetch, an LLM call, an asyncio.sleep — the exception arrives and propagates up. If your generator is in a pure CPU loop with no await, cancellation has nowhere to land and you've effectively built a small DoS surface for yourself.

This is the first invariant to internalize: cooperative cancellation only works if you cooperate. Every long-running step inside a streaming generator needs an await checkpoint, either real I/O or an explicit await asyncio.sleep(0) yield to the event loop.

The naive pattern that breaks

Last month a teammate showed me a Grafana panel where the upstream-token-spend line kept climbing for forty seconds after the user-facing request count had dropped to zero — the gap turned out to be exactly the shape of this bug. Here's how it usually appears in a codebase. A team builds an SSE endpoint that proxies tokens from an upstream LLM provider. The first version looks fine. Then someone notices that closing the browser tab doesn't stop the upstream API call, and the bill from the provider starts including completions nobody ever read.

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import httpx

app = FastAPI()

async def stream_completion(prompt: str):
    async with httpx.AsyncClient(timeout=None) as client:
        async with client.stream("POST", "https://api.example.com/v1/complete",
                                 json={"prompt": prompt}) as upstream:
            async for chunk in upstream.aiter_bytes():
                yield f"data: {chunk.decode()}\
\
"

@app.get("/complete")
async def complete(prompt: str):
    return StreamingResponse(stream_completion(prompt), media_type="text/event-stream")

There are three problems hiding here, and they only show up under partial disconnects.

First, the async with httpx.AsyncClient block looks defensive, but it only runs its cleanup if the generator exits normally or via an exception that propagates out of the async for. When the client disconnects, Starlette cancels the task wrapping the response. That cancellation does eventually reach the generator and does trigger the async with cleanup — but only after aiter_bytes itself awaits. If the upstream is mid-chunk and not awaiting, the cleanup is deferred until the next chunk boundary. In practice this usually works, but it's fragile under upstream backpressure.

Second, timeout=None was added because the LLM stream can be long. Combined with no explicit disconnect check, this means the request to the upstream provider has no upper bound at all. If your client disconnects and somehow the cancellation doesn't propagate (because of a bug in the upstream client library, or because you forgot an await), you've got a generator that will run until the upstream times out on its own, which may be never.

Third — and this is the subtle one — even if cancellation does propagate cleanly, the yield itself can be where you get stuck. If the downstream socket is full and you're awaiting the write, you can sit there for a long time before the underlying TCP layer notices the peer is gone. Linux defaults give you something like fifteen minutes of keepalive grace before a half-open connection is reaped, and HTTP idle timeouts vary by proxy.

The defensive pattern

Most streaming-handler walkthroughs reach for try/except first and treat finally as an afterthought; in my experience the order should be inverted, and this section is about why. The fix has three parts: explicit disconnect polling on the request, structured exception handling around the streaming loop, and a finally block that does real cleanup. Here's the same endpoint rewritten with those concerns addressed.

import asyncio
import logging
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
import httpx

log = logging.getLogger(__name__)
app = FastAPI()

async def stream_completion(request: Request, prompt: str):
    client = httpx.AsyncClient(timeout=httpx.Timeout(60.0, read=None))
    try:
        async with client.stream("POST",
                                 "https://api.example.com/v1/complete",
                                 json={"prompt": prompt}) as upstream:
            async for chunk in upstream.aiter_bytes():
                if await request.is_disconnected():
                    log.info("client disconnected, aborting upstream stream")
                    break
                yield f"data: {chunk.decode()}\
\
"
    except asyncio.CancelledError:
        log.info("streaming generator cancelled")
        raise
    except Exception:
        log.exception("error in streaming generator")
        raise
    finally:
        await client.aclose()
        log.info("upstream httpx client closed")

@app.get("/complete")
async def complete(request: Request, prompt: str):
    return StreamingResponse(
        stream_completion(request, prompt),
        media_type="text/event-stream",
    )

A few details deserve unpacking. request.is_disconnected() is a Starlette method that consumes any pending ASGI messages on the receive channel and returns True if it has seen an http.disconnect. It doesn't block on the network — it only inspects what the ASGI server has already delivered. That makes it cheap to call inside the loop. The official Starlette source for this is worth reading; see the Starlette Request implementation for the exact semantics around message draining.

The explicit except asyncio.CancelledError: ... raise isn't strictly necessary — if you do nothing, the cancellation will still propagate — but it gives you an unambiguous log line that says this stopped because the client went away instead of an unhelpful traceback. Re-raising is critical; swallowing CancelledError is one of the textbook async footguns, called out in the Python asyncio task documentation under the cancellation section.

The finally block is the part that actually saves you from leaks. Whether the loop exited normally, broke on disconnect, or was cancelled, the httpx client gets closed. If you were holding a database session, this is where you'd commit or roll back and return it to the pool. If you spawned a subprocess, this is where you terminate it and await its exit.

Cancellation versus polling

There's a stylistic debate in the community about whether to rely purely on cancellation propagation or to poll is_disconnected explicitly. Both work; they have different failure modes.

Pure cancellation is more idiomatic asyncio. You write your generator as if disconnects don't exist, and you trust the framework to cancel the task when the client goes away. The cleanup goes in finally blocks. The advantage is that your code stays linear and you don't pepper it with disconnect checks. The disadvantage is that cancellation arrives only at await points, so a generator doing CPU-bound work between yields can run for a while after the client is gone. It also means that if any upstream library catches CancelledError and doesn't re-raise (which is a bug in that library, but real libraries have it), your cancellation gets eaten.

Explicit polling is more defensive. You check is_disconnected before every yield, or before every expensive operation, and you break early. The advantage is that you have a deterministic exit path that doesn't depend on the upstream library's exception hygiene. The disadvantage is that the check isn't free — it processes ASGI messages and allocates a task internally — and overusing it in tight loops can become measurable.

The pragmatic guideline I've settled on after a few production incidents is to use polling at coarse granularity (once per chunk, once per row batch, once per second) and rely on cancellation for the rest. This gives you fast detection without the overhead of checking before every byte.

Server-sent events and keepalive

For SSE specifically, there's a related pattern you want regardless of disconnect handling: a periodic comment frame to keep middleboxes from killing the idle connection. Many proxies will close a connection that hasn't seen bytes in thirty or sixty seconds, even if both endpoints think the connection is alive. SSE supports comment lines that start with a colon and are ignored by clients, which makes them perfect heartbeats.

async def sse_with_heartbeat(request: Request, source):
    last_send = time.monotonic()
    async for event in source:
        if await request.is_disconnected():
            break
        yield f"data: {event}\
\
"
        last_send = time.monotonic()
        if time.monotonic() - last_send > 15:
            yield ": keepalive\
\
"
            last_send = time.monotonic()

The heartbeat doubles as your write-side disconnect detector. If the socket is gone, the attempted yield will eventually fail, and you'll get a CancelledError or a write exception that propagates into your handler.

Background tasks and post-response cleanup

FastAPI offers BackgroundTasks for run-this-after-I-send-the-response work. For streaming responses, background tasks attached to the response run after the generator finishes — including after a disconnect-driven cancellation. This is the right place to put audit logging, metrics emission, and any non-critical cleanup that you don't want to block the response on. Critical cleanup (closing connections, returning sessions) still belongs in finally so it runs even if the background task subsystem fails.

Testing the disconnect path

The reason this bug ships so often is that the obvious test doesn't exercise it. An in-process async client will hold the connection open for the full stream. To actually test the disconnect path you need to start a real server, open a connection, read a few chunks, and then close the socket abruptly. The pattern I've used is uvicorn in a subprocess and then a raw asyncio.open_connection that closes early. Once you've got that harness, you can assert that your generator hit its finally block by checking a sentinel file or counter.

Alternatively, you can unit-test the generator directly by constructing a fake request whose is_disconnected returns true after N calls. This catches the polling logic but not the cancellation path. Both tests are worth having.

Production posture

The thing I tell every team that ships a streaming endpoint for the first time: write the disconnect test before you write the feature. The disconnect path isn't a polish item, it's the path your code will take more often than you expect, because users close tabs and mobile networks drop and load balancers reap connections. If you treat it as the primary path and the success case as the lucky variant, your streaming endpoints will be boring instead of being the source of three-AM pages.

The shape of a robust streaming handler is small once you internalize it: a generator that owns its resources in a try/finally, an explicit disconnect check at chunk boundaries, a heartbeat if the stream can be idle, and a logged re-raise of CancelledError so you know when and why each stream ended. Everything else is application logic.