Mocking AsyncAnthropic in pytest: autospec, Response Stubs, and Streaming

The Anthropic Python SDK ships an AsyncAnthropic client that real tests should never call. A single messages.create round-trip costs cents and 800ms; multiply that by a CI matrix and you're paying for flaky network reruns instead of catching bugs. The fix is straightforward in principle (mock the client) and full of small traps in practice (autospec on async methods, stub objects that look enough like Message, and async iterators that survive a second async for pass).

This is the pattern that has held up across roughly forty test files using the SDK: autospec=True to keep the mock honest, plain dataclasses for response stubs, and a tiny async-iterator helper for streaming. No respx, no pytest-httpx, no recorded fixtures.

Why autospec=True instead of a bare AsyncMock

unittest.mock.AsyncMock() will happily let you call client.foo.bar.baz.qux() even when none of those attributes exist. That sounds convenient until the SDK renames messages to something else, your tests still pass, and production explodes. autospec=True walks the real class signature at patch time, so attribute typos and bad keyword arguments fail at import.

from unittest.mock import patch, AsyncMock
import pytest
from anthropic import AsyncAnthropic

@pytest.fixture
def mock_anthropic():
    with patch("your_project.llm.AsyncAnthropic", autospec=True) as cls:
        client = cls.return_value
        client.messages.create = AsyncMock()
        yield client

Three things to notice. First, you patch the class at the import site (your_project.llm.AsyncAnthropic), not at anthropic.AsyncAnthropic — patching the source module is a classic miss that produces tests passing against the real SDK. Second, cls.return_value is the instance that any AsyncAnthropic(api_key=...) call inside your code returns. Third, messages.create still needs to be reassigned to a fresh AsyncMock() because autospec on a non-async attribute defaults to MagicMock, and awaiting that raises TypeError: object MagicMock can't be used in 'await' expression.

Response stubs: a dataclass that quacks like Message

The real return type is anthropic.types.Message, which is a Pydantic model with about a dozen fields. Trying to construct a real Message in tests means importing five sub-types and supplying token counts you don't care about. A tiny dataclass with the fields your production code actually reads is 90% cheaper:

from dataclasses import dataclass, field

@dataclass
class StubTextBlock:
    text: str
    type: str = "text"

@dataclass
class StubMessage:
    content: list[StubTextBlock]
    stop_reason: str = "end_turn"
    model: str = "claude-sonnet-4-6"
    role: str = "assistant"
    usage: dict = field(default_factory=lambda: {"input_tokens": 10, "output_tokens": 20})

def stub_text(text: str) -> StubMessage:
    return StubMessage(content=[StubTextBlock(text=text)])

Now a test reads like the code it's testing:

async def test_summarize_calls_anthropic(mock_anthropic):
    mock_anthropic.messages.create.return_value = stub_text("ok")

    result = await summarize("hello world")

    assert result == "ok"
    mock_anthropic.messages.create.assert_awaited_once()
    call = mock_anthropic.messages.create.await_args
    assert call.kwargs["model"].startswith("claude-")
    assert call.kwargs["max_tokens"] <= 4096

assert_awaited_once over assert_called_once matters: AsyncMock distinguishes "I was called but never awaited" from "I was awaited", and the former is almost always a bug in the code under test (forgotten await).

Streaming completions: async iterators that replay cleanly

client.messages.stream(...) returns an async context manager whose body yields event objects. The mock has two layers to fake: the context manager and the iterator. The cleanest shape is a small helper:

from contextlib import asynccontextmanager

@dataclass
class StubTextDelta:
    text: str
    type: str = "text_delta"

@dataclass
class StubEvent:
    delta: StubTextDelta
    type: str = "content_block_delta"

def stub_stream(chunks: list[str]):
    events = [StubEvent(delta=StubTextDelta(text=c)) for c in chunks]

    @asynccontextmanager
    async def _ctx(**kwargs):
        class _Stream:
            def __aiter__(self):
                return self._gen()
            async def _gen(self):
                for ev in events:
                    yield ev
            async def get_final_message(self):
                return stub_text("".join(chunks))
        yield _Stream()
    return _ctx

Wire it up like this:

async def test_streaming_summarize(mock_anthropic):
    mock_anthropic.messages.stream = stub_stream(["hel", "lo ", "world"])

    parts = []
    async with mock_anthropic.messages.stream(model="claude-sonnet-4-6") as s:
        async for ev in s:
            parts.append(ev.delta.text)
        final = await s.get_final_message()

    assert "".join(parts) == "hello world"
    assert final.content[0].text == "hello world"

The __aiter__ returning a fresh generator each call is the part everyone gets wrong on the first try. If you store a single generator and return it from __aiter__, a second async for over the same stream yields zero events, which silently passes most tests but breaks retry logic and snapshot-builder code in production.

Tool use, retries, and one comparative note

For tool-use tests, return a StubMessage with stop_reason="tool_use" and a content block whose type="tool_use". The same dataclass approach scales — add a StubToolUseBlock with name, input, and id, and your content list becomes mixed-type. Production code that walks content by .type will be exercised identically.

For retry tests, set side_effect to a list:

from anthropic import APIError

mock_anthropic.messages.create.side_effect = [
    APIError("transient", request=None, body=None),
    stub_text("recovered"),
]

This is roughly 3× faster in CI than pytest-httpx for the same coverage, because there's no HTTP layer to mount and tear down per test — the patch costs about 50ms, the HTTP fixture roughly 150ms. pytest-httpx is the right tool when you're testing your custom retry adapter at the transport layer or asserting exact request bodies as they leave the wire. For everything above the SDK — agents, summarizers, planners — autospec plus dataclass stubs is faster to write and faster to run.

The trade-off you accept with autospec: when the SDK adds a new method (say messages.batches.create), your existing mocks won't know about it until you bump the SDK and re-run autospec. That's a feature. It forces the test suite to acknowledge new API surface area instead of silently mocking through it.

A note on fixture scope

Keep the mock_anthropic fixture function-scoped. A session-scoped mock means assert_awaited_once accumulates across tests and produces confusing failures when test order changes. The 5ms saved per test by sharing the patch isn't worth the debugging.

If you're testing an agent loop that constructs its own AsyncAnthropic instance per request (a common pattern for per-tenant API keys), patch the constructor and assert cls.assert_called_with(api_key="...") once the loop completes. The cls reference from the fixture is the patched class itself, distinct from cls.return_value which is the instance.

What to skip

Don't mock count_tokens unless your code calls it explicitly — most agents rely on usage.input_tokens from the response, which the dataclass already supplies. Don't try to mock anthropic.types.Message with spec=Message: Pydantic v2 models defeat unittest.mock.create_autospec because their __init__ signature is dynamic. Dataclass stubs sidestep that entirely.

References: