Mocking AsyncAnthropic in pytest: autospec, Response Stubs, and Streaming
Mock the async Anthropic Python SDK in pytest with autospec=True, typed response stubs for messages.create, and async-iterator stubs for streaming completions.
Mocking AsyncAnthropic in pytest: autospec, Response Stubs, and Streaming
The Anthropic Python SDK ships an AsyncAnthropic client that real tests should never call. A single messages.create round-trip costs cents and 800ms; multiply that by a CI matrix and you're paying for flaky network reruns instead of catching bugs. The fix is straightforward in principle (mock the client) and full of small traps in practice (autospec on async methods, stub objects that look enough like Message, and async iterators that survive a second async for pass).
This is the pattern that has held up across roughly forty test files using the SDK: autospec=True to keep the mock honest, plain dataclasses for response stubs, and a tiny async-iterator helper for streaming. No respx, no pytest-httpx, no recorded fixtures.
Why autospec=True instead of a bare AsyncMock
unittest.mock.AsyncMock() will happily let you call client.foo.bar.baz.qux() even when none of those attributes exist. That sounds convenient until the SDK renames messages to something else, your tests still pass, and production explodes. autospec=True walks the real class signature at patch time, so attribute typos and bad keyword arguments fail at import.
from unittest.mock import patch, AsyncMock
import pytest
from anthropic import AsyncAnthropic
@pytest.fixture
def mock_anthropic():
with patch("your_project.llm.AsyncAnthropic", autospec=True) as cls:
client = cls.return_value
client.messages.create = AsyncMock()
yield client
Three things to notice. First, you patch the class at the import site (your_project.llm.AsyncAnthropic), not at anthropic.AsyncAnthropic \u2014 patching the source module is a classic miss that produces tests passing against the real SDK. Second, cls.return_value is the instance that any AsyncAnthropic(api_key=...) call inside your code returns. Third, messages.create still needs to be reassigned to a fresh AsyncMock() because autospec on a non-async attribute defaults to MagicMock, and awaiting that raises TypeError: object MagicMock can't be used in 'await' expression.
Response stubs: a dataclass that quacks like Message
The real return type is anthropic.types.Message, which is a Pydantic model with about a dozen fields. Trying to construct a real Message in tests means importing five sub-types and supplying token counts you don't care about. A tiny dataclass with the fields your production code actually reads is 90% cheaper:
from dataclasses import dataclass, field
@dataclass
class StubTextBlock:
text: str
type: str = "text"
@dataclass
class StubMessage:
content: list[StubTextBlock]
stop_reason: str = "end_turn"
model: str = "claude-sonnet-4-6"
role: str = "assistant"
usage: dict = field(default_factory=lambda: {"input_tokens": 10, "output_tokens": 20})
def stub_text(text: str) -> StubMessage:
return StubMessage(content=[StubTextBlock(text=text)])
Now a test reads like the code it's testing:
async def test_summarize_calls_anthropic(mock_anthropic):
mock_anthropic.messages.create.return_value = stub_text("ok")
result = await summarize("hello world")
assert result == "ok"
mock_anthropic.messages.create.assert_awaited_once()
call = mock_anthropic.messages.create.await_args
assert call.kwargs["model"].startswith("claude-")
assert call.kwargs["max_tokens"] <= 4096
assert_awaited_once over assert_called_once matters: AsyncMock distinguishes "I was called but never awaited" from "I was awaited", and the former is almost always a bug in the code under test (forgotten await).
Streaming completions: async iterators that replay cleanly
client.messages.stream(...) returns an async context manager whose body yields event objects. The mock has two layers to fake: the context manager and the iterator. The cleanest shape is a small helper:
from contextlib import asynccontextmanager
@dataclass
class StubTextDelta:
text: str
type: str = "text_delta"
@dataclass
class StubEvent:
delta: StubTextDelta
type: str = "content_block_delta"
def stub_stream(chunks: list[str]):
events = [StubEvent(delta=StubTextDelta(text=c)) for c in chunks]
@asynccontextmanager
async def _ctx(**kwargs):
class _Stream:
def __aiter__(self):
return self._gen()
async def _gen(self):
for ev in events:
yield ev
async def get_final_message(self):
return stub_text("".join(chunks))
yield _Stream()
return _ctx
Wire it up like this:
async def test_streaming_summarize(mock_anthropic):
mock_anthropic.messages.stream = stub_stream(["hel", "lo ", "world"])
parts = []
async with mock_anthropic.messages.stream(model="claude-sonnet-4-6") as s:
async for ev in s:
parts.append(ev.delta.text)
final = await s.get_final_message()
assert "".join(parts) == "hello world"
assert final.content[0].text == "hello world"
The __aiter__ returning a fresh generator each call is the part everyone gets wrong on the first try. If you store a single generator and return it from __aiter__, a second async for over the same stream yields zero events, which silently passes most tests but breaks retry logic and snapshot-builder code in production.
Tool use, retries, and one comparative note
For tool-use tests, return a StubMessage with stop_reason="tool_use" and a content block whose type="tool_use". The same dataclass approach scales \u2014 add a StubToolUseBlock with name, input, and id, and your content list becomes mixed-type. Production code that walks content by .type will be exercised identically.
For retry tests, set side_effect to a list:
from anthropic import APIError
mock_anthropic.messages.create.side_effect = [
APIError("transient", request=None, body=None),
stub_text("recovered"),
]
This is roughly 3\u00d7 faster in CI than pytest-httpx for the same coverage, because there's no HTTP layer to mount and tear down per test \u2014 the patch costs about 50ms, the HTTP fixture roughly 150ms. pytest-httpx is the right tool when you're testing your custom retry adapter at the transport layer or asserting exact request bodies as they leave the wire. For everything above the SDK \u2014 agents, summarizers, planners \u2014 autospec plus dataclass stubs is faster to write and faster to run.
The trade-off you accept with autospec: when the SDK adds a new method (say messages.batches.create), your existing mocks won't know about it until you bump the SDK and re-run autospec. That's a feature. It forces the test suite to acknowledge new API surface area instead of silently mocking through it.
A note on fixture scope
Keep the mock_anthropic fixture function-scoped. A session-scoped mock means assert_awaited_once accumulates across tests and produces confusing failures when test order changes. The 5ms saved per test by sharing the patch isn't worth the debugging.
If you're testing an agent loop that constructs its own AsyncAnthropic instance per request (a common pattern for per-tenant API keys), patch the constructor and assert cls.assert_called_with(api_key="...") once the loop completes. The cls reference from the fixture is the patched class itself, distinct from cls.return_value which is the instance.
What to skip
Don't mock count_tokens unless your code calls it explicitly \u2014 most agents rely on usage.input_tokens from the response, which the dataclass already supplies. Don't try to mock anthropic.types.Message with spec=Message: Pydantic v2 models defeat unittest.mock.create_autospec because their __init__ signature is dynamic. Dataclass stubs sidestep that entirely.
References: