Building AI Agents with Claude: Orchestration Patterns That Actually Work

A customer asks your support bot to look up their order, check the return policy, and process a refund. The bot looks up the order, then stops. Returns a half-finished answer. The customer starts over. This is not a model problem — it is an architecture problem. The agent's loop terminated early because the code checked for text content instead of the actual completion signal.

That single bug — checking the wrong field — is the gap between a demo and a production system. The patterns below are grounded in what Anthropic tests on the certification exam, and they come up constantly in real builds.

The agentic loop lifecycle

Every Claude-based agent runs the same four-step cycle:

Send a request to Claude via the Messages API, including the full conversation history and any tool results from the previous iteration.
Check stop_reason in the response — this is the only reliable signal for what happens next.
If stop_reason is "tool_use": execute the requested tool(s), append the results to conversation history, send the updated conversation back to Claude.
If stop_reason is "end_turn": the agent is finished. Return the final response.

The core loop in TypeScript:

typescript

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

async function runAgent(userPrompt: string, tools: Anthropic.Tool[]) {
  const messages: Anthropic.MessageParam[] = [
    { role: "user", content: userPrompt }
  ];

  while (true) {
    const response = await client.messages.create({
      model: "claude-sonnet-4-20250514",
      max_tokens: 4096,
      tools,
      messages,
    });

    // The ONLY reliable completion signal
    if (response.stop_reason === "end_turn") {
      const text = response.content.find(b => b.type === "text");
      return text?.text ?? "";
    }

    // Append assistant response to history
    messages.push({ role: "assistant", content: response.content });

    // Execute each tool call and append results
    const toolResults = response.content
      .filter(b => b.type === "tool_use")
      .map(toolUse => ({
        type: "tool_result" as const,
        tool_use_id: toolUse.id,
        content: executeTool(toolUse.name, toolUse.input),
      }));

    messages.push({ role: "user", content: toolResults });
  }
}

Step 2 is where things go wrong. The stop_reason field is deterministic and unambiguous. Three common anti-patterns break this, and they show up in production more often than you'd expect:

Checking response.content[0].type === "text" — Claude can return text alongside tool calls in the same response. A message like "Let me look up your order" followed by a tool_use block has text at position 0, but the agent is nowhere near finished.

Parsing natural language like "I'm done" or "task complete" — Claude might say "I've finished analysing the first file" while fully intending to continue with more files. Natural language is ambiguous. stop_reason is not.

Arbitrary iteration caps as the primary stop — setting "stop after 10 loops" either cuts off useful work or runs unnecessary iterations. Caps are fine as a safety net. They are terrible as the primary control mechanism. If your cap triggers during normal operation, something else is broken.

For the full deep dive, see Agentic Loops.

Single-agent vs multi-agent: when to split

Most problems do not need multiple agents. A single agent with well-chosen tools handles the majority of tasks — and it is dramatically simpler to build, debug, and maintain. The temptation to over-architect is real. Resist it until the evidence forces your hand.

Split into multiple agents when:

The task has distinct specialisms. Research that requires web search, document analysis, and synthesis benefits from agents optimised for each role. One generalist trying to juggle all three will produce mediocre results across the board.
Context isolation matters. A code review agent should not see the entire conversation history of a planning session. Isolated agents receive only what is relevant to their job.
Parallel execution adds value. Independent subtasks — searching three different sources, for instance — can run concurrently across agents and finish in a fraction of the time.

Stay with a single agent when:

The tools are closely related and the workflow is linear.
The total context fits comfortably in one conversation.
You cannot articulate a concrete benefit from the added coordination complexity.

Orchestration patterns compared

Pattern	How it works	Best for	Watch out for
Single agent	One agent with multiple tools	Linear workflows, straightforward tasks	Context bloat on complex tasks
Sequential pipeline	Agent A output feeds Agent B input	Document processing chains	Single point of failure at each stage
Parallel fan-out	Multiple agents run concurrently, results aggregated	Independent research subtasks	Coordination overhead on aggregation
Coordinator/specialist	Central coordinator delegates to specialist subagents	Complex, multi-domain tasks	Coordinator becomes a bottleneck if poorly designed

The exam heavily favours the coordinator/specialist pattern for complex scenarios. That is the one worth understanding inside and out.

The coordinator/specialist pattern

Think hub-and-spoke. A coordinator agent sits at the centre, receives the task, decomposes it into subtasks, delegates to specialist subagents, aggregates results, and handles errors.

One cardinal rule: all communication flows through the coordinator. Subagents never talk to each other directly. Not for efficiency. Not for convenience. Not for any reason. This is non-negotiable.

Why? Three reasons, and they compound:

Observability — every message passes through one place. You can log and monitor the entire system from a single point. When something breaks at 3 AM, you will be grateful for this.
Consistent error handling — the coordinator applies uniform recovery policies instead of each agent improvising its own.
Controlled information flow — the coordinator decides what context each subagent receives. No agent sees more than it needs.

The isolation principle

This catches most people off guard: subagents do not inherit the coordinator's conversation history. When the coordinator spawns a subagent, that subagent starts with only what the coordinator explicitly includes in its prompt. No access to the coordinator's system prompt. No access to previous messages. No access to results from other subagents. Nothing implicit.

Subagents also have no memory between invocations. Call the search agent twice and the second call has zero knowledge of the first. Blank slate.

This forces the coordinator to be deliberate about context passing. If the synthesis agent needs search results, the coordinator must pass them explicitly. There is no shared memory or global state to fall back on. This feels restrictive at first, but it eliminates an entire class of bugs around stale or leaked context.

The narrow decomposition failure

The exam tests this pattern directly. A coordinator is asked to research "the impact of AI on creative industries." It decomposes the topic into visual arts subtopics only — missing music, writing, film, and gaming entirely.

The output is thorough on visual arts. Empty on everything else. Where did it go wrong?

Not the search agent — it searched exactly what it was assigned. Not the synthesis agent — it accurately combined everything it received. The failure sits squarely in the coordinator's task decomposition. It assigned only one category of creative industries.

This is the exam lesson worth memorising: when a multi-agent system produces output that is incomplete in scope (not depth), trace the failure to the coordinator's decomposition. Throwing more subagents at the problem or improving search queries will not fix a decomposition that was wrong from the start.

The Claude Agent SDK

The Agent SDK provides a structured framework for building agents. The key concepts:

Agent definitions specify the model, system prompt, tools, and constraints for each agent:

python

from claude_agent_sdk import Agent

research_agent = Agent(
    name="research",
    model="claude-sonnet-4-20250514",
    system_prompt="You are a research specialist. Search thoroughly and return structured findings with sources.",
    tools=["web_search", "document_reader"],
    max_tokens=4096,
)

Lifecycle hooks let you intercept and control agent behaviour at specific points:

PreToolUse — fires before a tool executes. Use for validation, logging, or blocking dangerous operations.
PostToolUse — fires after a tool returns. Use for result validation or transformation.
SubagentStart / SubagentStop — fire when subagents are spawned and completed. Use for tracking and controlling the coordination layer.

Programmatic enforcement is where hooks earn their keep. For financial, security, or compliance operations, prompt instructions are not enough. An 8% failure rate on identity verification before refunds is unacceptable — full stop. A PreToolUse hook on the process_refund tool can check whether get_customer has returned a verified ID, blocking the operation deterministically regardless of what the model decides to do.

The exam draws a hard line on this: prompt-based guidance works 90-95% of the time. Programmatic enforcement works every time. For high-stakes operations, the answer is always programmatic enforcement. Always. The exam will try to tempt you with "add a few-shot example" or "rewrite the system prompt" as alternatives. Do not fall for it.

For the full breakdown, see Workflow Enforcement and Handoff.

Guardrails that matter in production

Getting agents to work in a demo is the easy part. Production is where things get interesting.

Safety iteration caps — set a maximum loop count (e.g., 20) as a fallback to prevent runaway agents. This is a safety net, not the primary stopping mechanism. If it triggers in normal operation, your loop logic is broken and the cap just masked the real problem.

Prerequisite gates — block downstream tools until preconditions are met. The refund-before-verification pattern is the canonical example, but the same principle applies anywhere you have multi-step workflows with dependencies. Gate the steps. Do not trust the model to sequence them correctly every time.

Structured handoff protocols — when an agent escalates to a human, the handoff must include customer ID, conversation summary, root cause analysis, and recommended action. The human agent does not have access to the conversation transcript. The handoff summary is all they get. If it is incomplete, the customer explains everything again.

Threshold enforcement — cap financial operations, rate-limit API calls, restrict scope. These are code-level controls that the model cannot override, and that is exactly the point.

What Domain 1 actually tests

Domain 1 (Agentic Architecture & Orchestration) carries 27% of the exam — the heaviest single domain. The questions are scenario-based. You get a situation with a broken or suboptimal agent system, four plausible answers, and one correct diagnosis.

Five patterns appear over and over:

Premature loop termination — the fix is always stop_reason, never iteration caps or text parsing. Every single time.
Multi-agent coverage gaps — trace to the coordinator's decomposition, not to downstream agents.
Subagent context failures — the coordinator did not pass enough context. The subagent is not broken; it just never received what it needed.
Compliance failures on financial operations — programmatic enforcement, never enhanced prompts or few-shot examples. No exceptions.
Direct inter-subagent communication proposed as an improvement — reject it. All communication flows through the coordinator. The exam will make the direct communication option sound efficient and sensible. It is still wrong.

If you can spot these five patterns on sight, you will handle the majority of Domain 1 questions without breaking a sweat.

Start with the full Domain 1 curriculum, or jump straight to the mock exam to test yourself under exam conditions. The study guide covers how to allocate prep time across all five domains.