Domain 2
Task 2.2

MCP Server Implementation

Learn this interactively

What You Need to Know

When an MCP tool fails, the error response it returns determines whether the agent can recover intelligently or flails blindly. Generic error messages like "Operation failed" are useless to an LLM — they provide no signal about what went wrong, whether to retry, or what alternative path to take.

The MCP protocol provides the isError flag specifically for communicating tool failures back to the agent. Setting this flag tells the model that the tool execution failed, enabling it to reason about recovery rather than treating error text as a normal successful result.

The Four Error Categories

Every tool failure falls into one of four categories. Each demands a different recovery strategy, and the agent needs structured metadata to distinguish them.

1. Transient Errors Timeouts, service unavailability, rate limits. The underlying system is temporarily unreachable but the request itself is valid. Recovery: retry after a brief delay.

json
{
  "isError": true,
  "content": [{
    "type": "text",
    "text": "Service temporarily unavailable"
  }],
  "errorCategory": "transient",
  "isRetryable": true,
  "description": "The order database is experiencing high load. The request is valid and should succeed on retry."
}

2. Validation Errors Invalid input format, missing required fields, out-of-range values. The request itself is malformed. Recovery: fix the input and retry.

json
{
  "isError": true,
  "content": [{
    "type": "text",
    "text": "Invalid order ID format"
  }],
  "errorCategory": "validation",
  "isRetryable": true,
  "description": "Order ID must be in format #NNNNN (e.g. #12345). Received: 'order-abc'. Please reformat and retry."
}

3. Business Errors Policy violations, limit exceedances, business rule conflicts. The request is technically valid but violates a business constraint. Recovery: do NOT retry — the same request will always fail. The agent needs an alternative workflow.

json
{
  "isError": true,
  "content": [{
    "type": "text",
    "text": "Refund exceeds policy limit"
  }],
  "errorCategory": "business",
  "isRetryable": false,
  "description": "Refund amount of £750 exceeds the £500 automatic refund limit. This requires manager approval. Please escalate to a human agent with the refund details."
}

Note the isRetryable: false flag. This is critical. Business errors are never resolved by retrying. The agent must take a fundamentally different path — typically escalation or an alternative workflow. Including a customer-friendly explanation enables the agent to communicate appropriately.

4. Permission Errors Access denied, insufficient credentials, authorisation failures. The tool cannot execute because the caller lacks the required permissions. Recovery: escalate or use different credentials.

json
{
  "isError": true,
  "content": [{
    "type": "text",
    "text": "Access denied"
  }],
  "errorCategory": "permission",
  "isRetryable": false,
  "description": "The current service account does not have permission to access financial records. Escalate to a senior agent with financial system access."
}

Access Failure vs Valid Empty Result

This distinction is one of the most critical concepts in the entire domain, and the exam tests it directly.

Access failure: The tool could not reach the data source. A timeout occurred, authentication failed, or the service was down. The data might exist, but the tool could not check. The agent needs to decide whether to retry.

Valid empty result: The tool successfully queried the data source and found no matches. The query executed correctly — there simply is no data matching the criteria. The agent should NOT retry. The answer is "no results found."

Confusing these two breaks recovery logic entirely. Consider this scenario:

A tool returns an empty array after a customer lookup. The agent retries 3 times, then escalates to a human. Analysis reveals the customer's account simply does not exist.

The tool succeeded. It queried the database, found no matching customer, and correctly returned an empty result. But because the response does not distinguish between "I could not reach the database" and "I reached the database and found nothing", the agent treats both the same way — as a failure worth retrying.

The fix: structure your tool responses so that a successful query with no results looks fundamentally different from a failed query.

json
// Valid empty result — NOT an error
{
  "isError": false,
  "content": [{
    "type": "text",
    "text": "No customer found matching email 'john@example.com'. The query executed successfully but returned no matches."
  }],
  "resultCount": 0
}

// Access failure — IS an error
{
  "isError": true,
  "content": [{
    "type": "text",
    "text": "Could not reach customer database"
  }],
  "errorCategory": "transient",
  "isRetryable": true,
  "description": "Connection to the customer database timed out after 5 seconds. The query did not execute."
}

Error Propagation in Multi-Agent Systems

In multi-agent architectures, error handling follows a principle of local recovery with selective propagation:

  1. Subagents implement local recovery for transient failures. If a web search times out, the search subagent retries before bothering the coordinator.
  2. Only propagate errors that cannot be resolved locally. If all retries fail, the subagent reports the failure upward.
  3. Include partial results and what was attempted. The coordinator needs context: "I searched 3 of 5 sources successfully. Sources 4 and 5 timed out. Here are partial results from the 3 successful sources."

This prevents two anti-patterns: silently suppressing errors (returning empty results as success) and terminating entire workflows on a single failure. Both destroy the coordinator's ability to make intelligent decisions.

Key Concept

The distinction between access failures (tool could not reach the data source) and valid empty results (tool successfully queried and found nothing) is critical. Confusing the two causes wasted retries and incorrect escalations. The exam tests this directly.

Exam Traps

EXAM TRAP

Retrying when a tool returns an empty result from a successful query

An empty result from a successful query means 'no data matches your criteria.' Retrying will produce the same empty result. The agent should accept the result and respond accordingly.

EXAM TRAP

Using generic error messages like 'Operation failed' without structured metadata

Without errorCategory, isRetryable, and a description, the agent cannot distinguish transient failures from business rule violations. It cannot make appropriate recovery decisions.

EXAM TRAP

Treating business errors as retryable

Business errors (e.g. refund exceeds policy limit) will never resolve through retry. The same policy violation applies every time. The agent must take an alternative path such as escalation.

EXAM TRAP

Silently suppressing subagent errors by returning empty results as success

This hides failure information from the coordinator, preventing intelligent recovery. The coordinator cannot distinguish 'found nothing' from 'could not search' and may produce incomplete or inaccurate output.

Practice Scenario

A tool returns an empty array after a customer lookup. The agent retries 3 times, then escalates to a human agent. Analysis shows the customer's account simply does not exist. What is the root cause of this wasted effort?

Build Exercise

Build Structured Error Responses for All Four Categories

Intermediate
45 minutes

What you'll learn

  • Implement structured error responses with errorCategory, isRetryable, and description metadata
  • Distinguish between access failures (isError: true) and valid empty results (isError: false)
  • Categorise tool failures into transient, validation, business, and permission types
  • Build agent recovery logic that takes different actions based on error metadata
  1. Create an MCP tool that queries a mock customer database with simulated failure modes

    Why: Simulating failure modes in a controlled environment lets you observe how agents behave when errors lack structure. The exam tests your understanding of how poor error responses cause wasted retries and incorrect escalations.

    You should see: An MCP server running with a customer_lookup tool that accepts a customer identifier and a failure_mode parameter to trigger specific error conditions on demand.

  2. Implement four error response types: transient (simulated timeout), validation (invalid input format), business (refund exceeds policy limit), and permission (access denied)

    Why: Each error category demands a different recovery strategy. The exam tests whether you can identify which category an error belongs to and what recovery action is appropriate. Transient errors are retryable; business errors never are.

    You should see: Four distinct error responses, each with isError: true, a specific errorCategory value, the correct isRetryable boolean, and a descriptive message explaining what went wrong and what to do next.

  3. Include structured metadata in each error: errorCategory, isRetryable boolean, and a human-readable description

    Why: Structured metadata is what enables intelligent recovery. Without these fields, the agent cannot distinguish a transient timeout from a permanent policy violation. The exam specifically tests whether you know that isRetryable: false means the agent must take an alternative path, not retry.

    You should see: Each error response parses to a JSON object containing exactly three fields: errorCategory (one of transient, validation, business, permission), isRetryable (boolean), and description (a sentence explaining the error and suggesting recovery).

  4. Implement a valid empty result response (isError: false, resultCount: 0) clearly distinguished from an access failure

    Why: This is one of the most critical distinctions in Domain 2. Confusing access failures with valid empty results causes wasted retries and incorrect escalations. The exam tests this directly — an agent retrying a successful empty query is the canonical anti-pattern.

    You should see: Two structurally different responses: a valid empty result with isError: false and resultCount: 0 (indicating the query ran successfully but found nothing), and an access failure with isError: true, errorCategory: transient, and isRetryable: true.

  5. Write an agent loop that reads the error metadata and takes appropriate action: retry for transient, fix input for validation, escalate for business, and request credentials for permission

    Why: The agent loop demonstrates the practical outcome of structured error metadata. Each error category maps to a specific recovery action, and the loop must branch correctly. This is exactly the kind of decision logic the exam expects you to design.

    You should see: An agent loop that parses the error metadata, branches on errorCategory, retries transient errors up to 3 times with backoff, reformats input for validation errors, escalates business errors to a human, and requests elevated credentials for permission errors.

Sources