Domain 5
15%

Quick Reference: Domain 5 — Context Management & Reliability

Progressive Summarisation Trap

The most common mistake: Relying on the model to progressively summarise earlier context as the conversation grows. Each summarisation pass loses specific details — names, numbers, exact quotes, precise timestamps.

The fix: Extract concrete facts into persistent structured blocks that are carried forward verbatim. Do not ask the model to "summarise what we've discussed" — instead, maintain a structured facts section:

## Persistent Facts - Customer: Jane Smith (ID: 12847) - Order: #ORD-2024-8821, placed 2024-01-15 - Issue: Delivery address incorrect (postcode SW1A 1AA, should be SW1A 2AA) - Status: Refund approved, reshipment pending

This block is copied forward exactly, not summarised. The model cannot lose these details.

Scratchpad Files

Use scratchpad files to persist findings across context boundaries during multi-step exploration.

When to use: Long-running tasks that may exceed the context window, or tasks where intermediate findings need to survive a context reset.

Pattern:

  1. Write findings to a scratchpad file as you discover them
  2. At context boundaries, read the scratchpad to restore state
  3. The scratchpad is a file on disk — it survives any context reset

Key distinction: Scratchpads are external persistence (files). Conversation history is internal persistence (context window). When the context window fills up, conversation history is lost — scratchpad files are not.

Error Propagation

Structured error context must include:

FieldPurposeExample
Failure typeWhat category of failure"network_timeout", "validation_error"
Partial resultsWhat was successfully retrieved{ "orders": [...], "payments": null }
Alternatives triedWhat recovery was attempted"Retried 2x, tried fallback endpoint"
Suggested actionWhat should happen next"Escalate to support with order IDs"

Never propagate generic error messages. "Something went wrong" gives the next agent (or human) nothing to work with. Structured error context enables intelligent recovery.

Valid Empty Results vs Access Failures

SituationTypeCorrect Action
Search returns 0 resultsValid emptyAccept — absence of data is the answer
API returns 401 UnauthorisedAccess failureRetry with correct credentials or escalate
Database query returns empty setValid emptyAccept — the record does not exist
Network timeoutAccess failureRetry or use fallback
Filter matches no itemsValid emptyAccept — refine the filter or report no matches
Rate limit (429)Access failureWait and retry (respect Retry-After)

The exam tests this distinction repeatedly. If the tool executed successfully and returned nothing, that is a valid result. If the tool failed to execute, that is an error requiring recovery.

Confidence Calibration & Stratified Validation

Aggregate metrics mask per-type issues. An overall accuracy of 95% can hide the fact that one category has 60% accuracy.

Stratified validation pattern:

  1. Group extractions by type/category
  2. Calculate per-type accuracy
  3. Identify categories where the model underperforms
  4. Apply targeted fixes (better examples, tighter schema) to weak categories

Stratified random sampling: Select validation samples from each category proportionally. This detects novel error patterns that random sampling across all categories would miss.

Key exam point: When the question mentions "high overall accuracy but users report errors", the answer is stratified validation — break down accuracy by category to find the weak spots.

Source Attribution

Structured claim-source mappings survive synthesis. Inline links do not.

ApproachSurvives Synthesis?Use When
Structured mapping: { claim: "...", source: "...", url: "...", date: "..." }YesProduction systems, multi-step pipelines
Inline links in textNo — lost during summarisationQuick prototyping only

Conflicting sources: Annotate both with provenance and dates. Never average conflicting claims or silently pick one. Present both with their sources and let the consumer decide.

Claim: "Context window is 200K tokens" Source A: Anthropic docs (2024-03-15) — 200K tokens Source B: Blog post (2023-11-20) — 100K tokens Resolution: Source A is newer and authoritative. Source B is outdated.

Monitoring & Observability

Per-type error rates, not just aggregate accuracy. This is the monitoring equivalent of stratified validation.

What to track:

  • Error rate per extraction type/category
  • Error rate per source type (different APIs, different document formats)
  • Confidence scores vs. actual accuracy (calibration curve)
  • Context window utilisation over conversation length
  • Retry rates and recovery success rates

Alert on: Per-category accuracy drops, not just overall accuracy drops. A 2% overall drop might represent a 30% drop in one category.

Context Window Strategies

StrategyWhen to Use
Persistent fact blocksCritical details that must survive summarisation
Scratchpad filesMulti-step exploration across context boundaries
Per-file passes + integrationLarge codebases — avoid attention dilution
Fresh start + summary injectionContext has become stale or contradictory
Prompt cachingRepeated system prompts across requests (cost + latency savings)

Context window pressure signals: Model starts repeating itself, contradicts earlier statements, or ignores recent information. These indicate the window is too full or the context is stale.

Decision Rules for the Exam

If the question says...The answer is likely...
"details lost over long conversation"Persistent fact blocks, not progressive summarisation
"findings need to survive context reset"Scratchpad files (external persistence)
"tool returned nothing"Distinguish: valid empty (accept) vs access failure (retry)
"high overall accuracy, users report errors"Stratified validation — check per-type accuracy
"sources disagree"Annotate both with provenance and dates
"generic error message"Replace with structured error context
"model contradicts itself"Fresh start + summary injection
"inline citations lost after processing"Switch to structured claim-source mappings
"monitoring shows stable metrics but quality complaints"Per-type error rates instead of aggregate
"long task exceeds context window"Scratchpad files + context boundary management

Common Exam Traps

TrapCorrect Answer
"Summarise the conversation to save context"Wrong — summarisation loses details; use persistent fact blocks
"Overall 95% accuracy means the system is reliable"Wrong — check per-type accuracy; aggregate masks category issues
"Pick the more recent source when they conflict"Wrong — annotate both with provenance; let consumer decide
"Return 'No results found — please try again'"Wrong if search legitimately found nothing — accept the empty result
"Inline citations in the text are sufficient"Wrong for production — they are lost during synthesis; use structured mappings
"Monitor overall error rate for quality"Wrong — monitor per-type error rates to catch category-specific degradation
"Keep all context in the conversation history"Wrong — use scratchpad files for external persistence across boundaries