Quick Reference: Domain 5 — Context Management & Reliability

Progressive Summarisation Trap

The most common mistake: Relying on the model to progressively summarise earlier context as the conversation grows. Each summarisation pass loses specific details — names, numbers, exact quotes, precise timestamps.

The fix: Extract concrete facts into persistent structured blocks that are carried forward verbatim. Do not ask the model to "summarise what we've discussed" — instead, maintain a structured facts section:

## Persistent Facts
- Customer: Jane Smith (ID: 12847)
- Order: #ORD-2024-8821, placed 2024-01-15
- Issue: Delivery address incorrect (postcode SW1A 1AA, should be SW1A 2AA)
- Status: Refund approved, reshipment pending

This block is copied forward exactly, not summarised. The model cannot lose these details.

Scratchpad Files

Use scratchpad files to persist findings across context boundaries during multi-step exploration.

When to use: Long-running tasks that may exceed the context window, or tasks where intermediate findings need to survive a context reset.

Pattern:

Write findings to a scratchpad file as you discover them
At context boundaries, read the scratchpad to restore state
The scratchpad is a file on disk — it survives any context reset

Key distinction: Scratchpads are external persistence (files). Conversation history is internal persistence (context window). When the context window fills up, conversation history is lost — scratchpad files are not.

Error Propagation

Structured error context must include:

Field	Purpose	Example
Failure type	What category of failure	`"network_timeout"`, `"validation_error"`
Partial results	What was successfully retrieved	`{ "orders": [...], "payments": null }`
Alternatives tried	What recovery was attempted	`"Retried 2x, tried fallback endpoint"`
Suggested action	What should happen next	`"Escalate to support with order IDs"`

Never propagate generic error messages. "Something went wrong" gives the next agent (or human) nothing to work with. Structured error context enables intelligent recovery.

Valid Empty Results vs Access Failures

Situation	Type	Correct Action
Search returns 0 results	Valid empty	Accept — absence of data is the answer
API returns 401 Unauthorised	Access failure	Retry with correct credentials or escalate
Database query returns empty set	Valid empty	Accept — the record does not exist
Network timeout	Access failure	Retry or use fallback
Filter matches no items	Valid empty	Accept — refine the filter or report no matches
Rate limit (429)	Access failure	Wait and retry (respect `Retry-After`)

The exam tests this distinction repeatedly. If the tool executed successfully and returned nothing, that is a valid result. If the tool failed to execute, that is an error requiring recovery.

Confidence Calibration & Stratified Validation

Aggregate metrics mask per-type issues. An overall accuracy of 95% can hide the fact that one category has 60% accuracy.

Stratified validation pattern:

Group extractions by type/category
Calculate per-type accuracy
Identify categories where the model underperforms
Apply targeted fixes (better examples, tighter schema) to weak categories

Stratified random sampling: Select validation samples from each category proportionally. This detects novel error patterns that random sampling across all categories would miss.

Key exam point: When the question mentions "high overall accuracy but users report errors", the answer is stratified validation — break down accuracy by category to find the weak spots.

Source Attribution

Structured claim-source mappings survive synthesis. Inline links do not.

Approach	Survives Synthesis?	Use When
Structured mapping: `{ claim: "...", source: "...", url: "...", date: "..." }`	Yes	Production systems, multi-step pipelines
Inline links in text	No — lost during summarisation	Quick prototyping only

Conflicting sources: Annotate both with provenance and dates. Never average conflicting claims or silently pick one. Present both with their sources and let the consumer decide.

Claim: "Context window is 200K tokens"
  Source A: Anthropic docs (2024-03-15) — 200K tokens
  Source B: Blog post (2023-11-20) — 100K tokens
  Resolution: Source A is newer and authoritative. Source B is outdated.

Monitoring & Observability

Per-type error rates, not just aggregate accuracy. This is the monitoring equivalent of stratified validation.

What to track:

Error rate per extraction type/category
Error rate per source type (different APIs, different document formats)
Confidence scores vs. actual accuracy (calibration curve)
Context window utilisation over conversation length
Retry rates and recovery success rates

Alert on: Per-category accuracy drops, not just overall accuracy drops. A 2% overall drop might represent a 30% drop in one category.

Context Window Strategies

Strategy	When to Use
Persistent fact blocks	Critical details that must survive summarisation
Scratchpad files	Multi-step exploration across context boundaries
Per-file passes + integration	Large codebases — avoid attention dilution
Fresh start + summary injection	Context has become stale or contradictory
Prompt caching	Repeated system prompts across requests (cost + latency savings)

Context window pressure signals: Model starts repeating itself, contradicts earlier statements, or ignores recent information. These indicate the window is too full or the context is stale.

Decision Rules for the Exam

If the question says...	The answer is likely...
"details lost over long conversation"	Persistent fact blocks, not progressive summarisation
"findings need to survive context reset"	Scratchpad files (external persistence)
"tool returned nothing"	Distinguish: valid empty (accept) vs access failure (retry)
"high overall accuracy, users report errors"	Stratified validation — check per-type accuracy
"sources disagree"	Annotate both with provenance and dates
"generic error message"	Replace with structured error context
"model contradicts itself"	Fresh start + summary injection
"inline citations lost after processing"	Switch to structured claim-source mappings
"monitoring shows stable metrics but quality complaints"	Per-type error rates instead of aggregate
"long task exceeds context window"	Scratchpad files + context boundary management

Common Exam Traps

Trap	Correct Answer
"Summarise the conversation to save context"	Wrong — summarisation loses details; use persistent fact blocks
"Overall 95% accuracy means the system is reliable"	Wrong — check per-type accuracy; aggregate masks category issues
"Pick the more recent source when they conflict"	Wrong — annotate both with provenance; let consumer decide
"Return 'No results found — please try again'"	Wrong if search legitimately found nothing — accept the empty result
"Inline citations in the text are sufficient"	Wrong for production — they are lost during synthesis; use structured mappings
"Monitor overall error rate for quality"	Wrong — monitor per-type error rates to catch category-specific degradation
"Keep all context in the conversation history"	Wrong — use scratchpad files for external persistence across boundaries