Context Window
The maximum number of tokens (input plus output) that Claude can process in a single API call. Different Claude models have different context window sizes (e.g., 200K tokens for Claude 3.5 Sonnet). Everything in the conversation — system prompt, message history, tool definitions, and the response — must fit within this limit.
Exam context: Know the context window sizes for current Claude models and how to calculate total token usage. The exam tests strategies for staying within limits: summarisation, message pruning, and content prioritisation.
See also: 5.1 Context Window Management
Prompt Caching
An API feature that allows frequently reused prompt content to be cached, reducing latency and cost on subsequent requests. Cached content (such as system prompts, tool definitions, or large documents) is marked with cache_control and stored for a time-to-live (TTL) period. Cache hits are charged at a reduced rate.
Exam context: Know how to enable caching (cache_control: { type: "ephemeral" }), what can be cached (system prompts, tools, messages), and the ordering requirement — cached content must come before non-cached content. Understand cache TTL behaviour.
See also: 5.2 Prompt Caching
Rate Limiting
Controls imposed by the API to restrict the number of requests, input tokens, or output tokens a user can consume within a time window. Rate limits are applied per organisation and per model. When limits are exceeded, the API returns a 429 status code. Proper handling includes exponential backoff and request queuing.
Exam context: Know the difference between requests-per-minute (RPM) and tokens-per-minute (TPM) limits. The exam tests rate limit handling strategies: retries with backoff, request queuing, and load distribution across time windows.
See also: 5.4 Rate Limiting & Quotas
Quotas
Usage limits that cap total API consumption over a billing period. Unlike rate limits (which restrict throughput), quotas cap total usage (e.g., maximum spend per month). Quotas are configured in the Anthropic Console and can be set at the organisation or workspace level.
Exam context: Understand the difference between rate limits (throughput) and quotas (total usage). Know how to monitor quota consumption and set up alerts before hitting limits.
See also: 5.4 Rate Limiting & Quotas
Monitoring
The practice of tracking API usage, performance metrics, and error rates in production Claude applications. Key metrics include response latency, token usage, error rates, tool call success rates, and cost per request. Monitoring enables early detection of issues and data-driven optimisation.
Exam context: Know which metrics to monitor for a production Claude deployment. The exam tests whether you can design a monitoring strategy that covers performance, cost, and reliability.
See also: 5.5 Monitoring & Observability
Observability
The ability to understand the internal state of a Claude-powered system by examining its outputs. Observability goes beyond monitoring by providing detailed traces of individual requests, including prompt content, tool calls, model responses, and latency breakdowns. Tracing frameworks and structured logging are core observability tools.
Exam context: Understand the difference between monitoring (aggregate metrics) and observability (individual request traces). The exam may ask about implementing tracing for agentic loops where multiple API calls occur per user request.
See also: 5.5 Monitoring & Observability
Token Counting
The process of measuring how many tokens a prompt or response consumes. Tokens are sub-word units — roughly 3-4 characters per token in English. The API response includes input_tokens and output_tokens in the usage field. Accurate token counting is essential for context window management and cost estimation.
Exam context: Know how to read the usage field in API responses. The exam tests awareness that token counts include all content (system prompt, messages, tool definitions) and that different content types have different token-per-character ratios.
See also: 5.1 Context Window Management
Cache Breakpoints
Markers in the message array that define where cached content ends and new content begins. The cache_control field with type: "ephemeral" is placed on the last cached message or content block. Content before the breakpoint can be served from cache; content after it is processed fresh on each request.
Exam context: Know the placement rules for cache breakpoints. The exam tests understanding that breakpoints must be placed strategically — too many breakpoints waste cache storage, while too few reduce cache hit rates.
See also: 5.2 Prompt Caching
Extended Thinking
A Claude feature that allocates a dedicated "thinking" budget before the model produces its visible response. When enabled, Claude uses internal reasoning tokens (not visible in the final output by default) to work through complex problems. This improves accuracy on tasks requiring multi-step reasoning, analysis, or planning.
Exam context: Know how to enable extended thinking via the API (thinking parameter with a budget_tokens value). Understand that thinking tokens count against the output token limit and are billed accordingly. Know when extended thinking helps versus when it is unnecessary overhead.
See also: 5.1 Context Window Management
Batches API
An API endpoint for submitting large volumes of requests to be processed asynchronously at a reduced cost. Batch requests are queued and processed within a 24-hour window. Results are retrieved by polling the batch status. This is ideal for offline processing tasks like bulk content generation, evaluation, or data extraction.
Exam context: Know the Batches API's cost advantage (typically 50% cheaper than real-time requests) and its trade-off (no guaranteed completion time within the 24-hour window). Understand when batch processing is appropriate versus when real-time processing is required.
See also: 5.4 Rate Limiting & Quotas