There's no secret syntax that makes Claude behave. Prompt engineering is about giving the model explicit criteria, hard constraints, and concrete examples so it does what you actually need, every time. This guide covers the patterns that work in production and the ones the Claude Certified Architect exam will test you on.
System prompts: explicit criteria beat vague instructions
The single biggest mistake people make with production prompts? Vague system instructions. Telling Claude to "be conservative" or "only report high-confidence findings" gives it no actual decision boundary. It'll interpret those phrases differently every time.
Compare these two approaches to a code review system prompt:
Vague (unreliable):
You are a code reviewer. Be conservative and only flag important issues.
Report findings with high confidence.
Explicit criteria (reliable):
You are a code reviewer. Report findings in these categories only:
BUGS: Flag when code will produce incorrect results at runtime.
- Report: null dereference, off-by-one errors, unchecked return values
- Skip: style preferences, naming conventions, missing comments
SECURITY: Flag when code introduces an exploitable vulnerability.
- Report: SQL injection, XSS, hardcoded credentials, path traversal
- Skip: theoretical attacks requiring physical access
Do not report findings outside these categories.
The second version tells Claude exactly what counts and what doesn't. No wiggle room. That's what makes it reliable across thousands of invocations, and it's the pattern the exam tests in Task 4.1 — System Prompts.
Severity calibration with code examples
You can describe severity levels in prose ("critical means the application will crash"), but it's less reliable than showing actual code:
Severity levels:
CRITICAL — Application crashes or data loss:
Example: `db.execute(f"DELETE FROM users WHERE id = {user_input}")`
→ SQL injection allowing arbitrary data deletion
MINOR — Suboptimal but functional:
Example: `items = [x for x in big_list if x > 0]; items = list(filter(...))`
→ Redundant list creation, minor memory overhead
Why does this work better? Claude can pattern-match against real code. It doesn't have to guess what you meant by "critical" — it can see what critical looks like.
Few-shot examples: when they help and when they waste tokens
Few-shot examples are great for locking down output format and covering edge cases. They're not a fix for broken prompt design.
When few-shot works
The sweet spot is consistent output formatting. Show Claude what the result should look like:
import anthropic
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-sonnet-4-6-20250514",
max_tokens=1024,
system="""Extract structured data from customer messages.
Return JSON with these fields: intent, product, urgency (low/medium/high).
Example input: "My laptop screen cracked yesterday, I need this fixed ASAP"
Example output: {"intent": "repair", "product": "laptop", "urgency": "high"}
Example input: "Just wondering what colour options the new headphones come in"
Example output: {"intent": "inquiry", "product": "headphones", "urgency": "low"}""",
messages=[
{"role": "user", "content": "The printer I bought last week keeps jamming and I have a deadline tomorrow"}
]
)
Those two examples do a lot of work. They lock in the JSON shape and show Claude how to map language cues to urgency ("ASAP" → high, "just wondering" → low). Covered in depth in Task 4.4 — Few-Shot & Examples.
When few-shot is the wrong fix
If Claude keeps calling get_customer when it should call lookup_order, throwing more few-shot examples at the problem won't help. The root cause is overlapping tool descriptions. Fix the descriptions, not the prompt.
This catches people on the exam. The exam always favours the low-effort, high-leverage fix. Cleaning up tool descriptions beats adding few-shot examples for routing problems. Every time.
Structured output with validation
If your application code can't parse Claude's output, it doesn't matter how good the answer was. Production systems need structured output, and Claude supports this through JSON mode and schema-constrained generation.
JSON output pattern
import anthropic
import json
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-sonnet-4-6-20250514",
max_tokens=1024,
system="""Analyse the given text and return a JSON object with this exact structure:
{
"sentiment": "positive" | "negative" | "neutral",
"confidence": 0.0-1.0,
"key_topics": ["string"],
"summary": "string (one sentence)"
}
Return ONLY the JSON object. No explanation, no markdown formatting.""",
messages=[
{"role": "user", "content": "The new API is significantly faster but the documentation is still incomplete."}
]
)
result = json.loads(message.content[0].text)
Output validation pipeline
Never trust model output without validation. It'll be correct 95% of the time, and that remaining 5% will break your downstream pipeline at 2am on a Saturday. Validate before passing data anywhere:
from pydantic import BaseModel, field_validator
from typing import Literal
class SentimentAnalysis(BaseModel):
sentiment: Literal["positive", "negative", "neutral"]
confidence: float
key_topics: list[str]
summary: str
@field_validator("confidence")
@classmethod
def confidence_in_range(cls, v: float) -> float:
if not 0.0 <= v <= 1.0:
raise ValueError("Confidence must be between 0 and 1")
return v
@field_validator("key_topics")
@classmethod
def topics_not_empty(cls, v: list[str]) -> list[str]:
if len(v) == 0:
raise ValueError("At least one topic required")
return v
# Validate the model output
try:
analysis = SentimentAnalysis.model_validate_json(message.content[0].text)
except Exception as e:
# Handle validation failure — retry, fall back, or escalate
print(f"Validation failed: {e}")
Generate, validate, handle failure. That's the loop, and it's what the exam tests in Task 4.6 — Output Validation. The principle is simple: validate at the boundary between the model and your application code. Don't let unvalidated JSON propagate.
Prompt chaining for complex tasks
When one prompt can't reliably do the whole job, break it into steps where each step's output feeds the next:
- Extract — Pull structured data from unstructured input
- Transform — Apply business logic or classification
- Validate — Check the output against constraints
- Format — Produce the final output
Each step has a narrower scope, so you can write tighter criteria and catch failures earlier. A 4-step chain where each step is simple and validated will beat a single monolithic prompt trying to do everything at once. It's not even close.
The exam tests this in Task 4.2 — Structured Output.
The false positive problem
Here's a scenario the exam loves. A CI/CD pipeline flags too many false positives. Developers start ignoring all findings, not just the bad ones. Trust erodes across every category.
The tempting answer is "add a confidence threshold." Don't pick it. Self-reported confidence scores are poorly calibrated. Here's what actually works:
- Identify which categories have high false positive rates
- Temporarily disable those categories to restore trust in the ones that work
- Rewrite prompts for the disabled categories with explicit criteria and code examples
- Re-enable one at a time, measuring false positive rates as you go
That's the systematic approach the exam expects. Vague instructions, confidence filtering, temperature adjustments? All distractors.
Key takeaways for the exam
- Explicit categorical criteria beat vague instructions. Always.
- Few-shot fixes formatting, not routing. If tools are misrouted, fix the tool descriptions first.
- Pydantic (or similar schema validation) catches malformed responses before they hit your users. Use it.
- When an exam question asks for the best fix, look for the low-effort, high-leverage option. It's almost always the answer.
- Severity calibration with code examples outperforms prose descriptions because the model can pattern-match against real code.
- Don't trust self-reported confidence scores. Validate output at the boundary.
Start with the Domain 4 lessons for the full curriculum, or go straight to the practice questions to test yourself.