Prompt Engineering for Claude: System Prompts, Few-Shot, and Output Validation

There's no secret syntax that makes Claude behave. Prompt engineering is about giving the model explicit criteria, hard constraints, and concrete examples so it does what you actually need, every time. This guide covers the patterns that work in production and the ones the Claude Certified Architect exam will test you on.

System prompts: explicit criteria beat vague instructions

The single biggest mistake people make with production prompts? Vague system instructions. Telling Claude to "be conservative" or "only report high-confidence findings" gives it no actual decision boundary. It'll interpret those phrases differently every time.

Compare these two approaches to a code review system prompt:

Vague (unreliable):

You are a code reviewer. Be conservative and only flag important issues.
Report findings with high confidence.

Explicit criteria (reliable):

You are a code reviewer. Report findings in these categories only:

BUGS: Flag when code will produce incorrect results at runtime.
  - Report: null dereference, off-by-one errors, unchecked return values
  - Skip: style preferences, naming conventions, missing comments

SECURITY: Flag when code introduces an exploitable vulnerability.
  - Report: SQL injection, XSS, hardcoded credentials, path traversal
  - Skip: theoretical attacks requiring physical access

Do not report findings outside these categories.

The second version tells Claude exactly what counts and what doesn't. No wiggle room. That's what makes it reliable across thousands of invocations, and it's the pattern the exam tests in Task 4.1 — System Prompts.

Severity calibration with code examples

You can describe severity levels in prose ("critical means the application will crash"), but it's less reliable than showing actual code:

Severity levels:

CRITICAL — Application crashes or data loss:
  Example: `db.execute(f"DELETE FROM users WHERE id = {user_input}")`
  → SQL injection allowing arbitrary data deletion

MINOR — Suboptimal but functional:
  Example: `items = [x for x in big_list if x > 0]; items = list(filter(...))`
  → Redundant list creation, minor memory overhead

Why does this work better? Claude can pattern-match against real code. It doesn't have to guess what you meant by "critical" — it can see what critical looks like.

Few-shot examples: when they help and when they waste tokens

Few-shot examples are great for locking down output format and covering edge cases. They're not a fix for broken prompt design.

When few-shot works

The sweet spot is consistent output formatting. Show Claude what the result should look like:

python

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-6-20250514",
    max_tokens=1024,
    system="""Extract structured data from customer messages.
Return JSON with these fields: intent, product, urgency (low/medium/high).

Example input: "My laptop screen cracked yesterday, I need this fixed ASAP"
Example output: {"intent": "repair", "product": "laptop", "urgency": "high"}

Example input: "Just wondering what colour options the new headphones come in"
Example output: {"intent": "inquiry", "product": "headphones", "urgency": "low"}""",
    messages=[
        {"role": "user", "content": "The printer I bought last week keeps jamming and I have a deadline tomorrow"}
    ]
)

Those two examples do a lot of work. They lock in the JSON shape and show Claude how to map language cues to urgency ("ASAP" → high, "just wondering" → low). Covered in depth in Task 4.4 — Few-Shot & Examples.

When few-shot is the wrong fix

If Claude keeps calling get_customer when it should call lookup_order, throwing more few-shot examples at the problem won't help. The root cause is overlapping tool descriptions. Fix the descriptions, not the prompt.

This catches people on the exam. The exam always favours the low-effort, high-leverage fix. Cleaning up tool descriptions beats adding few-shot examples for routing problems. Every time.

Structured output with validation

If your application code can't parse Claude's output, it doesn't matter how good the answer was. Production systems need structured output, and Claude supports this through JSON mode and schema-constrained generation.

JSON output pattern

python

import anthropic
import json

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-6-20250514",
    max_tokens=1024,
    system="""Analyse the given text and return a JSON object with this exact structure:
{
  "sentiment": "positive" | "negative" | "neutral",
  "confidence": 0.0-1.0,
  "key_topics": ["string"],
  "summary": "string (one sentence)"
}

Return ONLY the JSON object. No explanation, no markdown formatting.""",
    messages=[
        {"role": "user", "content": "The new API is significantly faster but the documentation is still incomplete."}
    ]
)

result = json.loads(message.content[0].text)

Output validation pipeline

Never trust model output without validation. It'll be correct 95% of the time, and that remaining 5% will break your downstream pipeline at 2am on a Saturday. Validate before passing data anywhere:

python

from pydantic import BaseModel, field_validator
from typing import Literal

class SentimentAnalysis(BaseModel):
    sentiment: Literal["positive", "negative", "neutral"]
    confidence: float
    key_topics: list[str]
    summary: str

    @field_validator("confidence")
    @classmethod
    def confidence_in_range(cls, v: float) -> float:
        if not 0.0 <= v <= 1.0:
            raise ValueError("Confidence must be between 0 and 1")
        return v

    @field_validator("key_topics")
    @classmethod
    def topics_not_empty(cls, v: list[str]) -> list[str]:
        if len(v) == 0:
            raise ValueError("At least one topic required")
        return v

# Validate the model output
try:
    analysis = SentimentAnalysis.model_validate_json(message.content[0].text)
except Exception as e:
    # Handle validation failure — retry, fall back, or escalate
    print(f"Validation failed: {e}")

Generate, validate, handle failure. That's the loop, and it's what the exam tests in Task 4.6 — Output Validation. The principle is simple: validate at the boundary between the model and your application code. Don't let unvalidated JSON propagate.

Prompt chaining for complex tasks

When one prompt can't reliably do the whole job, break it into steps where each step's output feeds the next:

Extract — Pull structured data from unstructured input
Transform — Apply business logic or classification
Validate — Check the output against constraints
Format — Produce the final output

Each step has a narrower scope, so you can write tighter criteria and catch failures earlier. A 4-step chain where each step is simple and validated will beat a single monolithic prompt trying to do everything at once. It's not even close.

The exam tests this in Task 4.2 — Structured Output.

The false positive problem

Here's a scenario the exam loves. A CI/CD pipeline flags too many false positives. Developers start ignoring all findings, not just the bad ones. Trust erodes across every category.

The tempting answer is "add a confidence threshold." Don't pick it. Self-reported confidence scores are poorly calibrated. Here's what actually works:

Identify which categories have high false positive rates
Temporarily disable those categories to restore trust in the ones that work
Rewrite prompts for the disabled categories with explicit criteria and code examples
Re-enable one at a time, measuring false positive rates as you go

That's the systematic approach the exam expects. Vague instructions, confidence filtering, temperature adjustments? All distractors.

Key takeaways for the exam

Explicit categorical criteria beat vague instructions. Always.
Few-shot fixes formatting, not routing. If tools are misrouted, fix the tool descriptions first.
Pydantic (or similar schema validation) catches malformed responses before they hit your users. Use it.
When an exam question asks for the best fix, look for the low-effort, high-leverage option. It's almost always the answer.
Severity calibration with code examples outperforms prose descriptions because the model can pattern-match against real code.
Don't trust self-reported confidence scores. Validate output at the boundary.

Start with the Domain 4 lessons for the full curriculum, or go straight to the practice questions to test yourself.

Prompt Engineering for Claude: System Prompts, Few-Shot, and Output Validation

System prompts: explicit criteria beat vague instructions

Severity calibration with code examples

Few-shot examples: when they help and when they waste tokens

When few-shot works

When few-shot is the wrong fix

Structured output with validation

JSON output pattern

Output validation pipeline

Prompt chaining for complex tasks

The false positive problem

Key takeaways for the exam

Related articles

Structured Outputs Are Now GA on the Claude API