Prompt Engineering & Structured Output

6 build exercises to practise the concepts in this domain.

4.1 — Build an Explicit Criteria Code Review Prompt

Understand why vague instructions (be conservative, high-confidence only) fail in production prompts
Design explicit categorical criteria that define what to flag and what to skip
Calibrate severity levels using concrete code examples rather than prose descriptions
Measure false positive rates and apply the trust recovery strategy of disabling problematic categories
Recognise the hierarchy: explicit criteria first, confidence-based routing second

Design JSON schemas with optional/nullable fields to prevent fabrication of missing data
Understand the three tool_choice modes (auto, any, forced) and when to use each
Recognise that tool_use eliminates syntax errors but not semantic errors
Apply schema design patterns: unclear enum values, other + detail string, format normalisation

Implement the retry-with-error-feedback pattern: original document + failed extraction + specific validation error
Distinguish fixable errors (format, structural, mathematical) from unfixable errors (absent information)
Design self-correction schemas with calculated_total vs stated_total and conflict_detected booleans
Build systematic improvement loops using detected_pattern fields and dismissal tracking
Understand the boundary between schema syntax errors (eliminated by tool_use) and semantic validation errors (require retry loops)

Identify the three triggers for deploying few-shot examples: inconsistent formatting, ambiguous judgement calls, and empty fields for existing data
Construct effective few-shot examples with reasoning, not just input-output pairs
Use 2-4 targeted examples covering the specific failing scenarios
Distinguish when few-shot examples are the right technique versus schema changes or validation loops
Measure the impact of few-shot examples on empty field rates and format consistency

Classify workflows as blocking (synchronous) or latency-tolerant (batch-eligible) based on latency requirements
Use the Message Batches API with custom_id fields for request-response correlation
Implement failure handling that resubmits only failed documents with targeted modifications
Calculate batch submission frequency against SLA constraints accounting for the 24-hour processing window
Apply the prompt refinement workflow: sample set testing before full batch submission

Understand why self-review in the same session retains reasoning context and is less effective than independent review
Design multi-pass review architectures with per-file local analysis and cross-file integration passes
Identify and mitigate attention dilution in large multi-file reviews
Implement confidence-based routing with calibrated thresholds from labelled validation sets
Distinguish uncalibrated raw confidence from calibrated thresholds suitable for automated routing