5.5 — Human Review & Confidence Calibration

What You Need to Know

Human review is the safety net for automated extraction and classification systems. The exam tests your understanding of when and how to deploy human reviewers effectively. The core challenge is not whether to use human review, but how to allocate limited reviewer capacity to maximise accuracy while minimising cost. This requires understanding confidence calibration, the trap of aggregate metrics, and stratified sampling strategies.

The Aggregate Metrics Trap

This is the most dangerous misconception in production extraction systems. A system reports 97% overall accuracy. The team celebrates. Management approves full automation for all high-confidence extractions.

The problem: that 97% hides catastrophic failure rates on specific document types. The system extracts dates from standard invoices at 99.5% accuracy. But handwritten receipts? 60%. Scanned PDFs with poor OCR? 72%. International documents with non-standard formatting? 45%.

The aggregate masks the segments where the system fails most. And those segments are often the ones where errors have the highest business impact — handwritten receipts from field staff, international invoices from new suppliers, scanned historical documents for compliance audits.

The rule: always validate accuracy by document type AND field segment before automating. Never make automation decisions based on aggregate metrics alone.

Document Type	Date Accuracy	Amount Accuracy	Name Accuracy
Standard invoices	99.5%	98.2%	97.8%
Handwritten receipts	60.1%	55.3%	71.2%
Scanned PDFs	72.4%	69.8%	80.1%
International formats	45.2%	52.1%	63.4%
Aggregate	97.0%	96.1%	95.8%

The aggregate looks excellent because standard invoices dominate the volume. But three document types have unacceptable accuracy, hidden by the volume-weighted average.

Stratified Random Sampling

Even after validating by document type and field, you need ongoing verification. Stratified random sampling means selecting a representative sample from each stratum (document type, confidence band, field type) and having humans verify it.

The critical insight is that you must sample high-confidence extractions, not just low-confidence ones. Low-confidence items are already routed to human review. High-confidence items are automated. If the model develops a novel error pattern that affects high-confidence extractions, only stratified sampling will catch it.

Stratified sampling serves two purposes:

Ongoing accuracy measurement — confirm that each segment maintains its validated accuracy rate.
Novel error pattern detection — discover new failure modes that did not exist in the original validation set.

Without stratified sampling, you are flying blind on your automated extractions. The system could develop a systematic error on a new document format and you would not know until downstream business processes fail.

Field-Level Confidence Calibration

The model can output confidence scores per field. For an invoice extraction, it might report:

json

{
  "vendorName": {"value": "Acme Corp", "confidence": 0.98},
  "invoiceDate": {"value": "2024-03-15", "confidence": 0.95},
  "totalAmount": {"value": "$1,247.83", "confidence": 0.72},
  "lineItems": {"value": [...], "confidence": 0.61}
}

But raw model confidence scores are not calibrated. A model that reports 0.95 confidence might actually be correct 88% of the time on certain field types. Or 99% of the time on others. The confidence score is relative, not absolute.

Calibration requires labelled validation sets (ground truth data). You take a set of documents with known correct extractions, run the model, compare its confidence scores to actual accuracy, and build a calibration curve. This tells you: "When the model reports 0.90 confidence on date fields, it is actually correct 94% of the time. When it reports 0.90 on amount fields, it is actually correct 82% of the time."

Calibrated thresholds then drive routing:

Fields above the calibrated threshold → automated (with stratified sampling)
Fields below the calibrated threshold → human review
Fields in the ambiguous zone → prioritised human review

Reviewer Capacity Prioritisation

Human reviewers are expensive and limited. The exam tests whether you understand how to allocate their capacity effectively.

Route the highest-uncertainty items to reviewers first. This means:

Low model confidence fields
Extractions from ambiguous or contradictory source documents
Document types with historically poor accuracy
Fields where the model expresses uncertainty (e.g., multiple possible interpretations)

Do NOT spread reviewer capacity evenly across all extractions. An even distribution wastes time reviewing high-confidence items that the model handles well while leaving insufficient capacity for the uncertain items that actually need human judgement.

The prioritisation should be dynamic, not static. As the system processes documents, the queue of items awaiting human review should be ordered by uncertainty. When a reviewer finishes one item, the next item in their queue should be the highest-uncertainty item remaining, not simply the next in chronological order.

Validation Before Automation

The sequence matters:

Measure accuracy by document type and field segment — not aggregate.
Calibrate confidence scores using labelled validation sets.
Set calibrated thresholds for automation versus human review.
Implement stratified random sampling for ongoing verification of automated extractions.
Only then reduce human review on segments that demonstrate consistent, validated accuracy.

Skipping to step 5 based on aggregate metrics is the trap. Every step in this sequence exists to prevent a specific failure mode.

Key Concept

97% aggregate accuracy can hide 40% error rates on specific document types. Validate accuracy by document type AND field segment. Calibrate confidence scores using labelled validation sets. Sample high-confidence extractions through stratified sampling. Prioritise limited reviewer capacity on the highest-uncertainty items.

Exam Traps

Exam Trap

Using aggregate accuracy (e.g., 97%) to justify automating all high-confidence extractions

Aggregate metrics hide per-type performance. 97% overall can mean 40% accuracy on specific document types. Validate by document type and field segment before automating.

Exam Trap

Only sampling low-confidence extractions for human review

High-confidence extractions are automated. If a novel error pattern affects them, only stratified random sampling of high-confidence items will detect it.

Exam Trap

Using raw model confidence scores without calibration

Raw confidence scores are not calibrated. 0.90 confidence on dates might mean 94% actual accuracy, while 0.90 on amounts might mean only 82%. Calibrate using labelled validation sets.

Exam Trap

Spreading reviewer capacity evenly across all extractions

Even distribution wastes time on high-confidence items. Prioritise limited reviewer capacity on the highest-uncertainty items where human judgement adds the most value.

Practice Scenario

A structured data extraction system achieves 97% overall accuracy across all document types. The team proposes automating all extractions where model confidence exceeds 95% to reduce human review costs. What is the critical risk in this approach?

Option AThe 95% confidence threshold is too low and should be raised to 99% for automation

Option BAggregate accuracy may mask poor performance on specific document types or fields, and confidence scores need calibration against labelled validation sets before use

Option CThe model will become overconfident over time as it processes more documents, requiring regular retraining

Option DAutomated extractions should always have human review regardless of confidence, making the proposal fundamentally flawed

Build Exercise

Build a Confidence-Calibrated Review Router

Difficulty

50 minutes

What you'll learn

Recognise the aggregate metrics trap: 97% overall accuracy can hide 40% error rates on specific document types
Implement accuracy tracking broken down by document type AND field segment
Calibrate raw confidence scores using labelled validation sets to produce reliable routing thresholds
Design stratified random sampling that includes high-confidence extractions for ongoing verification
Prioritise limited reviewer capacity on the highest-uncertainty items with dynamic queue ordering

Create a mock extraction system that outputs field-level confidence scores for different document types (invoices, receipts, scanned PDFs, international documents)
Why: Field-level confidence scores are the foundation of intelligent review routing. The exam tests that raw model confidence is not calibrated and must be validated against ground truth before use. Building the mock system gives you data to calibrate against.
You should see: An extraction function that returns each field with its value and a confidence score between 0.0 and 1.0. The system should process at least 4 document types with noticeably different confidence distributions per type.
Implement accuracy tracking broken down by document type and field segment — not just aggregate metrics
Why: The aggregate metrics trap is the most dangerous misconception in production extraction systems. 97% overall accuracy can hide catastrophic failure rates on specific document types because standard invoices dominate the volume. The exam tests that you must validate by document type AND field segment before automating.
You should see: An accuracy table showing each document type and field combination separately. Standard invoices should show 95%+ accuracy while handwritten receipts and international documents show 40-70%. The aggregate should look excellent (90%+) despite the poor per-type numbers.
Build a calibration module that takes a labelled validation set (ground truth) and produces calibrated confidence thresholds per field type per document type
Why: Raw model confidence scores are not calibrated. A model reporting 0.90 confidence might actually be correct 94% of the time on date fields but only 82% on amount fields. Calibration using labelled validation sets is required before confidence scores can drive automated routing decisions.
You should see: A calibration curve for each field type per document type, mapping reported confidence ranges to actual accuracy percentages. The curve should reveal that the same confidence score means different things for different field-document combinations.
Implement stratified random sampling that selects high-confidence extractions for ongoing verification, sampling proportionally across all document types
Why: High-confidence extractions are automated and not reviewed. If the model develops a novel error pattern affecting high-confidence items, only stratified sampling will catch it. Sampling only low-confidence items leaves you blind to systematic errors in automated extractions.
You should see: A sampling function that selects a representative subset from each stratum (document type and confidence band), including samples from the high-confidence automated extractions. The sample should be proportional to the volume in each stratum.
Build a review router that prioritises limited reviewer capacity on the highest-uncertainty items, dynamically reordering the review queue as new extractions arrive
Why: Human reviewers are expensive and limited. Spreading capacity evenly across all extractions wastes time on high-confidence items while leaving insufficient capacity for uncertain items that need human judgement. Dynamic priority ordering ensures the most uncertain items are always reviewed first.
You should see: A priority queue that orders items by uncertainty (lowest confidence first), dynamically reorders as new extractions arrive, and serves the next-highest-uncertainty item to each available reviewer. The queue should never serve items in chronological order.

Sources

Claude Certified Architect Foundations Exam Guide — Domain 5, Task Statement 5.5 — Anthropic
Anthropic Structured Data Extraction Guide — Anthropic
Anthropic Human-in-the-Loop Patterns — Anthropic