AI Learning Ramp | Course 10

System-Design Frame

Assume your BigQuery copilot already retrieves table context, drafts SQL, runs dry runs, asks for approval, executes jobs, and summarizes results. The interview question is how you ship prompt, model, retrieval, and tool changes without relying on vibes: define a golden set from real analyst work, grade outcomes with deterministic checks and calibrated LLM judges, route ambiguous cases to humans, slice regressions by task type and risk, and gate releases with clear thresholds.

Course 10: Eval-Driven Development

One-hour objective: design an evaluation and release-gating loop for a production AI query engine, then explain how it catches regressions before they reach analysts.

0-5 min

Define the release risk.

Write three changes that could break a BigQuery copilot: model swap, prompt rewrite, and retrieval policy update. For each, name the failure you need an eval to catch.

5-18 min

Read OpenAI's eval best practices.

Focus on task-specific datasets, evaluator design, ground truth, model-graded evals, human calibration, and the habit of making evals part of the development loop.

18-32 min

Study Anthropic's agent-evals framing.

Look for how to evaluate agent outcomes under nondeterminism, include realistic tasks, inspect transcripts, use partial credit, and avoid brittle path-based grading.

32-42 min

Map LangSmith concepts to release gates.

Translate datasets, experiments, online feedback, code evaluators, LLM-as-judge evaluators, and human review queues into a regression workflow for query agents.

42-50 min

Sketch the eval stack.

Draw the golden-set store, trace capture, replay runner, SQL validators, BigQuery dry-run checks, LLM judge service, human-review queue, metrics board, and release decision record.

50-56 min

Choose slices and gates.

Set minimum bars for schema grounding, SQL execution safety, answer faithfulness, sensitive-data access, cost controls, high-value customer tasks, and previously fixed incidents.

56-60 min

Deliver the interview synthesis.

Explain that eval-driven development makes AI releases operational: every model or prompt change must beat a known baseline on the tasks and risks the product actually owns.

Course 10 Reading List

Use exactly three required readings: one OpenAI source for eval craft, one Anthropic source for agent-specific evaluation, and one workflow source for datasets, evaluators, regressions, and feedback loops.

Required

OpenAI: Evaluation Best Practices

OpenAI's guide to building useful evals: task-specific datasets, clear grader criteria, baseline comparisons, model-graded evaluation, human review, and iterative evaluation design.

Read for: how to turn a vague product-quality goal into a concrete eval set with graders and thresholds.

Required

Anthropic Engineering: Demystifying Evals for AI Agents

A practical post on evaluating agents with realistic tasks, nondeterministic runs, partial credit, transcript review, model graders, human calibration, and outcome-focused scoring.

Read for: how to evaluate an agent that may solve the same query task through different valid tool paths.

Required

LangSmith: Evaluation

LangSmith's evaluation docs cover datasets, experiments, online and offline evaluation, code and LLM-as-judge evaluators, human review, and comparison workflows.

Read for: how regression tests, production traces, feedback, and release comparisons fit into one AI development loop.

Optional Refresher

Google Cloud: Define Your Evaluation Metrics

A short Google Cloud reference for GenAI metric design, useful if you want to map the lesson into Vertex AI or Gemini Agent Platform vocabulary.

Skim for: metric-selection language when translating BigQuery-agent evals into Google Cloud tooling.

Readiness Checklist

You are ready for the interview version of this topic when you can defend the eval loop as a release system, not just a benchmark.

You can define a golden case for a BigQuery copilot with user intent, table context, allowed data scope, SQL invariants, expected evidence, and final-answer criteria.
You can choose deterministic graders for SQL parseability, dry-run success, referenced tables, row limits, cost ceilings, permissions, and required citations.
You can explain when to use an LLM judge, how to write a rubric, and how to calibrate judge scores against human labels before trusting them in a release gate.
You can slice eval results by query type, customer tier, table domain, sensitive-data risk, retrieval dependence, long-context stress, and previously fixed incidents.
You can design a regression workflow that compares candidate prompts, models, retrievers, and tool policies against a locked baseline before rollout.
You can route low-confidence, high-impact, or judge-disagreement cases into human review and feed adjudicated examples back into the golden set.
You can describe production feedback capture: trace sampling, user corrections, incident postmortems, false-positive approval blocks, and nightly replay on fresh failures.

Interview Drill: AI Infra System Design

Prompt: design the eval and release-gating platform for a BigQuery GenAI query engine before a model upgrade from the current production model to a stronger frontier model.

Start with requirements: catch SQL regressions, hallucinated schema references, unsafe access, cost explosions, weak summaries, latency regressions, and failures on strategic customer workloads.
Define the data model: golden cases, synthetic edge cases, production-derived cases, task metadata, expected artifacts, grader configs, human labels, baselines, and release decisions.
Design the execution path: replay traces through candidate model and prompt versions, run retrieval and tool calls in a sandbox, dry-run SQL in BigQuery, and store all artifacts for audit.
Pick graders: deterministic validators for syntax and policy, BigQuery dry-run checks for feasibility and cost, LLM judges for faithfulness and explanation quality, and human adjudication for disputed cases.
Set gates: no regression on critical slices, bounded aggregate quality deltas, zero high-severity policy failures, latency and cost budgets, and manual approval for waivers.
Handle nondeterminism: run high-risk cases multiple times, track variance, measure pass rates by slice, and require stability for approval-sensitive or expensive-query paths.
Close with operations: dashboard comparisons, nightly replay, canary sampling, alerting on production drift, incident-to-eval backfills, and a documented owner for each release decision.

Eval-driven development for AI query systems.

System-Design Frame

Course 10: Eval-Driven Development

Define the release risk.

Read OpenAI's eval best practices.

Study Anthropic's agent-evals framing.

Map LangSmith concepts to release gates.

Sketch the eval stack.

Choose slices and gates.

Deliver the interview synthesis.

Course 10 Reading List

OpenAI: Evaluation Best Practices

Anthropic Engineering: Demystifying Evals for AI Agents

LangSmith: Evaluation

Google Cloud: Define Your Evaluation Metrics

Readiness Checklist

Interview Drill: AI Infra System Design

Sources