Use exactly three required readings: one OpenAI source for eval craft, one Anthropic source for agent-specific evaluation, and one workflow source for datasets, evaluators, regressions, and feedback loops.
Required
OpenAI's guide to building useful evals: task-specific datasets, clear grader criteria, baseline comparisons, model-graded evaluation, human review, and iterative evaluation design.
Read for: how to turn a vague product-quality goal into a concrete eval set with graders and thresholds.
Required
A practical post on evaluating agents with realistic tasks, nondeterministic runs, partial credit, transcript review, model graders, human calibration, and outcome-focused scoring.
Read for: how to evaluate an agent that may solve the same query task through different valid tool paths.
Required
LangSmith's evaluation docs cover datasets, experiments, online and offline evaluation, code and LLM-as-judge evaluators, human review, and comparison workflows.
Read for: how regression tests, production traces, feedback, and release comparisons fit into one AI development loop.
Optional Refresher
A short Google Cloud reference for GenAI metric design, useful if you want to map the lesson into Vertex AI or Gemini Agent Platform vocabulary.
Skim for: metric-selection language when translating BigQuery-agent evals into Google Cloud tooling.