Editorial desk with abstract AI data visualizations and research notes

AI Learning Ramp

Observability for agentic query systems.

Course 11 is a one-hour systems session on instrumenting a BigQuery copilot so incidents become explainable: traces, tool receipts, state diffs, latency metrics, and replayable failure evidence.

Course 11 of 24 Published June 17, 2026 Focus: observability Target: OpenAI / Anthropic interviews

System-Design Frame

Assume your BigQuery copilot now has retrieval, SQL drafting, dry runs, approval gates, execution, summarization, and eval-based release checks. The interview question is how you debug production behavior when a run intermittently loops, chooses the wrong table, violates a latency SLO, or gives an unsupported answer. Design the observability plane: structured traces for every model, tool, retrieval, policy, and state transition; receipts that prove what actually ran; metrics that connect latency, tokens, cost, and quality signals; and replay bundles that let engineers reproduce failures without exposing unnecessary customer data.

Course 11: Observability

One-hour objective: design the telemetry and debugging workflow for a production AI query engine, then explain how traces, receipts, metrics, and replay shorten incident response.

Start from one incident.

Write a realistic failure: the copilot answers from the wrong table, gets stuck retrying a tool, or misses a p95 latency SLO. List the evidence you would need within 15 minutes.

Read OpenTelemetry's GenAI conventions.

Focus on common names for model requests, token usage, operations, prompts, completions, tools, embeddings, agent steps, errors, and metrics. Decide what your standard span vocabulary should be.

Study OpenAI agent tracing.

Look for how traces group agent runs and how spans represent agent execution, model generations, tool calls, handoffs, guardrails, and custom application work.

Map LangSmith observability to operations.

Translate traces, runs, metadata, feedback, errors, dashboards, and production monitoring into a workflow for debugging query-agent failures after deployment.

Sketch the telemetry schema.

Define trace IDs, tenant and user risk labels, model and prompt hashes, retrieval IDs, BigQuery job IDs, tool receipts, state diffs, latency, tokens, cost, cache hits, and redaction rules.

Choose SLOs and alerts.

Set thresholds for p95 end-to-end latency, time to first token, tool failure rate, loop count, SQL dry-run errors, policy denials, cost spikes, user corrections, and unresolved escalations.

Deliver the interview synthesis.

Explain that AI observability is the operational bridge between evals and incidents: it turns nondeterministic agent behavior into inspectable traces, comparable metrics, and safe replay.

Course 11 Reading List

Use exactly three required readings: one open telemetry standard, one OpenAI agent tracing source, and one production observability workflow source.

Required

OpenTelemetry: Semantic Conventions for GenAI Systems

The emerging shared vocabulary for GenAI telemetry: operations, model calls, token usage, prompts, completions, tools, embeddings, agents, errors, and metrics.

Read for: how to avoid inventing custom span names when standard attributes can make traces portable across vendors and teams.

Required

OpenAI Agents SDK: Tracing

OpenAI's tracing model for agent runs, including traces, spans, agent spans, generation spans, function spans, handoffs, guardrails, custom spans, and trace processors.

Read for: how an agent framework represents the nested work behind a single user-visible answer.

Required

LangSmith: Observability

LangSmith's observability guide for tracing LLM applications, inspecting runs, adding metadata and feedback, monitoring production behavior, and debugging failures.

Read for: how production traces become dashboards, investigations, feedback loops, and regression candidates.

Optional Refresher

Anthropic Engineering: How We Built Our Multi-Agent Research System

A systems post with useful operational lessons on tracing, prompt iteration, tool design, error modes, and debugging multi-agent behavior at scale.

Skim for: production-agent debugging instincts when many tool calls and subagents make failure attribution hard.

Readiness Checklist

You are ready for the interview version of this topic when you can defend observability as a product reliability system, not just log collection.

Interview Drill: AI Infra System Design

Prompt: design observability for a BigQuery GenAI query engine that uses a planner, retrieval, SQL generation, dry runs, approval gates, execution tools, and final-answer summarization.

Sources

  1. OpenTelemetry: Semantic Conventions for GenAI Systems
  2. OpenAI Agents SDK: Tracing
  3. LangSmith: Observability
  4. Anthropic Engineering: How We Built Our Multi-Agent Research System