AI Learning Ramp | Course 11

System-Design Frame

Assume your BigQuery copilot now has retrieval, SQL drafting, dry runs, approval gates, execution, summarization, and eval-based release checks. The interview question is how you debug production behavior when a run intermittently loops, chooses the wrong table, violates a latency SLO, or gives an unsupported answer. Design the observability plane: structured traces for every model, tool, retrieval, policy, and state transition; receipts that prove what actually ran; metrics that connect latency, tokens, cost, and quality signals; and replay bundles that let engineers reproduce failures without exposing unnecessary customer data.

Course 11: Observability

One-hour objective: design the telemetry and debugging workflow for a production AI query engine, then explain how traces, receipts, metrics, and replay shorten incident response.

0-5 min

Start from one incident.

Write a realistic failure: the copilot answers from the wrong table, gets stuck retrying a tool, or misses a p95 latency SLO. List the evidence you would need within 15 minutes.

5-19 min

Read OpenTelemetry's GenAI conventions.

Focus on common names for model requests, token usage, operations, prompts, completions, tools, embeddings, agent steps, errors, and metrics. Decide what your standard span vocabulary should be.

19-32 min

Study OpenAI agent tracing.

Look for how traces group agent runs and how spans represent agent execution, model generations, tool calls, handoffs, guardrails, and custom application work.

32-44 min

Map LangSmith observability to operations.

Translate traces, runs, metadata, feedback, errors, dashboards, and production monitoring into a workflow for debugging query-agent failures after deployment.

44-51 min

Sketch the telemetry schema.

Define trace IDs, tenant and user risk labels, model and prompt hashes, retrieval IDs, BigQuery job IDs, tool receipts, state diffs, latency, tokens, cost, cache hits, and redaction rules.

51-56 min

Choose SLOs and alerts.

Set thresholds for p95 end-to-end latency, time to first token, tool failure rate, loop count, SQL dry-run errors, policy denials, cost spikes, user corrections, and unresolved escalations.

56-60 min

Deliver the interview synthesis.

Explain that AI observability is the operational bridge between evals and incidents: it turns nondeterministic agent behavior into inspectable traces, comparable metrics, and safe replay.

Course 11 Reading List

Use exactly three required readings: one open telemetry standard, one OpenAI agent tracing source, and one production observability workflow source.

Required

OpenTelemetry: Semantic Conventions for GenAI Systems

The emerging shared vocabulary for GenAI telemetry: operations, model calls, token usage, prompts, completions, tools, embeddings, agents, errors, and metrics.

Read for: how to avoid inventing custom span names when standard attributes can make traces portable across vendors and teams.

Required

OpenAI Agents SDK: Tracing

OpenAI's tracing model for agent runs, including traces, spans, agent spans, generation spans, function spans, handoffs, guardrails, custom spans, and trace processors.

Read for: how an agent framework represents the nested work behind a single user-visible answer.

Required

LangSmith: Observability

LangSmith's observability guide for tracing LLM applications, inspecting runs, adding metadata and feedback, monitoring production behavior, and debugging failures.

Read for: how production traces become dashboards, investigations, feedback loops, and regression candidates.

Optional Refresher

Anthropic Engineering: How We Built Our Multi-Agent Research System

A systems post with useful operational lessons on tracing, prompt iteration, tool design, error modes, and debugging multi-agent behavior at scale.

Skim for: production-agent debugging instincts when many tool calls and subagents make failure attribution hard.

Readiness Checklist

You are ready for the interview version of this topic when you can defend observability as a product reliability system, not just log collection.

You can define the core spans in a BigQuery copilot run: user request, retrieval, planning, model generation, SQL draft, dry run, approval, execution, summary, and policy checks.
You can describe a tool receipt schema with tool name, arguments, caller span, auth scope, BigQuery job ID, result summary, error class, latency, cost, and redaction status.
You can separate high-cardinality debug attributes from low-cardinality metrics that should drive dashboards and alerts.
You can choose latency and cost metrics for interactive query assistance: time to first token, end-to-end latency, tool latency, queue time, token counts, cache hits, and BigQuery bytes processed.
You can explain how to capture state diffs between agent steps without logging raw sensitive data unnecessarily.
You can design a failure replay package with prompt hash, model version, retrieval IDs, schema snapshot, tool receipts, policy decisions, sampled artifacts, and expected access controls.
You can connect observability back to evals by turning production incidents, user corrections, and high-risk traces into new regression cases.

Interview Drill: AI Infra System Design

Prompt: design observability for a BigQuery GenAI query engine that uses a planner, retrieval, SQL generation, dry runs, approval gates, execution tools, and final-answer summarization.

Start with requirements: debug wrong-table answers, SQL dry-run failures, tool loops, policy denials, high latency, token and BigQuery cost spikes, and user-visible unsupported claims.
Define trace boundaries: one trace per user task, spans for planner decisions, retrieval, model calls, tool calls, approvals, retries, execution, summarization, and user feedback.
Define telemetry fields: model ID, prompt hash, schema snapshot ID, retrieval corpus version, table IDs, BigQuery job IDs, token counts, bytes processed, tool status, error class, and redaction policy.
Pick metrics and SLOs: p95 end-to-end latency, time to first token, tool error rate, loop count, dry-run failure rate, approval rate, cost per successful task, and correction rate.
Design storage and privacy: raw traces with access control and retention, redacted trace views for broad debugging, low-cardinality metrics for dashboards, and sampled artifacts for replay.
Explain failure replay: reconstruct the prompt context, retrieval candidates, schema state, model version, tool receipts, and policy decisions in a sandbox before creating a regression eval.
Close with operations: on-call dashboards, alert routing, incident runbooks, canary comparison traces, eval backfills, customer-safe exports, and named owners for telemetry schema changes.

Observability for agentic query systems.

System-Design Frame

Course 11: Observability

Start from one incident.

Read OpenTelemetry's GenAI conventions.

Study OpenAI agent tracing.

Map LangSmith observability to operations.

Sketch the telemetry schema.

Choose SLOs and alerts.

Deliver the interview synthesis.

Course 11 Reading List

OpenTelemetry: Semantic Conventions for GenAI Systems

OpenAI Agents SDK: Tracing

LangSmith: Observability

Anthropic Engineering: How We Built Our Multi-Agent Research System

Readiness Checklist

Interview Drill: AI Infra System Design

Sources