AI Learning Ramp | Course 9

System-Design Frame

Assume the BigQuery copilot now performs multi-step work that may outlive one request: clarify intent, inspect schemas, draft SQL, wait for human approval, run approved jobs, summarize results, and open follow-up tasks. The interview question is how you make that execution durable without hiding policy decisions inside model context: explicit state, queued work, checkpointed progress, replayable traces, interruptible approval steps, cancellation, and compensation for side effects.

Course 9: Durable Agent Execution

One-hour objective: design a durable execution layer for an AI analytics agent and explain how state machines, queues, checkpoints, human approval, replay, and cancellation fit together in production.

0-5 min

Write the failure contract.

List what users and operators should see after a crash, timeout, rate limit, human non-response, unsafe SQL plan, cancelled request, and partial BigQuery job.

5-20 min

Study durable agent execution in Temporal.

Focus on the practical split between the workflow, model/tool activities, retries, event history, and how an OpenAI Agents SDK loop can survive process failure.

20-35 min

Read LangGraph persistence as checkpointing vocabulary.

Track threads, checkpoints, state snapshots, long-term store, human-in-the-loop pauses, time travel, fault tolerance, and how to resume from known state.

35-48 min

Ground approval gates in containment.

Use Anthropic's containment post to separate "ask the user" from enforceable boundaries: permissions, sandboxing, confirmation UI, and attacks that target human approval.

48-55 min

Draw the durable execution architecture.

Sketch the state store, job queue, workflow runner, tool activities, approval interrupt, cancellation path, receipts log, and replay/debug view for a BigQuery analysis task.

55-60 min

Deliver the interview synthesis.

Explain why durable agents are stateful distributed systems: the model can choose steps, but the platform owns state transitions, policy gates, retries, and side effects.

Course 9 Reading List

Use three required sources: one durable OpenAI-agent recipe, one checkpointing reference, and one containment engineering post for approval-boundary realism. Keep the optional refresher for core durable-execution vocabulary only.

Required

Temporal AI Cookbook: Durable Agent With Tools - OpenAI Agents SDK

A practical Temporal recipe for wrapping an OpenAI Agents SDK loop in durable workflow execution, with model calls and tools treated as activities that can be retried, recorded, and resumed.

Read for: how to translate an agent loop into workflow state, activity boundaries, retries, and event history.

Required

LangGraph: Persistence

Official LangGraph guidance on checkpointers, threads, checkpoints, state snapshots, long-term stores, human-in-the-loop flows, time travel, and fault tolerance.

Read for: the checkpoint and resume vocabulary needed to discuss production agent state without hand-waving.

Required

Anthropic: How We Contain Claude Across Products

A recent Anthropic engineering post on containment across products, including why human approval prompts can fail when the surrounding system allows coercion, confusion, or unsafe authority transfer.

Read for: how to make approval gates operationally meaningful instead of treating the human as a magic safety layer.

Optional Refresher

Temporal: Durable Execution

A concise refresher on the durable-execution model: event history, replay, recovery after worker failure, and long-running workflow semantics.

Skim for: core replay language if workflow engines are not already fresh in memory.

Readiness Checklist

You are ready for the interview version of this topic when you can defend durable execution as a distributed-systems design, not an SDK feature flag.

You can explain the difference between a best-effort request loop, a background job, and a durable workflow with persisted state transitions.
You can define the state record for a BigQuery agent: intent, plan, selected datasets, approvals, current step, tool receipts, retry counters, cancellation status, and final answer.
You can choose checkpoint boundaries that support resume, replay, user-visible status, human approval, and safe incident debugging.
You can keep side effects outside replay-sensitive logic by using idempotency keys, recorded tool results, BigQuery job IDs, and compensation steps where rollback is impossible.
You can design human approval as an interrupt and resume point with immutable context, policy checks, timeout behavior, and a clear record of what was approved.
You can specify cancellation semantics for queued work, in-flight tool calls, long-running BigQuery jobs, downstream notifications, and partially completed analyses.
You can propose evals and operations checks for stuck workflows, retry storms, approval bypasses, duplicate tool calls, replay correctness, queue backpressure, and trace completeness.

Interview Drill: Agentic AI System Design

Prompt: design durable execution for a BigQuery analytics agent that may spend minutes or hours clarifying a request, inspecting metadata, waiting for approval, running approved SQL, and recovering from model or worker failures.

Start with requirements: user-visible status, resumability after crash, approval before expensive or sensitive queries, cancellation, auditability, tenant isolation, latency targets, and cost caps.
Draw the control plane: API request creates a workflow instance, state transitions persist to a store, queue workers run activities, and the UI subscribes to status and approval interrupts.
Define state transitions: received, clarifying, planning, waiting-for-approval, queued-for-execution, running-query, summarizing, complete, failed, cancelled, and needs-human-follow-up.
Separate deterministic orchestration from side effects: tool calls, model calls, BigQuery dry runs, BigQuery jobs, notifications, and ticket creation run as activities with timeouts and receipts.
Handle replay and retries: replay workflow decisions from persisted history, do not repeat completed external actions, retry transient failures with caps, and escalate non-idempotent uncertainty to a human.
Handle approval and cancellation: freeze the plan shown to the approver, record approval scope, expire stale approvals, propagate cancellation to queued and in-flight work, and mark completed side effects explicitly.
Close with operations: traces across model and tool steps, queue-depth alerts, stuck-workflow sweeps, replay debugging, duplicate-job detection, approval-bypass tests, and golden long-running analytics scenarios.

Durable agent execution for long-running analytics work.

System-Design Frame

Course 9: Durable Agent Execution

Write the failure contract.

Study durable agent execution in Temporal.

Read LangGraph persistence as checkpointing vocabulary.

Ground approval gates in containment.

Draw the durable execution architecture.

Deliver the interview synthesis.

Course 9 Reading List

Temporal AI Cookbook: Durable Agent With Tools - OpenAI Agents SDK

LangGraph: Persistence

Anthropic: How We Contain Claude Across Products

Temporal: Durable Execution

Readiness Checklist

Interview Drill: Agentic AI System Design

Sources