AI Learning Ramp | Course 12

System-Design Frame

Assume your BigQuery copilot is now used by analysts for interactive SQL help and by scheduled agents for recurring investigations. Reliability is no longer just uptime: the system must preserve trust when OpenAI or Anthropic rate limits bite, retrieval or BigQuery dependencies slow down, a model or prompt change shifts behavior, or an agent starts retrying a costly tool loop. Design the production controls: quotas, admission control, degradation paths, circuit breakers, canary gates, rollback triggers, and incident learning loops.

Course 12: Production Reliability

One-hour objective: design the reliability envelope around a production AI query engine, then explain how overload controls, provider-limit handling, rollout gates, and incident loops keep agentic behavior bounded.

0-5 min

Pick a failure mode.

Choose one incident: rate-limit storms, expensive SQL loops, degraded retrieval, a bad prompt rollout, model drift on finance queries, or a regional dependency outage.

5-18 min

Read OpenAI rate limits.

Focus on organization and project limits, model-specific limits, response headers, user-level usage caps, exponential backoff, jitter, retry caps, and why failed retries still consume quota.

18-33 min

Study SRE overload handling.

Translate degraded responses, per-customer limits, client-side throttling, request criticality, utilization signals, and retry budgets into AI-agent admission control.

33-45 min

Study SRE canarying.

Map canary population, duration, metric choice, control comparisons, isolation, rollback, and error-budget risk to model, prompt, retriever, and tool releases.

45-52 min

Define the reliability control plane.

Sketch rate-limit adapters, tenant quotas, queue classes, circuit breakers, fallback responses, drift monitors, safe replay, and rollback paths for the BigQuery copilot.

52-56 min

Connect incidents to releases.

Decide which production traces become regression evals, which canary metrics become release gates, and which incident actions need runbooks or automation.

56-60 min

Deliver the interview synthesis.

Frame reliability as an envelope around nondeterminism: bound demand, reduce blast radius, degrade honestly, roll out with evidence, and convert incidents into better gates.

Course 12 Reading List

Use exactly three required readings: one provider-limit source, one overload-control source, and one rollout-safety source.

Required

OpenAI API: Rate Limits

OpenAI's guide to organization, project, model, and usage limits, including response headers, error mitigation, per-user caps, and retry behavior.

Read for: how provider limits should shape admission control, backoff, retry budgets, customer messaging, and workload isolation.

Required

Google SRE Book: Handling Overload

A reliability classic on degraded responses, direct capacity signals, per-customer limits, client-side throttling, criticality, utilization, and retry budgets.

Read for: the vocabulary to discuss backpressure, load shedding, and safe degradation for expensive AI query workloads.

Required

Google SRE Workbook: Canary Release

A practical deployment-safety guide covering canary population, duration, SLO and error-budget risk, metric choice, control comparison, isolation, and rollback.

Read for: how to safely ship model, prompt, retriever, tool, and policy changes without treating production traffic as one giant experiment.

Optional Refresher

Google SRE: AI Engineering - Reliable Operations

A current SRE perspective on AI agents in production operations, including transparency, real-time risk evaluation, progressive authorization, memory, eval data, and tool guardrails.

Skim for: language that connects agentic autonomy, production controls, and incident response in an AI infra interview.

Readiness Checklist

You are ready for the interview version of this topic when you can turn reliability from a slogan into concrete controls and tradeoffs.

You can define admission control before expensive work: tenant quotas, per-user caps, token budgets, BigQuery bytes limits, tool fanout limits, and provider rate-limit headers.
You can explain when to retry, queue, shed, or fail fast, including retry caps, jitter, timeout budgets, and the risk of multiplying load through nested agent retries.
You can design graceful degradation paths: cached metadata, narrower retrieval, smaller model, dry-run-only SQL, delayed execution, partial summaries, or human approval for risky actions.
You can separate interactive analyst tasks from scheduled agent tasks with different criticality, latency SLOs, queue priorities, and cancellation behavior.
You can name model-drift signals for a query copilot: eval regression, user corrections, SQL dry-run errors, table-selection shifts, policy denials, cost spikes, and unsupported-answer rates.
You can canary model, prompt, retriever, and tool changes with control comparisons, representative traffic, release gates, rollback triggers, and post-release monitoring.
You can close the incident loop by turning traces, failed canaries, user escalations, and outage notes into runbooks, regression evals, quota changes, and safer defaults.

Interview Drill: AI Infra System Design

Prompt: design production reliability for a BigQuery GenAI query engine that calls frontier models, retrieval services, policy checks, dry-run validation, and BigQuery execution tools.

Start with SLOs: availability, p95 latency, time to first token, successful task completion, cost per task, dry-run success, correction rate, and unsupported-answer rate.
Define demand controls: tenant quotas, provider-limit adapters, per-user caps, token and BigQuery byte budgets, queue classes, concurrency limits, and burst protection.
Design overload behavior: classify requests by criticality, shed scheduled work first, cap retries, return honest degraded answers, and prevent tool loops with circuit breakers.
Design provider-failure handling: read rate-limit headers, use jittered backoff, switch to smaller or alternate models only when quality risk is acceptable, and expose retry-after state to callers.
Design rollout safety: canary model, prompt, retriever, SQL tool, and policy changes with control populations, eval gates, SLO gates, cost gates, safety gates, and automatic rollback.
Design drift detection: compare live traces against golden tasks, monitor table-selection and dry-run error shifts, sample high-risk domains, and require review for sudden behavioral changes.
Close with operations: dashboards, alert routes, runbooks, incident commander handoff, customer-safe status language, replay bundles, postmortems, and new regression tests from every serious incident.

Production reliability for AI query systems.

System-Design Frame

Course 12: Production Reliability

Pick a failure mode.

Read OpenAI rate limits.

Study SRE overload handling.

Study SRE canarying.

Define the reliability control plane.

Connect incidents to releases.

Deliver the interview synthesis.

Course 12 Reading List

OpenAI API: Rate Limits

Google SRE Book: Handling Overload

Google SRE Workbook: Canary Release

Google SRE: AI Engineering - Reliable Operations

Readiness Checklist

Interview Drill: AI Infra System Design

Sources