AI Learning Ramp | Course 3

System-Design Frame

Assume your BigQuery-connected GenAI analyst serves three request classes: interactive dashboard Q&A, longer schema-heavy SQL generation, and tool-using agentic investigations. Your job is to keep interactive p95 latency tight while preventing long-running work from wrecking GPU cost and queue health.

Course 3: Latency And Cost Models

One-hour objective: defend a latency-and-cost operating model for mixed interactive and agentic workloads using explicit metrics, queue policy, and prompt-caching strategy.

0-8 min

Define the scorecard.

Write the few numbers you will manage to: p95 TTFT, p95 end-to-end latency, user TPS, and cost per successful request.

8-22 min

Read the OpenAI latency taxonomy.

Use the seven-principles framework to separate token-speed issues from request-count, prompt-shape, and UX choices.

22-34 min

Anchor on Anthropic's latency guidance.

Focus on TTFT, model choice, output-length control, and streaming so you can explain real versus perceived responsiveness.

34-46 min

Study prompt caching mechanics.

Map exact-prefix reuse, retention, and cache-routing behavior to repeated agent workflows and shared system prompts.

46-54 min

Refresh the metric glossary if needed.

Use the optional metrics guide only if you want sharper language for TTFT, ITL, TPS, RPS, and goodput before the drill.

54-60 min

Deliver the interview synthesis.

State your routing lanes, admission policy, cache strategy, and one explicit tradeoff you are making to keep cost bounded.

Course 3 Reading List

Keep this session tight: exactly three required readings (about 38 minutes total) and one optional refresher.

Required - 14 min

OpenAI: Latency Optimization

Official OpenAI guide that frames latency work as seven levers, not just faster inference. It is the cleanest source for thinking about token speed, request count, parallelism, and product-side responsiveness together.

Extract: which levers change TTFT, which change total latency, and which remove work from the model entirely.

Required - 12 min

Anthropic: Reducing Latency

Official Anthropic guidance on measuring baseline latency and TTFT, then reducing it through model choice, shorter prompts and outputs, and streaming.

Extract: which changes improve actual generation speed versus which mostly improve perceived responsiveness.

Required - 12 min

OpenAI: Prompt Caching

Official mechanics for cache-friendly prompt design, including exact-prefix matching, retention policies, and the latency/cost impact of repeated long prefixes.

Extract: how prompt assembly order changes both p95 latency and input-token cost for repeated agent or analytics workflows.

Optional - 8 min

Anyscale: Understand LLM Latency And Throughput Metrics

Use this only if you want a tighter metric glossary before the drill. It cleanly distinguishes TTFT, ITL, TPOT, TPS, RPS, goodput, and p95/p99 behavior.

Extract: which two metrics you will use for user experience and which two you will use for fleet efficiency.

Readiness Checklist

You can explain why p95 TTFT, ITL or TPOT, TPS, RPS, and goodput should not be collapsed into a single performance number.
You can defend when to separate interactive and asynchronous workloads into different serving lanes instead of one mixed queue.
You can describe how prompt prefix structure affects cache hit rate, latency, and input-token spend.
You can reason through what happens when prompt length or concurrency doubles for a schema-heavy AI analyst workflow.
You can define an admission-control or degradation policy for long agentic requests during peak interactive traffic.

Interview Drill: AI Infra System Design

Prompt: "Design the serving policy for a BigQuery-native AI analyst that supports interactive Q&A, scheduled batch enrichment, and agentic investigations without blowing p95 latency or GPU spend."

Metrics contract: Pick the 4 metrics you will show leadership and the 4 metrics you will page on, including at least one user-facing latency metric and one cost-efficiency metric.
Workload segmentation: Decide whether interactive chat, long-context SQL generation, and async agent runs share a fleet or get separate lanes, then justify the split.
Queue and admission control: Define concurrency caps, backpressure behavior, and what gets delayed, downgraded, or rejected when prefills spike.
Cache strategy: Explain how you structure prompts, tool definitions, and reusable context so repeated workloads get higher cache hit rates.
Capacity and unit economics: Convert request mix into GPU demand and state when smaller models, shorter outputs, or asynchronous processing lower cost without violating the UX bar.

Latency and cost models for interview-grade LLM serving decisions.