Editorial desk with abstract AI data visualizations and research notes

AI Learning Ramp

Latency and cost models for interview-grade LLM serving decisions.

Course 3 is a one-hour systems session on TTFT, ITL, throughput, prompt caching, and unit economics, built to help you defend batching, routing, and admission-control choices in a frontier AI interview.

Course 3 of 24 Published June 4, 2026 Focus: latency + cost modeling Target: OpenAI / Anthropic interviews

System-Design Frame

Assume your BigQuery-connected GenAI analyst serves three request classes: interactive dashboard Q&A, longer schema-heavy SQL generation, and tool-using agentic investigations. Your job is to keep interactive p95 latency tight while preventing long-running work from wrecking GPU cost and queue health.

Course 3: Latency And Cost Models

One-hour objective: defend a latency-and-cost operating model for mixed interactive and agentic workloads using explicit metrics, queue policy, and prompt-caching strategy.

Define the scorecard.

Write the few numbers you will manage to: p95 TTFT, p95 end-to-end latency, user TPS, and cost per successful request.

Read the OpenAI latency taxonomy.

Use the seven-principles framework to separate token-speed issues from request-count, prompt-shape, and UX choices.

Anchor on Anthropic's latency guidance.

Focus on TTFT, model choice, output-length control, and streaming so you can explain real versus perceived responsiveness.

Study prompt caching mechanics.

Map exact-prefix reuse, retention, and cache-routing behavior to repeated agent workflows and shared system prompts.

Refresh the metric glossary if needed.

Use the optional metrics guide only if you want sharper language for TTFT, ITL, TPS, RPS, and goodput before the drill.

Deliver the interview synthesis.

State your routing lanes, admission policy, cache strategy, and one explicit tradeoff you are making to keep cost bounded.

Course 3 Reading List

Keep this session tight: exactly three required readings (about 38 minutes total) and one optional refresher.

Required - 14 min

OpenAI: Latency Optimization

Official OpenAI guide that frames latency work as seven levers, not just faster inference. It is the cleanest source for thinking about token speed, request count, parallelism, and product-side responsiveness together.

Extract: which levers change TTFT, which change total latency, and which remove work from the model entirely.

Required - 12 min

Anthropic: Reducing Latency

Official Anthropic guidance on measuring baseline latency and TTFT, then reducing it through model choice, shorter prompts and outputs, and streaming.

Extract: which changes improve actual generation speed versus which mostly improve perceived responsiveness.

Required - 12 min

OpenAI: Prompt Caching

Official mechanics for cache-friendly prompt design, including exact-prefix matching, retention policies, and the latency/cost impact of repeated long prefixes.

Extract: how prompt assembly order changes both p95 latency and input-token cost for repeated agent or analytics workflows.

Optional - 8 min

Anyscale: Understand LLM Latency And Throughput Metrics

Use this only if you want a tighter metric glossary before the drill. It cleanly distinguishes TTFT, ITL, TPOT, TPS, RPS, goodput, and p95/p99 behavior.

Extract: which two metrics you will use for user experience and which two you will use for fleet efficiency.

Readiness Checklist

Interview Drill: AI Infra System Design

Prompt: "Design the serving policy for a BigQuery-native AI analyst that supports interactive Q&A, scheduled batch enrichment, and agentic investigations without blowing p95 latency or GPU spend."

  1. Metrics contract: Pick the 4 metrics you will show leadership and the 4 metrics you will page on, including at least one user-facing latency metric and one cost-efficiency metric.
  2. Workload segmentation: Decide whether interactive chat, long-context SQL generation, and async agent runs share a fleet or get separate lanes, then justify the split.
  3. Queue and admission control: Define concurrency caps, backpressure behavior, and what gets delayed, downgraded, or rejected when prefills spike.
  4. Cache strategy: Explain how you structure prompts, tool definitions, and reusable context so repeated workloads get higher cache hit rates.
  5. Capacity and unit economics: Convert request mix into GPU demand and state when smaller models, shorter outputs, or asynchronous processing lower cost without violating the UX bar.

Course 3 Sources

  1. OpenAI: Latency Optimization
  2. Anthropic: Reducing Latency
  3. OpenAI: Prompt Caching
  4. Anyscale: Understand LLM Latency and Throughput Metrics