Keep this session tight: exactly three required readings (about 38 minutes total) and one optional refresher.
Required - 14 min
Official OpenAI guide that frames latency work as seven levers, not just faster inference. It is the cleanest source for thinking about token speed, request count, parallelism, and product-side responsiveness together.
Extract: which levers change TTFT, which change total latency, and which remove work from the model entirely.
Required - 12 min
Official Anthropic guidance on measuring baseline latency and TTFT, then reducing it through model choice, shorter prompts and outputs, and streaming.
Extract: which changes improve actual generation speed versus which mostly improve perceived responsiveness.
Required - 12 min
Official mechanics for cache-friendly prompt design, including exact-prefix matching, retention policies, and the latency/cost impact of repeated long prefixes.
Extract: how prompt assembly order changes both p95 latency and input-token cost for repeated agent or analytics workflows.
Optional - 8 min
Use this only if you want a tighter metric glossary before the drill. It cleanly distinguishes TTFT, ITL, TPOT, TPS, RPS, goodput, and p95/p99 behavior.
Extract: which two metrics you will use for user experience and which two you will use for fleet efficiency.