AI Learning Ramp | Course 2

System-Design Frame

Assume you run a BigQuery-connected GenAI analyst that serves mixed workloads: short dashboard questions, long schema-heavy SQL generation, and occasional agentic multi-step investigations. Your job is to pick the serving stack per request class, keep p95 latency inside SLO, and control GPU cost while preserving reliability.

Course 2: Serving Engine Tradeoffs

One-hour objective: defend an engine-selection strategy for vLLM, TGI, TensorRT-LLM, and SGLang under explicit latency, throughput, reliability, and operability constraints.

0-8 min

Define decision criteria.

Write your scorecard: p95 TTFT, tokens/s/GPU, memory efficiency, failover model, and ops complexity.

8-22 min

Study vLLM fundamentals.

Review PagedAttention and why memory fragmentation directly affects usable batching and practical throughput.

22-34 min

Understand TGI positioning.

Use the official Hugging Face engine guidance to place TGI vs newer engine options in platform strategy.

34-46 min

Read TensorRT-LLM deployment tradeoffs.

Focus on disaggregated serving and where specialized performance optimization justifies operational complexity.

46-54 min

Map SGLang into your stack.

Skim the optional runtime docs to understand when SGLang should be tested as a candidate in your routing layer.

54-60 min

Final interview synthesis.

Deliver a 90-second recommendation with one primary engine, one fallback path, and one risk mitigation plan.

Course 2 Reading List

Keep this session tight: exactly three required readings (about 38-42 minutes total) and one optional refresher.

Required - 14 min

Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM paper)

Primary research source for vLLM's core design. It gives interview-safe language for explaining memory fragmentation, throughput limits, and why paging-style KV management matters.

Extract: how KV cache allocation strategy becomes a first-order performance constraint.

Required - 10 min

Hugging Face TGI Engine Guide

Official engine guidance from Hugging Face, including current positioning of TGI and migration context to newer engines such as vLLM and SGLang.

Extract: when TGI is operationally acceptable versus when modern engine alternatives are better long-term choices.

Required - 14 min

TensorRT-LLM: Disaggregated Serving

Official NVIDIA technical docs on splitting prefill and decode into separate serving groups. High-signal for advanced latency and utilization tradeoffs at scale.

Extract: when disaggregated prefill/decode architecture improves SLO adherence and fleet efficiency.

Optional - 8 min

SGLang Documentation (Overview + Runtime)

Use only if you need a quick refresh on SGLang runtime capabilities before the drill. Treat it as context, not mandatory reading for this hour.

Extract: where SGLang can fit as an alternative serving/runtime layer in a multi-engine strategy.

Readiness Checklist

You can explain why KV memory strategy changes real throughput, not just theoretical throughput.
You can state one reason to choose vLLM, one reason to keep TGI, and one reason to adopt TensorRT-LLM for a specific workload slice.
You can describe when prefill/decode disaggregation is worth the added operational complexity.
You can propose a routing policy by query class (short Q&A, long SQL generation, multi-turn agent workflows).
You can name the top 3 telemetry signals to judge engine health: TTFT, inter-token latency, and GPU memory/utilization pressure.

Interview Drill: AI Infra System Design

Prompt: "Design the serving layer for a BigQuery-native AI analyst with strict p95 latency SLOs and variable query complexity." Keep answers explicit about engine routing and failure handling.

Architecture: Describe gateway, tenant-aware prompt assembly, model/engine router, inference fleet, streaming layer, and observability.
Engine policy: Propose a default engine and at least one specialized lane (for example, optimized long-context or high-throughput decode lanes).
SLO strategy: Explain admission control, backpressure, and overload behavior during schema-heavy traffic spikes.
Capacity plan: Show how you convert target QPS and token distributions into GPU headcount and buffer capacity.
Reliability path: Define fallback order when an engine is degraded and how to preserve user-visible behavior during failover.

Serving engine tradeoffs for interview-grade AI infra decisions.