Editorial desk with abstract AI data visualizations and research notes

AI Learning Ramp

Serving engine tradeoffs for interview-grade AI infra decisions.

Course 2 is a one-hour systems session on when to use vLLM, TGI, TensorRT-LLM, or SGLang, and how to defend routing, SLO, and capacity choices in a frontier AI interview.

Course 2 of 24 Published May 29, 2026 Focus: AI infra system design Target: OpenAI / Anthropic interviews

System-Design Frame

Assume you run a BigQuery-connected GenAI analyst that serves mixed workloads: short dashboard questions, long schema-heavy SQL generation, and occasional agentic multi-step investigations. Your job is to pick the serving stack per request class, keep p95 latency inside SLO, and control GPU cost while preserving reliability.

Course 2: Serving Engine Tradeoffs

One-hour objective: defend an engine-selection strategy for vLLM, TGI, TensorRT-LLM, and SGLang under explicit latency, throughput, reliability, and operability constraints.

Define decision criteria.

Write your scorecard: p95 TTFT, tokens/s/GPU, memory efficiency, failover model, and ops complexity.

Study vLLM fundamentals.

Review PagedAttention and why memory fragmentation directly affects usable batching and practical throughput.

Understand TGI positioning.

Use the official Hugging Face engine guidance to place TGI vs newer engine options in platform strategy.

Read TensorRT-LLM deployment tradeoffs.

Focus on disaggregated serving and where specialized performance optimization justifies operational complexity.

Map SGLang into your stack.

Skim the optional runtime docs to understand when SGLang should be tested as a candidate in your routing layer.

Final interview synthesis.

Deliver a 90-second recommendation with one primary engine, one fallback path, and one risk mitigation plan.

Course 2 Reading List

Keep this session tight: exactly three required readings (about 38-42 minutes total) and one optional refresher.

Required - 10 min

Hugging Face TGI Engine Guide

Official engine guidance from Hugging Face, including current positioning of TGI and migration context to newer engines such as vLLM and SGLang.

Extract: when TGI is operationally acceptable versus when modern engine alternatives are better long-term choices.

Required - 14 min

TensorRT-LLM: Disaggregated Serving

Official NVIDIA technical docs on splitting prefill and decode into separate serving groups. High-signal for advanced latency and utilization tradeoffs at scale.

Extract: when disaggregated prefill/decode architecture improves SLO adherence and fleet efficiency.

Optional - 8 min

SGLang Documentation (Overview + Runtime)

Use only if you need a quick refresh on SGLang runtime capabilities before the drill. Treat it as context, not mandatory reading for this hour.

Extract: where SGLang can fit as an alternative serving/runtime layer in a multi-engine strategy.

Readiness Checklist

Interview Drill: AI Infra System Design

Prompt: "Design the serving layer for a BigQuery-native AI analyst with strict p95 latency SLOs and variable query complexity." Keep answers explicit about engine routing and failure handling.

  1. Architecture: Describe gateway, tenant-aware prompt assembly, model/engine router, inference fleet, streaming layer, and observability.
  2. Engine policy: Propose a default engine and at least one specialized lane (for example, optimized long-context or high-throughput decode lanes).
  3. SLO strategy: Explain admission control, backpressure, and overload behavior during schema-heavy traffic spikes.
  4. Capacity plan: Show how you convert target QPS and token distributions into GPU headcount and buffer capacity.
  5. Reliability path: Define fallback order when an engine is degraded and how to preserve user-visible behavior during failover.

Course 2 Sources

  1. vLLM paper: Efficient Memory Management for Large Language Model Serving with PagedAttention
  2. Hugging Face: TGI Engine Guide
  3. NVIDIA TensorRT-LLM: Disaggregated Serving
  4. SGLang Documentation