System-Design Frame
Assume you run a BigQuery-connected GenAI analyst that serves mixed workloads: short dashboard questions, long schema-heavy SQL generation, and occasional agentic multi-step investigations. Your job is to pick the serving stack per request class, keep p95 latency inside SLO, and control GPU cost while preserving reliability.