Design AI inference scaling architecture.
### Signal to interviewer
I can design inference systems that scale elastically while preserving reliability under bursty, heterogeneous workloads.
### Clarify
I would clarify request volume shape, latency tiers, model sizes, regional compliance, and traffic criticality classes.
### Approach
Use capacity cell architecture: isolate serving pools by workload class, route by policy, and provide spillover to adjacent cells when pressure rises.
### Metrics & instrumentation
Primary metric: successful responses per compute dollar. Secondary metrics: queue wait time, autoscale reaction latency, and GPU utilization quality. Guardrails: tail-latency violations, dropped-request rate, and hotspot saturation.
### Tradeoffs
Higher utilization cuts cost but reduces buffer for sudden spikes. Aggressive autoscaling improves elasticity but may introduce cold-start delays.
### Risks & mitigations
Risk: regional imbalance during incidents; mitigate with cross-cell overflow limits. Risk: noisy-neighbor model contention; mitigate with workload isolation. Risk: unstable scaling loops; mitigate with hysteresis and rate limits.
### Example
For a global writing assistant, low-latency interactive prompts stay in premium cells while background summarization routes to cost-efficient batch-oriented cells.
### 90-second version
Scale inference by capacity cells, route requests by SLO class, and protect reliability with admission control plus reserved headroom. Optimize cost without sacrificing burst resilience.
- Which workload classes should get dedicated capacity cells first?
- What reserve headroom policy would you set for priority traffic?
- How would you tune autoscaling to avoid oscillation under spiky demand?
- What telemetry is required to detect contention between model workloads?