Design a real-time AI chatbot system.
### Signal to interviewer
I can design real-time AI systems by treating latency as a first-class architectural contract without sacrificing answer quality.
### Clarify
I would clarify concurrency targets, acceptable response delay, expected request complexity, and moderation requirements.
### Approach
Decompose latency budget across pipeline stages and define fallback strategies for each stage. Stream responses early while running non-critical enrichments asynchronously.
### Metrics & instrumentation
Primary metric: first-token latency percentile. Secondary metrics: turn-completion quality, timeout rate, and handoff success. Guardrails: moderation misses, degraded-region error spikes, and queue saturation.
### Tradeoffs
Strict latency goals improve UX but can reduce depth of reasoning. Richer context improves quality but increases compute and response delay.
### Risks & mitigations
Risk: tail latency blowups during spikes; mitigate with admission control and queue shaping. Risk: degraded moderation under load; mitigate with prioritized safety pipeline. Risk: context retrieval timeouts; mitigate with cache-first fallback.
### Example
In customer support chat, order lookup uses a fast cache path while long-form troubleshooting invokes deeper retrieval after first response streaming.
### 90-second version
Design real-time chat by allocating latency budgets per stage, streaming early, and using adaptive depth. Measure first-token speed and resolution quality together while protecting safety under load.
- What latency threshold defines real-time for your target users?
- Which chat intents require deep context versus fast direct answers?
- How would you implement adaptive routing without unstable oscillation?
- What backpressure strategy would you use during regional traffic spikes?