Explore topic-wise interview questions and answers.
Inference Optimization
QUESTION 01
What is quantization in LLMs and what are the main types (INT8, INT4, GPTQ, AWQ)?
š DEFINITION:
Quantization is a model compression technique that reduces the numerical precision of model weights and activations from higher-bit formats (like FP16) to lower-bit formats (INT8, INT4), dramatically decreasing memory usage and accelerating inference while attempting to preserve model quality. It's essential for deploying large models on resource-constrained hardware.
āļø HOW IT WORKS:
Quantization maps a range of floating-point values to a smaller set of integer values. Post-training quantization (PTQ) applies this after training: weights are analyzed to determine optimal scaling factors, then converted to integers. Quantization-aware training (QAT) simulates quantization during training, often yielding better quality. Main types: INT8 (8-bit) reduces memory by 2Ć with minimal quality loss, standard for inference. INT4 (4-bit) reduces by 4Ć but requires more careful techniques like GPTQ and AWQ. GPTQ uses approximate second-order information to optimally quantize weights layer-by-layer. AWQ (Activation-aware Weight Quantization) protects salient weights by analyzing activation distributions, achieving better INT4 quality.
š” WHY IT MATTERS:
Quantization enables models that would otherwise require multiple GPUs to run on single devices, dramatically reducing cost. A 70B model in FP16 requires 140GB - impossible on consumer hardware. INT4 quantized to 35GB fits on high-end consumer GPUs (RTX 4090 24GB with offloading). Inference speed improves 2-4Ć due to reduced memory bandwidth and faster integer arithmetic. This democratizes access to large models for researchers, enables on-device deployment, and cuts cloud costs. Quality loss is often minimal (1-2% accuracy drop) with modern techniques.
š EXAMPLE:
LLaMA-65B (130GB FP16) quantized with GPTQ to 4-bit: 32.5GB, runs on single 48GB GPU. Inference speed: 20 tokens/s vs 5 tokens/s FP16 on same hardware. Perplexity increases from 5.2 to 5.4 - barely noticeable. For production, serving 100 QPS with FP16 requires 20 GPUs; with INT4, 5 GPUs suffice, saving $75k/year. This is why quantization is standard practice - it turns impossible deployments into cost-effective reality.
QUESTION 02
What is the KV cache and how does it speed up autoregressive inference?
š DEFINITION:
The KV cache is a memory structure that stores the key and value vectors from previous tokens during autoregressive generation, eliminating redundant recomputation. Instead of recomputing attention for all tokens at each step, the model caches these vectors and only computes attention for the new token, reducing inference complexity from O(n³) to O(n²) overall.
āļø HOW IT WORKS:
During generation, the first forward pass processes the entire prompt, computing keys and values for all positions. These are stored in the KV cache (per layer, per head). For each subsequent token, the model: 1) computes query for the new token, 2) retrieves all previous keys/values from cache, 3) computes attention between new query and all cached keys, 4) generates new token, 5) computes new key/value and appends to cache. This avoids recomputing previous token representations, which would require O(n²) per step. Cache size grows linearly with sequence length, storing n_layers à 2 à n_heads à d_head à seq_len values.
š” WHY IT MATTERS:
Without KV cache, generating 1000 tokens would require 1000 full forward passes over growing sequences - O(n³) total complexity, prohibitively slow. With KV cache, each new token costs O(n) in current length (only new attention computations), making long generations practical. For a 1000-token generation, speedup is ~500Ć. KV cache enables real-time chat applications where low latency is critical. Memory overhead is manageable: for 7B model with 2k context, KV cache ~1GB. Trade-offs: cache limits batch size and sequence length, but optimizations like PagedAttention reduce fragmentation.
š EXAMPLE:
Generating 1000 tokens with 7B model. Without KV cache: step 1 processes 20 tokens, step 2 processes 21, ... step 1000 processes 1019 tokens. Total FLOPs ā 500M Ć 1000 = 500B, latency minutes. With KV cache: step 1 (prompt) processes 20 tokens (caches K,V). Steps 2-1000 each process 1 new token attending to cached K,V. Total FLOPs ā 500M + 1000 Ć 25M = 525M - 1000Ć less compute. Latency drops from minutes to seconds. This is why every production system uses KV cache.
QUESTION 03
What is speculative decoding and how does it reduce latency?
š DEFINITION:
Speculative decoding is a technique that accelerates autoregressive generation by using a smaller, faster draft model to propose multiple candidate tokens, which are then verified in parallel by the larger target model. This trades increased computation for reduced latency by exploiting the fact that verification can be parallelized while generation is sequential.
āļø HOW IT WORKS:
Speculative decoding uses two models: a fast draft model (e.g., 100M params) and the target model (e.g., 7B). At each step, the draft model autoregressively generates K candidate tokens (typically 3-5) using greedy decoding. These K tokens and their probabilities are passed to target model, which computes a single forward pass over all K candidates in parallel, obtaining its own probability distributions. The system then determines the longest prefix where target model agrees with draft model's predictions, accepting those tokens. If a disagreement occurs at position i, the process resamples using target model's distribution and discards remaining candidates. This verifies multiple tokens per target forward pass.
š” WHY IT MATTERS:
Speculative decoding can achieve 2-3Ć speedup in generation latency without quality loss. It's most effective when draft model is much faster than target (10-20Ć) and acceptance rate high (60-80%). The technique works because many tokens are easy to predict (common words, grammatical particles) - draft model gets them right, target verifies in parallel. It requires no model changes, just orchestration. This is particularly valuable for latency-sensitive applications like chatbots where each saved millisecond improves user experience.
š EXAMPLE:
Generating response with 7B target, 100M draft model (20Ć faster). Without speculation: 100 tokens Ć 20ms = 2000ms latency. With speculation (K=5, acceptance rate 70%): target forward passes ā 100/(5Ć0.7) ā 29 passes Ć 20ms = 580ms. Draft runs 29Ć5Ć1ms = 145ms. Total 725ms - 2.8Ć speedup. User perceives faster responses. Quality identical because target verifies all accepted tokens. This is why speculative decoding is increasingly standard in production inference systems.
QUESTION 04
What is batching in LLM inference and why does it improve throughput?
š DEFINITION:
Batching in LLM inference groups multiple independent requests together for simultaneous processing, leveraging GPU parallelism to improve throughput (requests per second). Instead of processing requests sequentially, the system combines them into a single batch that runs through the model together, amortizing overhead and maximizing hardware utilization.
āļø HOW IT WORKS:
For a batch of B requests, the system pads input sequences to the same length (or uses careful scheduling to group similar lengths). The batched tensors have shape [B, seq_len] for inputs, and attention masks handle variable lengths. During forward pass, matrix multiplications become [B, ...] operations, utilizing GPU tensor cores efficiently. For autoregressive generation, KV caches are maintained separately per request, and generation proceeds in lockstep (all requests generate token i before moving to i+1). Throughput scales nearly linearly with batch size until hitting memory or compute limits.
š” WHY IT MATTERS:
Batching is essential for cost-effective serving. GPUs are designed for parallel computation - running single requests utilizes only 1-5% of peak FLOPs. Batching of 32-64 requests can achieve 80-90% utilization, increasing throughput 20-30Ć. This directly translates to lower cost per request. For example, serving 1000 QPS with batch size 1 might need 100 GPUs; with batch size 32, 4 GPUs suffice - 96% cost reduction. Batching also enables economies of scale in cloud deployments. The trade-off is increased latency (requests wait for batch to fill), managed via dynamic batching (timeout-based).
š EXAMPLE:
Serving 7B model on A100. Single request: 20ms prefill + 20ms per token = 220ms for 10-token response ā 4.5 QPS per GPU. With batch size 32: prefill processes 32 requests in parallel (still 20ms), generation processes 32 tokens in parallel (still 20ms per token). Total time for batch = 220ms for 32 requests ā 145 QPS - 32Ć throughput. Cost per request drops from $0.001 to $0.00003. This is why production systems aggressively batch - it's the most impactful optimization for throughput.
QUESTION 05
What is continuous batching and how does it differ from static batching?
š DEFINITION:
Continuous batching (also called iteration-level batching) is an advanced scheduling technique where requests are added to and removed from the batch dynamically at each generation step, rather than waiting for entire batches to complete. This eliminates the 'straggler effect' where fast requests wait for slow ones, improving both latency and throughput.
āļø HOW IT WORKS:
In static batching, a batch of requests starts together and all must finish generation before new requests can be added. If one request generates 100 tokens while others generate 10, the 10-token requests wait idle. Continuous batching maintains a pool of active requests. At each iteration, the scheduler selects which requests to process based on their current generation state. Completed requests are removed, new requests are added immediately. KV caches are managed dynamically, with memory paging to handle variable sequence lengths. This requires careful memory management (PagedAttention) to avoid fragmentation.
š” WHY IT MATTERS:
Continuous batching dramatically improves GPU utilization and reduces latency variability. In static batching, the slowest request determines batch completion time, wasting compute. With continuous batching, GPU always works on available requests. Throughput gains of 2-4Ć over static batching are common. Latency improves because requests don't wait for batch to fill before starting. This technique enabled high-performance inference engines like vLLM, TensorRT-LLM, and TGI. It's essential for serving mixtures of short and long requests efficiently.
š EXAMPLE:
Static batch of 8 requests: 2 short (10 tokens), 2 medium (50 tokens), 4 long (200 tokens). Total time = 200 tokens Ć generation time. Short requests wait 200 steps for completion - terrible latency. GPU idle during later steps when only long requests remain. Continuous batching: short requests complete in 10 steps and leave, new requests join immediately. GPU always fully utilized. Throughput same or better, but short request latency 10 steps vs 200 steps - 20Ć improvement. For real workloads with mixed lengths, continuous batching is essential for good user experience.
QUESTION 06
What is tensor parallelism and how is it used to serve large models?
š DEFINITION:
Tensor parallelism is a distributed inference technique that splits individual model layers across multiple GPUs, with each GPU holding a portion of the weights and computing part of each operation. It enables serving models too large for single GPU memory by distributing the workload, with communication between GPUs after each layer to combine results.
āļø HOW IT WORKS:
For a transformer layer, the attention heads are split across GPUs (each GPU handles subset of heads). For MLP layers, the weight matrices are partitioned column-wise or row-wise. During forward pass, each GPU computes its portion of the layer on its shard of the input. After attention, all-to-all communication gathers results from all heads. After MLP, results are combined via reduce-scatter or all-reduce. Communication overhead is significant but necessary for models exceeding single GPU memory. Pipeline parallelism (layer-wise splitting) is often combined with tensor parallelism for extreme scale.
š” WHY IT MATTERS:
Tensor parallelism enables serving models that would otherwise be impossible. A 175B model in FP16 requires 350GB - far beyond any single GPU. With tensor parallelism across 8 A100s (80GB each), each holds 44GB, fitting comfortably. Inference would be impossible without it. The trade-off is communication overhead - each layer requires GPU-to-GPU transfers, which can become bottleneck. However, modern interconnects (NVLink, InfiniBand) make this feasible. Tensor parallelism is standard for large model serving, often combined with quantization for further memory reduction.
š EXAMPLE:
Serving GPT-3 175B on 8Ć A100 GPUs with tensor parallelism. Each GPU holds 1/8 of weights (~44GB). Input sequence: each GPU processes its portion of attention heads in parallel. After attention, all-to-all communication (8Ć44GB) across NVLink (600GB/s) takes ~0.6ms per layer. With 96 layers, total communication overhead ~58ms per forward pass - significant but acceptable for batch inference. Without tensor parallelism, model wouldn't fit at all. For smaller models (7B), tensor parallelism unnecessary - fits on single GPU. Choice depends on model size vs GPU memory.
QUESTION 07
What is model distillation and how does it produce smaller, faster models?
š DEFINITION:
Model distillation is a compression technique where a smaller 'student' model is trained to mimic the behavior of a larger 'teacher' model, capturing its knowledge and capabilities in a more efficient architecture. The student learns from the teacher's soft probabilities rather than just hard labels, acquiring nuanced understanding that enables comparable performance with far fewer parameters.
āļø HOW IT WORKS:
The teacher model (e.g., 175B) generates outputs on a large dataset. For each input, the teacher produces probability distributions over vocabulary (soft targets) that contain rich information about relationships between classes (e.g., 'dog' and 'cat' have higher probabilities than 'car'). The student model (e.g., 7B) is trained to match these soft targets via KL divergence loss, often combined with standard cross-entropy on ground truth. Temperature scaling smooths distributions to emphasize relationships. The student learns not just correct answers but the teacher's reasoning patterns and uncertainty estimates. Multiple teacher models can be ensembled.
š” WHY IT MATTERS:
Distillation produces models that retain most of the teacher's capability at fraction of size and cost. A distilled 7B model can match 70B model performance on many tasks while running 10Ć faster and using 10Ć less memory. This enables deployment on edge devices, reduces cloud costs, and improves latency. Distillation is how companies create efficient model families (DistilBERT, MiniLM, Phi series). It's also used for domain specialization - distill general knowledge into smaller domain-specific models. The trade-off is some quality loss, but often surprisingly small.
š EXAMPLE:
Distilling GPT-4 (estimated 1.8T) into 7B student. Teacher generates outputs on 100M diverse prompts. Student trained to match teacher distributions. Resulting model scores 85% on MMLU vs GPT-4's 90% - 5% loss but 250Ć smaller. Inference cost: $0.001 per query vs $0.03 - 30Ć cheaper. For many applications, 85% accuracy sufficient, making distillation highly cost-effective. Google's DistilBERT reduced BERT size by 40% while retaining 97% performance, becoming standard for production. This efficiency gain enables AI deployment at scale.
QUESTION 08
What are the trade-offs between latency, throughput, and cost in LLM serving?
š DEFINITION:
Latency, throughput, and cost form the fundamental trilemma in LLM serving - optimizing one typically harms others. Latency measures time per request (user experience), throughput measures requests per second (capacity), and cost measures dollars per request (economics). Serving systems must balance these based on application requirements.
āļø HOW IT WORKS:
Low latency requires small batches (less waiting) and more resources (over-provisioning), increasing cost. High throughput requires large batches, which increases latency (requests wait). Cost optimization pushes toward maximal utilization (large batches, slower hardware), harming both latency and throughput. Techniques trade off: quantization reduces cost and latency (smaller models) but may slightly harm quality. Batching improves throughput and cost but hurts latency. Model distillation improves all three but requires upfront investment. Hardware choice balances: A100 expensive but fast, T4 cheaper but slower.
š” WHY IT MATTERS:
Different applications prioritize differently. Chatbots need low latency (<500ms) - willing to pay more, use smaller batches, over-provision. Batch processing (offline summarization) prioritizes throughput and cost - use large batches, accept higher latency. Real-time APIs need balance - typical SLO: p95 latency <2s, throughput maximized within that constraint. Understanding trade-offs guides architecture: choose model size, quantization, batching strategy, hardware based on requirements. Misalignment causes either poor user experience or excessive costs.
š EXAMPLE:
Three serving configurations for 7B model. Config A (low latency): batch size 1, 10 GPUs, p95 latency 200ms, throughput 50 QPS, cost $0.05/request. Config B (balanced): batch size 32, 2 GPUs, latency 1.5s, throughput 400 QPS, cost $0.002/request. Config C (cost-optimized): batch size 128, 1 GPU, latency 5s, throughput 600 QPS, cost $0.0005/request. Chat app chooses A, batch processing chooses C, API service chooses B. Each valid for different use cases. The art is matching configuration to requirements.
QUESTION 09
What is time-to-first-token (TTFT) and why does it matter for user experience?
š DEFINITION:
Time-to-first-token (TTFT) is the latency between submitting a prompt and receiving the first output token from the model. It measures how quickly the model begins responding, critically impacting perceived responsiveness in interactive applications like chatbots, where users experience waiting time until they see any response.
āļø HOW IT WORKS:
TTFT consists of: network latency (request transmission), prefill phase (processing input prompt through model to compute initial KV cache), and first token generation. Prefill dominates - for long prompts, this requires full forward pass over all input tokens. TTFT scales with input length (O(n²) attention) and model size. Factors affecting TTFT: model size (larger = slower), input length (longer = slower), hardware speed, batching (waiting for batch to fill), and optimization techniques like prefix caching.
š” WHY IT MATTERS:
TTFT directly shapes user perception of responsiveness. Studies show users notice delays >300ms; >2s causes frustration and abandonment. For chatbots, TTFT determines how natural conversation feels - long TTFT breaks flow. Applications like real-time translation or voice assistants require very low TTFT (<200ms). TTFT is often more noticeable than per-token speed because users wait for initial response without feedback. Optimizing TTFT involves: smaller models, faster hardware, prompt caching, and avoiding unnecessary batching before starting inference.
š EXAMPLE:
User asks chatbot with 2000-token conversation history. Model A (optimized): prefill 300ms, TTFT 320ms - feels instant. Model B (unoptimized): prefill 1500ms, TTFT 1520ms - user wonders if system froze. Same per-token speed (30ms) after first token, but initial experience dramatically different. User likely abandons Model B. This is why streaming APIs return first token as fast as possible, even if rest slower - perceived responsiveness matters more than total generation time.
QUESTION 10
What is tokens-per-second (TPS) and how is it measured?
š DEFINITION:
Tokens-per-second (TPS) measures the generation speed of a language model, indicating how many output tokens are produced per second after the first token. It's the primary metric for inference throughput and directly impacts user experience for longer generations and system capacity planning.
āļø HOW IT WORKS:
TPS is measured by timing generation of a sequence of tokens and dividing count by time. For example, generating 100 tokens in 2 seconds = 50 TPS. Important distinctions: prefill time (processing input) is excluded - TPS measures generation phase only. Peak TPS vs average TPS can differ due to variable sequence lengths. Measurement must account for: batch size (per-request TPS vs system TPS), hardware type, quantization, and optimization techniques. For streaming applications, TPS determines how fast text appears to user.
š” WHY IT MATTERS:
TPS affects both user experience and system economics. For users, higher TPS means faster responses - 100 TPS feels instant, 10 TPS feels sluggish. For systems, TPS determines capacity: 10 TPS per request with 100 concurrent users requires 1000 TPS system throughput. TPS varies dramatically: small models (7B) can achieve 100+ TPS on high-end GPUs, large models (70B) 10-20 TPS, quantized models 2-3Ć faster. Applications like code completion need high TPS (50+) to keep pace with typing; batch processing can tolerate lower TPS.
š EXAMPLE:
Comparing configurations for 7B model: FP16 on A100: 40 TPS, INT8 on A100: 70 TPS, INT4 on RTX 4090: 120 TPS. For chat application targeting 50 TPS minimum, all work but INT4 cheapest. For batch summarization of 10k documents, 40 TPS sufficient - choose FP16 for quality. For real-time translation, 120 TPS needed - must optimize heavily. TPS measurement guides hardware selection, optimization priorities, and capacity planning.
QUESTION 11
What is vLLM and what problem does PagedAttention solve?
š DEFINITION:
vLLM is a high-throughput inference engine that introduces PagedAttention, a memory management technique inspired by operating system virtual memory. It solves the problem of KV cache fragmentation and inefficient memory usage in traditional systems, enabling much larger batch sizes and higher throughput for LLM serving.
āļø HOW IT WORKS:
Traditional systems allocate contiguous memory blocks for each request's KV cache, leading to fragmentation as requests start and complete at different times. PagedAttention divides KV cache into fixed-size blocks (pages) that can be non-contiguous in memory. The attention mechanism is modified to work with block-wise KV cache, allowing flexible allocation. Blocks are mapped via page tables, similar to OS virtual memory. This enables: 1) Efficient memory usage (no fragmentation), 2) Sharing of KV cache across requests (e.g., for parallel sampling), 3) Larger batch sizes, 4) Higher throughput. vLLM implements this with CUDA kernels optimized for block-wise attention.
š” WHY IT MATTERS:
Memory fragmentation in traditional systems wastes 30-50% of GPU memory, limiting batch sizes and throughput. PagedAttention eliminates this waste, enabling 2-4Ć higher throughput. For example, serving 13B model on A100, traditional system might handle 8 concurrent requests; vLLM handles 20-30. This directly translates to lower cost per request. PagedAttention also enables novel features like parallel sampling with shared prefix cache, further improving efficiency. vLLM has become the standard for high-performance inference due to these innovations.
š EXAMPLE:
Serving 100 QPS with 7B model. Traditional inference engine: memory fragmentation limits batch size to 16, requires 4 GPUs to handle load. vLLM with PagedAttention: memory utilization 95%, batch size 48, handles load on 2 GPUs - 50% cost reduction. For long sequences (32k tokens), fragmentation worse in traditional systems - vLLM advantage grows. This is why vLLM adoption exploded - it's essentially free performance through smarter memory management.
QUESTION 12
How does prompt caching work and when does it save cost?
š DEFINITION:
Prompt caching stores and reuses the KV cache of repeated prompt prefixes across multiple requests, avoiding redundant computation of the same prefix. When many requests share common prefixes (system prompts, few-shot examples, conversation history), caching can dramatically reduce latency and cost by eliminating repeated prefill computation.
āļø HOW IT WORKS:
When a request arrives, the system checks if the prompt prefix matches any cached entry. If so, it retrieves the cached KV cache for that prefix and only computes attention for the new suffix. The cache stores key-value vectors for each layer and head, exactly as computed during prefill. Cache entries have timestamps and are evicted via LRU when memory fills. Advanced systems implement hierarchical caching: system prompts cached permanently, conversation history cached per session. PagedAttention enables efficient cache management through block-based storage.
š” WHY IT MATTERS:
Prompt caching saves significant compute in common scenarios. For chatbots, each conversation turn re-processes entire history - with caching, only new message needs computation. For applications with fixed system prompts, caching eliminates 90% of prefill compute. For few-shot prompting, cached examples reduce repeated work. This translates to 2-5Ć latency reduction and proportional cost savings. Cache hit rates of 70% are common in production, making it one of the most impactful optimizations.
š EXAMPLE:
Chat application with 2000-token conversation history, 100-token new message. Without caching: prefill 2000+100 tokens (2100) each turn ā expensive. With caching: after first turn, KV cache for 2000 tokens stored. Second turn: only prefill 100 new tokens, reuse cached history. Prefill cost reduced 95%. For 10-turn conversation, total prefill tokens: traditional 11,550 vs cached 2,100 - 5.5Ć less compute. Over millions of conversations, this saves millions in compute costs while making responses faster.
QUESTION 13
What is the prefill vs. decode phase in LLM inference?
š DEFINITION:
LLM inference consists of two distinct phases: prefill (processing the input prompt to build initial KV cache) and decode (generating output tokens one by one). These phases have different computational characteristics and optimization strategies, requiring different approaches for efficient serving.
āļø HOW IT WORKS:
Prefill phase processes the entire input prompt in one forward pass. It computes attention over all prompt tokens, generating KV cache for each position. This is compute-intensive (O(n²) attention over prompt length) but highly parallelizable. Decode phase generates tokens sequentially. Each step processes one new token, attending to all previous tokens via cached K,V. This is memory-bandwidth bound (loading KV cache) and latency-sensitive. Prefill uses dense matrix multiplications; decode uses memory-heavy attention operations. The ratio of prefill to decode cost depends on input/output lengths.
š” WHY IT MATTERS:
Optimizing each phase requires different strategies. Prefill benefits from large batches (multiple prompts processed together) and tensor parallelism. Decode benefits from KV cache, attention optimizations, and speculative decoding. Memory allocation differs: prefill needs temporary buffers for attention matrices; decode needs persistent KV cache. Systems must balance both: long prompts make prefill dominant; long generations make decode dominant. Understanding phases guides hardware selection (compute for prefill, bandwidth for decode) and batching strategies (continuous batching handles both).
š EXAMPLE:
Processing 2000-token prompt generating 500 tokens. Prefill: 2000 tokens processed once, compute ~4M FLOPs per layer. Decode: 500 steps, each processing 1 new token attending to 2000+ cached - compute ~2M FLOPs per step total 1B FLOPs. Prefill dominates compute but decode dominates time due to sequential nature. Optimizations: prefill benefits from tensor parallelism (faster compute), decode from KV cache and speculative decoding (fewer steps). System must handle both efficiently.
QUESTION 14
What hardware (GPU/TPU) considerations matter for deploying LLMs at scale?
š DEFINITION:
Deploying LLMs at scale requires careful hardware selection balancing compute capacity, memory bandwidth, interconnects, and cost. Key considerations include memory capacity (model size Ć quantization), memory bandwidth (tokens/sec), FLOPs (compute throughput), inter-GPU communication (for parallelism), and total cost of ownership.
āļø HOW IT WORKS:
GPU selection: A100 (80GB) and H100 (80GB) are standard for large models, offering high memory bandwidth (2-3TB/s) and tensor cores. Consumer GPUs (RTX 4090 24GB) viable for smaller/quantized models at lower cost. TPUs offer high compute but require model adaptation. Memory capacity determines if model fits: 70B FP16 needs 140GB - requires multiple GPUs or quantization. Memory bandwidth determines generation speed: higher bandwidth = faster token generation. Interconnects (NVLink, InfiniBand) matter for tensor parallelism. Power and cooling affect operational costs.
š” WHY IT MATTERS:
Hardware choices directly impact serving cost and capability. Wrong choice: model doesn't fit, too slow, or unnecessarily expensive. For 7B models, RTX 4090 (24GB) often optimal - cheap, fast enough. For 70B, need A100/H100 with quantization or multiple GPUs. For 175B+, multiple H100s with NVLink essential. Cloud vs on-premises trade-offs: cloud offers flexibility but higher long-term cost; on-premises requires upfront investment but lower marginal cost at scale. Hardware selection should align with workload patterns (batch size, latency requirements).
š EXAMPLE:
Comparing serving 70B model. Option A: 8Ć A100 80GB (tensor parallelism), cost $80/hour cloud, latency 100ms, throughput 500 QPS. Option B: 2Ć H100 80GB with INT4 quantization, cost $60/hour, latency 150ms, throughput 300 QPS. Option C: 1Ć H100 with GPTQ 4-bit, cost $30/hour, latency 200ms, throughput 150 QPS. Choice depends on QPS requirements and budget. For most production workloads, Option B offers best price-performance. Hardware decisions are multi-dimensional, requiring careful analysis of workload characteristics.
QUESTION 15
What is GGUF format and how does it enable CPU inference?
š DEFINITION:
GGUF (GPT-Generated Unified Format) is a file format designed for efficient storage and execution of quantized LLMs, particularly optimized for CPU inference. It organizes model weights, quantization parameters, and metadata in a memory-mappable structure that enables fast loading and execution on consumer hardware without GPUs.
āļø HOW IT WORKS:
GGUF stores models with advanced quantization (2-8 bits) using techniques like k-quants that adapt precision based on parameter importance. The format is designed for memory mapping - instead of loading entire model into RAM, the operating system can load pages on-demand as needed. It includes metadata about architecture, tokenizer, and quantization parameters. Inference engines like llama.cpp read GGUF files and execute optimized CPU kernels (AVX, NEON) for matrix multiplication. This enables running models on laptops, phones, and servers without GPUs.
š” WHY IT MATTERS:
GGUF democratizes LLM access by enabling inference on commodity hardware. A 7B model quantized to 4-bit in GGUF format is ~4GB - fits in RAM of any modern laptop or phone. CPU inference runs at 5-20 tokens/s - usable for many applications. This eliminates GPU dependency, reducing cost and enabling edge deployment. GGUF's memory-mapping allows running models larger than RAM (slow but possible). The format's flexibility supports multiple quantization types, allowing users to trade quality for speed. It's become the standard for local LLM deployment through tools like Ollama, LM Studio, and GPT4All.
š EXAMPLE:
User runs LLaMA-7B on 2020 MacBook Air. Download GGUF 4-bit file (3.8GB). Launch llama.cpp, get 8 tokens/s generation - sufficient for chat, summarization, coding help. No GPU, no cloud costs, data stays private. Same model on iPhone 15 Pro via GGUF: 5 tokens/s - usable. This accessibility enables private, offline AI applications impossible with cloud-dependent solutions. GGUF is why local LLMs are viable.
QUESTION 16
What is the impact of context window length on inference cost and latency?
š DEFINITION:
Context window length directly affects inference cost and latency through quadratic attention complexity in prefill (O(n²)) and linear KV cache growth in decode (O(n)). Longer contexts dramatically increase computational requirements, memory usage, and generation time, creating fundamental trade-offs between capability and efficiency.
āļø HOW IT WORKS:
Prefill phase: attention computation scales as O(n²) with context length n. Doubling context quadruples prefill compute. Memory for attention matrices also O(n²). Decode phase: each token generation attends to all n cached tokens, so per-token cost O(n) - longer context slows generation. KV cache size grows linearly with n, consuming GPU memory and limiting batch size. For 32k context, KV cache for 7B model ~8GB per request - significantly reducing concurrency.
š” WHY IT MATTERS:
Context length decisions impact everything. 128k context models cost 64Ć more to prefill than 2k models, and generate 64Ć slower per token. This limits practical deployment - you pay for capability you may not need. For applications processing long documents, cost may be prohibitive. Techniques like sparse attention, sliding windows, and retrieval attempt to mitigate, but full attention remains expensive. Practitioners must right-size context to actual needs - using 128k for 1k average queries wastes resources.
š EXAMPLE:
Comparing same model with different context limits. 2k context: prefill 10ms, generation 20ms per token, batch size 64, cost $0.001/request. 32k context: prefill 160ms, generation 320ms per token, batch size 8, cost $0.02/request. For typical chat (2k context), 32k capability unnecessary but costs 20Ć more if deployed same way. This is why many services offer different context tiers - match cost to need. Understanding this drives efficient deployment.
QUESTION 17
How do you choose between self-hosted inference and managed API providers?
š DEFINITION:
The choice between self-hosted inference and managed API providers (OpenAI, Anthropic, etc.) involves trade-offs in cost, control, latency, privacy, and complexity. Self-hosting offers long-term cost savings and data privacy at the expense of operational overhead; APIs provide simplicity and scalability but with higher per-request costs and potential data concerns.
āļø HOW IT WORKS:
Self-hosting requires: hardware procurement/cloud instances, model deployment (vLLM, TGI), scaling infrastructure, monitoring, and maintenance. Costs: fixed (hardware) + variable (power, ops). APIs provide: instant access, no maintenance, pay-per-token pricing, automatic updates, but per-request costs higher at scale. Decision factors: query volume (break-even point typically 1-10M tokens/day), latency requirements (self-hosted can optimize), data privacy (self-hosted keeps data on-premises), customization needs (fine-tuning, specific quantization), and team expertise.
š” WHY IT MATTERS:
Wrong choice wastes money or creates operational headaches. At low volume (<1M tokens/day), APIs almost always cheaper and simpler. At high volume (>10M tokens/day), self-hosting typically 5-10Ć cheaper. Data-sensitive applications (healthcare, finance) may require self-hosting regardless of cost. Latency-sensitive apps may benefit from self-hosting to avoid API variability. Many companies use hybrid: APIs for prototyping, self-hosted for production at scale.
š EXAMPLE:
Application processing 100M tokens/month (ā3.3M/day). API cost: GPT-4 $30/1M tokens = $3,000/month. Self-hosted: 2Ć A100 GPUs ($3/hour cloud = $4,320/month) or on-premises hardware amortized $2,000/month + ops. Break-even around 5-10M tokens/day. For this volume, self-hosted saves $1,000+/month. But if team lacks ML engineers, API premium may be worth it. Healthcare startup with patient data: self-hosted mandatory regardless of cost. The decision requires total cost analysis including team resources.
QUESTION 18
What is structured output generation (constrained decoding) and how is it implemented?
š DEFINITION:
Structured output generation, or constrained decoding, ensures that LLM outputs follow a specific format like JSON, XML, or a particular schema by constraining the token generation process. Instead of hoping the model follows instructions, it actively prevents invalid tokens, guaranteeing syntactically correct outputs essential for production systems.
āļø HOW IT WORKS:
Several implementation approaches: 1) Grammar-based decoding: Define a context-free grammar (e.g., JSON schema) and at each step, compute which tokens are valid given the grammar and already generated tokens. Sample only from allowed tokens. 2) Regex constraints: Use finite-state machines to enforce regex patterns. 3) Parsing-based: Generate, parse, regenerate if invalid (inefficient). 4) Fine-tuning: Train model to output structured formats reliably. Libraries like LMQL, Guidance, Outlines implement these by integrating constraints into the decoding loop, modifying logits before sampling. For JSON, the grammar ensures proper braces, quotes, and key-value structure.
š” WHY IT MATTERS:
Production systems need reliable, parseable outputs. Unconstrained LLMs frequently produce malformed JSON, missing fields, or extra text, causing downstream failures. Constrained decoding guarantees validity, eliminating parsing errors and enabling safe integration with type-safe systems. It also improves reliability for agent tool calls, API responses, and data extraction. The slight latency cost (10-20% slower) is worth the reliability gain. This technique is essential for moving from chat prototypes to production applications.
š EXAMPLE:
Extracting structured data from resumes: need JSON with 'name', 'experience', 'skills' fields. Unconstrained model might output 'The person's name is John...' or valid JSON but missing fields. Constrained decoding with JSON schema: at each step, only tokens that maintain valid JSON structure are allowed. After '{', only keys from schema or '}' allowed. After key, only ':' allowed. After ':', only valid value tokens for that field type. Result guaranteed valid JSON matching schema. Downstream database insert never fails. This reliability is why constrained decoding is standard in production.
QUESTION 19
How would you benchmark and compare two LLM serving frameworks?
š DEFINITION:
Benchmarking LLM serving frameworks (vLLM, TGI, TensorRT-LLM, etc.) requires systematic measurement of throughput, latency, scalability, and resource efficiency under realistic workloads. Proper benchmarking reveals which framework best meets application requirements, avoiding costly deployment mistakes.
āļø HOW IT WORKS:
Benchmark methodology: 1) Define workload - request rate, prompt length distribution, output length distribution, concurrency levels. 2) Select metrics - throughput (requests/sec, tokens/sec), latency (TTFT, TBT, end-to-end), hardware utilization (GPU memory, compute), and cost per request. 3) Controlled environment - same hardware, model, quantization for fair comparison. 4) Load testing - vary concurrency from low to saturation, measure latency vs throughput curves. 5) Measure tail latency (p50, p95, p99) not just averages. 6) Test edge cases - long prompts, long generations, mixed lengths. 7) Monitor resource usage - memory fragmentation, KV cache efficiency.
š” WHY IT MATTERS:
Frameworks have different strengths. vLLM excels at throughput via PagedAttention. TGI offers good feature set and HuggingFace integration. TensorRT-LLM provides maximum performance on NVIDIA hardware but complex. Choosing wrong can cost 2-5Ć performance or require over-provisioning. Benchmarking reveals: which framework supports your concurrency needs, how latency degrades under load, memory efficiency, and ease of use. Without benchmarks, decisions based on anecdotes lead to suboptimal deployments.
š EXAMPLE:
Comparing vLLM vs TGI for 7B model with 100 QPS workload. vLLM: p95 latency 800ms, throughput 120 QPS, memory usage 14GB. TGI: p95 latency 1200ms, throughput 95 QPS, memory usage 16GB. vLLM wins for this workload. But for low-latency requirement (<500ms p95), both saturate at 50 QPS - need more GPUs. Another test with long prompts (8k tokens): vLLM maintains throughput better due to PagedAttention. Without benchmarking, might choose TGI for its feature set, then fail to meet SLOs. Data-driven decisions prevent this.
QUESTION 20
What strategies would you use to reduce inference costs by 50% without significant quality loss?
š DEFINITION:
Reducing inference costs by half while maintaining quality requires combining multiple optimization techniques that target different inefficiencies: quantization reduces memory and compute, batching improves utilization, caching eliminates redundant work, and model selection matches capability to need. A systematic approach yields dramatic savings.
āļø HOW IT WORKS:
Strategy combination: 1) Quantization - INT8 typically 2Ć memory reduction with <1% quality loss; INT4 4Ć reduction with 2-3% loss, often acceptable. 2) Batching optimization - continuous batching increases throughput 2-4Ć, reducing per-request cost. 3) Prompt caching - for repeated prefixes, eliminate 50-90% of prefill compute. 4) Model distillation - use smaller model where quality sufficient (e.g., 7B instead of 70B for many tasks). 5) Request batching - aggregate multiple user requests where latency allows. 6) Hardware right-sizing - match GPU to workload (T4 for light, A100 for heavy). 7) Spot/preemptible instances for batch workloads.
š” WHY IT MATTERS:
50% cost reduction at scale means millions in savings annually for large deployments. Quality must be monitored to ensure user experience doesn't degrade. The key is layering optimizations - each contributes incremental savings while cumulative impact multiplies. Many organizations leave 2-5Ć efficiency on table by not optimizing. Systematic approach identifies biggest opportunities first.
š EXAMPLE:
Current deployment: 70B FP16 on 8Ć A100, $100/hour, serving 500 QPS. Target: 50% cost reduction. Steps: 1) INT4 quantization (GPTQ) - reduces to 4Ć A100, $50/hour, quality drop 2% acceptable. 2) Continuous batching - increases throughput 2Ć, now 2Ć A100, $25/hour, same 500 QPS. 3) Prompt caching - 60% cache hit rate, reduces compute 40%, now 1.2Ć A100 effective, $15/hour. 4) Distillation where possible - 30% of requests use 7B model, further reduction to $10/hour - 90% total reduction, exceeding target. Quality monitoring shows overall user satisfaction unchanged. This layered approach achieved far more than 50% savings through systematic optimization.