Explore topic-wise interview questions and answers.
RAG Evaluation
QUESTION 01
What are the key dimensions to evaluate in a RAG system?
š DEFINITION:
Evaluating a RAG system requires assessing multiple dimensions across the pipeline: retrieval quality (does it find relevant documents?), generation quality (are answers correct and fluent?), and their interaction (does the model use retrieved context appropriately?). Each dimension requires different metrics and evaluation approaches.
āļø HOW IT WORKS:
Key dimensions: 1) Retrieval quality - metrics like recall@k (did we find relevant docs?), precision@k (are retrieved docs relevant?), MRR (how early did first relevant appear?). 2) Generation quality - answer correctness (factual accuracy), faithfulness (grounded in retrieved context), completeness (covers all aspects), fluency (well-written). 3) Context utilization - does the model actually use retrieved information? measured by citation accuracy, or comparing answers with/without context. 4) End-to-end metrics - overall task success, user satisfaction. 5) Latency and cost - response time, token usage, compute resources. 6) Safety and bias - harmful content, fairness across groups.
š” WHY IT MATTERS:
A RAG system can fail in many ways: retrieval may miss relevant docs (low recall), retrieved docs may be irrelevant (low precision), model may ignore context (hallucination), or answers may be incomplete. Measuring all dimensions identifies which part needs improvement. For example, if answers are wrong but retrieval is good, the problem is generation. If retrieval is poor, generation can't help. Comprehensive evaluation guides systematic optimization.
š EXAMPLE:
RAG system evaluation reveals: retrieval recall@5 0.92 (good), precision@5 0.65 (poor - many irrelevant docs). Generation faithfulness 0.88 (good when context relevant). Diagnosis: low precision causes model to sometimes get irrelevant context, hurting answers. Fix: add reranker to improve precision. After fix, precision@5 0.85, overall answer accuracy improves from 0.82 to 0.89. Without dimensional evaluation, might have blamed generation and wasted effort.
QUESTION 02
What is the RAGAS framework and what metrics does it provide?
š DEFINITION:
RAGAS (RAG Assessment) is an evaluation framework specifically designed for RAG systems that provides automated metrics using LLMs as judges. It measures key dimensions of RAG quality: faithfulness (answer grounded in context), answer relevancy (how well answer addresses question), context relevancy (retrieved context relevance), and context recall (whether context contains all needed information).
āļø HOW IT WORKS:
RAGAS uses LLM-based evaluation with carefully designed prompts. Metrics: 1) Faithfulness - measures if answer claims can be inferred from retrieved context. LLM extracts claims from answer, checks each against context. Score = fraction of claims supported. 2) Answer relevancy - measures how well answer addresses question. LLM generates artificial questions from answer, computes similarity with original question. 3) Context relevancy - measures if retrieved context contains relevant information. LLM extracts sentences relevant to question, computes proportion. 4) Context recall - measures if context contains all information needed to answer. LLM checks if each fact in ground truth answer appears in context. 5) Context precision - measures if retrieved context is concise (no irrelevant info). All metrics scaled 0-1.
š” WHY IT MATTERS:
RAGAS provides a systematic, automated way to evaluate RAG quality without requiring ground truth answers for every query. It's particularly valuable for iterative development and monitoring. Studies show RAGAS metrics correlate reasonably with human judgments (0.7-0.8). The framework enables comparing different RAG configurations, detecting regressions, and pinpointing failures (e.g., low faithfulness indicates hallucination).
š EXAMPLE:
Evaluating RAG system with RAGAS on 100 queries. Results: faithfulness 0.92 (good), answer relevancy 0.88 (good), context relevancy 0.75 (moderate), context recall 0.82 (good). Analysis: context relevancy lower because retrieved chunks contain some irrelevant sentences. Action: implement context compression to filter irrelevant content. After fix, context relevancy improves to 0.88, answer relevancy to 0.92. RAGAS metrics guided targeted improvement.
QUESTION 03
What is faithfulness in RAG evaluation and how is it measured?
š DEFINITION:
Faithfulness measures whether the generated answer is grounded in the retrieved context and does not contain information outside of it. It's a critical RAG metric because unfaithful answers (hallucinations) undermine trust, even if they're factually correct from the model's knowledge but not in the provided documents.
āļø HOW IT WORKS:
Faithfulness evaluation typically uses LLM-as-judge: 1) Extract all claims from the generated answer (facts asserted). 2) For each claim, check if it can be inferred from the retrieved context. 3) Calculate faithfulness score = (number of supported claims) / (total claims). Claims may be partially supported (e.g., claim has multiple parts). More sophisticated approaches use NLI (natural language inference) models to check entailment. Human evaluation remains gold standard but expensive. Faithfulness is distinct from factual accuracy - a claim could be true in the world but not in context (still unfaithful).
š” WHY IT MATTERS:
Unfaithful answers mislead users. In enterprise RAG, answering based on model knowledge rather than provided documents violates the contract that answers come from trusted sources. For example, a customer support bot answering based on general knowledge rather than current policy could give wrong information. Faithfulness is often more important than accuracy because it ensures answers are grounded in authoritative sources. Low faithfulness indicates the model is ignoring context and hallucinating.
š EXAMPLE:
Query about company policy. Retrieved context: 'Return policy: items can be returned within 30 days.' Generated answer: 'Items can be returned within 30 days, and shipping is free.' Claim extraction: ['items can be returned within 30 days', 'shipping is free']. Check first claim: supported by context. Second claim: not in context. Faithfulness score = 1/2 = 0.5. Even if shipping actually is free (true in world), answer is unfaithful because it used external knowledge. In production, this matters because policy might change, but model knowledge outdated.
QUESTION 04
What is answer relevancy in RAGAS and how is it computed?
š DEFINITION:
Answer relevancy in RAGAS measures how well the generated answer addresses the user's question, regardless of whether it's correct. A highly relevant answer directly responds to the query, while an irrelevant answer may be off-topic, evasive, or miss the point. It captures whether the model understood and responded to the query intent.
āļø HOW IT WORKS:
RAGAS computes answer relevancy using a clever indirect method: 1) Take the generated answer and prompt an LLM to generate several artificial questions that this answer would be a good response to. 2) Compute the similarity (embedding cosine similarity) between these generated questions and the original user question. 3) Answer relevancy score = average similarity. The intuition: a relevant answer should be a good response to the original question, so questions it answers well should be similar to the original. This avoids needing a ground truth answer. Score ranges 0-1, with higher better.
š” WHY IT MATTERS:
Answer relevancy captures whether the model is on-topic. A model might generate fluent, grammatically correct text that completely misses the question - that's a failure. Relevancy is especially important for open-ended queries where there's no single correct answer. It complements faithfulness (grounding) and correctness (accuracy). In production, low relevancy indicates the model doesn't understand the query or is ignoring it.
š EXAMPLE:
User question: 'What is the return policy for electronics?' Generated answer: 'Our electronics include smartphones, laptops, and tablets. They come with a 1-year warranty.' This answer is about electronics but doesn't address return policy. RAGAS generates questions this answer fits: 'What products are considered electronics?' 'What warranty do electronics have?' Similarity to original question low ā low answer relevancy (0.3). Good answer: 'Electronics can be returned within 30 days with receipt.' Generates questions like 'What is electronics return policy?' High similarity ā high relevancy (0.95).
QUESTION 05
What is context recall and context precision in RAG evaluation?
š DEFINITION:
Context recall measures whether the retrieved context contains all the information needed to answer the question (completeness). Context precision measures whether the retrieved context is concise - containing relevant information without too much irrelevant content (noise). Together they characterize retrieval quality from an answer-centric perspective.
āļø HOW IT WORKS:
Context recall: Given a ground truth answer (or set of facts needed), check if each fact can be found in the retrieved context. Often computed by having an LLM extract claims from ground truth, then check each claim against context. Score = fraction of claims supported. Context precision: Given retrieved context, have LLM identify sentences that are relevant to answering the question. Score = (number of relevant sentences) / (total sentences in context). Alternatively, compute information density: useful information per token. Both metrics range 0-1.
š” WHY IT MATTERS:
High recall ensures the model has necessary information; low recall means retrieval missed key content, making good answers impossible regardless of generation. High precision ensures the model isn't distracted by irrelevant content; low precision may cause the model to ignore relevant parts or get confused. Together they diagnose retrieval issues: low recall suggests need to improve retrieval (chunking, embedding, search), low precision suggests need for reranking or compression.
š EXAMPLE:
Query about product warranty. Retrieved context chunk: 'Product X has 1-year warranty. Product Y has 2-year warranty. Our store is open 9-5. Free shipping on orders over $50.' Ground truth answer requires only warranty info. Context recall: if both warranties present, recall 1.0 (good). Context precision: relevant sentences = first two (warranty info), total 4 sentences ā precision 0.5. Diagnosis: retrieval found relevant info but with noise. Fix: use context compression to extract only warranty sentences. After fix, precision improves to 1.0, answer quality likely improves.
QUESTION 06
What is context relevance and why does it matter?
š DEFINITION:
Context relevance measures whether the retrieved information is pertinent to the user's question, focusing on the proportion of retrieved content that is actually useful. Unlike precision which counts sentences, relevance often considers the degree of usefulness - some information may be tangentially related but not directly helpful.
āļø HOW IT WORKS:
Context relevance is typically measured by having an LLM evaluate each retrieved chunk or sentence on a relevance scale (e.g., 0-3) based on how directly it helps answer the question. Alternatively, use RAGAS approach: have LLM extract sentences from context that are relevant, compute proportion. More sophisticated methods consider: 1) Direct relevance - directly answers the question. 2) Supporting relevance - provides context that helps understand the answer. 3) Irrelevant - no value. Scores can be aggregated as average relevance or weighted by position.
š” WHY IT MATTERS:
Low context relevance means the model is wasting context window on unhelpful information, potentially missing useful content due to limited space. It also increases the chance the model will be distracted or incorporate irrelevant details. In production, improving context relevance often yields significant answer quality gains without changing the model - just by providing cleaner context. It's a key diagnostic: if relevance low, focus on retrieval improvements (reranking, better chunking, query rewriting).
š EXAMPLE:
Query: 'What is the battery life of iPhone 14?' Retrieved chunks: Chunk A: 'iPhone 14 battery life up to 20 hours video playback' (highly relevant). Chunk B: 'iPhone 14 comes in blue, purple, and midnight colors' (irrelevant). Chunk C: 'Battery technology overview' (tangential). Context relevance score (if using sentence extraction): relevant sentences from A (1), irrelevant from B (1), tangential from C (1). Total relevant 1 out of 3 chunks ā relevance 0.33. Model might still answer correctly if A used, but context window inefficient. Better retrieval would rank A first and exclude B,C, improving relevance to 1.0.
QUESTION 07
How do you evaluate the retrieval component independently of the generation component?
š DEFINITION:
Evaluating retrieval independently isolates the performance of the search system from the generation model, allowing targeted optimization. This is done by creating a test set of queries with known relevant documents (ground truth) and measuring retrieval metrics without involving the LLM.
āļø HOW IT WORKS:
Process: 1) Create a retrieval test set: for each query, identify which documents (or chunks) in the corpus are relevant. This can be done by human annotation, using existing judgments, or automatically (e.g., treat documents that contain answer to known QA pairs as relevant). 2) Run retrieval system on each query, getting ranked list of documents. 3) Compute retrieval metrics: recall@k (fraction of relevant docs in top-k), precision@k (fraction of top-k that are relevant), MRR (reciprocal rank of first relevant), nDCG@k (graded relevance). 4) Analyze results across query types. This evaluation is independent of generation - you're measuring whether the right documents are found, regardless of what an LLM would do with them.
š” WHY IT MATTERS:
If end-to-end RAG quality is poor, you need to know whether retrieval is the problem. Independent retrieval evaluation pinpoints this. If retrieval metrics are good but end-to-end poor, generation is the issue. If retrieval metrics poor, focus on improving retrieval first. This separation enables systematic debugging and optimization. For production, maintain a retrieval test set and monitor metrics over time.
š EXAMPLE:
RAG system with 80% end-to-end accuracy. Retrieval evaluation on same queries shows recall@5 = 0.95 (excellent), precision@5 = 0.60 (moderate). Diagnosis: retrieval finds relevant docs (high recall) but also many irrelevant (low precision). The irrelevant docs are confusing the generator. Fix: add reranker to improve precision. After fix, precision@5 improves to 0.85, end-to-end accuracy to 88%. Without independent retrieval evaluation, might have tried improving recall (already good) and wasted effort.
QUESTION 08
What is answer correctness vs. answer faithfulness?
š DEFINITION:
Answer correctness measures whether the generated answer is factually true according to world knowledge or a trusted reference. Answer faithfulness measures whether the answer is grounded in the retrieved context, regardless of whether it's true in the world. They are distinct concepts and both important for RAG evaluation.
āļø HOW IT WORKS:
Correctness evaluation requires a ground truth answer or trusted knowledge source. Compare generated answer to ground truth using exact match, semantic similarity, or LLM judgment. Faithfulness evaluation requires only the retrieved context and the answer. Check if each claim in answer can be inferred from context. A claim can be correct (true in world) but unfaithful (not in context). Conversely, a claim can be faithful (in context) but incorrect if context contains wrong information. Both scenarios are problematic: unfaithful answers undermine trust in grounding; incorrect answers provide wrong information.
š” WHY IT MATTERS:
In enterprise RAG, both matter. Faithfulness ensures answers come from trusted company documents, not model's potentially outdated knowledge. Correctness ensures those documents themselves are accurate. A faithful but incorrect answer means the knowledge base is wrong. An correct but unfaithful answer means the model is using its own knowledge, which may be fine for general questions but problematic for company-specific information. The distinction guides action: low faithfulness ā improve grounding (prompting, retrieval quality); low correctness with high faithfulness ā update knowledge base.
š EXAMPLE:
Company policy document (outdated) says 'Return window is 30 days'. Actual policy (updated) is 45 days. Query about return policy. Faithful answer: '30 days' (based on context) - faithful but incorrect. Unfaithful answer: '45 days' (model's knowledge) - correct but unfaithful. Which is better? Depends: if you prioritize grounding in company docs, faithful but wrong; if you prioritize accuracy, correct but unfaithful. Ideally, both: update knowledge base so faithful answer is correct. Evaluation distinguishing these helps decide whether to update docs or improve grounding.
QUESTION 09
How do you build a ground-truth dataset for RAG evaluation?
š DEFINITION:
A ground-truth dataset for RAG evaluation consists of query-answer pairs with relevance judgments linking each query to relevant documents/chunks in the knowledge base. Building it requires careful annotation to ensure it represents real user queries and captures the information needed for both retrieval and generation evaluation.
āļø HOW IT WORKS:
Steps: 1) Collect representative queries - sample from user logs, create synthetic queries for coverage, include edge cases. Aim for 200-1000 queries. 2) For each query, identify relevant documents/chunks in corpus. This can be done by: human annotators searching corpus, using existing relevance judgments, or treating documents that contain answer to query as relevant. 3) Create ground truth answers - have experts write ideal answers based on relevant documents, or use strong LLM with human verification. 4) Annotate relevance at chunk level (not just document) for retrieval evaluation. 5) Split into development and test sets. 6) Document annotation guidelines to ensure consistency. Quality control: multiple annotators per query, measure agreement.
š” WHY IT MATTERS:
A good ground-truth dataset is the foundation of reliable evaluation. Poor datasets (unrepresentative queries, missing relevance judgments, low-quality answers) lead to misleading metrics and wrong optimization decisions. Building it requires investment but pays off through faster development and better production performance. Without it, you're flying blind.
š EXAMPLE:
Building dataset for customer support RAG. Collect 500 real user queries from logs. Stratify by category: returns (20%), technical issues (30%), product info (25%), account (15%), other (10%). For each, have support agents: identify relevant support articles (1-5 each), write ideal answer (2-3 sentences). Two agents per query, resolve disagreements. Result: 500 queries with 3.2 relevant articles on average, high-quality answers. This dataset enables measuring retrieval recall (did we find those articles?) and answer correctness (compare to ideal). Investment: 100 agent-hours, $5000. Pays off by enabling systematic improvement.
QUESTION 10
What is the LLM-as-judge approach in RAG evaluation and what are its biases?
š DEFINITION:
LLM-as-judge uses a powerful language model (e.g., GPT-4) to evaluate RAG outputs by scoring them on dimensions like correctness, faithfulness, or relevance according to rubrics. It enables scalable, consistent evaluation but comes with inherent biases that must be understood and mitigated.
āļø HOW IT WORKS:
Process: Define evaluation criteria with clear rubrics (e.g., 'Score 1-5 on faithfulness: 1=multiple claims not in context, 5=all claims supported'). For each (query, context, answer) triple, prompt judge LLM to output score and explanation. Use techniques to reduce bias: 1) Position bias - randomize order when comparing two answers. 2) Verbosity bias - longer answers often scored higher; control by normalizing. 3) Self-preference - judges prefer answers that match their own style; use diverse judges. 4) Calibration - compare judge scores to human judgments on sample to detect bias. 5) Chain-of-thought - have judge explain reasoning before scoring.
š” WHY IT MATTERS:
LLM-as-judge makes large-scale evaluation practical. Human evaluation costs $5-20 per example; LLM costs pennies. But biases can distort results. A verbose but less accurate answer may score higher than a concise correct one. A judge may consistently prefer its own style. Without bias mitigation, you may optimize for the wrong things. Understanding biases enables proper use: validate judge against humans, use multiple judges, be aware of limitations. For critical decisions, augment with human evaluation.
š EXAMPLE:
Comparing two RAG answers with GPT-4 judge. Answer A: concise, correct, 50 words. Answer B: verbose, repeats information, includes fluff, 200 words. GPT-4 consistently scores B higher (verbosity bias). Human evaluation shows A preferred. Mitigation: add instruction to penalize verbosity, or use multiple judges and average. After calibration, judge aligns better. Without awareness, would wrongly conclude B better and deploy inferior system. This is why understanding judge biases is essential.
QUESTION 11
How do you detect hallucination that is not caught by standard RAG metrics?
š DEFINITION:
Standard RAG metrics like faithfulness check if answer claims are in retrieved context, but hallucinations can be subtle: plausible-sounding claims that are neither in context nor obviously false, or claims that combine context elements incorrectly. Detecting these requires more sophisticated techniques.
āļø HOW IT WORKS:
Advanced hallucination detection methods: 1) Claim decomposition - break answer into atomic claims, verify each against context using NLI models or LLM. 2) Entailment checking - use specialized NLI models (TrueTeacher, AlignScore) fine-tuned for factual consistency. 3) Self-consistency - generate multiple answers, check for contradictions. 4) Counterfactual probing - modify query slightly, see if answer changes appropriately. 5) Cross-examination - have LLM ask itself questions about the answer and verify. 6) Knowledge base contradiction - for facts not in context, check against trusted KB. 7) Human review of samples - still gold standard.
š” WHY IT MATTERS:
Simple faithfulness metrics miss subtle hallucinations. An answer might correctly use context but combine facts incorrectly (e.g., 'Drug X treats condition Y' from separate sentences 'Drug X is approved' and 'Condition Y affects millions' - incorrect synthesis). Or it might add plausible but unsupported details. These subtle hallucinations erode trust gradually but dangerously. Advanced detection catches them, enabling improvement.
š EXAMPLE:
Context: 'Acme Corp acquired Beta Inc in 2020. Beta Inc developed the Gamma product.' Answer: 'Acme Corp developed the Gamma product after acquiring Beta Inc in 2020.' Simple faithfulness: all facts in context (acquisition, Gamma, dates). But answer incorrectly attributes development to Acme (Beta developed it). This subtle hallucination requires inference: from acquisition, did development transfer? Not stated. Advanced detection using NLI would flag that 'Acme developed Gamma' is not entailed by context. Without this, hallucination goes undetected.
QUESTION 12
What is end-to-end RAG evaluation vs. component-level evaluation?
š DEFINITION:
End-to-end evaluation measures the final output quality of the complete RAG system - the answers users actually see. Component-level evaluation measures individual parts (retrieval, reranking, generation) in isolation. Both are necessary: end-to-end tells you if the system is working overall; component-level tells you why and where to fix it.
āļø HOW IT WORKS:
End-to-end evaluation: take queries, run through full RAG pipeline, evaluate final answers on metrics like correctness, faithfulness, relevance. Use ground truth answers, human evaluation, or LLM-as-judge. Tells you overall performance but not which component caused failures. Component-level evaluation: 1) Retrieval - measure recall/precision on ground-truth relevant docs. 2) Reranking - measure nDCG improvement over raw retrieval. 3) Generation - measure faithfulness given fixed (ideal) context. 4) Each component evaluated independently with appropriate test sets.
š” WHY IT MATTERS:
End-to-end metrics tell you if you're winning; component-level tells you why you're losing. If end-to-end accuracy drops, component-level reveals whether retrieval got worse (recall drop), generation got worse (faithfulness drop with same context), or both. This targeted diagnosis enables efficient fixes. Without component-level, you're guessing. Many teams do both: continuous end-to-end monitoring for overall health, component-level analysis when investigating issues.
š EXAMPLE:
RAG system accuracy drops from 90% to 85%. End-to-end shows drop. Component analysis: retrieval recall@5 unchanged (0.92), precision@5 unchanged (0.75), but generation faithfulness on fixed context drops from 0.95 to 0.88. Diagnosis: generation model changed (maybe updated version) and now ignores context more. Fix: adjust generation prompt or rollback model. Without component-level, might have wasted time optimizing retrieval that wasn't the problem.
QUESTION 13
What is TruLens and how does it support RAG evaluation?
š DEFINITION:
TruLens is an open-source evaluation and tracking library for LLM applications, including RAG systems. It provides a framework for defining and computing feedback functions (metrics) that assess quality, and tools for experimenting, tracking, and comparing different RAG configurations.
āļø HOW IT WORKS:
TruLens core concepts: 1) Feedback functions - modular metrics that evaluate inputs, outputs, or intermediate results. Can use LLM-as-judge, NLI models, or heuristics. Built-in functions for groundedness (faithfulness), answer relevance, context relevance, and more. 2) Recording - tracks application runs, storing inputs, outputs, intermediate steps, and feedback scores. 3) Experimentation - compare different RAG versions (different retrievers, prompts, models) side-by-side. 4) Visualization - dashboard showing performance over time, comparisons. 5) Integration - works with LangChain, LlamaIndex, custom apps. Users define their own feedback functions for custom metrics.
š” WHY IT MATTERS:
TruLens systematizes RAG evaluation, moving from ad-hoc scripts to structured, reproducible measurement. It enables: 1) Continuous evaluation - track metrics as you develop. 2) Comparison - objectively compare RAG variants. 3) Debugging - see which component caused failures. 4) Standardization - use community-vetted metrics. For teams building RAG, TruLens provides the infrastructure for data-driven optimization.
š EXAMPLE:
Developer building RAG system uses TruLens to compare two retrievers. Records 100 queries through both systems, computing groundedness, answer relevance, context relevance. Dashboard shows Retriever A has higher groundedness (0.92 vs 0.88) but lower answer relevance (0.85 vs 0.90). Drills down to see Retriever A finds more precise chunks (higher groundedness) but sometimes misses broad context (lower relevance). Chooses Retriever A and adds query expansion to improve recall. Without TruLens, would have had to manually compute metrics or guess.
QUESTION 14
What is the RAG Triad and what three checks does it involve?
š DEFINITION:
The RAG Triad, popularized by TruEra, consists of three fundamental checks for RAG quality: context relevance (are retrieved documents relevant?), groundedness (is the answer supported by context?), and answer relevance (does the answer address the question?). Together they provide a minimal but comprehensive quality assessment.
āļø HOW IT WORKS:
The three checks: 1) Context relevance - measures whether retrieved information is pertinent to the query. Low relevance indicates retrieval failure. 2) Groundedness (faithfulness) - measures whether answer claims are supported by context. Low groundedness indicates hallucination or poor context use. 3) Answer relevance - measures whether answer actually answers the question. Low relevance indicates the model misunderstood or went off-topic. These three metrics capture the main failure modes: wrong context, ignoring context, or wrong answer. Each can be computed with LLM-as-judge using appropriate prompts.
š” WHY IT MATTERS:
The RAG Triad provides a minimal set of metrics that diagnose the most common RAG failures. If all three are high, the system is likely working well. If any is low, you know where to focus: context relevance low ā improve retrieval; groundedness low ā improve prompting or generation; answer relevance low ā improve query understanding or generation. This triage enables rapid iteration. The triad has become a standard starting point for RAG evaluation.
š EXAMPLE:
RAG system scores: context relevance 0.95 (high), groundedness 0.60 (low), answer relevance 0.85 (moderate). Diagnosis: retrieval finds good context (high relevance) but model often ignores it (low groundedness), occasionally going off-topic (moderate answer relevance). Fix: strengthen prompt instructions to use only context, add citation requirement. After fix, groundedness improves to 0.90, answer relevance to 0.92. Triad guided focus to generation issue, not retrieval.
QUESTION 15
How do you handle evaluation for multi-turn RAG conversations?
š DEFINITION:
Evaluating multi-turn RAG conversations adds complexity beyond single-turn evaluation because each turn depends on conversation history, and overall session quality matters beyond individual turns. Metrics must assess turn-level quality, context maintenance, and overall session success.
āļø HOW IT WORKS:
Multi-turn evaluation approaches: 1) Turn-level metrics - evaluate each response individually on relevance, groundedness, etc., but with context including conversation history. 2) Context maintenance - does the model remember and correctly use information from earlier turns? Test by asking follow-ups that require recall. 3) Coherence - do responses flow naturally and maintain consistency? 4) Session success - did the conversation achieve user's goal? (e.g., issue resolved). 5) Efficiency - number of turns to resolution. 6) Human evaluation of full conversations - gold standard but expensive. 7) Simulated users - use LLM to play user role, evaluate interaction quality.
š” WHY IT MATTERS:
Multi-turn conversations are the norm in production (chatbots, support). Single-turn metrics miss failures that accumulate: the model might forget earlier context, contradict itself, or fail to resolve issues over multiple turns. Evaluating multi-turn requires different methodologies and is more complex but essential for conversational applications.
š EXAMPLE:
Customer support conversation: Turn 1: 'My phone won't turn on.' System: 'Try holding power button for 10 seconds.' Turn 2: 'Still not working.' System: 'Have you tried charging it?' (forgets previous attempt). This is context maintenance failure. Turn-level evaluation of turn 2 alone might score relevance high (addressing problem), but misses that it ignored history. Multi-turn evaluation would catch this by checking if system used information from turn 1. Session success: if issue eventually resolved, counts; if not, failure. Multi-turn metrics provide complete picture.
QUESTION 16
What is citation accuracy and why is it important for RAG systems?
š DEFINITION:
Citation accuracy measures whether the sources cited by a RAG system actually support the claims they're attached to. It goes beyond faithfulness by requiring explicit source attribution and verifying that each claim's cited source indeed contains that information. High citation accuracy enables verifiability and trust.
āļø HOW IT WORKS:
Evaluation process: 1) Parse answer to identify claims and their associated citations (e.g., [1], [DocA]). 2) For each claim-citation pair, check if the cited document/chunk contains the claim. This can be done via: exact match, semantic similarity, NLI, or LLM judgment. 3) Calculate citation accuracy = (number of supported claim-citation pairs) / (total claim-citation pairs). Consider also citation recall: are all claims that need citations actually cited? Citation hallucinations (citing non-existent sources) are particularly harmful.
š” WHY IT MATTERS:
Citations enable users to verify information, building trust. In enterprise, citations are often required for compliance and auditability. A RAG system may be faithful (claims in context) but cite wrong sources, undermining trust. High citation accuracy ensures users can check sources and rely on the system. It also deters hallucination by forcing attribution. For legal, medical, and research applications, citation accuracy is as important as answer correctness.
š EXAMPLE:
Answer: 'The return policy is 30 days [1][2].' Claim: 'return policy is 30 days'. Check Doc1: contains '30-day return policy' ā supported. Doc2: contains 'shipping policy' only ā not supported. Citation accuracy = 1/2 = 0.5 (one incorrect citation). Even though claim is true and in context (Doc1), the incorrect citation to Doc2 misleads user about where to verify. User checking Doc2 won't find policy, eroding trust. Fix: ensure model cites only sources actually containing the claim. High citation accuracy prevents this.
QUESTION 17
How do you monitor RAG quality over time in production?
š DEFINITION:
Monitoring RAG quality in production requires continuous measurement of key metrics on live traffic, with alerting for degradation and dashboards for trends. Unlike offline evaluation with fixed test sets, production monitoring must handle distribution shift and detect issues before they impact users significantly.
āļø HOW IT WORKS:
Components: 1) Online metrics - sample production traffic (e.g., 10% of queries) and run LLM-as-judge evaluations for faithfulness, relevance, etc. Store results with timestamps. 2) User feedback - collect explicit (thumbs up/down) and implicit (follow-up rate, session length) signals. 3) Golden dataset - run fixed test set periodically (daily) to detect regression independent of traffic changes. 4) Drift detection - monitor input query distribution, retrieval score distributions for shifts. 5) Alerting - set thresholds on key metrics (e.g., faithfulness drops below 0.85 for 1 hour triggers alert). 6) Dashboard - visualize trends over time, compare across model versions, slice by query type.
š” WHY IT MATTERS:
RAG systems degrade over time due to: knowledge base changes, user query drift, model updates, or external factors. Without monitoring, degradation goes unnoticed until user complaints spike. Proactive monitoring catches issues early, often before users notice. It also provides data for continuous improvement and for evaluating changes.
š EXAMPLE:
E-commerce RAG dashboard shows faithfulness dropping from 0.92 to 0.88 over 2 days. Alert triggered. Investigation reveals new product descriptions added with different format; retrieval finds them but model doesn't ground answers well. Fix: update prompt to handle new format. Faithfulness recovers. Without monitoring, would have lost user trust for 2+ weeks. Also tracks: answer relevance stable (0.90), user satisfaction down 5% (correlates with faithfulness drop). Monitoring enables rapid response.
QUESTION 18
What is a regression test suite for a RAG system and how do you build one?
š DEFINITION:
A regression test suite for RAG is a collection of test cases designed to catch when changes to the system (retrieval updates, prompt changes, model upgrades) degrade performance on critical queries. It ensures that improvements don't accidentally break existing functionality.
āļø HOW IT WORKS:
Building process: 1) Identify critical query types - from user logs, pick queries that are frequent, business-critical, or historically problematic. 2) Create test cases - for each query, define: input, expected behavior (answer should contain X, should cite sources, should not contain Y), and retrieval relevance judgments. 3) Automate evaluation - run test suite automatically on each candidate change, compute pass/fail rates. 4) Version control - store test suite alongside code, update as system evolves. 5) Thresholds - define minimum pass rate (e.g., 95%) for deployment. 6) Continuous integration - run tests in CI pipeline before deployment.
š” WHY IT MATTERS:
Without regression tests, improvements often break edge cases. A change that improves average performance might break queries about rare products or specific policies. Regression tests catch these, ensuring system reliability. They're especially important for production systems where failures have business impact. Building a suite is investment in quality assurance.
š EXAMPLE:
RAG system regression suite includes 200 test cases: 100 frequent queries, 50 edge cases (complex queries, rare products), 50 historically problematic. CI pipeline runs suite on each PR. PR to improve retrieval for product specs passes 198/200 tests; fails on two policy queries where new retrieval ranks policy docs lower. Developer adjusts ranking, passes all tests before merge. Without suite, would have deployed and broken policy queries, causing support issues. Suite caught regression before production.
QUESTION 19
How do you handle evaluation when the ground truth changes over time?
š DEFINITION:
When ground truth (correct answers) changes over time due to knowledge base updates, policy changes, or new information, evaluation datasets must evolve to remain relevant. Static evaluation becomes misleading as old ground truth becomes incorrect. This requires versioned evaluation and strategies for handling temporal dynamics.
āļø HOW IT WORKS:
Approaches: 1) Versioned golden datasets - maintain dataset versions aligned with knowledge base versions. When knowledge updates, create new golden set with updated ground truth. Compare models against appropriate version. 2) Temporal splitting - in evaluation, note date of each query; only consider knowledge available as of that date as ground truth. 3) Continuous ground truth updates - when knowledge changes, automatically update affected golden examples (e.g., via LLM regeneration with new context). 4) Human-in-loop review - periodically review golden set for stale examples, update manually. 5) Live evaluation - use production traffic with implicit feedback, which naturally reflects current ground truth.
š” WHY IT MATTERS:
Using outdated ground truth leads to wrong conclusions. A model that correctly gives new policy information might be penalized by old golden set, appearing worse than it is. Conversely, a model using old policy might score well on stale test but fail in production. Evaluation must reflect current reality to guide decisions. Versioning and temporal awareness are essential for long-lived systems.
š EXAMPLE:
Return policy changes from 30 days to 45 days. Old golden set has 'return policy is 30 days' as correct. New model trained on updated docs answers '45 days'. Evaluation against old set scores it 0% on those queries - misleading. Solution: create v2 golden set with updated answers. Compare old model (scores 0% on v2) vs new model (100% on v2). Correct conclusion: new model better for current policy. Without versioning, might wrongly reject new model. Versioned evaluation prevents this.
QUESTION 20
How would you report RAG system quality to a non-technical stakeholder?
š DEFINITION:
Reporting RAG quality to non-technical stakeholders requires translating technical metrics into business outcomes: user satisfaction, task completion rates, cost savings, and risk reduction. Avoid jargon; focus on what matters to the business and how the system is performing against goals.
āļø HOW IT WORKS:
Key translation: 1) Accuracy ā 'Answers are correct 92% of the time, meaning users get right information in 9 out of 10 queries.' 2) Hallucination rate ā 'Only 3% of answers contain made-up information, down from 8% last quarter.' 3) Retrieval quality ā 'We find the right documents 95% of the time, ensuring answers are based on our actual data.' 4) Business metrics: 'Self-service resolution rate increased 15%, saving $500k in support costs.' 5) Trends: 'Quality has improved 5% this quarter.' 6) Comparisons: 'Our system outperforms the previous version by 10% on user satisfaction.' Use visuals: dashboards with trend lines, simple charts.
š” WHY IT MATTERS:
Non-technical stakeholders (executives, product managers) make decisions based on business impact. If you report recall@5=0.92, they don't know if that's good or what it means. If you report 'users find answers 92% of the time, leading to 15% fewer support tickets', they understand value. Good reporting aligns technical work with business goals and secures continued investment.
š EXAMPLE:
Quarterly report to VP of Product: 'Our customer support RAG system now handles 65% of queries automatically, up from 55% last quarter. Accuracy is 94% (vs 90%), meaning customers get correct answers more often. This has reduced average handle time by 20% and saved $200k in support costs. The main improvement came from better retrieval of policy documents. Next quarter we're targeting 70% automation. Here's a chart showing steady improvement.' This resonates. Compare to: 'Recall@5 improved from 0.88 to 0.92.' Which gets funding? The former.