Question 1

What are the main differences between naive RAG and advanced RAG?

Accepted Answer

🔍 DEFINITION: Naive RAG is the basic implementation: chunk documents, embed them, retrieve top-k chunks based on query similarity, and stuff them into a prompt for generation. Advanced RAG incorporates multiple optimizations across the pipeline - pre-retrieval (query rewriting, expansion), retrieval (hybrid search, reranking), and post-retrieval (context compression, reordering) - to improve accuracy, relevance, and reliability.

⚙️ HOW IT WORKS: Naive RAG pipeline: simple chunking → single-vector retrieval → top-k concatenation → generation. Advanced RAG adds: Pre-retrieval: query rewriting (expanding acronyms, correcting spelling), query expansion (generating multiple query variants), HyDE (hypothetical document generation). Retrieval: hybrid search (combining vector + keyword), metadata filtering, multi-stage retrieval (retrieve more, then rerank with cross-encoder). Post-retrieval: context compression (extracting relevant sentences), reordering chunks (most relevant at edges to combat lost-in-middle), iterative retrieval (multi-hop for complex questions).

💡 WHY IT MATTERS: Naive RAG works for simple queries but fails on complex ones, specialized domains, or large knowledge bases. Advanced RAG techniques can improve accuracy by 10-30% by addressing specific failure modes. Query rewriting helps when user queries are poorly phrased. Reranking ensures most relevant chunks used. Context compression fits more information in limited window. The right techniques depend on your data and query types - not all needed for every application, but understanding them enables systematic improvement.

📋 EXAMPLE: Medical RAG with naive vs advanced. Naive: user query 'HTN treatment' retrieves chunks with 'HTN' but misses those with 'hypertension'. Fails. Advanced with query expansion: expands 'HTN' to 'hypertension' and 'high blood pressure', retrieves more relevant results. Reranking with cross-encoder ensures treatment guidelines prioritized over general information. Context compression extracts only treatment sentences, fitting more guidelines in context. Result: comprehensive treatment recommendations vs partial or missed information. The 20% accuracy improvement justifies additional complexity.

Question 2

What is HyDE (Hypothetical Document Embeddings) and how does it improve retrieval?

Accepted Answer

🔍 DEFINITION: HyDE (Hypothetical Document Embeddings) is a technique that improves retrieval by first using an LLM to generate a hypothetical document that would answer the query, then using that generated document's embedding for retrieval instead of the original query. This bridges the lexical gap between short queries and relevant documents by expanding the query into document-like text.

⚙️ HOW IT WORKS: Process: 1) User query is sent to an LLM with instruction: 'Generate a passage that answers this query.' 2) LLM generates a hypothetical document containing relevant information, even if not entirely accurate. 3) This hypothetical document is embedded using the same embedding model used for the document corpus. 4) The hypothetical document's embedding is used to retrieve actual documents via similarity search. The intuition: the hypothetical document is more similar in distribution to relevant documents than the original query, improving embedding matching. HyDE works because embedding models are trained on document-document similarity, not query-document, so document-like queries perform better.

💡 WHY IT MATTERS: HyDE significantly improves retrieval recall, especially for complex or rare queries where the lexical gap is large. It's particularly effective when: queries are short, terminology differs between queries and documents, or when embedding model is document-focused. Studies show HyDE can improve recall by 10-20% over direct query embedding. The trade-off is additional latency (one LLM call) and cost, but for many applications, the quality gain justifies it.

📋 EXAMPLE: User query: 'How do I fix a leaky faucet?' Direct embedding retrieves documents with 'leaky faucet' but misses those with 'dripping tap' or 'washer replacement'. HyDE: LLM generates hypothetical document: 'To fix a leaky faucet, first turn off water supply. Then disassemble handle, remove old washer, replace with new washer of same size. Reassemble and test.' This hypothetical document contains terms like 'washer replacement', 'disassemble', 'water supply' that appear in actual repair guides. Its embedding retrieves relevant documents containing these terms, even if they don't mention 'leaky faucet' explicitly. Recall improves from 0.70 to 0.85.

Question 3

What is query rewriting and how does it improve RAG performance?

Accepted Answer

🔍 DEFINITION: Query rewriting modifies the original user query to make it more effective for retrieval, addressing issues like ambiguity, missing context, terminology mismatch, or poor phrasing. It's a pre-retrieval optimization that transforms queries into forms better suited for embedding similarity and keyword search.

⚙️ HOW IT WORKS: Multiple rewriting techniques: 1) Spelling correction - fix typos using LLM or dedicated models. 2) Acronym expansion - 'HTN' → 'hypertension'. 3) Synonym expansion - add common synonyms for key terms. 4) De-ambiguation - add context to resolve ambiguous terms ('Apple' → 'Apple company' if context suggests tech). 5) Query expansion - generate multiple query variants and combine results. 6) Multi-hop decomposition - break complex queries into sub-queries. 7) Conversational rewriting - for multi-turn, incorporate conversation history into current query. Rewriting can be done with small models (fast, cheap) or LLMs (more accurate but slower).

💡 WHY IT MATTERS: Raw user queries are often suboptimal for retrieval. Typos cause misses, acronyms may not match documents, ambiguous terms retrieve wrong content. Query rewriting can improve recall by 5-15% with minimal latency impact. For conversational RAG, rewriting is essential - without it, each turn loses context. The technique is especially valuable for specialized domains where terminology varies.

📋 EXAMPLE: Customer support query: 'My iphone 13 won't charge fix?' Raw query retrieval misses documents about 'iPhone 13 charging issues' due to typo 'iphone' and informal phrasing. Rewritten: 'iPhone 13 charging problems troubleshooting' (corrected spelling, formalized). Retrieves relevant support articles. Another: user in multi-turn conversation asks 'What about the refund policy?' Without rewriting, retrieves general refund info. With rewriting incorporating history: 'Refund policy for iPhone 13 purchase' retrieves specific policy. Query rewriting turns ambiguous queries into precise retrieval targets.

Question 4

What is multi-query retrieval and why does it improve recall?

Accepted Answer

🔍 DEFINITION: Multi-query retrieval generates multiple variations of the original query, retrieves documents for each, and combines the results. This improves recall by capturing different aspects of the information need and overcoming the limitations of any single query formulation. It's particularly effective for complex queries with multiple facets.

⚙️ HOW IT WORKS: Process: 1) Use an LLM to generate N (typically 3-5) different query formulations based on the original. For example, for 'machine learning applications in healthcare', generate variants like 'healthcare AI use cases', 'medical machine learning implementations', 'clinical applications of ML'. 2) Perform retrieval for each query variant separately, obtaining top-k documents per variant. 3) Merge results, often with reciprocal rank fusion (RRF) or score averaging. 4) Remove duplicates, re-rank by combined relevance. The diverse queries capture different terminology and aspects, increasing chance of finding all relevant documents.

💡 WHY IT MATTERS: Single queries can miss relevant documents due to vocabulary mismatch or focusing on one aspect. Multi-query retrieval increases recall by exploring the semantic space more thoroughly. Studies show 5-15% recall improvement over single-query retrieval. The trade-off is increased latency (N× retrieval time) and cost, but queries can run in parallel, minimizing latency impact. Especially valuable for complex, multi-faceted information needs.

📋 EXAMPLE: Legal query: 'precedent cases for breach of contract in software development'. Single query retrieves cases with 'breach of contract software'. Multi-query variants: 'software development contract violation cases', 'IT project breach precedents', 'court rulings software contract disputes'. Each variant retrieves different cases. Combined results include all relevant precedents, recall improves from 0.75 to 0.88. For legal research, missing 13% of relevant cases is significant - multi-query worth the extra cost.

Question 5

What is re-ranking and which models are commonly used (Cohere Rerank, ColBERT)?

Accepted Answer

🔍 DEFINITION: Re-ranking is a two-stage retrieval process where an efficient first-stage retriever (bi-encoder) fetches many candidate documents, then a more accurate but slower second-stage model (cross-encoder or late interaction) reorders them by relevance. This combines the scalability of bi-encoders with the precision of deep interaction models.

⚙️ HOW IT WORKS: Stage 1 (retrieval): bi-encoder retrieves top-k candidates (typically 100-200) quickly. Stage 2 (reranking): for each candidate, compute more accurate relevance score using: 1) Cross-encoder (e.g., Cohere Rerank, monoBERT) - processes query and document together through transformer, outputs relevance score. Slow but accurate. 2) Late interaction (ColBERT) - preserves token-level matching with efficient scoring. 3) Lightweight models - distilled cross-encoders for faster reranking. Top-n (10-20) after reranking used for generation.

💡 WHY IT MATTERS: Re-ranking dramatically improves precision. Bi-encoder may rank a partially relevant document at position 5 above a highly relevant at position 20. Reranking corrects this, ensuring most relevant documents are used. Can improve answer quality by 10-20% by providing better context. Cohere Rerank is a popular commercial option; open-source alternatives include MiniLM cross-encoders. The additional latency (50-200ms) is usually worth the quality gain.

📋 EXAMPLE: RAG for technical support. Stage 1 retrieves 100 documents. At position 3: document about 'printer paper jam' (query about 'printer not working'). At position 15: detailed troubleshooting guide for 'printer power issues' (more relevant). Reranker identifies guide as more relevant, moves to position 1. Generation now uses best document, answer quality improves. Without reranking, might use less relevant paper jam guide and give wrong solution. This is why reranking is standard in production RAG.

Question 6

What is contextual compression and how does it improve the quality of retrieved context?

Accepted Answer

🔍 DEFINITION: Contextual compression extracts only the most relevant parts from retrieved documents, reducing noise and fitting more useful information into the limited context window. Instead of stuffing entire chunks, it filters, compresses, or summarizes content to focus on what matters for answering the query.

⚙️ HOW IT WORKS: Techniques: 1) Extractive compression - use an LLM or smaller model to extract sentences most relevant to the query. 2) Summarization - generate concise summaries of each retrieved chunk, query-focused. 3) Filtering - remove chunks below relevance threshold. 4) Hierarchical compression - for very long documents, compress sections iteratively. Compression can be performed per chunk (extract key sentences) or across chunks (synthesize common themes). The compressed output typically 30-50% of original size, allowing more chunks in context.

💡 WHY IT MATTERS: Retrieved chunks often contain irrelevant information - introductions, boilerplate, tangential content. This dilutes attention and wastes context window. Compression focuses the model on what matters, improving answer quality and reducing hallucination. Studies show compression can improve RAG accuracy by 5-10% by providing cleaner context. It also enables including more documents by reducing per-document footprint.

📋 EXAMPLE: Query: 'What are the side effects of aspirin?' Retrieved chunk: 'Aspirin, also known as acetylsalicylic acid, is a medication used to treat pain, fever, and inflammation. It works by inhibiting cyclooxygenase. Common side effects include upset stomach, heartburn, drowsiness, and mild headache. More serious side effects like stomach bleeding may occur with long-term use. Patients should consult their doctor before use.' Compression extracts: 'Common side effects include upset stomach, heartburn, drowsiness, and mild headache. More serious side effects like stomach bleeding may occur with long-term use.' This removes introductory and mechanism text, focusing on answer. 60% compression without losing relevant information.

Question 7

What is parent-child chunking and what problem does it solve?

Accepted Answer

🔍 DEFINITION: Parent-child chunking is a retrieval strategy that stores two levels of chunks: smaller 'child' chunks for precise retrieval and larger 'parent' chunks for context. It solves the problem of retrieving small, precise chunks that lack surrounding context, by retrieving children but returning their parent chunks to generation.

⚙️ HOW IT WORKS: During indexing, documents are split into larger parent chunks (e.g., 1000 tokens) and further subdivided into smaller child chunks (e.g., 200 tokens) with overlap. Child chunks are embedded and stored with reference to their parent. During retrieval, similarity search runs on child embeddings, finding the most relevant small pieces. Instead of returning these small chunks, the system retrieves their parent chunks, which contain surrounding context. This combines precise retrieval (child-level) with comprehensive context (parent-level).

💡 WHY IT MATTERS: Standard chunking faces a dilemma: small chunks (200 tokens) are precise for retrieval but may lack context needed to answer. Large chunks (1000 tokens) have context but lower retrieval precision. Parent-child chunking gets best of both: child chunks ensure precise matching; parent chunks provide full context. This is especially valuable for question answering where answer may depend on surrounding text. Improves answer quality by 5-15% on tasks requiring contextual understanding.

📋 EXAMPLE: Document section: 'The treatment for hypertension includes lifestyle changes. Specific medications: lisinopril (ACE inhibitor), amlodipine (calcium channel blocker), and hydrochlorothiazide (diuretic). Patients should consult their doctor for personalized recommendations.' Query: 'What medications are used for hypertension?' Child chunk containing exactly the medication list (200 tokens) matches query precisely. Retrieval returns parent chunk (500 tokens) containing full context including lifestyle changes and doctor consultation. Generation can now answer with medications and appropriate medical disclaimer. Without parent-child, might retrieve small chunk and miss important context, or retrieve large chunk with lower precision.

Question 8

What is the difference between pre-retrieval, retrieval, and post-retrieval optimization in RAG?

Accepted Answer

🔍 DEFINITION: RAG optimization spans three stages: pre-retrieval (optimizing queries before search), retrieval (improving the search itself), and post-retrieval (optimizing context after retrieval). Each stage addresses different failure modes and requires different techniques, forming a comprehensive optimization framework.

⚙️ HOW IT WORKS: Pre-retrieval optimizations: query rewriting (spelling, expansion), query decomposition, HyDE, query routing (choose right index). Goal: make query more effective for retrieval. Retrieval optimizations: hybrid search (vector + keyword), metadata filtering, fine-tuned embedding models, multi-query retrieval, HNSW parameter tuning. Goal: find more relevant documents. Post-retrieval optimizations: reranking, context compression, chunk reordering (combat lost-in-middle), duplicate removal, citation verification. Goal: prepare optimal context for generation.

💡 WHY IT MATTERS: Each stage addresses specific failure modes. Poor query → pre-retrieval fixes. Low recall → retrieval fixes. Low precision or poor context use → post-retrieval fixes. Systematic optimization requires measuring which stage is failing. A comprehensive RAG system implements all three, but the degree depends on needs. Understanding the framework enables targeted improvements rather than random tweaks.

📋 EXAMPLE: RAG system with 80% accuracy. Analysis reveals: queries have typos and acronyms (pre-retrieval issue) → add query rewriting → 83%. Still missing relevant docs → add hybrid search (retrieval) → 87%. Model still ignores some relevant chunks due to position → add reranking and reordering (post-retrieval) → 91%. Each stage contributed 3-4% improvement. Without systematic approach, might have only tried one fix and stopped at 83%, missing additional gains.

Question 9

What is self-RAG and how does it use retrieval as a decision process?

Accepted Answer

🔍 DEFINITION: Self-RAG is an advanced framework where the model decides when to retrieve, what to retrieve, and how to use retrieved information through special reflection tokens. Instead of always retrieving, the model dynamically chooses whether retrieval is needed for each step, making the process more efficient and targeted.

⚙️ HOW IT WORKS: Self-RAG trains the model to output special tokens that control retrieval: [Retrieve] token indicates need for retrieval; [No Retrieve] skips it. When [Retrieve] generated, system pauses generation, retrieves relevant passages, and resumes with retrieved content. Additional tokens evaluate retrieved passages: [Relevant]/[Irrelevant] judges usefulness; [Support]/[Partially]/[No Support] checks if passage supports generation. This creates a controlled, transparent process where retrieval decisions are explicit and auditable. Training uses reinforcement learning or supervised fine-tuning on data with these annotations.

💡 WHY IT MATTERS: Traditional RAG always retrieves, wasting compute when not needed (simple questions, conversational follow-ups). Self-RAG reduces unnecessary retrieval by 30-50%, lowering latency and cost. It also improves quality by retrieving only when helpful and by evaluating retrieved content. The explicit reasoning tokens make the process interpretable - you can see why the model retrieved and whether it found relevant information.

📋 EXAMPLE: Multi-turn conversation. User: 'What's the capital of France?' Model generates [No Retrieve] (knows from memory) and answers 'Paris'. User: 'What about its population?' Model generates [Retrieve] (doesn't know current population), retrieves updated demographics, then answers. User: 'Thanks!' Model generates [No Retrieve] and polite response. Without self-RAG, would retrieve for every turn, wasting compute. With self-RAG, retrieval used only when needed, reducing costs by 60% in this conversation while maintaining accuracy.

Question 10

What is corrective RAG (CRAG) and how does it handle low-quality retrievals?

Accepted Answer

🔍 DEFINITION: Corrective RAG (CRAG) is a framework that evaluates retrieved documents and takes corrective actions when quality is low. Instead of blindly using retrieved content, CRAG assesses relevance and confidence, then decides to proceed, refine retrieval, or fall back to alternative strategies.

⚙️ HOW IT WORKS: CRAG adds an evaluation step after retrieval. A relevance judge (LLM or smaller model) scores each retrieved chunk: high confidence (use as-is), low confidence (trigger correction), or no relevant content (trigger fallback). Correction strategies: 1) Query reformulation - rewrite query and retrieve again. 2) Web search - supplement with external search if internal docs insufficient. 3) Decomposition - break query into sub-questions and retrieve for each. 4) Knowledge graph lookup - if available. 5) LLM knowledge fallback - use model's parametric knowledge as last resort. The process continues until sufficient quality achieved or maximum attempts reached.

💡 WHY IT MATTERS: Retrieval sometimes fails - no relevant documents, or retrieved content is tangential. Standard RAG proceeds anyway, leading to poor answers or hallucination. CRAG detects these failures and corrects them, improving reliability. Studies show CRAG can increase answer accuracy by 10-20% on challenging queries. It also provides graceful degradation - when information truly unavailable, it can say so rather than guessing.

📋 EXAMPLE: Enterprise RAG for HR policies. Query: 'Can I take unpaid leave for volunteering?' Initial retrieval returns documents about vacation policy (somewhat related but not correct). CRAG judge scores relevance low (0.3). Triggers correction: query reformulation 'unpaid leave volunteer policy', retrieve again. Now finds correct policy document about volunteer leave. Judge scores high (0.9), proceed to generation. Without CRAG, would have answered based on vacation policy, giving wrong information. Correction turned failure into success.

Question 11

What is fusion retrieval and how does it combine multiple retrieval strategies?

Accepted Answer

🔍 DEFINITION: Fusion retrieval combines results from multiple retrieval strategies (vector search, keyword search, structured queries) using algorithms like Reciprocal Rank Fusion (RRF) to produce a single, more robust ranked list. It leverages the complementary strengths of different methods - vector for semantic, keyword for exact matches, metadata for precise filtering.

⚙️ HOW IT WORKS: Process: 1) Execute multiple retrievers in parallel: dense vector search (semantic), sparse keyword search (BM25), and possibly metadata-filtered search. 2) Each retriever returns top-k results with scores or ranks. 3) Fusion algorithm combines results. RRF is popular: for each document, score = Σ 1/(k + rank_r(d)) where rank_r(d) is document's rank in retriever r, k is constant (typically 60). This gives higher weight to documents ranked well across multiple methods. 4) Combined list reranked by fusion score. Documents appearing in multiple retrievers get boosted, those unique to one may still appear if highly ranked.

💡 WHY IT MATTERS: Different retrievers have different strengths. Vector search finds semantically similar content but may miss exact terminology matches. BM25 excels at exact keyword matches but misses synonyms. Fusion combines them, achieving higher recall and precision than either alone. Studies show 5-15% improvement over single-method retrieval. It's especially valuable for domains with both semantic concepts and precise terminology.

📋 EXAMPLE: Medical literature search. Query: 'ACE inhibitors for hypertension'. Vector search finds documents about 'angiotensin-converting enzyme inhibitors for high blood pressure' (semantic match). BM25 finds documents with exact 'ACE inhibitors' and 'hypertension'. Fusion combines: documents that appear in both (mention both terms and concepts) get highest rank; those only in one still appear if highly relevant. Result: comprehensive set including both terminology variants. Without fusion, might miss documents using 'angiotensin-converting enzyme' exclusively.

Question 12

What is iterative RAG and when is it useful?

Accepted Answer

🔍 DEFINITION: Iterative RAG performs multiple retrieval-generation cycles, using information from previous turns to refine subsequent retrieval. Instead of one-shot retrieval, it enables multi-step reasoning where the model identifies what additional information is needed, retrieves it, and progressively builds understanding.

⚙️ HOW IT WORKS: Process: 1) Initial query processed, initial retrieval and partial answer. 2) Model identifies gaps or needs: 'To answer this, I need information about X.' 3) System formulates new query for X, retrieves additional documents. 4) Combines new information with previous context, updates answer. 5) Repeats until answer complete or max iterations reached. This is similar to multi-hop QA but with explicit retrieval steps. Implementation can use ReAct-style prompting where model outputs thoughts and retrieval actions.

💡 WHY IT MATTERS: Complex questions often require information from multiple sources that can't be retrieved in one pass. First retrieval may provide partial info, revealing need for more. Iterative RAG enables this multi-step reasoning, improving accuracy on complex, multi-faceted queries by 15-25% over single-shot RAG. It's essential for research, analysis, and troubleshooting tasks requiring synthesis across documents.

📋 EXAMPLE: User asks: 'How does climate change affect coffee production in Brazil?' Iteration 1: retrieve general climate change impacts on agriculture. Identifies need for Brazil-specific data. Iteration 2: retrieve Brazil coffee production statistics. Identifies need for specific climate factors. Iteration 3: retrieve temperature and rainfall projections for Brazil coffee regions. Combine all to answer: 'Climate change is expected to reduce suitable coffee-growing areas in Brazil by 30% by 2050 due to temperature increases, with key regions like Minas Gerais experiencing more frequent droughts...' Single-shot retrieval would miss the multi-step synthesis. Iterative RAG builds comprehensive answer through progressive information gathering.

Question 13

How do you handle multi-hop questions in RAG (questions requiring reasoning across multiple documents)?

Accepted Answer

🔍 DEFINITION: Multi-hop questions require information from multiple documents that must be connected through reasoning - answering requires finding, synthesizing, and reasoning across pieces of evidence that aren't all in one place. Handling them requires retrieval strategies that can follow chains of information across the knowledge base.

⚙️ HOW IT WORKS: Approaches: 1) Iterative retrieval - retrieve initial documents, extract new query terms, retrieve again. 2) Graph-based retrieval - build knowledge graph connecting entities across documents, traverse graph to find related information. 3) Multi-query decomposition - break question into sub-questions, answer each, combine. 4) Joint retrieval - encode question and use cross-encoder to score document pairs simultaneously. 5) ReAct-style agents - model decides what to retrieve next based on current information. 6) Self-ask - model explicitly asks and answers sub-questions before final answer.

💡 WHY IT MATTERS: Real-world questions often require connecting information: 'What movies has the director of Inception also directed?' requires finding director of Inception (one doc), then finding other movies by that director (other docs). Single-hop RAG fails because no single document contains both. Multi-hop capability is essential for comprehensive question answering, especially in research, legal, and analytical applications.

📋 EXAMPLE: Question: 'What medications are contraindicated for patients with the condition treated by the drug mentioned in this clinical trial?' Trial document mentions drug X treats condition Y. Need to: 1) Extract condition Y from trial document. 2) Find documents about condition Y's contraindications. 3) Retrieve medications contraindicated. Multi-hop system: first hop retrieves trial doc, extracts condition Y. Second hop formulates query 'contraindications for condition Y', retrieves relevant docs. Synthesizes answer: 'Patients with Y should avoid medications A, B, and C.' Without multi-hop, cannot answer despite information existing in knowledge base.

Question 14

What is RAG fusion and how does it use reciprocal rank fusion?

Accepted Answer

🔍 DEFINITION: RAG Fusion combines multi-query retrieval with reciprocal rank fusion (RRF) to improve retrieval quality. It generates multiple query variations, retrieves documents for each, and uses RRF to merge results, giving higher weight to documents that appear in multiple result sets. This increases both recall and precision by capturing diverse query formulations.

⚙️ HOW IT WORKS: Process: 1) Generate N query variations from original using LLM (e.g., 'machine learning healthcare', 'AI medical applications', 'healthcare ML use cases'). 2) For each query, perform retrieval, obtaining ranked lists R1...RN. 3) Apply RRF: for each document d, fusion score = Σ 1/(k + rank_i(d)) where rank_i(d) is rank in list i (∞ if not present), k=60 constant. 4) Documents ranked by fusion score. 5) Top fused results used for generation. RRF gives high scores to documents consistently highly ranked across queries, reducing noise from individual query variations.

💡 WHY IT MATTERS: RAG Fusion improves both recall (more queries find diverse relevant docs) and precision (consistently ranked docs more likely truly relevant). Studies show 10-20% improvement over single-query retrieval. It's particularly effective for ambiguous queries or when terminology varies. The technique adds latency (multiple retrievals) but queries can run in parallel, minimizing impact.

📋 EXAMPLE: Query: 'ways to reduce carbon footprint'. Variations: 'carbon footprint reduction methods', 'how to lower carbon emissions', 'reduce environmental impact carbon'. Retrieval for each yields different documents. RRF fusion: document about 'home energy efficiency' appears in all three lists (rank 5, 7, 4) → high fusion score. Document about 'electric vehicles' appears in two lists → medium score. Document about 'recycling' appears in one → lower. Final top results balance coverage and consensus, providing comprehensive, well-supported answer about multiple reduction methods.

Question 15

What is the role of a query router in an advanced RAG system?

Accepted Answer

🔍 DEFINITION: A query router directs each incoming query to the most appropriate retrieval strategy, index, or processing pipeline based on query characteristics. Instead of one-size-fits-all retrieval, it enables specialized handling for different query types, improving efficiency and accuracy.

⚙️ HOW IT WORKS: Router analyzes query to determine: 1) Domain or category (technical support, policy, product info). 2) Query type (factual, how-to, comparative, conversational). 3) Required recency (real-time data vs static knowledge). 4) Language. Based on classification, routes to: specific vector index (product docs vs HR policies), retrieval strategy (keyword for IDs, vector for concepts), data source (internal vs web search), or processing mode (simple QA vs multi-hop). Routing can use LLM-based classification, smaller NLP models, or rule-based systems.

💡 WHY IT MATTERS: Different queries need different handling. Product code queries need exact keyword match, not semantic. Time-sensitive questions need fresh data. Complex questions need multi-hop. Using same pipeline for all degrades performance. Router enables specialized optimization, improving accuracy by 5-15% while potentially reducing cost by using cheaper methods where appropriate.

📋 EXAMPLE: E-commerce RAG with router. Query 'iPhone 15 specs' → routed to product specs index with fast vector search. Query 'return policy for electronics' → routed to policy documents with hybrid search. Query 'compare iPhone 15 vs Samsung S24' → routed to multi-hop pipeline that retrieves specs for both, then synthesis. Query 'track order #12345' → routed to order database (structured query), not RAG at all. Each gets optimal handling. Without router, would treat all same - order query would fail, comparison would be shallow.

Question 16

How do you implement step-back prompting to improve RAG retrieval?

Accepted Answer

🔍 DEFINITION: Step-back prompting is a technique that improves retrieval by first generating a more abstract, conceptual version of the query, retrieving based on that, then using the retrieved concepts to inform more specific retrieval. It helps overcome the tendency to retrieve too narrowly by first understanding the broader context.

⚙️ HOW IT WORKS: Two-step process: 1) Step-back: given original query, prompt LLM to generate a more general, conceptual version. For 'What causes migraines?', step-back might be 'What are the general mechanisms of headache disorders?' 2) Retrieve documents using both original and step-back queries (separately or combined). Step-back retrieval finds foundational concepts, original retrieval finds specific details. 3) Combine results, potentially with reranking. The step-back context helps the model understand the broader domain before diving into specifics, improving answer quality.

💡 WHY IT MATTERS: Direct retrieval often misses foundational concepts because queries are too specific. Step-back ensures these concepts are retrieved, providing context that improves answer accuracy and depth. Studies show step-back can improve RAG performance by 5-15% on complex questions by providing better conceptual grounding.

📋 EXAMPLE: Physics question: 'How does quantum entanglement enable quantum computing?' Direct retrieval finds documents about entanglement and computing. Step-back: 'What is quantum entanglement and its properties?' retrieves foundational explanations. Combined context: model understands both fundamentals and application, can explain how entanglement properties (superposition, correlation) are used in quantum computing. Without step-back, might retrieve only applied papers missing basic explanation, answer would assume knowledge user may not have. Step-back ensures comprehensive, accessible answer.

Question 17

What is agentic RAG and how does it differ from standard RAG?

Accepted Answer

🔍 DEFINITION: Agentic RAG replaces the fixed retrieval-generation pipeline with an AI agent that has tools (retrieval, search, calculator, etc.) and autonomously decides how to fulfill the user's request. The agent can plan, use multiple tools, iterate, and adapt based on results, handling complex tasks that standard RAG cannot.

⚙️ HOW IT WORKS: Agentic RAG uses a framework (LangChain, AutoGen) where an LLM agent has access to: retrieval tools (vector DB, web search), reasoning tools (calculator, code interpreter), and action tools (APIs, databases). The agent: 1) Receives user query. 2) Creates plan: what tools to use, in what order. 3) Executes tools, observes results. 4) Iterates, refining approach based on findings. 5) Synthesizes final answer. The agent can retrieve multiple times, combine information, and even write code to analyze data. This contrasts with standard RAG's fixed pipeline of retrieve-once-then-generate.

💡 WHY IT MATTERS: Standard RAG handles simple Q&A but fails on multi-step tasks requiring analysis, comparison, or synthesis across diverse sources. Agentic RAG can: compare products across specifications (retrieve multiple specs, compute differences), answer questions requiring current data (search web, then retrieve internal docs), or perform analysis (retrieve data, run calculations). It expands RAG from simple Q&A to general task automation.

📋 EXAMPLE: User asks: 'Which laptop has better battery life, the Dell XPS 13 or MacBook Air M3, and what's the price difference?' Agentic RAG: 1) Retrieves XPS 13 specs → finds 12 hours claimed battery, $999 price. 2) Retrieves MacBook Air specs → finds 15 hours claimed battery, $1099 price. 3) Computes difference: 3 hours more on MacBook, $100 more expensive. 4) Searches for real-world battery test reviews to validate claims. 5) Synthesizes answer with both claimed and real-world numbers, price comparison. Standard RAG might retrieve one laptop's info only or miss price comparison. Agentic RAG handles the full multi-step task.

Question 18

How does knowledge graph integration enhance RAG retrieval quality?

Accepted Answer

🔍 DEFINITION: Knowledge graph integration combines vector retrieval with structured knowledge from graphs (entities, relationships, properties). It enhances RAG by enabling retrieval based on relationships and structured facts, not just text similarity. This improves accuracy for queries involving specific entities, relationships, or multi-hop connections.

⚙️ HOW IT WORKS: Approaches: 1) Graph-enhanced retrieval - first retrieve entities via vector search, then traverse graph to find related entities and their documents. 2) Graph-based query expansion - extract entities from query, use graph to find related concepts for expansion. 3) Hybrid retrieval - combine vector similarity with graph traversal scores. 4) Graph RAG - build graph from documents (entities and relationships), retrieve by traversing from query entities. 5) Knowledge graph as additional context - include relevant subgraph in prompt alongside retrieved documents.

💡 WHY IT MATTERS: Vector search finds semantically similar text but misses explicit relationships. Knowledge graphs capture that 'Apple acquired Beats' - relationship that may not be obvious from text similarity. For queries like 'What companies has Apple acquired?', graph retrieval directly finds all acquisition relationships, while vector search might miss some. Graph integration improves recall for relationship-heavy queries by 15-30%.

📋 EXAMPLE: Query: 'Who are the founders of companies that compete with Tesla in electric vehicles?' Vector search retrieves documents about Tesla competitors. Graph RAG: 1) Find competitors of Tesla in graph (Rivian, Lucid, Nio). 2) For each, find founder relationships. 3) Retrieve documents about those founders. 4) Synthesize answer: 'Rivian was founded by RJ Scaringe, Lucid by Bernard Tse and Sam Weng, Nio by William Li.' Vector search alone might miss some founders if documents don't explicitly mention 'founder' together with company name in same chunk. Graph provides structured retrieval impossible with vectors alone.

Question 19

What is the FLARE (Forward-Looking Active Retrieval) technique?

Accepted Answer

🔍 DEFINITION: FLARE (Forward-Looking Active Retrieval) is a method where the model actively decides when to retrieve during generation by evaluating its own confidence. It generates sentences, checks if it's uncertain about upcoming tokens, and triggers retrieval to get relevant information before continuing, enabling dynamic, need-based retrieval.

⚙️ HOW IT WORKS: Process: 1) Model begins generating answer. 2) After each sentence or when confidence low, it evaluates the next few tokens' probabilities. If low-confidence tokens detected (indicating uncertainty), it pauses. 3) The generated text so far is used as query to retrieve relevant documents. 4) Retrieved information added to context. 5) Generation resumes, incorporating new information. This contrasts with standard RAG (retrieve once at start) and iterative RAG (retrieve at fixed intervals). FLARE retrieves only when needed, reducing unnecessary retrievals while ensuring accuracy on uncertain parts.

💡 WHY IT MATTERS: Not all parts of an answer need retrieval. Simple facts the model knows don't require it; complex or uncertain parts do. FLARE optimizes this, reducing latency and cost by 30-50% compared to always retrieving, while maintaining or improving accuracy by retrieving exactly when needed. It's particularly effective for long-form generation where information needs vary across the response.

📋 EXAMPLE: Answering 'Explain quantum computing and its applications.' Model confidently generates explanation of superposition and entanglement (no retrieval). When it reaches 'applications in cryptography', low-confidence tokens indicate uncertainty about specific algorithms. FLARE triggers retrieval for 'quantum computing cryptography applications', retrieves info about Shor's algorithm, incorporates, continues. Final answer accurate on both fundamentals and applications, with retrieval only for uncertain part. Without FLARE, would either retrieve unnecessarily for fundamentals (waste) or miss applications (inaccurate).

Question 20

How would you architect a production-grade advanced RAG system from scratch?

Accepted Answer

🔍 DEFINITION: Architecting a production-grade advanced RAG system requires designing each component for scalability, reliability, and quality, with clear interfaces, monitoring, and fallbacks. The architecture must handle varying query types, scale to millions of documents, and maintain performance under load.

⚙️ HOW IT WORKS: Components: 1) Ingestion pipeline: document processors (parsers for PDF, HTML, etc.), chunking with configurable strategies, embedding generation (parallelized), vector DB storage with metadata. Includes versioning and update handling. 2) Query processing: router to classify query type, query rewriting/enhancement, multi-query generation. 3) Retrieval: hybrid search (vector + keyword) with HNSW indexes, metadata filtering, multi-stage retrieval (retrieve more, rerank with cross-encoder). Caching layer for frequent queries. 4) Post-processing: context compression, chunk reordering, citation verification. 5) Generation: prompt management with versioning, LLM orchestration (fallback models, retries), streaming support. 6) Monitoring: golden dataset evaluation, latency tracking, cost per query, user feedback collection. 7) Orchestration: workflow engine (e.g., LangGraph) to coordinate steps, handle errors, implement retries.

💡 WHY IT MATTERS: Production systems must be reliable, observable, and maintainable. Naive RAG fails under real-world conditions. Advanced architecture provides: graceful degradation (fallbacks when retrieval fails), scalability (parallel processing, caching), quality (reranking, compression), and safety (guardrails, monitoring). The investment in architecture pays off through consistent performance and easier maintenance.

📋 EXAMPLE: Production RAG for customer support with 10M documents, 100 QPS. Architecture: ingestion runs hourly, processing 100k new docs with Spark, updating Qdrant. Query router classifies: 70% simple FAQs → fast path (simple retrieval, 7B model), 20% complex → full pipeline (reranking, 34B model), 10% out-of-scope → fallback to human. Cache hits 30%, reducing load. Monitoring alerts when golden dataset recall drops below 0.90. Latency p95 1.2s, cost $0.02/query. This architecture scales and maintains quality, unlike naive RAG that would collapse under load.

AI Interview Questions

Advanced RAG

What are the main differences between naive RAG and advanced RAG?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

What is HyDE (Hypothetical Document Embeddings) and how does it improve retrieval?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

What is query rewriting and how does it improve RAG performance?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

What is multi-query retrieval and why does it improve recall?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

What is re-ranking and which models are commonly used (Cohere Rerank, ColBERT)?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

What is contextual compression and how does it improve the quality of retrieved context?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

What is parent-child chunking and what problem does it solve?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

What is the difference between pre-retrieval, retrieval, and post-retrieval optimization in RAG?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

What is self-RAG and how does it use retrieval as a decision process?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

What is corrective RAG (CRAG) and how does it handle low-quality retrievals?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

What is fusion retrieval and how does it combine multiple retrieval strategies?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

What is iterative RAG and when is it useful?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

How do you handle multi-hop questions in RAG (questions requiring reasoning across multiple documents)?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

What is RAG fusion and how does it use reciprocal rank fusion?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

What is the role of a query router in an advanced RAG system?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

How do you implement step-back prompting to improve RAG retrieval?

🔍 DEFINITION:

⚙️ HOW IT WORKS: