Explore topic-wise interview questions and answers.
Embedding Models
QUESTION 01
What is an embedding model and how does it differ from a generative LLM?
🔍 DEFINITION:
An embedding model is a neural network trained to convert text (or other data) into dense vector representations that capture semantic meaning, where similar texts produce vectors close together in the embedding space. Unlike generative LLMs that produce variable-length text outputs, embedding models produce fixed-length vectors optimized for comparison, search, and retrieval tasks.
⚙️ HOW IT WORKS:
Embedding models are typically encoder-only transformers (like BERT) trained with contrastive objectives: pulling semantically similar texts together in vector space while pushing dissimilar ones apart. During inference, text passes through the model, and the output (often the [CLS] token or mean pooling) becomes the embedding vector. Generative LLMs (decoder-only) are trained for next-token prediction and generate text autoregressively. Embedding models are smaller, faster, and specialized for representation; generative models are larger, slower, and specialized for creation.
💡 WHY IT MATTERS:
Embedding models enable efficient semantic search, clustering, and retrieval at scale. A 100M-parameter embedding model can process thousands of documents per second on CPU, while a 7B generative model would be too slow and expensive for retrieval. In RAG systems, embedding models handle the retrieval step, generative models handle answer synthesis. Understanding the distinction prevents misapplication: using a generative model for retrieval would be costly and slow; using an embedding model for generation impossible.
📋 EXAMPLE:
For a RAG system answering customer questions: embedding model (e.g., 100M parameters) converts all support documents to vectors (once) and user queries to vectors (real-time) for fast retrieval (10ms). Generative model (7B parameters) then answers based on retrieved documents (500ms). Each model does what it's best at: embedding for speed and comparison, generative for synthesis and fluency. This分工 enables both fast retrieval and high-quality answers.
QUESTION 02
What is the difference between OpenAI embeddings, Cohere embeddings, and open-source models like BGE or E5?
🔍 DEFINITION:
Different embedding model families offer varying trade-offs in quality, cost, latency, and customization. OpenAI's text-embedding-ada-002 (1536-dim) is a popular closed-source API; Cohere offers multilingual and specialized models; open-source models (BGE, E5, GTE) provide flexibility and self-hosting at lower cost.
⚙️ HOW IT WORKS:
OpenAI embeddings: API-based, 1536 dimensions, strong general performance, $0.13/1M tokens, no fine-tuning access. Cohere: API or self-hosted, multiple models (embed-english-v3.0, embed-multilingual-v3.0), 1024 dimensions, strong on RAG benchmarks. BGE (BAAI): open-source, multiple sizes (BGE-base 768-dim, BGE-large 1024-dim), permissive license, can be self-hosted and fine-tuned. E5 (Microsoft): open-source, trained with contrastive learning, strong performance. GTE (Alibaba): open-source, multilingual. Open-source models require infrastructure but offer lower cost at scale and customization.
💡 WHY IT MATTERS:
Choice impacts cost, control, and performance. For prototyping, OpenAI simple. For production at scale (>10M queries/month), open-source can save $50k+/year. For specialized domains, fine-tuned open-source may outperform general APIs. For data privacy, self-hosting essential. Quality differences: MTEB leaderboard shows top open-source models (BGE, E5) competitive with commercial APIs. The best choice depends on scale, domain, privacy needs, and budget.
📋 EXAMPLE:
Company with 100M documents, 1M queries/month. OpenAI: $0.13/1M tokens × 200 tokens/query × 1M = $26,000/month. Self-hosted BGE on 2 GPUs: $2,000/month cloud cost. Quality difference <2% on internal benchmarks. Switch saves $288k/year. For another company with sensitive legal data, self-hosting non-negotiable regardless of cost. Open-source enables both savings and control.
QUESTION 03
What is sentence-transformers and how is it used?
🔍 DEFINITION:
Sentence-Transformers is a Python framework built on PyTorch and Hugging Face that provides easy-to-use models and training tools for generating sentence and text embeddings. It simplifies using and fine-tuning embedding models for semantic search, clustering, and similarity tasks.
⚙️ HOW IT WORKS:
The library wraps transformer models (BERT, RoBERTa) with pooling layers to produce fixed-size sentence embeddings. It provides: 1) Pre-trained models for many languages and domains. 2) Simple API: model.encode(texts) returns embeddings. 3) Training utilities for fine-tuning on custom datasets using contrastive learning (MultipleNegativesRankingLoss, TripletLoss). 4) Evaluation tools for semantic search benchmarks. Models can be used for semantic textual similarity (STS), paraphrase mining, and retrieval.
💡 WHY IT MATTERS:
Sentence-Transformers democratized embedding models. Before it, generating sentence embeddings required complex BERT handling and pooling. Now, a few lines of code give production-quality embeddings. The library enabled rapid experimentation and fine-tuning for domain adaptation. It's the foundation of many RAG systems and the go-to tool for embedding work. Pre-trained models like all-MiniLM-L6-v2 (384-dim) are widely used for their speed-quality trade-off.
📋 EXAMPLE:
Building a semantic search system for customer support. Using sentence-transformers: model = SentenceTransformer('all-MiniLM-L6-v2'); docs_emb = model.encode(documents); query_emb = model.encode(query); similarities = util.cos_sim(query_emb, docs_emb). 10 lines of code gives working prototype. For domain adaptation: fine-tune on 10k labeled pairs with MultipleNegativesRankingLoss. All within same framework. This ease of use is why sentence-transformers is ubiquitous.
QUESTION 04
What is the MTEB leaderboard and how do you use it to choose an embedding model?
🔍 DEFINITION:
MTEB (Massive Text Embedding Benchmark) is a comprehensive leaderboard that evaluates embedding models across 8 tasks and 56 datasets, including retrieval, semantic similarity, classification, clustering, and reranking. It provides a standardized way to compare models across diverse capabilities, helping practitioners select the best model for their needs.
⚙️ HOW IT WORKS:
MTEB runs models through standardized tasks: retrieval (18 datasets like NFCorpus, SciFact), STS (semantic similarity), classification (20 tasks like Amazon reviews), clustering, pair classification, reranking, and summarization. Each task produces scores (nDCG, accuracy, etc.) averaged across datasets. Overall MTEB score is average across all tasks. Models are ranked, with breakdowns by task category. Leaderboard includes model size, embedding dimension, and max tokens.
💡 WHY IT MATTERS:
MTEB enables data-driven model selection. Instead of guessing which model is best, you can see performance on tasks similar to your use case. Need strong retrieval? Look at retrieval scores. Need multilingual? Check multilingual tasks. MTEB also reveals trade-offs: smaller models (384-dim) often 95% of large model performance at 1/10 the cost. The leaderboard is updated as new models release, keeping comparisons current.
📋 EXAMPLE:
Choosing embedding model for RAG system focused on scientific papers. Check MTEB retrieval scores on SciFact and NFCorpus. Top models: voyage-2 (0.89), BGE-large (0.87), ada-002 (0.86). BGE-large open-source, 1024-dim, can self-host. For budget-conscious, GTE-small (384-dim) scores 0.82 - 5% drop but 10x cheaper. Based on MTEB, choose BGE-large for best quality or GTE-small for cost. Without MTEB, would rely on marketing claims.
QUESTION 05
What is the difference between symmetric and asymmetric semantic search?
🔍 DEFINITION:
Symmetric semantic search involves queries and documents that are similar in nature - like finding similar questions, duplicate detection, or paraphrase retrieval. Asymmetric semantic search involves queries that are short (questions) and documents that are long (passages, articles) - the typical RAG scenario. Different embedding models and training approaches optimize for each.
⚙️ HOW IT WORKS:
In symmetric search, both sides are comparable (e.g., question-question). Models can use bi-encoders with symmetric training: pairs of similar texts pulled together. In asymmetric search, queries and documents have different distributions (short keywords/phrases vs long descriptive text). Models often trained with query-document pairs, sometimes using different encoders for each side. Training data for asymmetric includes (query, relevant document) pairs from search logs or Q&A datasets.
💡 WHY IT MATTERS:
Using symmetric model for asymmetric task reduces accuracy. A model trained to match question-question may fail to match short queries to long documents because distributions differ. MTEB reports both symmetric and asymmetric task performance. Many general models now support both, but specialized asymmetric models (e.g., E5, BGE) often perform better for RAG. Understanding the distinction ensures choosing right model for use case.
📋 EXAMPLE:
RAG system: short user query (5-10 words) vs long documentation (500 words). Asymmetric search. Using symmetric model (trained on STS) gives recall@10 0.75. Using asymmetric-trained E5 gives recall 0.88 - 13% improvement. For FAQ matching (question to similar question), symmetric works best. The use case determines model choice. This is why embedding model selection must consider search type.
QUESTION 06
How do you fine-tune an embedding model for a domain-specific use case?
🔍 DEFINITION:
Fine-tuning an embedding model adapts a general-purpose model to a specific domain (legal, medical, technical) by training on domain-specific pairs of related texts. This improves retrieval accuracy for that domain by teaching the model domain terminology, relationships, and relevance patterns.
⚙️ HOW IT WORKS:
Process: 1) Create training data - pairs of (query, relevant document) from your domain. Need 1k-100k examples. Can use existing labels, search logs, or synthetic generation (LLM creates Q&A pairs from documents). 2) Choose base model - typically a strong open-source model (BGE, E5). 3) Define loss function - MultipleNegativesRankingLoss works well: for each query, has one positive document, treats other documents in batch as negatives. 4) Train with sentence-transformers or similar framework. Use contrastive learning to pull positive pairs together, push negatives apart. 5) Evaluate on held-out test set, compare to base model. 6) Iterate with more data or different hyperparameters.
💡 WHY IT MATTERS:
Domain adaptation can improve retrieval by 5-15% over general models. For specialized domains (medical, legal, technical), general models miss terminology and relationships. Fine-tuning on as few as 1k examples can yield significant gains. This is often more cost-effective than switching to larger models. Fine-tuned models also better handle domain-specific query patterns.
📋 EXAMPLE:
Medical RAG with general BGE model recall@10 0.78. Fine-tune on 10k (query, relevant medical abstract) pairs from PubMed. After fine-tuning, recall@10 0.87 - 9% improvement. The model now understands that 'MI' means myocardial infarction, that 'HTN' relates to hypertension, and that certain treatments pair with conditions. This improvement directly translates to better patient answers. Fine-tuning turned a good model into a great medical model.
QUESTION 07
What is contrastive learning and how is it used to train embedding models?
🔍 DEFINITION:
Contrastive learning is a training technique that teaches models to pull semantically similar examples closer together in embedding space while pushing dissimilar examples apart. It's the foundation of modern embedding models, enabling them to learn rich representations without explicit labels, just from pairs or triples of related/unrelated texts.
⚙️ HOW IT WORKS:
The model is trained on batches containing positive pairs (similar texts) and negative samples (dissimilar texts). Loss functions like MultipleNegativesRankingLoss treat, for each query, the matching document as positive and all other documents in batch as negatives. The model learns to maximize similarity for positive pairs and minimize for negatives. InfoNCE loss: -log(exp(sim(q,p)) / Σ exp(sim(q,all))). This pushes positive similarity high relative to all negatives. Training uses large batches (many negatives) for effectiveness. Data can be naturally occurring pairs: (query, clicked document), (question, answer), (sentence, paraphrase).
💡 WHY IT MATTERS:
Contrastive learning enabled the leap from static embeddings (Word2Vec) to task-optimized embeddings. It doesn't require expensive labeled data - just indications of relevance. This scales to web-scale training. The technique produces embeddings where distance directly corresponds to semantic relevance, perfect for search and retrieval. Most top embedding models (E5, BGE, GTE) use contrastive learning.
📋 EXAMPLE:
Training batch of 64 (query, relevant document) pairs. For query 1, its document is positive; the other 63 documents are negatives. Model computes similarities, loss encourages sim(q1, doc1) >> sim(q1, doc63). Over millions of examples, model learns that 'capital of France' should be close to 'Paris' and far from 'pasta recipe'. This contrastive pressure creates the semantic space used in RAG.
QUESTION 08
What are the trade-offs between embedding model size and retrieval quality?
🔍 DEFINITION:
Larger embedding models (more parameters, higher dimensions) generally achieve better retrieval quality but incur higher costs in inference latency, memory usage, storage, and compute. Understanding these trade-offs is crucial for cost-effective RAG system design, as the optimal size depends on application requirements and scale.
⚙️ HOW IT WORKS:
Model size scales with parameters (100M to 1B) and output dimension (384 to 1536). Larger models capture more nuanced semantics, improving retrieval recall by 2-8% typically. Costs: inference latency (5ms vs 20ms on CPU), memory for model (500MB vs 2GB), vector storage (384-dim vs 1536-dim: 4x difference), and search latency (higher dimensions slower). At scale, these multiply: 100M vectors at 1536-dim = 600GB storage vs 150GB at 384-dim.
💡 WHY IT MATTERS:
For many applications, a 384-dim model (all-MiniLM-L6-v2) achieves 90-95% of the quality of a 1536-dim model (ada-002) at 1/10 the cost. The last few percent quality may not justify 4-10x infrastructure costs. However, for high-stakes applications (medical, legal) where every percentage point matters, larger models worth it. The trade-off analysis should consider: quality difference on your specific data, cost at scale, latency requirements, and business impact of errors.
📋 EXAMPLE:
50M document RAG system. Option A (384-dim MiniLM): recall@10 0.88, storage 75GB, CPU inference 5ms, annual storage $2k. Option B (1024-dim BGE-large): recall 0.92, storage 200GB, GPU inference needed, annual storage $6k + GPU costs $10k. For customer support chatbot, 4% recall improvement may reduce escalations by 2% - worth $5k/year. Net benefit positive. For internal knowledge base with lower stakes, Option A better. Trade-off analysis guides decision.
QUESTION 09
What is Matryoshka Representation Learning (MRL) and why is it useful?
🔍 DEFINITION:
Matryoshka Representation Learning (MRL) is a technique that trains embeddings to be useful at multiple dimensions simultaneously, allowing flexible truncation to smaller dimensions without retraining or significant quality loss. Named after Russian nesting dolls, it creates representations where early dimensions contain most important information, later dimensions add refinement.
⚙️ HOW IT WORKS:
During training, MRL computes loss not just on full embedding but also on multiple nested subsets. For example, with target dimension 768, losses computed on first 64, 128, 256, 384, 512, and 768 dimensions. Model learns to pack information hierarchically so early dimensions capture core semantics, later dimensions encode finer distinctions. This is achieved through multi-task learning with shared encoder and multiple projection heads or a single head with multi-level loss. After training, embedding can be truncated to any dimension (e.g., 128) and still provide good performance.
💡 WHY IT MATTERS:
MRL enables flexible deployment where one model serves multiple use cases. High-precision search can use full dimensions for maximum recall. Real-time recommendation with strict latency can use 128-dim from same model, achieving 6x faster search and 6x less storage with only slight quality degradation. This eliminates need for multiple models. Storage savings at scale: 10B documents at 768-dim = 30.7TB, at 128-dim = 5.1TB. The technique enables adaptive trade-offs without retraining.
📋 EXAMPLE:
E-commerce platform with 100M products. Full 768-dim embeddings: storage 307GB, search latency 50ms. 128-dim truncation: storage 51GB, latency 10ms. With MRL-trained model, same model serves both: backend search uses full dims for high recall (0.95), real-time recommendations use 128-dim (0.92 recall) - 6x faster, 6x cheaper storage. Without MRL, would need two separate embedding models. This flexibility is why MRL is increasingly adopted.
QUESTION 10
How does embedding model dimensionality affect storage costs?
🔍 DEFINITION:
Embedding dimensionality directly determines the storage footprint of vector databases. Each dimension stores a floating-point number (typically 4 bytes for float32), so total storage = number of vectors × dimensions × bytes per dimension. Higher dimensions increase storage linearly, with significant cost implications at scale.
⚙️ HOW IT WORKS:
For a dataset of N vectors with dimension D using float32 (4 bytes), storage = N × D × 4 bytes. Example: 1M vectors at 384-dim = 1M × 384 × 4 = 1.54GB. At 1536-dim = 1M × 1536 × 4 = 6.14GB - 4x larger. At 10M vectors: 384-dim = 15.4GB, 1536-dim = 61.4GB. At 1B vectors: 384-dim = 1.54TB, 1536-dim = 6.14TB. Storage costs in cloud: SSD storage $0.10/GB/month means 384-dim 1B vectors = $154k/month, 1536-dim = $614k/month - $460k/month difference. RAM costs even higher for in-memory databases.
💡 WHY IT MATTERS:
Dimensionality choice is a major cost driver at scale. A seemingly small increase from 384 to 768 doubles storage; to 1536 quadruples. For large-scale applications (hundreds of millions to billions of vectors), this translates to millions in infrastructure costs annually. The trade-off: higher dimensions may improve retrieval quality by 1-5%, but at 4x cost. Many applications find 384-dim models (all-MiniLM-L6-v2) provide 90-95% of quality at 1/4 the cost. Storage costs also affect: backup costs, transfer costs, and recovery time. Choosing the right dimensionality requires cost-benefit analysis based on your scale and quality requirements.
📋 EXAMPLE:
E-commerce company with 500M product vectors. Option A (384-dim): storage 500M × 384 × 4 = 768GB, cloud SSD cost $77/month, plus vector database overhead $200/month = $277/month. Option B (1536-dim): storage 3TB, cost $307/month + overhead $500/month = $807/month. Annual difference $6,360. For 10-year lifespan, $63,600 savings with Option A. If quality difference is 2% (recall 0.92 vs 0.94), is 2% worth $63k? For many businesses, no. This is why dimensionality optimization is crucial for cost-effective scaling.
QUESTION 11
What is bi-encoder vs. cross-encoder architecture and when do you use each?
🔍 DEFINITION:
Bi-encoders encode queries and documents independently into vectors, enabling fast similarity search via vector comparison. Cross-encoders process query-document pairs together through a transformer, producing a relevance score directly. Bi-encoders are efficient for retrieval (millions of documents), cross-encoders are accurate but slow, used for reranking.
⚙️ HOW IT WORKS:
Bi-encoder: query → encoder → vector q; document → same encoder → vector d. Similarity = cos(q,d). Can pre-compute all document vectors, store in vector DB. Search: encode query once, compare to all document vectors via ANN. O(1) per document at scale. Cross-encoder: query + document concatenated (e.g., '[CLS] query [SEP] document [SEP]') passed through transformer, output score via classification head. Must process each query-document pair individually - O(n) per query, too slow for large-scale retrieval but accurate for reranking top candidates.
💡 WHY IT MATTERS:
The choice is about speed-accuracy trade-off. Bi-encoders enable billion-scale retrieval in milliseconds - essential for RAG. Cross-encoders provide more accurate relevance judgments because they can model query-document interactions deeply. Typical RAG pipeline uses bi-encoder for initial retrieval (top-100), then cross-encoder for reranking (top-10) to combine speed and accuracy. Cross-encoders also used for training data generation (labeling pairs).
📋 EXAMPLE:
RAG with 10M documents. Bi-encoder retrieves top-100 in 50ms. Cross-encoder reranks those 100 pairs in 100ms (1ms per pair). Final top-10 used for generation. Without bi-encoder, cross-encoder on 10M would take 10M × 1ms = 10,000 seconds - impossible. Without cross-encoder, retrieval quality may be lower. The combination leverages both strengths.
QUESTION 12
How do you handle multilingual embeddings for a global application?
🔍 DEFINITION:
Multilingual embeddings enable semantic search across multiple languages by mapping texts from different languages into a shared vector space where similar meanings are close regardless of language. This allows queries in one language to retrieve relevant documents in another, essential for global applications.
⚙️ HOW IT WORKS:
Approaches: 1) Multilingual models trained on parallel data (translation pairs) to align languages (LaBSE, multilingual E5). 2) Cross-lingual transfer from multilingual pretraining (mBERT, XLM-R) fine-tuned on multilingual tasks. 3) Separate monolingual models with alignment (less common). During indexing, documents in any language embedded with same model. Query in any language embedded similarly. Vector search finds semantically similar documents across languages. Performance varies by language - high-resource languages (English, Spanish) better than low-resource.
💡 WHY IT MATTERS:
Global applications can't assume all content in one language. Customer support needs answers in user's language; product search should work across languages; knowledge bases may have documents in multiple languages. Multilingual embeddings enable this without language detection and translation pipelines. However, quality gaps exist - a model may perform well on English but poorly on Thai. Testing on target languages essential.
📋 EXAMPLE:
Global e-commerce with product descriptions in English, Spanish, Japanese, Chinese. User in Japan searches in Japanese. Multilingual E5 embeds query, finds relevant products regardless of description language. Spanish user sees same product via Spanish query. Without multilingual embeddings, would need separate indexes per language or translation layer, adding complexity and latency. Multilingual model simplifies architecture.
QUESTION 13
What is late interaction (ColBERT) and how does it differ from bi-encoder retrieval?
🔍 DEFINITION:
ColBERT (Contextualized Late Interaction over BERT) is a retrieval architecture that combines bi-encoder efficiency with cross-encoder-like accuracy by encoding queries and documents into sets of token embeddings and using a lightweight interaction mechanism ('MaxSim') to score relevance. It preserves fine-grained token-level matching while enabling pre-computation of document representations.
⚙️ HOW IT WORKS:
Query encoded into bag of token vectors Q = [q1, q2, ..., qm]. Document encoded into bag of token vectors D = [d1, d2, ..., dn] (pre-computed and stored). Relevance score = Σ_i max_j (qi · dj) - for each query token, find maximum similarity with any document token, sum across query tokens. This captures fine-grained matches (e.g., query 'machine learning' matching 'learning' in document) while remaining efficient: only m×n comparisons (m small, n moderate) per query-document pair. Can use with inverted indexes for scalability.
💡 WHY IT MATTERS:
ColBERT bridges gap between bi-encoder speed and cross-encoder accuracy. Bi-encoders compress full text to single vector, losing token-level detail. Cross-encoders capture full interaction but are slow. ColBERT preserves token-level matching while enabling pre-computation, achieving state-of-the-art retrieval quality. It's particularly good for queries with multiple concepts where exact token matching matters. Used in many advanced RAG systems.
📋 EXAMPLE:
Query 'presidential election results 2020'. Bi-encoder might match documents about '2020 election' generally. ColBERT matches 'presidential', 'election', 'results', '2020' individually, ensuring document contains all concepts. If document has '2020 presidential election' but not 'results', still may match via 'results' from another section. This token-level interaction captures nuanced relevance better than single-vector compression, improving retrieval quality by 5-10% on complex queries.
QUESTION 14
What is the impact of embedding model updates on an existing vector index?
🔍 DEFINITION:
Updating an embedding model changes the vector representations of all documents, rendering an existing vector index invalid because vectors are no longer comparable across versions. This creates significant operational challenges for production RAG systems, requiring strategies to manage model version changes.
⚙️ HOW IT WORKS:
When embedding model changes (new version, fine-tuned model, different architecture), all document embeddings must be regenerated because the vector space shifts. New queries embedded with new model won't align with old document vectors - similarity scores meaningless. Options: 1) Full re-indexing - regenerate all embeddings, rebuild indexes. Takes time and compute. 2) Versioned indexes - maintain separate indexes for different model versions, route queries accordingly during transition. 3) Gradual migration - dual-write new documents with both models, backfill old documents over time. 4) Embedding model freezing - avoid updates unless necessary.
💡 WHY IT MATTERS:
Model updates can break production systems if not managed. A naive switch without re-indexing leads to retrieval failure. Full re-indexing of billion-scale datasets can take days and cost thousands. Versioning adds complexity. This is why model selection is critical - frequent updates impractical. Ideally, choose a stable, well-performing model and update infrequently. When updates necessary, plan migration carefully with validation to ensure quality improvement justifies cost.
📋 EXAMPLE:
Company using ada-002 decides to switch to BGE-large for cost savings. 100M documents need re-embedding. Process: spin up parallel index, re-embed 100M docs (2 weeks), validate on test queries (recall improved from 0.88 to 0.92), gradually shift traffic (10% daily), monitor, then decommission old index. Total migration time 3 weeks, cost $10k in compute. The quality improvement and long-term savings justify migration. Without planning, could cause downtime or quality degradation.
QUESTION 15
How do you measure embedding model quality for your specific dataset?
🔍 DEFINITION:
Measuring embedding model quality for your specific dataset requires creating a domain-specific evaluation set with queries and relevant documents, then computing retrieval metrics to assess how well the model captures your domain's semantics. This is essential because general benchmarks may not reflect performance on specialized terminology and concepts.
⚙️ HOW IT WORKS:
Process: 1) Create evaluation dataset - collect 200-1000 representative queries from your domain. For each query, manually label 5-20 relevant documents from your corpus (or use click logs, expert judgments). Ensure coverage of domain concepts and edge cases. 2) Generate embeddings for all documents and queries using candidate models. 3) For each query, retrieve top-k documents and compute metrics: recall@k (proportion of relevant documents retrieved), precision@k (proportion of retrieved that are relevant), MRR (mean reciprocal rank, measures first relevant rank), and nDCG (discounted cumulative gain, accounts for ranking quality). 4) Compare across models including baselines (BM25, general embeddings). 5) Statistical significance testing to ensure differences meaningful.
💡 WHY IT MATTERS:
General benchmarks (MTEB) test on Wikipedia and news - but your domain may have different characteristics. Medical terminology, legal jargon, product codes may be poorly handled by general models. A model scoring 0.90 on MTEB might score 0.60 on your data if it never saw similar terms. This leads to failed RAG systems. Domain evaluation identifies gaps, guides model selection, and provides baseline for improvement through fine-tuning.
📋 EXAMPLE:
Legal tech company evaluating embedding models. MTEB scores: Model A 0.89, Model B 0.87. Domain test (500 legal queries): Model A recall@10 0.72, Model B 0.81. Model B better on legal despite lower MTEB because trained on more legal data. Without domain evaluation, would pick wrong model and build failing system. The 9% recall difference translates to missing 9% of relevant cases in production - unacceptable.
QUESTION 16
What is negative mining in embedding model training?
🔍 DEFINITION:
Negative mining is a technique in contrastive learning that selects challenging negative examples - documents that are similar to the query but not relevant - to make the training more effective. Instead of using random negatives (too easy), negative mining forces the model to learn fine-grained distinctions between closely related but irrelevant items.
⚙️ HOW IT WORKS:
Standard contrastive learning uses in-batch negatives - other documents in same batch. These may be too easy (different topics). Negative mining strategies: 1) Hard negative mining - retrieve top documents that are similar but not relevant using a base model. Add these as explicit negatives. 2) Cross-batch negative mining - maintain queue of recent negatives from other batches. 3) Synthetic negatives - use LLM to generate similar-but-irrelevant texts. 4) Hierarchical negatives - negatives at different difficulty levels. The model learns to distinguish between, e.g., 'Python programming' and 'Python snake' - both contain 'Python' but meanings differ.
💡 WHY IT MATTERS:
Without hard negatives, models learn to separate different topics but fail on subtle distinctions. In production RAG, subtle distinctions matter: 'diabetes treatment' vs 'diabetes symptoms' are different but related. Hard negatives teach these nuances. Models trained with hard negatives significantly outperform those without on retrieval tasks, especially for fine-grained domains. The technique is essential for state-of-the-art embedding models.
📋 EXAMPLE:
Training medical embedding model. Random negatives: 'diabetes' vs 'cancer' (easy). Hard negatives: 'type 1 diabetes treatment' vs 'type 2 diabetes treatment' (difficult). Model must learn that insulin and metformin are different treatments. By including hard negatives from retrieval results, model learns these distinctions. Final model better at distinguishing between closely related medical concepts, improving retrieval precision for specific queries.
QUESTION 17
What is the difference between text embeddings and multimodal embeddings?
🔍 DEFINITION:
Text embeddings represent only text in vector space. Multimodal embeddings represent multiple modalities (text, images, audio, video) in a shared vector space where similar concepts are close regardless of modality. This enables cross-modal search: searching images with text, finding text that describes an image, or comparing across modalities.
⚙️ HOW IT WORKS:
Multimodal embeddings are trained on paired data (image-caption pairs, video-transcript pairs) using contrastive learning. Models like CLIP (Contrastive Language-Image Pre-training) have separate encoders for each modality but project to shared space. During training, matching pairs pulled together, non-matching pushed apart. After training, any image can be compared to any text via cosine similarity in shared space. This enables zero-shot classification, cross-modal retrieval, and multimodal search.
💡 WHY IT MATTERS:
Real-world data is multimodal - documents contain images, products have photos, videos have audio. Multimodal embeddings enable unified understanding. E-commerce can search products by text or image; content moderation can flag images based on text policies; accessibility tools can generate image descriptions. As models evolve, multimodal understanding becomes essential for comprehensive AI systems.
📋 EXAMPLE:
E-commerce search: user searches 'red dress with floral pattern'. Text embedding finds products with those words. Multimodal (CLIP) can also find products where the image matches 'red floral dress' even if description says 'scarlet flowered gown'. User can also upload photo of dress they like, find similar products via image-to-image or image-to-text search. This multimodal capability enhances search beyond text-only.
QUESTION 18
How would you handle embedding drift over time in a production system?
🔍 DEFINITION:
Embedding drift refers to the gradual change in embedding distributions over time due to shifts in data (new products, changed terminology, evolving language) or model updates. This can degrade retrieval quality if not monitored and addressed. Handling drift requires detection, analysis, and mitigation strategies.
⚙️ HOW IT WORKS:
Detection methods: 1) Monitor query embedding distribution statistics (mean, variance) over time for significant shifts. 2) Track retrieval metrics on golden dataset - if recall drops, drift may be cause. 3) Compare embeddings of new vs old documents for same topics - if distances increase, drift occurring. 4) User feedback signals - satisfaction drops may indicate retrieval degradation. Mitigation: 1) Regular re-indexing of new data with current embeddings. 2) Periodic full re-indexing (quarterly) to align all vectors. 3) Model fine-tuning on new data if distribution shift significant. 4) Versioned indexes with gradual migration.
💡 WHY IT MATTERS:
Ignoring drift leads to silent degradation. Users start getting worse results, but no alerts trigger because system appears functional. For e-commerce, new products may not match old queries well. For news, terminology evolves. Proactive monitoring prevents this. Drift also affects A/B testing - comparing models across time periods becomes invalid if distributions changed.
📋 Example: E-commerce RAG with 5M products. After 6 months, 20% of products are new. Golden dataset recall drops from 0.92 to 0.88. Investigation shows new products have different embedding distribution (modern terminology). Solution: re-index all products with current embeddings (2 days). Recall restored. Implement quarterly re-indexing and weekly drift monitoring to catch future shifts early. Without monitoring, would have deployed degraded system for months.
QUESTION 19
What are the cost implications of calling a third-party embedding API at scale?
🔍 DEFINITION:
Using third-party embedding APIs (OpenAI, Cohere) at scale involves significant costs that grow linearly with usage. For high-volume applications, these costs can become substantial and often justify transitioning to self-hosted open-source models. Understanding the economics is crucial for sustainable RAG deployment.
⚙️ HOW IT WORKS:
Pricing models: per 1M tokens (OpenAI $0.13, Cohere $0.10). For a typical document (500 words ≈ 670 tokens) and query (10 words ≈ 13 tokens), costs add up. For 10M documents indexed once: 10M × 670 tokens = 6.7B tokens = $871 (OpenAI). For 1M queries/month: 1M × 13 tokens = 13M tokens = $1.69/month. But for 100M documents: $8,710 one-time indexing. For 10M queries/month: $16,900/month. These numbers explain why large-scale users self-host.
💡 WHY IT MATTERS:
At moderate scale (<1M queries/month), APIs are cost-effective and simple. At large scale (>10M queries/month), self-hosting can save millions annually. Self-hosting requires infrastructure (GPUs/CPUs), engineering time, and maintenance. Break-even analysis: 10M queries/month at $0.13/1M tokens = $16,900/month. Self-hosting 2x GPUs = $3,000/month cloud + $2k engineering = $5k/month - saves $11.9k/month ($143k/year). Over 3 years, $429k savings justifies migration.
📋 EXAMPLE:
Startup grows from 100k to 10M queries/month over 2 years. Initially, OpenAI embeddings fine - $169/month. At 10M queries, cost $16,900/month. Switch to self-hosted BGE on 2 GPUs: $3,000/month. Savings $13,900/month fund additional engineers. Without monitoring costs, would have paid $200k+ unnecessarily. Cost-aware architecture evolves with scale.
QUESTION 20
When would you choose a local, self-hosted embedding model over a cloud API?
🔍 DEFINITION:
Choosing between self-hosted and cloud embedding APIs involves trade-offs in cost, privacy, latency, control, and engineering effort. Self-hosted becomes preferable at sufficient scale, for sensitive data, or when low latency and full control are required.
⚙️ HOW IT WORKS:
Decision factors: 1) Scale - break-even typically 1-10M queries/month. Below, API simpler; above, self-hosted cheaper. 2) Privacy - if data sensitive (medical, legal, proprietary), self-hosted keeps it on-premises. 3) Latency - self-hosted can be optimized (CPU inference, batching), potentially lower than API calls with network overhead. 4) Control - ability to fine-tune models for domain, control versions, avoid API changes. 5) Compliance - some industries require data not leave jurisdiction. 6) Engineering resources - self-hosted requires team to maintain.
💡 WHY IT MATTERS:
Wrong choice wastes money or risks data. At low scale, API simplifies development. At high scale, self-hosted essential for cost control. For sensitive data, self-hosted only option regardless of cost. The decision should be revisited as scale changes - many companies start with API, migrate to self-hosted at growth milestones.
📋 EXAMPLE:
Healthcare startup with 50k queries/month, patient data. Privacy paramount - cannot send data to third-party APIs. Self-hosted BGE on single GPU ($300/month) only option despite lower scale. Financial savings irrelevant; compliance dictates choice. Another startup with public product data, 20M queries/month. API cost $33,800/month. Self-hosted on 4 GPUs $6,000/month + $5k engineering = $11k/month - saves $22.8k/month. Migration pays for itself in 2 months. Each scenario different.