Explore topic-wise interview questions and answers.
Vector Databases
QUESTION 01
What is a vector database and how does it differ from a traditional relational database?
ð DEFINITION:
A vector database is a specialized database designed to store, index, and query high-dimensional vector embeddings efficiently using similarity search. Unlike traditional relational databases that excel at exact matches on structured data (names, dates, IDs), vector databases are optimized for finding 'similar' items based on vector distance, enabling semantic search, recommendations, and AI-powered retrieval.
âïļ HOW IT WORKS:
Vector databases store embeddings - dense numerical arrays (typically 384-1536 dimensions) that represent the semantic meaning of text, images, or other data. They use specialized indexing structures (HNSW, IVF, PQ) that enable Approximate Nearest Neighbor (ANN) search, finding vectors closest to a query vector without exhaustive comparison. This is fundamentally different from relational databases' B-tree indexes for exact equality or range queries. Vector databases also store metadata alongside vectors, enabling hybrid queries (e.g., 'find similar items where category = X').
ðĄ WHY IT MATTERS:
Traditional databases can't efficiently answer queries like 'find documents similar to this one' or 'recommend products like this' - they require exact matches on keywords or categories. Vector databases enable semantic understanding: searching by meaning rather than exact words. This powers modern AI applications: RAG (retrieving relevant context), recommendation systems (finding similar items), image search (finding visually similar images), and anomaly detection (finding outliers). As AI generates more embeddings, vector databases have become essential infrastructure.
ð EXAMPLE:
Product search in an e-commerce database. Traditional SQL: 'SELECT * FROM products WHERE category = 'electronics' AND price < 500' - works for structured filters. But for 'find comfortable running shoes', traditional fails because it can't match 'comfortable' semantically. Vector database: product descriptions embedded, user query embedded, returns products semantically similar to 'comfortable running shoes' even if they use different words ('cushioned athletic footwear'). This semantic understanding is impossible with traditional databases.
QUESTION 02
What is Approximate Nearest Neighbor (ANN) search and why is it used instead of exact search?
ð DEFINITION:
Approximate Nearest Neighbor (ANN) search is a technique that finds vectors close to a query vector without guaranteeing the exact nearest neighbors, trading a small amount of accuracy for dramatic gains in speed and scalability. Instead of comparing against every vector (exact search, O(n)), ANN uses indexing structures to narrow the search space, making it feasible for billion-scale datasets.
âïļ HOW IT WORKS:
Exact nearest neighbor search computes distance between query and every vector in database - O(n) time, acceptable for thousands of vectors but impossible for millions or billions (would take seconds to minutes per query). ANN pre-builds index structures that organize vectors for efficient approximation: trees (partition space), graphs (connect nearby vectors), or quantization (compress vectors). During search, the index guides the query to promising regions, examining only a fraction of vectors. Results are typically 90-99% accurate (recall) but 100-1000x faster.
ðĄ WHY IT MATTERS:
ANN makes vector search practical at scale. A billion-vector dataset with exact search would require 1 billion distance calculations per query - at 1Ξs per calculation, that's 1000 seconds per query. ANN with HNSW can achieve 0.99 recall in <10ms - 100,000x faster. This enables real-time applications: semantic search on massive document collections, recommendation systems with millions of users, and RAG on enterprise knowledge bases. Without ANN, vector databases wouldn't scale beyond toy datasets.
ð EXAMPLE:
E-commerce with 10 million products. Exact search: compare query vector against all 10M (10M operations) - 1 second per query, can't handle 100 QPS. ANN with HNSW: examines ~1000 candidates (0.01% of data) to find nearest neighbors - 1ms per query, handles 1000 QPS. Recall 98% - users won't notice the slight approximation, but system scales. This trade-off (2% accuracy loss for 1000x speed) enables production deployment.
QUESTION 03
What are the most popular vector databases and how do they compare (Pinecone, Weaviate, Chroma, Qdrant, pgvector)?
ð DEFINITION:
The vector database landscape includes managed services (Pinecone), open-source self-hosted options (Weaviate, Qdrant, Chroma), and extensions to traditional databases (pgvector). Each has different trade-offs in scalability, ease of use, features, and cost, making choice dependent on specific requirements.
âïļ HOW IT WORKS:
Pinecone: fully managed, serverless, handles scaling automatically, excellent performance, but proprietary and costlier at scale. Weaviate: open-source with cloud option, built-in modules for embedding and hybrid search, strong filtering. Qdrant: Rust-based, highly performant, fine-grained control over indexing, good for high-scale self-hosted. Chroma: lightweight, Python-native, designed for development and local use, less scalable. pgvector: PostgreSQL extension, adds vector support to existing DB, easy if already using Postgres, but limited performance at extreme scale. Milvus/Zilliz: enterprise-focused, highly scalable, complex to operate.
ðĄ WHY IT MATTERS:
Choice affects development speed, production scalability, and cost. For startups prototyping, Chroma offers quick start. For production with moderate scale, pgvector simplifies stack (no new DB). For high-scale managed service, Pinecone reduces ops burden. For cost-sensitive large scale, self-hosted Qdrant or Weaviate may be best. Consider: vector dimension, dataset size, QPS requirements, need for hybrid search, filtering complexity, and team expertise.
ð EXAMPLE:
Startup building RAG for 1M documents. Phase 1 (prototype): Chroma for quick iteration. Phase 2 (launch): pgvector if already using Postgres, avoiding new infrastructure. Phase 3 (scale to 10M docs, 100 QPS): migrate to Qdrant self-hosted for better performance and cost control. Or if team small, Pinecone managed reduces ops burden despite higher cost. The right choice evolves with needs.
QUESTION 04
What is HNSW (Hierarchical Navigable Small World) indexing and how does it work?
ð DEFINITION:
HNSW (Hierarchical Navigable Small World) is a graph-based indexing algorithm for approximate nearest neighbor search that constructs a multi-layer graph structure where each layer is a progressively sparser set of vectors. Search navigates from top layer (coarse) down to bottom layer (fine), efficiently finding nearest neighbors with logarithmic complexity.
âïļ HOW IT WORKS:
HNSW builds a hierarchy of graphs. Bottom layer (layer 0) contains all vectors with connections to nearby neighbors. Each higher layer contains a random subset of vectors (e.g., layer 1 has 1/10 of vectors, layer 2 has 1/100). Connections within each layer form a small-world graph (short average path length). Search starts at top layer, finds closest node, moves to next layer using that node as entry point, repeats. This greedy navigation quickly converges to nearest neighbors. During index construction, parameters control: M (max connections per node), efConstruction (search depth during build), and efSearch (search depth during query).
ðĄ WHY IT MATTERS:
HNSW is the most popular ANN algorithm due to its excellent trade-offs: high recall (0.99+), low latency (milliseconds), and reasonable memory usage. It consistently outperforms tree-based and quantization methods on benchmark datasets. Most vector databases (Pinecone, Qdrant, Weaviate) use HNSW variants. The main downside is memory - graph structure can be larger than the vectors themselves. For billion-scale datasets, memory costs can be significant.
ð EXAMPLE:
Searching 10M vectors with HNSW. Index built with M=16, efConstruction=200. Query: navigate top layer (1000 nodes) in 5 steps, middle layer (100K nodes) in 10 steps, bottom layer (10M nodes) in 50 steps. Total ~65 distance calculations vs 10M for exhaustive search - 150,000x speedup. Recall 0.99. This enables real-time search at scale. Without HNSW, 10M vector search would be impractical.
QUESTION 05
What is IVF (Inverted File Index) indexing and when is it used?
ð DEFINITION:
IVF (Inverted File Index) is a clustering-based ANN indexing method that partitions the vector space into regions (Voronoi cells) using k-means clustering. During search, only vectors in the closest cells to the query are examined, dramatically reducing search space. It's a simpler alternative to graph-based methods like HNSW.
âïļ HOW IT WORKS:
During indexing, vectors are clustered into nlist groups using k-means. Each cluster has a centroid, and vectors are assigned to their nearest centroid. An inverted index maps each centroid to its member vectors. During search, the query finds the nprobe closest centroids, then searches only vectors in those clusters. Parameters: nlist (number of clusters) controls granularity; nprobe (clusters to search) trades speed for accuracy. IVF is often combined with Product Quantization (IVF-PQ) to compress vectors and reduce memory.
ðĄ WHY IT MATTERS:
IVF offers several advantages: lower memory than HNSW (especially with PQ), faster indexing (simpler structure), and good scalability. It's particularly useful when memory is constrained or for very large datasets where HNSW's graph memory overhead is prohibitive. IVF with PQ can handle billion-scale datasets on a single machine. The trade-off: lower recall than HNSW for same speed, or slower for same recall. Choice between IVF and HNSW depends on dataset size, memory budget, and accuracy requirements.
ð EXAMPLE:
100M vector dataset, memory constrained (32GB). HNSW graph would require ~100M à 16 connections à 4 bytes â 6.4GB plus vectors 100M à 768 à 4 = 307GB - impossible. IVF-PQ with 262144 clusters, PQ 8-bit compresses vectors to 100M à 8 = 800MB, index overhead ~1GB - fits. Search with nprobe=64 examines ~24K vectors, 5ms latency, recall 0.90. This makes billion-scale search practical on modest hardware. IVF is the workhorse for large-scale, memory-constrained deployments.
QUESTION 06
What is the trade-off between recall, precision, and latency in ANN search?
ð DEFINITION:
In ANN search, recall (fraction of true nearest neighbors found), precision (fraction of retrieved vectors that are true neighbors), and latency form fundamental trade-offs governed by index parameters and search effort. Improving any typically degrades another, requiring careful tuning based on application requirements.
âïļ HOW IT WORKS:
Key parameters control the trade-off: For HNSW, efSearch (number of candidates examined) - higher efSearch increases recall but increases latency. For IVF, nprobe (clusters searched) - higher nprobe increases recall but increases latency. Index build parameters also affect base trade-off: more connections (HNSW M) or more clusters (IVF nlist) improve potential recall but increase memory and indexing time. The relationship is logarithmic: to increase recall from 0.90 to 0.99 typically requires 2-5x more latency. Precision-latency trade-off similar.
ðĄ WHY IT MATTERS:
Different applications need different balances. Recommendation systems may tolerate 0.80 recall if speed critical. RAG systems often need high recall (0.95+) because missing relevant documents directly harms answer quality. Real-time search needs low latency (<10ms), batch processing can accept higher latency. Understanding trade-offs enables tuning: start with target latency, maximize recall within that bound, or target recall, minimize latency. Production systems often tune per query type.
ð EXAMPLE:
RAG system with 10M vectors. Target: recall@10 0.95. HNSW with efSearch=128: recall 0.96, latency 8ms. efSearch=64: recall 0.91, latency 4ms. efSearch=256: recall 0.98, latency 15ms. Choose efSearch=128 meeting target. For another application (real-time recommendations) needing <5ms latency, accept recall 0.91 with efSearch=64. The same index serves both with different search parameters - flexibility of ANN tuning.
QUESTION 07
How does metadata filtering work in vector databases?
ð DEFINITION:
Metadata filtering in vector databases combines vector similarity search with structured filters on associated metadata (date, category, source, permissions). This enables queries like 'find documents similar to X where date > 2023 and category = "policy"', combining semantic understanding with precise constraints.
âïļ HOW IT WORKS:
Two main approaches: pre-filtering and post-filtering. Pre-filtering applies metadata filters before vector search: only vectors matching metadata criteria are considered in the ANN search. This is efficient but requires the index to support filtered search, which can be complex for some ANN structures. Post-filtering does vector search first, then filters results by metadata. Simpler but may waste compute if many results filtered out, and may miss relevant vectors if top-k don't include filtered matches. Advanced implementations use: 1) Filtered IVF - only search clusters where metadata might match. 2) Bitmap indices for fast filtering. 3) Combined indexes that store metadata alongside vectors for efficient lookup.
ðĄ WHY IT MATTERS:
Metadata filtering is essential for production RAG. Without it, you can't: restrict to authorized documents, filter by date (ensure recent info), limit to specific document types, or implement multi-tenancy. Poor filtering leads to wrong results or security issues. Performance impact varies: pre-filtering with good indexes adds 10-20% latency; post-filtering can be 10x slower if many results filtered out. Choice depends on selectivity: highly selective filters favor pre-filtering; low selectivity favors post-filtering.
ð EXAMPLE:
Legal RAG with 1M documents, filtered by jurisdiction. Query: 'precedent for contract disputes' + filter 'jurisdiction = California'. Pre-filtering: only search California documents (200K). Latency 5ms. Post-filtering: search all 1M (4ms), then filter (1ms) - 5ms total, same. But if filter 'jurisdiction = California AND year = 2024' (very selective, 1000 docs), post-filtering would search 1M to find 1000 matches - inefficient. Pre-filtering essential here. Modern vector databases optimize with hybrid approaches.
QUESTION 08
What is the difference between in-memory and disk-based vector storage?
ð DEFINITION:
In-memory vector databases store all vectors and indexes in RAM for fastest access, while disk-based systems store data on SSD/HDD and load portions as needed. The choice dramatically affects query latency, cost, and scalability, with trade-offs between speed and capacity.
âïļ HOW IT WORKS:
In-memory: all vectors loaded into RAM at startup. Queries hit memory - sub-millisecond latency. Indexes (HNSW graphs) also in memory for fast navigation. Limits: dataset size constrained by RAM (e.g., 10M 768-dim vectors = 30GB). Cost: RAM expensive. Disk-based: vectors stored on disk, index may be partly in memory. Queries may need disk I/O, increasing latency to milliseconds or more. Can handle datasets much larger than RAM using memory mapping or caching. Some systems use hybrid: index in memory, vectors on disk, loading only candidates.
ðĄ WHY IT MATTERS:
Choice depends on dataset size, latency requirements, and budget. In-memory essential for sub-10ms latency at high QPS - powers real-time applications. Disk-based enables billion-scale datasets that wouldn't fit in RAM - powers offline analysis and large-scale RAG where latency less critical. Cost difference: RAM ~$5/GB/month cloud, SSD ~$0.2/GB/month. For 100GB dataset, in-memory $500/month, disk-based $20/month - 25x difference.
ð EXAMPLE:
E-commerce with 50M products (150GB vectors). Real-time recommendations need <10ms latency - in-memory required, costs $750/month. Product search for internal analytics can tolerate 100ms - disk-based, costs $30/month. For RAG on 1B documents (3TB vectors), in-memory impossible (would cost $15k/month per replica). Disk-based with memory-mapped files makes it feasible. Each use case chooses based on requirements.
QUESTION 09
How do you handle updates and deletions in a vector database?
ð DEFINITION:
Handling updates and deletions in vector databases is challenging because ANN indexes (HNSW, IVF) are optimized for static data and degrade with frequent modifications. Strategies range from real-time updates (complex, may fragment index) to batch rebuilding (simple but creates staleness). The approach affects system freshness and operational complexity.
âïļ HOW IT WORKS:
Options: 1) Real-time updates - modify index directly: for HNSW, insert new vectors by finding their neighbors and updating connections; deletions marked as invalid and periodically removed. Can cause index degradation over time. 2) Soft deletion - mark vectors as deleted, filter them out during search, rebuild index periodically. 3) Versioned indexes - maintain multiple index segments, merge periodically (like LSM trees). 4) Stream processing - use streaming platforms to update incrementally. 5) Batch rebuild - rebuild entire index periodically (hourly/daily), simple but data stale between rebuilds.
ðĄ WHY IT MATTERS:
Application requirements determine acceptable staleness. News search needs updates within minutes; product catalog can tolerate hourly updates; archival search may be daily. Real-time updates increase complexity and may reduce query performance. Soft deletion with periodic rebuild balances freshness and performance for many use cases. The choice affects: query latency (real-time updates may slow), index size (deleted vectors waste space), and operational load (rebuild frequency).
ð EXAMPLE:
E-commerce catalog with 1M products, prices change daily. Soft deletion approach: updates write new vector, mark old as deleted. Queries filter out deleted (post-filter). Nightly rebuild: compact index, remove deleted vectors permanently. Query performance: 10ms during day, slightly slower due to filtering. After rebuild, back to 8ms. Freshness: updates visible within seconds (soft delete) but space grows until rebuild. This balances freshness and performance without complex real-time indexing.
QUESTION 10
What is the role of dimensionality in vector storage and search efficiency?
ð DEFINITION:
Vector dimensionality (e.g., 384, 768, 1536) directly impacts storage size, search speed, and retrieval quality. Higher dimensions capture more nuanced semantics but suffer from the 'curse of dimensionality' - distances become less discriminative, and storage/compute costs increase linearly. Choosing dimensionality involves trade-offs between expressiveness and efficiency.
âïļ HOW IT WORKS:
Storage: vectors of dimension d require d à 4 bytes (float32) each. 1M 1536-dim vectors = 6.1GB, 384-dim = 1.5GB. Search: distance calculations O(d), so higher dimensions proportionally slower. Index structures also affected: HNSW with higher d needs more comparisons. Quality: higher dimensions can capture more semantic nuance, but beyond a point (typically 768-1024) returns diminish. Curse of dimensionality: in high dimensions, all vectors become nearly equidistant, making similarity search less meaningful. Dimensionality reduction (PCA, Matryoshka) can compress vectors while preserving semantics.
ðĄ WHY IT MATTERS:
Dimensionality choice affects system cost and performance. A 1536-dim model (text-embedding-ada-002) provides excellent quality but 4x storage and compute vs 384-dim models (MiniLM). For 100M vectors, that's 600GB vs 150GB storage, $3k/month vs $750/month cloud costs. Search latency: 1536-dim 20ms vs 384-dim 5ms. Quality difference may be 2-3% on retrieval metrics. Many applications find 384-dim sufficient, saving 75% costs.
ð EXAMPLE:
RAG for customer support with 50M documents. Compare 1536-dim (ada-002) vs 384-dim (MiniLM). Ada-002: recall@10 0.93, storage 300GB, latency 15ms, cost $1500/month. MiniLM: recall@10 0.90, storage 75GB, latency 5ms, cost $375/month. For this use case, 3% recall drop acceptable for 75% cost reduction and 3x faster search. Dimensionality choice optimized for business requirements, not maximum quality.
QUESTION 11
How do you choose between a dedicated vector database and a vector extension (pgvector)?
ð DEFINITION:
Choosing between dedicated vector databases (Pinecone, Weaviate, Qdrant) and vector extensions like pgvector involves trade-offs in scalability, feature set, operational complexity, and integration with existing infrastructure. The decision depends on dataset size, query volume, required features, and team expertise.
âïļ HOW IT WORKS:
pgvector adds vector similarity search to PostgreSQL, enabling hybrid SQL+vector queries in a familiar environment. Scales to millions of vectors with reasonable performance, uses existing PostgreSQL infrastructure. Dedicated vector databases offer: better scalability (billions of vectors), more ANN algorithms (HNSW, IVF), advanced features (metadata filtering, hybrid search), managed options, and optimized performance at scale. However, they add new infrastructure, require learning new APIs, and increase operational complexity.
ðĄ WHY IT MATTERS:
For startups and moderate scale (<10M vectors, <100 QPS), pgvector often wins: simpler stack, no new DB to learn, ACID compliance, easy integration. For large scale (>100M vectors, >1000 QPS), dedicated databases necessary: pgvector's IVF indexes degrade at extreme scale, lacks advanced features. Also consider: need for hybrid search (pgvector supports), real-time updates, multi-tenancy, and managed vs self-hosted preferences.
ð EXAMPLE:
Company with 5M product catalog, 50 QPS, already using PostgreSQL. pgvector: add column, create index, same SQL queries, no new infrastructure. Works well, recall 0.95, latency 20ms. Scales to needs. Another company with 200M documents, 1000 QPS, needs hybrid search with complex metadata filtering. pgvector would struggle; Qdrant self-hosted provides needed scale and features. The choice is about right-sizing infrastructure to requirements.
QUESTION 12
What is Product Quantization (PQ) and how does it compress vectors?
ð DEFINITION:
Product Quantization (PQ) is a lossy compression technique that reduces vector storage by splitting vectors into subvectors and quantizing each subvector independently using a learned codebook. It can compress vectors by 10-100x while preserving enough information for accurate similarity estimation, enabling billion-scale vector search on limited hardware.
âïļ HOW IT WORKS:
Process: 1) Split each d-dimensional vector into m subvectors (e.g., 768-dim â 8 subvectors of 96-dim each). 2) For each subvector space, run k-means clustering (typically k=256) to create a codebook of centroids. 3) Replace each subvector with the ID of its nearest centroid (8 bits if k=256). 4) Store only the m 8-bit codes per vector (total 8m bits vs original dÃ32 bits). For 768-dim, original 3072 bytes â compressed 8 bytes - 384x compression. During search, distances approximated by summing precomputed distances from codebooks.
ðĄ WHY IT MATTERS:
PQ makes billion-scale vector search practical. 1B 768-dim vectors uncompressed = 3TB - too large for memory. With PQ (m=8), compressed = 8GB - fits in memory. Search accuracy typically 80-95% of uncompressed, depending on parameters. IVF-PQ combines clustering (IVF) with compression (PQ) for state-of-the-art large-scale search. Trade-off: compression ratio vs accuracy - more subvectors (higher m) better accuracy but less compression. Essential for cost-effective large-scale deployments.
ð EXAMPLE:
1B image vectors (768-dim) for visual search. Uncompressed: 3TB RAM - impossible. IVF-PQ with 262k clusters, m=16 (128-bit): index 8GB + vectors 16GB = 24GB - fits on single machine. Search examines ~50k vectors, 5ms latency, recall@10 0.85. Good enough for production. Without PQ, this scale would require dozens of machines. PQ democratizes large-scale vector search.
QUESTION 13
How do you benchmark the performance of a vector database for your use case?
ð DEFINITION:
Benchmarking vector databases requires measuring recall, latency, throughput, and scalability under workloads that match your application's query patterns, dataset characteristics, and performance requirements. Systematic benchmarking prevents choosing a database that looks good on paper but fails in production.
âïļ HOW IT WORKS:
Process: 1) Create representative dataset - sample from your actual data (or use public benchmark like SIFT, GIST, DEEP). Size should match production scale (or scaled down proportionally). 2) Define ground truth - for a subset of queries, compute exact nearest neighbors (exhaustive search) to measure recall. 3) Define workload - query distribution (similar to real queries), QPS target, latency requirements. 4) Test configurations - vary index parameters (M, efConstruction for HNSW; nlist, nprobe for IVF), hardware, batch sizes. 5) Measure metrics: recall@k, latency (p50, p95, p99), throughput (QPS at target latency), memory usage, indexing time. 6) Compare across databases under same conditions.
ðĄ WHY IT MATTERS:
Vector databases perform differently based on data characteristics. Some excel at high-dimensional, some at low. Some handle filtering well, some don't. Recall-latency trade-offs vary. Without benchmarking, you might choose a database that gives 0.95 recall at 10ms on public benchmarks but 0.80 recall at 50ms on your data due to different dimensionality or distribution. Systematic benchmarking reveals true performance and guides parameter tuning.
ð EXAMPLE:
Comparing Qdrant vs Weaviate for 5M 768-dim vectors with metadata filtering. Results: Qdrant recall@10 0.95 at 8ms with filtering; Weaviate 0.92 at 12ms. Throughput: Qdrant 1200 QPS, Weaviate 800 QPS at p95<20ms. Indexing: Qdrant 2 hours, Weaviate 3 hours. Based on this, Qdrant chosen. Without benchmarking, would rely on marketing claims and potentially choose wrong database costing thousands in wasted performance.
QUESTION 14
What is multi-tenancy in vector databases and how is it implemented?
ð DEFINITION:
Multi-tenancy in vector databases enables multiple independent users or organizations to share the same database infrastructure while maintaining complete data isolation, security, and performance separation. Each tenant's vectors are invisible to others, and queries only return results from that tenant's data, essential for SaaS applications.
âïļ HOW IT WORKS:
Implementation approaches: 1) Separate indexes per tenant - each tenant gets their own physical index. Simple, excellent isolation, but scales poorly with many tenants (file handles, memory overhead). 2) Partitioned single index - all tenants in one index with tenant ID metadata. Queries filter by tenant ID (pre-filtering). Scales to many tenants but filtering overhead. 3) Hybrid - groups of tenants share indexes based on size. 4) Namespaces/collections - logical separation within same database (Pinecone namespaces, Qdrant collections). Performance considerations: separate indexes provide best isolation but resource waste; partitioned scales better but filtering latency.
ðĄ WHY IT MATTERS:
Multi-tenancy is essential for SaaS products: each customer's data must be isolated for privacy and security. Choice affects scalability: with 10k tenants, separate indexes impossible (too many file handles). Partitioned with tenant ID filtering works but requires careful tuning to maintain performance. Also impacts cost: shared infrastructure reduces per-tenant cost. Security critical: must prevent cross-tenant leakage.
ð EXAMPLE:
SaaS platform for legal document search with 500 law firm tenants. Approach: partitioned single index with tenant_id field. Queries: 'find similar documents' + filter 'tenant_id = X'. Pre-filtering with tenant_id index ensures only firm's documents searched. 500 tenants, 2M docs each (1B total) - works well. Latency: 15ms vs 10ms without filtering - acceptable. Security: tenant_id filter ensures isolation. For 50k smaller tenants, would need different approach (tenant groups) due to filtering overhead. Multi-tenancy design critical for scalability.
QUESTION 15
How do you scale a vector database to billions of vectors?
ð DEFINITION:
Scaling vector databases to billions of vectors requires distributed architectures that partition data across multiple machines while maintaining fast search, high availability, and consistency. This involves sharding, replication, and sophisticated query routing to achieve linear scalability.
âïļ HOW IT WORKS:
Key techniques: 1) Sharding - partition vectors across nodes based on vector ID or hash. Each node responsible for subset of data. Queries broadcast to all shards, results merged. 2) Replication - copies of shards for high availability and read scaling. 3) Distributed indexing - coordinate index builds across nodes. 4) Query routing - for ANN, may use two-level approach: coarse quantization (cluster) to identify relevant shards, then search only those. 5) Load balancing - distribute queries evenly. 6) Dynamic resharding - add/remove nodes without downtime. Systems like Milvus, Qdrant Cloud, Pinecone implement these with Kubernetes for orchestration.
ðĄ WHY IT MATTERS:
Billion-scale vector search enables applications impossible with single-node: internet-scale search, large enterprise knowledge bases, recommendation at Spotify/Netflix scale. Without distribution, a billion 768-dim vectors would require 3TB RAM and single-node search would be too slow. Distribution enables linear scaling: double nodes, double QPS capacity. However, distributed systems introduce complexity: consistency challenges, network overhead, and cost.
ð EXAMPLE:
E-commerce with 2B product vectors. Single node: impossible (6TB RAM). Distributed: 20 nodes each with 100M vectors (300GB RAM per node). HNSW indexes per node. Query: broadcast to all 20 nodes (each searches 100M, returns top-100), merge results for final top-10. Latency 30ms (20ms search + 10ms network). 1000 QPS achievable. With replication, can scale to 5000 QPS. This makes billion-scale practical.
QUESTION 16
What is a namespace or collection in a vector database?
ð DEFINITION:
A namespace or collection is a logical grouping of vectors within a vector database that provides isolation and organization. It's analogous to a table in relational databases or an index in Elasticsearch, allowing multiple independent datasets to coexist in the same database instance with separate configuration, indexing, and querying.
âïļ HOW IT WORKS:
Each namespace has: its own vector dimension, similarity metric (cosine, dot, L2), index parameters (HNSW M, efConstruction), and metadata schema. Vectors in different namespaces are physically separated (different index files) but share database resources. Operations (insert, search, delete) are scoped to a namespace. Namespaces can be created/dropped dynamically. In distributed setups, namespaces may span multiple shards. Some systems (Pinecone) use namespaces for multi-tenancy; others (Qdrant) use collections similarly.
ðĄ WHY IT MATTERS:
Namespaces simplify data organization and management. Instead of running separate database instances for each project or tenant, you use namespaces, reducing operational overhead. They enable: 1) Multi-tenancy - each customer gets namespace. 2) Testing - separate dev/prod namespaces. 3) Experimentation - different index configurations for different data types. 4) Clean deletion - drop namespace to remove all data. Performance isolation varies: some databases share resources fairly, others may have contention.
ð EXAMPLE:
SaaS company with 50 customers, each with 1M vectors. Instead of 50 database instances, use one database with 50 namespaces. Customer A queries: search namespace 'customer_a'. Resources shared but isolation maintained. Dev team has 'staging' namespace for testing. New customer onboarded: create namespace, start indexing. Operations simplified: one database to monitor, backup, scale. Namespaces make multi-tenant SaaS practical.
QUESTION 17
How would you migrate from one vector database to another?
ð DEFINITION:
Migrating between vector databases involves transferring vectors, metadata, and index configurations while minimizing downtime and ensuring data integrity. This complex operation requires careful planning, validation, and often dual-running systems to maintain availability during transition.
âïļ HOW IT WORKS:
Migration steps: 1) Assessment - compare schemas, feature support, index parameters. Map metadata fields, vector dimensions, similarity metrics. 2) Export - extract vectors and metadata from source database. May need pagination for large datasets. 3) Transformation - convert to target format, handle any incompatibilities (different ID formats, metadata types). 4) Import - bulk load into target database. May need to tune batch sizes for performance. 5) Index build - build ANN indexes (can take hours for billion-scale). 6) Validation - compare query results between source and target on sample queries (recall@k). 7) Dual-run - run both systems in parallel, compare results, gradually shift traffic. 8) Cutover - switch production traffic, monitor closely. 9) Decommission - after confidence, shut down source.
ðĄ WHY IT MATTERS:
Migration risks include data loss, downtime, performance degradation, and incorrect search results. For production systems, zero-downtime migration requires dual-running. Validation critical: if recall drops from 0.95 to 0.90, user experience suffers. Migration may also be opportunity to re-evaluate chunking, embedding models, or index parameters. Cost of migration must be weighed against benefits of new database.
ð EXAMPLE:
Migrating from Pinecone (managed) to self-hosted Qdrant for cost savings at scale. Process: export 100M vectors (3 days), transform (1 day), import to Qdrant (2 days), build indexes (1 day). Dual-run for 1 week comparing results (recall 0.96 vs 0.97 - Qdrant better). Gradually shift 10% traffic daily, monitoring latency and recall. After full cutover, decommission Pinecone. Monthly cost drops from $10k to $2k. Migration complex but justified by savings.
QUESTION 18
What security and access control features should a production vector database have?
ð DEFINITION:
Production vector databases require comprehensive security features including authentication, authorization, encryption, audit logging, and network security to protect sensitive vector data and ensure compliance with regulations (GDPR, HIPAA, SOC2). These features are essential for enterprise deployments.
âïļ HOW IT WORKS:
Key security features: 1) Authentication - verify identity (API keys, IAM, OAuth, mTLS). 2) Authorization - role-based access control (RBAC) at collection/namespace level, API key permissions. 3) Encryption at rest - vectors encrypted in storage (AES-256). 4) Encryption in transit - TLS for all network communication. 5) Network security - VPC support, private endpoints, IP whitelisting. 6) Audit logging - track all access and modifications. 7) Data isolation - multi-tenancy with strict separation. 8) Compliance certifications - SOC2, HIPAA readiness, GDPR tools. 9) Backup/restore - encrypted backups with retention policies.
ðĄ WHY IT MATTERS:
Vector databases often store sensitive data: customer information, proprietary documents, personal data. Breaches can cause legal liability, reputational damage, and regulatory fines. For healthcare (HIPAA), financial (SOC2), or EU customers (GDPR), specific features required. Even without compliance mandates, security best practices protect against common threats. Enterprise customers won't adopt without these features.
ð EXAMPLE:
Healthcare RAG system storing patient records. Vector database must: encrypt data at rest (HIPAA requirement), support VPC (no public internet), have audit logs (track all access), provide role-based access (researchers vs clinicians), and maintain HIPAA compliance documentation. Without these, cannot deploy. Qdrant Enterprise or Pinecone with SOC2 compliance chosen. Security features non-negotiable for production.
QUESTION 19
What is the relationship between the embedding model and the vector database index?
ð DEFINITION:
The embedding model and vector database index are deeply interconnected: the model's output dimensionality, distribution, and similarity metric directly determine index performance, accuracy, and configuration. Understanding this relationship is crucial for optimizing RAG systems.
âïļ HOW IT WORKS:
Key connections: 1) Dimensionality - higher dimensions increase index size, search latency, and memory. Index parameters (M in HNSW) may need tuning for high dimensions. 2) Similarity metric - model trained for cosine similarity (normalized embeddings) vs dot product affects distance calculations. Index must match. 3) Distribution - if embeddings poorly distributed (clustered or sparse), index efficiency changes. 4) Normalization - many models output unit vectors, enabling faster cosine similarity via dot product. 5) Quantization compatibility - some compression techniques (PQ) work better with certain embedding properties. 6) Model updates - changing embedding model requires re-indexing all vectors.
ðĄ WHY IT MATTERS:
Choosing embedding model without considering index implications leads to suboptimal performance. A 4096-dim model may have excellent semantic quality but make search 10x slower and 10x more expensive than 768-dim model with similar quality. Some models produce embeddings where ANN indexes struggle (highly skewed distributions). Understanding this relationship enables co-design: pick model with good quality-to-dimension ratio, ensure metric matches, test index performance with sample embeddings before full-scale deployment.
ð EXAMPLE:
Two embedding models considered: Model A 1536-dim, cosine similarity, uniform distribution. Model B 768-dim, dot product, slightly clustered. Testing with HNSW: Model A recall@10 0.95 at 8ms; Model B recall@10 0.94 at 5ms. Model B 37% faster with similar quality - better choice. If Model A had unique quality advantage (0.98 recall), might accept slower search. Without testing, might pick Model A blindly and over-provision hardware. The model-index relationship critical for cost-performance optimization.
QUESTION 20
How do vector databases fit into a broader data architecture?
ð DEFINITION:
Vector databases are one component in modern data architectures, working alongside data lakes, warehouses, streaming platforms, and other systems. They serve the specialized role of enabling semantic similarity search, complementing rather than replacing existing data infrastructure.
âïļ HOW IT WORKS:
Typical architecture: 1) Data sources (databases, data lakes, streaming) feed into processing pipelines. 2) Data transformed, cleaned, and passed to embedding models. 3) Generated vectors stored in vector database with metadata. 4) Applications query vector DB for similarity search, often combining with other databases for additional context. 5) Results may be enriched from data warehouses or operational databases. 6) Feedback loops capture user interactions to improve embeddings. Vector databases integrate via APIs, change data capture (CDC), or batch ETL.
ðĄ WHY IT MATTERS:
Vector databases aren't standalone solutions - they're part of a system. They don't replace relational databases for transactions, data warehouses for analytics, or search engines for keyword search. Understanding integration patterns prevents architectural mistakes: don't store all data in vector DB (expensive, unnecessary), do use it alongside other systems. For RAG, vector DB retrieves relevant documents, which may then be fetched from object storage or data lake. For recommendations, vector DB finds candidates, filtered by business rules from relational DB.
ð EXAMPLE:
E-commerce platform architecture: Product catalog in PostgreSQL (source of truth). Daily ETL job: extract products, generate embeddings, store in Qdrant with product IDs. Web app: user query â embed â Qdrant search â get product IDs â fetch full details from PostgreSQL â apply business logic (pricing, inventory) â return. Real-time inventory updates from PostgreSQL, embeddings refreshed nightly. This hybrid architecture leverages each system's strengths: PostgreSQL for transactions, Qdrant for similarity. Vector DB fits into, not replaces, existing stack.