Question 1

What is tokenization and why does it matter for LLMs?

Accepted Answer

🔍 DEFINITION:
Tokenization is the process of converting raw text into smaller units called tokens that the model can process mathematically. These tokens can be words, subwords, or characters, and each is mapped to a unique integer ID in the model's vocabulary. It serves as the critical bridge between human-readable text and the numerical representations that neural networks can understand and manipulate.

⚙️ HOW IT WORKS:
The tokenizer first normalizes text (lowercasing, Unicode normalization), then splits it according to learned rules. For example, the sentence "I love AI!" with a BPE tokenizer might become ['I', 'love', 'AI', '!'] - each mapped to IDs like [40, 512, 789, 3]. Special tokens like [CLS] (classification), [SEP] (separator), and [PAD] (padding) are added as needed. The tokenizer maintains a vocabulary file mapping tokens to IDs, typically containing 32k-256k entries. During training, the tokenizer is built by analyzing corpus statistics to determine optimal token splits. At inference, every input text must pass through the exact same tokenizer used during training.

💡 WHY IT MATTERS:
Tokenization is the gateway between human language and model mathematics, affecting virtually every aspect of LLM performance. It determines vocabulary size, influences how rare words are handled, affects sequence length (and thus context window usage), and impacts model performance on different languages and domains. Poor tokenization leads to out-of-vocabulary issues, inefficient use of context window (costing money in API calls), and degraded understanding of specialized terminology. For multilingual models, tokenizer design determines fairness across languages - if English is over-optimized, other languages suffer. Tokenization also affects inference speed and cost since APIs charge per token. Understanding tokenization helps practitioners optimize prompts, manage budgets, and debug model failures.

📋 EXAMPLE:
The word "unhappiness" with a word tokenizer might be OOV (out-of-vocabulary) and become [UNK], losing meaning. With BPE subword tokenization, it becomes ['un', 'happiness'] - both known tokens, preserving meaning. For cost comparison: English "I love AI" = 4 tokens. The same meaning in Thai "ฉันรักเอไอ" might be 8 tokens with an English-optimized tokenizer, costing twice as much for the same semantic content and using twice the context window space. This is why multilingual models need carefully designed tokenizers.

Question 2

Explain Byte Pair Encoding (BPE). How does it build a vocabulary?

Accepted Answer

🔍 DEFINITION:
Byte Pair Encoding is a subword tokenization algorithm that iteratively merges the most frequent character or byte pairs in a corpus to build a vocabulary of variable-length tokens. It starts with a base vocabulary of all characters and progressively adds merged pairs, balancing between word-level efficiency and character-level coverage. Originally developed for data compression, it was adapted for NLP by Sennrich et al. in 2016.

⚙️ HOW IT WORKS:
The algorithm begins with a vocabulary containing every unique character/byte in the training corpus. It then counts frequencies of all adjacent character pairs across the entire corpus. The most frequent pair is merged into a new token, added to vocabulary, and all occurrences in the corpus are replaced with this new token. This process repeats iteratively until the desired vocabulary size is reached. For example, if 't' and 'h' appear together frequently, they merge to 'th'. If 'th' and 'e' then appear frequently together, they merge to 'the'. Common words become single tokens ('the', 'ing', 'tion'), while rare words remain as sequences of subword tokens. The final vocabulary contains characters, common subwords, and frequent words, typically ranging from 32k to 100k tokens. During tokenization, the algorithm applies merges greedily from longest to shortest to ensure consistent segmentation.

💡 WHY IT MATTERS:
BPE solves the fundamental open vocabulary problem that plagued earlier NLP systems. Before BPE, models either used word-level vocabularies (failing on rare words) or character-level (making sequences too long). BPE provides an optimal middle ground: common words become efficient single tokens, while any word can be represented by composing known subwords. This handles morphology naturally ('run', 'running', 'runs' share 'run'), enables handling of compound words, and works across languages. It dramatically reduces vocabulary size compared to word-level (from millions to tens of thousands) while maintaining coverage. BPE is used in virtually all modern LLMs including GPT, LLaMA, and RoBERTa, making it one of the most impactful NLP techniques.

📋 EXAMPLE:
Starting with characters: ['a','b','c',...]. Count pairs in corpus: ('t','h') appears 1M times, ('h','e') 900k times. Merge ('t','h') → 'th'. Now corpus has 'th' as token. Recount: ('th','e') appears 850k times. Merge → 'the'. Continue: 'the' becomes token, 'ing' becomes token, 'tion' becomes token. After 50k merges, common words like 'the', 'and', 'ing' are single tokens, while 'xylophone' might be ['xy', 'lo', 'phone'] - all known subwords. A rare word like 'floccinaucinihilipilification' becomes a sequence of common subword pieces rather than [UNK].

Question 3

What is the difference between BPE, WordPiece, and SentencePiece tokenization?

Accepted Answer

🔍 DEFINITION:
BPE, WordPiece, and SentencePiece are subword tokenization algorithms with different merging criteria and preprocessing approaches. BPE merges based on frequency, WordPiece maximizes likelihood of training data, and SentencePiece operates directly on raw Unicode with no language-specific preprocessing, treating text as a sequence of bytes or characters regardless of language.

⚙️ HOW IT WORKS:
BPE greedily merges the most frequent adjacent pair of tokens at each step, building vocabulary purely by frequency statistics. WordPiece (used in BERT) also starts with characters but chooses merges that maximize the likelihood of the training data - it computes the gain in likelihood from adding each candidate pair and picks the one with highest gain. This likelihood-based approach often produces more linguistically meaningful tokens. SentencePiece (used in T5, LLaMA) differs fundamentally by working directly on raw text without pre-tokenization (splitting by spaces). It treats the input as a stream of Unicode characters or bytes and learns merges from this raw stream, making it language-agnostic and reversible. It can operate in BPE mode or unigram language model mode.

💡 WHY IT MATTERS:
The choice dramatically affects multilingual performance and handling of different writing systems. BPE is simpler and computationally efficient, widely used in GPT models. WordPiece often produces slightly better linguistic units for languages with clear word boundaries. SentencePiece is essential for languages without spaces (Chinese, Japanese, Thai) because it doesn't rely on pre-tokenization that would fail for these languages. It also provides perfect reversibility - you can convert tokens back to the exact original text, which is important for some applications. The unigram LM variant of SentencePiece can produce multiple segmentations, enabling subword regularization during training. For multilingual models covering 100+ languages, SentencePiece is typically the best choice.

📋 EXAMPLE:
Japanese sentence "今日は良い天気です" (Today is good weather). BPE/WordPiece with standard pre-tokenization would first try to split by spaces - but Japanese has no spaces, so the entire sentence becomes one "word," defeating subword tokenization. SentencePiece processes raw characters: 今, 日, は, 良, い, 天, 気, で, す and learns appropriate subwords like '今日' (today) and '天気' (weather) as merges from this character stream. For English, all three work well, but SentencePiece's reversibility means you can reconstruct original text exactly, useful for tasks like machine translation where output must match source formatting.

Question 4

What are word embeddings and how are they different from one-hot encodings?

Accepted Answer

🔍 DEFINITION:
Word embeddings are dense, low-dimensional vector representations of words learned from data, where semantic similarity corresponds to vector proximity in the embedding space. One-hot encodings are sparse, high-dimensional vectors with a single 1 and all other entries 0, representing words as discrete atomic symbols with no inherent relationship between them.

⚙️ HOW IT WORKS:
One-hot encoding creates a vector of size V (vocabulary size) for each word, with 1 at the word's index and 0 elsewhere. For vocabulary of 50k, each word is a 50k-dimensional vector with 49,999 zeros. These vectors are orthogonal (dot product 0) regardless of word meaning. Word embeddings, by contrast, map each word to a dense vector of much lower dimension (typically 100-1000), where each dimension is learned to capture semantic features. Similar words have similar vectors (high cosine similarity) because they appear in similar contexts during training. The classic example is that vector arithmetic captures analogies: vector('king') - vector('man') + vector('woman') ≈ vector('queen').

💡 WHY IT MATTERS:
One-hot encodings are fundamentally flawed for modern NLP. First, they're impractical for large vocabularies - 50k dimensions create huge input layers and suffer from curse of dimensionality. Second, they treat all words as equally unrelated, so model sees no connection between 'cat' and 'dog' or 'run' and 'running'. Third, they provide no generalization - if model learns something about 'dog', it doesn't transfer to 'puppy'. Word embeddings solve all these problems: they're computationally efficient (dense 300-dim vs sparse 50k-dim), capture semantic relationships through proximity, enable transfer learning, and provide continuous representations where similar concepts cluster. They're the foundation upon which all modern NLP is built, from search to translation to LLMs.

📋 EXAMPLE:
With one-hot, 'cat' = [0,0,1,0,...,0], 'dog' = [0,1,0,0,...,0] - dot product = 0, model sees no relationship. With Word2Vec embeddings trained on large corpus, 'cat' might be [0.2, -0.5, 0.8, ...] and 'dog' [0.3, -0.4, 0.7, ...] - cosine similarity 0.85, reflecting their semantic similarity as pets. The embedding space organizes semantically: animals cluster together, verbs cluster separately, and relationships like countries-capitals form consistent vector offsets (Paris - France ≈ Rome - Italy).

Question 5

What is the difference between static embeddings (Word2Vec, GloVe) and contextual embeddings (BERT, GPT)?

Accepted Answer

🔍 DEFINITION:
Static embeddings assign the same vector to a word regardless of its context - each word type has one fixed representation. Contextual embeddings generate different vectors for the same word based on surrounding words, capturing polysemy and usage-specific meaning dynamically as the word appears in different contexts.

⚙️ HOW IT WORKS:
Static embeddings like Word2Vec and GloVe are trained by predicting context words (Word2Vec skip-gram) or factorizing co-occurrence matrices (GloVe). After training, each word has one vector stored in a lookup table, frozen and identical for every occurrence. For example, 'bank' has the same vector whether in 'river bank' or 'money bank'. Contextual embeddings from models like BERT and GPT use deep transformer networks where each layer's representation depends on all tokens through attention. The same input token 'bank' passes through the network, but its final vector differs based on the entire context, with attention mechanisms weighting relevant contextual information. These models are typically pre-trained on large corpora and can be fine-tuned or used as feature extractors.

💡 WHY IT MATTERS:
Static embeddings fundamentally cannot handle polysemy - words with multiple meanings get a single vector that's a compromise, losing nuance. 'Apple' (fruit) and 'Apple' (company) share the same vector, causing confusion. Contextual embeddings solve this, dramatically improving performance on virtually all NLP tasks. They've enabled transfer learning where pre-trained models provide rich, context-aware representations that capture syntax, semantics, and world knowledge. This is why modern NLP has shifted entirely to contextual embeddings - they're essential for tasks requiring nuanced understanding like question answering, sentiment analysis, and machine translation. The trade-off is computational cost: static embeddings are cheap (lookup table), contextual embeddings require full model forward pass.

📋 EXAMPLE:
Sentences: (1) 'I went to the bank to deposit money.' (2) 'The river bank was muddy after the storm.' Static embeddings give identical vector for 'bank' in both, losing distinction. BERT produces vectors for 'bank' in sentence 1 that have high similarity with financial terms (money, deposit), while in sentence 2, the vector aligns with geographical terms (river, muddy). This enables downstream tasks to treat the same word appropriately in different contexts - crucial for accurate understanding.

Question 6

How are token embeddings trained in a large language model?

Accepted Answer

🔍 DEFINITION:
Token embeddings in LLMs are trained jointly with the rest of the model during pretraining, typically via next-token prediction (decoder-only) or masked language modeling (encoder-only). The embedding matrix learns to map discrete token IDs to continuous vectors that capture semantic and syntactic features useful for prediction.

⚙️ HOW IT WORKS:
The embedding matrix E of shape (vocab_size, d_model) is initialized randomly (often with small values or special initialization schemes). During forward pass, each input token ID i looks up its embedding vector E[i] (row i of matrix). These embeddings pass through transformer layers, ultimately producing predictions for next tokens (or masked tokens). The loss (cross-entropy) compares predictions to actual tokens. During backpropagation, gradients flow back through the entire network to the embedding layer, updating E[i] for all tokens that appeared in the batch. Tokens appearing in similar contexts receive similar gradient updates, gradually organizing the embedding space. The embedding matrix is often tied with the output projection matrix (same weights used for input and output) to improve training efficiency and consistency.

💡 WHY IT MATTERS:
End-to-end training ensures embeddings are optimized specifically for the model's objective, rather than being static or separately trained. This creates representations where similarity in embedding space corresponds to usefulness for next-token prediction - semantically related words cluster, but also function words group, and syntactic patterns emerge. The embedding layer is often one of the largest parameter components (vocab_size × d_model, e.g., 50k × 4096 = 205M parameters) and stores substantial knowledge about word relationships. Training these embeddings on massive corpora (trillions of tokens) is what enables models to capture nuanced word meanings, rare word handling, and cross-lingual transfer. The quality of embeddings fundamentally limits what the model can learn - if embeddings don't distinguish between important concepts, deeper layers can't recover.

📋 EXAMPLE:
Early in training, embeddings are random - 'king' and 'queen' vectors are unrelated. As training progresses on billions of tokens, patterns emerge: when 'king' appears, next words often include 'ruled', 'crown', 'throne'. Similar contexts for 'queen' push their embeddings closer. By end of training, 'king' and 'queen' have high cosine similarity, and vector differences capture royalty-related dimensions. This happens automatically through gradient updates without any explicit supervision about word relationships.

Question 7

What is embedding dimensionality and how does it affect model performance?

Accepted Answer

🔍 DEFINITION:
Embedding dimensionality is the size of the vector space used to represent each token, typically denoted as d_model in transformer architectures. It ranges from 128 for small models to 12,288 for largest models like GPT-3, determining the capacity to encode semantic, syntactic, and contextual information about each token.

⚙️ HOW IT WORKS:
Each dimension in the embedding vector can be thought of as encoding a latent feature or property - some dimensions might capture animacy, others tense, others semantic categories. Higher dimensions provide more degrees of freedom to represent nuanced distinctions and multiple facets of meaning. The embedding matrix size is vocab_size × d_model, so doubling dimensions doubles embedding parameters (often hundreds of millions). During training, these dimensions learn to specialize through gradient updates. The rank of the embedding space (effective dimensionality) may be lower than d_model due to correlations.

💡 WHY IT MATTERS:
Choosing embedding dimensionality involves fundamental trade-offs. Too low (underparameterized): model lacks capacity to capture all necessary distinctions, causing bottleneck where tokens with different meanings collapse to similar vectors, limiting downstream performance. Too high (overparameterized): wasted parameters increase memory and compute without benefit, may overfit to training data, and can make training less efficient. Scaling laws research shows optimal dimensionality scales with total model size and training data. For small models (100M params), 512-768 dimensions suffice. For large models (175B params), 12,288 dimensions are needed to encode rich semantics across 500B training tokens. Practitioners must balance performance goals against computational constraints.

📋 EXAMPLE:
Consider distinguishing between 'run' (verb, exercise), 'run' (noun, in baseball), and 'run' (verb, operate machinery). With 64 dimensions, vectors might be too similar, causing confusion. With 768 dimensions, each sense can have distinct patterns: exercise-run activates dimensions for physical activity, baseball-run for sports, operate-run for machinery operation, while all sharing core 'run' semantics. The extra dimensions allow separation without losing commonality. MTEB benchmark scores typically improve with higher dimensions up to a point, then plateau - finding this sweet spot is crucial for cost-effective deployment.

Question 8

What are sentence embeddings and how are they used in semantic search?

Accepted Answer

🔍 DEFINITION:
Sentence embeddings are dense vector representations of entire sentences or paragraphs that capture their semantic meaning in a fixed-dimensional space. Unlike token embeddings that represent individual words, sentence embeddings compress the meaning of variable-length text into a single vector, enabling direct comparison of semantic similarity between texts regardless of length or word overlap.

⚙️ HOW IT WORKS:
Several methods produce sentence embeddings. Simple approaches average token embeddings (cheap but loses word order). Better approaches use the [CLS] token from BERT models, which is designed to aggregate sequence information. State-of-the-art uses specialized models like Sentence-BERT (SBERT), trained on sentence pairs with siamese or triplet networks to produce similar vectors for semantically similar sentences. The training objective pushes paraphrases close together and dissimilar sentences apart. During inference, each sentence (query or document) is passed through the model to produce an embedding. Semantic search then computes cosine similarity between query embedding and all document embeddings, retrieving documents with highest similarity scores. Vector databases index these embeddings for efficient approximate nearest neighbor search at scale.

💡 WHY IT MATTERS:
Semantic search represents a paradigm shift from keyword matching to meaning matching. Traditional search finds documents containing query words, missing relevant documents that use different terminology. Semantic search finds documents with similar meaning, even with zero word overlap. This powers modern applications: retrieval-augmented generation (RAG) where relevant context must be found regardless of exact wording, duplicate detection across documents, recommendation systems finding similar content, and clustering documents by topic. Sentence embeddings enable this by transforming the search problem into nearest neighbor search in a continuous space where semantic proximity equals vector proximity.

📋 EXAMPLE:
User searches 'How to fix a car engine that won't start.' Keyword search looks for 'fix', 'car', 'engine', 'start' - might miss an article titled 'Automotive troubleshooting: ignition system failure diagnosis' which contains zero query words but is semantically perfect. Sentence embeddings: both query and article map to vectors with high cosine similarity (0.85) because they both relate to car engine problems and solutions. The article is retrieved and can be used for RAG, providing truly relevant information that keyword search would miss entirely.

Question 9

Explain the concept of cosine similarity in the context of embeddings.

Accepted Answer

🔍 DEFINITION:
Cosine similarity measures the cosine of the angle between two vectors in embedding space, ranging from -1 (opposite directions) through 0 (orthogonal) to 1 (identical direction). It captures semantic similarity independent of vector magnitude, focusing purely on directional alignment which corresponds to meaning in embedding spaces.

⚙️ HOW IT WORKS:
Mathematically, cosine similarity = (A·B)/(||A||×||B||) = cos(θ), where θ is the angle between vectors. For unit-normalized vectors (common in embedding systems), it simplifies to dot product. The computation involves dot product divided by product of magnitudes. Values near 1 indicate vectors point in nearly same direction (semantically similar), near 0 indicate unrelated meanings (orthogonal), near -1 indicate opposite meanings (antonyms in some spaces). Unlike Euclidean distance, cosine similarity is invariant to vector scaling - doubling a vector doesn't change its direction or similarity to others.

💡 WHY IT MATTERS:
In embedding spaces, meaning is encoded in direction, not magnitude. A document's embedding magnitude might reflect length or confidence, but its direction captures semantic content. Cosine similarity naturally focuses on this directional meaning. Two documents on same topic but different lengths will have different magnitudes but similar directions, yielding high cosine similarity - exactly what we want. Euclidean distance would be large due to magnitude difference, incorrectly suggesting dissimilarity. This makes cosine similarity the standard metric for semantic search, document clustering, and any task comparing semantic content. Vector databases optimize for fast cosine similarity computation with indexing structures like HNSW.

📋 EXAMPLE:
'king' embedding ≈ [0.8, 0.3, -0.2], magnitude 0.87. 'queen' embedding ≈ [0.75, 0.4, -0.15], magnitude 0.86. Dot product = 0.8×0.75 + 0.3×0.4 + (-0.2)×(-0.15) = 0.6 + 0.12 + 0.03 = 0.75. Cosine similarity = 0.75/(0.87×0.86) = 0.75/0.75 ≈ 1.0 (actually 0.99) - nearly identical direction, correctly indicating semantic similarity. 'apple' (fruit) and 'apple' (company) might have similarity 0.3 in static embeddings - lower because meanings differ. This directional comparison enables semantic organization.

Question 10

Why do LLMs struggle with rare words or out-of-vocabulary tokens?

Accepted Answer

🔍 DEFINITION:
LLMs struggle with rare words because these tokens have limited training examples, leading to poorly learned embeddings that fail to capture accurate meaning. Out-of-vocabulary tokens (those never seen during training) are even more problematic, requiring fallback mechanisms like subword decomposition that may produce unusual or never-before-seen combinations.

⚙️ HOW IT WORKS:
During training, embedding updates for a token occur only when that token appears in the batch. Rare words appear infrequently (maybe once per million tokens), so their embeddings receive sparse gradient updates. Their vectors remain close to random initialization, failing to move toward semantically appropriate regions of embedding space. For OOV words, subword tokenization decomposes them into known subwords - but if the word is rare, these subword combinations may never have appeared together during training, so the model lacks experience with that specific sequence. The attention mechanism then processes these unusual subword sequences with weights learned on different patterns, often producing incoherent representations.

💡 WHY IT MATTERS:
This limitation creates real-world problems. Domain-specific terminology in medicine, law, or technology often includes rare words that models mishandle. New words enter language constantly (neologisms). User queries contain typos, slang, and proper names. When models fail on these words, they may hallucinate, ignore critical terms, or produce nonsensical responses. For retrieval systems, rare words in queries might fail to match relevant documents containing those terms. This is particularly problematic for specialized domains where rare terminology carries crucial meaning - a medical term misinterpreted could have serious consequences. Mitigations include fine-tuning on domain data, character-level fallbacks, and retrieval augmentation with definitions.

📋 EXAMPLE:
Medical term 'xanthoma' (skin condition with fatty deposits). In training corpus (mostly general text), 'xanthoma' appears 50 times total. Its embedding is noisy and poorly positioned. During inference, user asks 'What causes xanthoma?' The model's representation of 'xanthoma' is weak - it might activate vaguely related concepts like 'skin', 'rash', 'fat', but not accurately represent the specific condition. The generated answer may be partially correct but miss key details like 'lipid disorders' or 'familial hypercholesterolemia' that a properly understood term would trigger. Subword decomposition ['xan', 'tho', 'ma'] doesn't help because this combination never appeared in training.

Question 11

What is the typical vocabulary size of modern LLMs and how is it chosen?

Accepted Answer

🔍 DEFINITION:
Modern LLMs typically use vocabulary sizes between 32,000 and 256,000 tokens, with 50,000 being a common choice. The vocabulary includes characters, subwords, and frequent words learned from training corpus via BPE or similar algorithms. The size is a critical hyperparameter chosen based on trade-offs between coverage, sequence length, and model capacity.

⚙️ HOW IT WORKS:
Vocabulary is built by running the tokenizer on training corpus and merging until target size reached. The process analyzes token frequencies and merges most common pairs. Larger vocabulary (e.g., 256k) captures more words as single tokens, reducing average sequence length because common phrases become single tokens. Smaller vocabulary (e.g., 32k) uses more subwords per word, increasing sequence length but providing better handling of rare words and morphology. The embedding matrix size is vocab_size × d_model, so larger vocabularies significantly increase parameter count (e.g., 256k × 4096 = 1.05B parameters just for embeddings).

💡 WHY IT MATTERS:
Vocabulary size affects virtually every aspect of model performance and efficiency. Too small: sequences become long, consuming context window and increasing compute cost for attention (O(n²)). For a 50k vocabulary, average English tokenization ratio is ~1.3 tokens per word; with 32k, it might be 1.5 tokens per word - for a 1000-word document, that's 1300 vs 1500 tokens, 15% more compute. Too large: embedding layer dominates parameters, rare tokens get undertrained (fewer occurrences per token), and model may overfit to corpus-specific tokens. The choice also affects multilingual performance - languages with large character sets need more vocabulary capacity. Scaling laws guide optimal size for given compute budget.

📋 EXAMPLE:
GPT-2 uses 50,257 vocabulary, GPT-3 ~50,000, LLaMA 32,000, PaLM 256,000. Consider a 1000-word document: with LLaMA (32k), tokenizes to ~1500 tokens, using 15% of 8k context. With PaLM (256k), same document tokenizes to ~1100 tokens (more words as single tokens), using only 11% of context, leaving more room for response. But PaLM's embedding layer is 256k × 7680 (d_model) = 1.97B parameters vs LLaMA's 32k × 4096 = 131M - 15× larger embedding layer, significant cost. This trade-off drives different design choices.

Question 12

How does tokenization affect the cost and latency of LLM API calls?

Accepted Answer

🔍 DEFINITION:
Tokenization directly determines the number of tokens in both input prompts and generated outputs, which in turn determines API cost (most providers charge per token) and latency (processing time scales with token count). Different tokenizers produce different token counts for the same text, directly impacting operational expenses and user experience.

⚙️ HOW IT WORKS:
Most LLM APIs (OpenAI, Anthropic, etc.) charge based on total tokens processed: input tokens + output tokens. For a given text, tokenizer A might produce 100 tokens while tokenizer B produces 150 tokens for the same semantic content. Cost scales linearly with token count. Latency is affected in two ways: input processing time scales with input token count (attention is O(n²) for input), and output generation time scales with number of output tokens (each token requires a forward pass). More tokens mean slower responses and higher costs. Additionally, context window is measured in tokens - less efficient tokenization means less actual text fits, potentially requiring truncation or multiple calls.

💡 WHY IT MATTERS:
For production applications at scale, tokenization efficiency translates directly to bottom-line cost differences. An application processing 1M queries daily with average 500 input tokens and 300 output tokens: at $0.01/1k tokens, daily cost = 1M × (500+300)/1000 × $0.01 = $8,000. If a more efficient tokenizer reduces average tokens by 20%, daily savings = $1,600 ($584,000 annually). Latency differences affect user experience - 20% faster responses can significantly impact engagement and conversion rates. For multilingual applications, disparities are even larger - some languages tokenize 2-3× less efficiently than English with English-optimized tokenizers, effectively charging those users more and giving them slower service.

📋 EXAMPLE:
English phrase 'I love artificial intelligence' with GPT-4 tokenizer: 5 tokens. Same meaning in Japanese '私は人工知能が大好きです' with same tokenizer: 12 tokens (2.4× more). For 1M queries: English cost $40, Japanese cost $96 - Japanese users pay 2.4× more for same service. Response latency also 2.4× higher. This inequity drives development of multilingual tokenizers that balance efficiency across languages. Within same language, prompt optimization reducing tokens by 10% saves 10% on costs - significant at scale.

Question 13

What is subword tokenization and what problem does it solve?

Accepted Answer

🔍 DEFINITION:
Subword tokenization is a method that splits text into units between characters and words, creating a vocabulary of common substrings (like 'ing', 'pre', 'tion') along with frequent words. It solves the fundamental dilemma between word-level tokenization (large vocabulary, OOV issues) and character-level tokenization (long sequences, lost morphology) by providing an optimal middle ground.

⚙️ HOW IT WORKS:
Subword tokenizers like BPE, WordPiece, and Unigram start with a base vocabulary of characters and iteratively learn to merge frequent character sequences into subword units. The resulting vocabulary contains characters, common morphemes (prefixes, suffixes, roots), and frequent words. During tokenization, the algorithm applies merges greedily, breaking text into the longest possible subwords from vocabulary. Unknown words decompose into smaller known pieces until reaching character level. This creates a consistent, invertible mapping from text to tokens where every possible string can be represented, with common strings being efficient single tokens.

💡 WHY IT MATTERS:
Subword tokenization revolutionized NLP by solving the open vocabulary problem that plagued earlier systems. Word-level models (vocabulary 500k) still encountered OOV words, especially in specialized domains. Character-level models (vocabulary ~100) could represent any text but produced sequences 5-10× longer, making it hard to capture long-range dependencies and increasing compute cost. Subword tokenization provides the best of both worlds: any text can be represented (no OOV), common words are efficient single tokens, morphology is preserved through shared subwords ('run', 'running', 'runs' share 'run'), and sequence length is manageable. This enables models to handle any input while maintaining efficiency.

📋 EXAMPLE:
Word 'unhappiness': word-level might have it in vocabulary if common, otherwise [UNK] (lost meaning). Character-level: ['u','n','h','a','p','p','i','n','e','s','s'] - 11 tokens, loses morphological structure. Subword (BPE): ['un', 'happiness'] - 2 tokens. 'un' is a common prefix seen in many words, 'happiness' is common root. The model benefits from seeing 'un' in other contexts (unclear, unknown) and 'happiness' in others, so it understands the meaning compositionally even if it never saw 'unhappiness' during training. This generalization is the key power of subword tokenization.

Question 14

Why does the choice of tokenizer matter for multilingual models?

Accepted Answer

🔍 DEFINITION:
Tokenizer choice critically affects multilingual model performance because different languages have vastly different writing systems, character frequencies, word boundaries, and morphological structures. A tokenizer optimized for one language can severely disadvantage others, creating performance disparities and unfair user experiences across languages.

⚙️ HOW IT WORKS:
Tokenizers learn merge operations from training corpus. If corpus is English-dominated (as most are), merges that benefit English (like 'th', 'ing') dominate, using up vocabulary capacity. Languages with different character sets (Chinese, Japanese) get fewer merges, leaving them at character-level representation. Languages without spaces (Thai, Lao) require special handling that standard pre-tokenization fails. Languages with rich morphology (Turkish, Finnish) need more subword capacity to handle complex word forms. The resulting tokenizer produces much longer sequences for non-English text, consuming more context window and compute for the same semantic content.

💡 WHY IT MATTERS:
Poor multilingual tokenization leads to systematic bias in model performance. English text gets efficient representation (few tokens), leaving more context for reasoning and response. Thai or Chinese text of same meaning uses 2-3× more tokens, filling context faster and leaving less room for generation. This means non-English users effectively pay more (if API pricing is per token) and get worse performance (less context, more truncation). Models also understand non-English languages less well because their tokens are less semantically meaningful - they're just characters or random subwords rather than linguistically motivated units. This has real-world implications for global AI accessibility.

📋 EXAMPLE:
English 'I love artificial intelligence' with SentencePiece trained on multilingual data: 5 tokens. Thai 'ฉันรักปัญญาประดิษฐ์' (same meaning) with same tokenizer: 12 tokens. The Thai sentence uses 2.4× more context window, leaving less room for conversation history or response. Thai user gets shorter responses or truncated history. If tokenizer was trained on English-only data, Thai might be 20+ tokens, even worse. This is why modern multilingual models like mT5 and XLM-R use SentencePiece with careful sampling to balance languages and ensure fair representation.

Question 15

What is the relationship between token count and context window length?

Accepted Answer

🔍 DEFINITION:
Context window length is measured in tokens, not words or characters, so the number of tokens a piece of text consumes directly determines how much of the context window it occupies. This creates a direct relationship between tokenization efficiency (tokens per word) and usable context for a given model.

⚙️ HOW IT WORKS:
A model's context window is a fixed maximum token length, e.g., 8k, 32k, or 128k tokens. When processing a conversation or document, all text (user messages, system prompts, retrieved documents, generated responses) must fit within this limit. The text first passes through tokenizer, converting to N tokens. Available tokens for the rest = context_window - N. If N exceeds context window, text must be truncated (losing information). The token-to-word ratio varies by language, domain, and tokenizer efficiency - some languages need 2-3× more tokens than English for same content.

💡 WHY IT MATTERS:
This relationship determines how much actual content can fit in context. A model with 128k context sounds impressive, but if tokenizer is inefficient for your language, you might only fit 50k words instead of 100k. For RAG applications, this affects how many retrieved documents can be included. For long conversations, it determines how much history can be retained. For document processing, it decides whether a document fits entirely or needs chunking. Understanding this helps practitioners set appropriate chunk sizes, manage costs, and set user expectations about how much context the model can handle.

📋 EXAMPLE:
Model with 8k context window. English technical document: 4000 words tokenize to 5200 tokens with typical tokenizer (1.3× ratio) - fits with 2800 tokens for response. Same document in Chinese: 4000 words tokenize to 8000-12000 tokens with English-optimized tokenizer (2-3× ratio) - may not fit at all, requiring truncation or chunking. The Chinese user loses document content the English user can keep. This is why tokenizer design is crucial for fair multilingual support.

Question 16

How do you choose the right embedding model for a RAG application?

Accepted Answer

🔍 DEFINITION:
Choosing an embedding model for RAG involves systematically evaluating trade-offs between retrieval quality, latency, cost, dimensionality, and domain alignment to select the model that best meets application requirements within operational constraints.

⚙️ HOW IT WORKS:
The selection process follows several steps. First, define requirements: domain (legal, medical, general), languages, scale (documents count, queries per second), latency budget, and quality targets. Second, evaluate candidates on MTEB leaderboard for general performance, but more importantly, create domain-specific test set with queries and relevant documents from your corpus. Compute retrieval metrics (recall@k, precision@k, MRR) for each model. Third, assess practical factors: embedding dimension affects storage cost (10M docs × 1536-dim × 4 bytes = 61GB vs 384-dim = 15GB), inference latency (model size), API costs if using providers, and hardware requirements for self-hosting.

💡 WHY IT MATTERS:
Embedding model is the foundation of RAG quality - if retrieval fails, generation fails regardless of LLM quality. Choosing wrong model can mean missing 20-30% of relevant documents, directly impacting answer accuracy. At scale, cost differences are enormous: a 384-dim model requires 4× less storage and 2-3× faster search than 1536-dim, potentially saving millions in infrastructure over years. Domain mismatch is common - general models like text-embedding-ada-002 may underperform on specialized domains by 10-20% compared to domain-tuned models. The selection directly impacts user satisfaction, operational costs, and project success.

📋 EXAMPLE:
Legal RAG application with 50M documents. Test four models: OpenAI ada-002 (1536-dim, $0.13/1M tokens), Cohere embed-english-v3 (1024-dim), open-source BGE-large (1024-dim), and legal-tuned model. Create test set of 500 legal queries with manually labeled relevant cases. Results: ada-002 recall@10 = 0.82, BGE-large = 0.79, legal-tuned = 0.91. Storage costs: ada-002 50M×1536×4 = 307GB, legal-tuned 1024-dim = 205GB. The legal-tuned model wins on quality (11% better retrieval) and costs less to store. Even if ada-002 had slightly better quality, legal-tuned's 33% storage savings and lower API costs (if self-hosted) might make it better choice. This data-driven selection prevents costly mistakes.

Question 17

What is the difference between sparse and dense embeddings?

Accepted Answer

🔍 DEFINITION:
Sparse embeddings represent text as high-dimensional vectors where most entries are zero, with each dimension corresponding to a vocabulary term and values indicating term importance (like TF-IDF or BM25 scores). Dense embeddings are low-dimensional vectors with non-zero values in all dimensions, learned by neural networks to capture semantic meaning in a continuous space.

⚙️ HOW IT WORKS:
Sparse embeddings like BM25 are built by counting term frequencies: each document becomes a vector of length V (vocabulary size) with non-zero entries only for terms present. Values are computed as TF-IDF or BM25 scores capturing term importance. These vectors are sparse (e.g., 100 non-zero out of 500k dimensions), interpretable (each dimension corresponds to specific word), and require no training. Dense embeddings are learned by models like BERT or SBERT: text passes through neural network producing fixed-size vector (e.g., 768-dim) where every dimension is non-zero. The network is trained to map semantically similar texts to nearby vectors, creating continuous semantic space.

💡 WHY IT MATTERS:
Each type has complementary strengths. Sparse embeddings excel at exact keyword matching, crucial for proper nouns, IDs, terminology, and domain-specific terms where exactness matters. They're interpretable (you know why document matched - same words). Dense embeddings excel at semantic matching, finding conceptually related content even with zero word overlap, handling synonyms, and understanding context. In practice, hybrid approaches combining both achieve best results - sparse catches exact matches dense might miss, dense catches semantic matches sparse misses. This is why modern search systems use hybrid search with reciprocal rank fusion.

📋 EXAMPLE:
Query 'iPhone 15 release date 2024'. Sparse BM25 retrieves documents with exact terms 'iPhone 15', 'release date', '2024' - high precision, but might miss an article 'Apple's new phone launch scheduled for next year' which contains zero query terms but is semantically perfect. Dense embeddings retrieve that article because semantically similar. Conversely, query 'FDA approval for drug X-123' - dense might retrieve general drug approval articles, but sparse catches exact 'X-123' matches that dense could miss if drug name is rare. Together, they cover both exact and semantic needs.

Question 18

What is Matryoshka Representation Learning (MRL) in the context of embeddings?

Accepted Answer

🔍 DEFINITION:
Matryoshka Representation Learning is a technique that trains embeddings to be useful at multiple dimensions simultaneously, allowing flexible truncation to smaller dimensions without retraining or significant quality loss. Named after Russian nesting dolls, it creates representations where early dimensions contain most important information, later dimensions add refinement.

⚙️ HOW IT WORKS:
During training, MRL computes loss not just on full embedding but also on multiple nested subsets. For example, with target dimension 768, losses are computed on first 64, 128, 256, 384, 512, and 768 dimensions. The model must ensure all these prefixes are useful representations. This is achieved by having multiple output heads or a single head with multi-level loss. The training objective pushes most critical semantic information into early dimensions, with later dimensions encoding finer distinctions. After training, the embedding can be truncated to any dimension (e.g., 128) and still provide good performance, unlike standard embeddings where truncation destroys quality.

💡 WHY IT MATTERS:
MRL enables flexible deployment where one model serves multiple use cases with different cost-quality requirements. High-precision search can use full 768-dim for maximum recall. Real-time recommendation with strict latency can use 128-dim from same model, achieving 6× faster search and 6× less storage with only slight quality degradation. This eliminates need to train and maintain multiple models. Storage costs for large-scale vector databases are dramatically reduced - 10B documents at 768-dim = 30.7TB, at 128-dim = 5.1TB. The technique also enables progressive transmission and early exiting in distributed systems.

📋 EXAMPLE:
E-commerce platform with 100M products. Full 768-dim embeddings: storage 100M×768×4 = 307GB, search latency 50ms. 128-dim truncation: storage 51GB, latency 10ms. With MRL, same model serves both: backend search uses full dims for high recall (0.95), real-time recommendations use 128-dim (0.92 recall) - 6× faster, 6× cheaper storage. Without MRL, would need two separate embedding models, doubling training and maintenance costs. Tests show MRL retains 98% of full quality at 1/6 dimensions, a game-changing efficiency gain.

Question 19

How would you evaluate the quality of an embedding model for a specific domain?

Accepted Answer

🔍 DEFINITION:
Evaluating embedding model quality for a specific domain requires creating a domain-specific test set with queries and relevant documents, then measuring retrieval metrics to assess how well the model captures domain semantics. This is essential because general benchmarks may not reflect performance on specialized terminology and concepts.

⚙️ HOW IT WORKS:
First, create evaluation dataset: collect 200-1000 representative queries from your domain (e.g., legal questions, medical queries, product searches). For each query, manually label 5-20 relevant documents from your corpus (or use existing relevance judgments). Ensure coverage of domain concepts, terminology, and edge cases. Second, generate embeddings for all documents and queries using candidate models. Third, for each query, retrieve top-k documents (k=5,10,20,100) and compute metrics: recall@k (proportion of relevant documents retrieved), precision@k (proportion of retrieved that are relevant), MRR (Mean Reciprocal Rank, measures first relevant rank), and nDCG (discounted cumulative gain, accounts for ranking quality). Compare across models including baselines (BM25, general embeddings, domain-tuned).

💡 WHY IT MATTERS:
General benchmarks like MTEB test on Wikipedia, news, and generic datasets - but your domain may have completely different characteristics. Medical terminology, legal jargon, product codes, or scientific concepts may be poorly handled by general models. A model scoring 0.90 on MTEB might score 0.60 on your domain if it never saw similar terms. This leads to failed RAG systems, user frustration, and project failure. Domain evaluation identifies these gaps, guides model selection, and provides baseline for improvement through fine-tuning. It also quantifies business impact - a 10% recall improvement might mean 10% fewer customer support escalations.

📋 EXAMPLE:
Medical QA system for rare diseases. Create test set: 300 queries from patient forums with doctor-labeled relevant medical literature. Test candidates: OpenAI ada-002 (general), BioBERT (biomedical), ClinicalBERT (clinical notes). Results recall@10: ada-002 0.58, BioBERT 0.79, ClinicalBERT 0.83. Despite ada-002 scoring higher on MTEB, it performs worst on medical domain. ClinicalBERT wins, with 43% better recall than ada-002. This translates to patients getting correct answers 43% more often. Without domain evaluation, would have chosen wrong model and built failing system. Investment in evaluation prevented this.

Question 20

What are the trade-offs between embedding model size and retrieval performance?

Accepted Answer

🔍 DEFINITION:
Larger embedding models (more parameters, higher dimensions) generally achieve better retrieval quality but incur higher costs in compute, memory, latency, and storage. Understanding these trade-offs is crucial for cost-effective system design, as the optimal point depends on application requirements and scale.

⚙️ HOW IT WORKS:
Model size affects multiple dimensions. Quality typically improves with size but with diminishing returns - going from 100M to 1B parameters might improve recall by 5%, from 1B to 10B by another 2%. Dimension affects storage linearly: 10B documents at 384-dim = 15TB, 768-dim = 30TB, 1536-dim = 60TB. Inference latency scales with model parameters - small models (100M) run in 5ms on CPU, large models (1B) need 20ms on GPU. Indexing cost (building vector search index) scales with dimension and number of documents. API costs for commercial models scale with usage volume.

💡 WHY IT MATTERS:
For large-scale production systems, these trade-offs translate directly to millions in infrastructure costs. A 1% quality gain might not justify 2× storage cost. For real-time applications, latency differences affect user experience and conversion rates. For budget-constrained projects, smaller models may be the only viable option. The key is finding the Pareto frontier - models that give best quality for given compute/storage budget. Many applications find that 384-dim models (like MiniLM) achieve 95% of quality of 1536-dim models (like ada-002) at 1/4 the storage cost, which is optimal for their use case.

📋 EXAMPLE:
100M document corpus. Option A: 1536-dim embeddings (ada-002), recall@10 = 0.92, storage 100M×1536×4 = 614GB, search latency 30ms (GPU), annual storage cost $18k (cloud). Option B: 384-dim (MiniLM), recall@10 = 0.88, storage 154GB, latency 8ms (CPU), annual storage $4.6k. Option C: 768-dim (BGE-base), recall@10 = 0.90, storage 307GB, latency 15ms, annual storage $9.2k. If application requires recall >0.90, BGE-base hits target at half the cost of ada-002. If recall 0.88 acceptable, MiniLM saves 75% storage and runs 4× faster. Data-driven choice depends on specific quality requirements - no universal "best" model.

AI Interview Questions

Tokenization & Embeddings

What is tokenization and why does it matter for LLMs?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

Explain Byte Pair Encoding (BPE). How does it build a vocabulary?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

What is the difference between BPE, WordPiece, and SentencePiece tokenization?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

What are word embeddings and how are they different from one-hot encodings?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

What is the difference between static embeddings (Word2Vec, GloVe) and contextual embeddings (BERT, GPT)?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

How are token embeddings trained in a large language model?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

What is embedding dimensionality and how does it affect model performance?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

What are sentence embeddings and how are they used in semantic search?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

Explain the concept of cosine similarity in the context of embeddings.

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

Why do LLMs struggle with rare words or out-of-vocabulary tokens?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

What is the typical vocabulary size of modern LLMs and how is it chosen?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

How does tokenization affect the cost and latency of LLM API calls?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

What is subword tokenization and what problem does it solve?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

Why does the choice of tokenizer matter for multilingual models?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

What is the relationship between token count and context window length?

🔍 DEFINITION:

⚙️ HOW IT WORKS:

💡 WHY IT MATTERS:

📋 EXAMPLE:

How do you choose the right embedding model for a RAG application?

🔍 DEFINITION:

⚙️ HOW IT WORKS: