QUESTION 01What is tokenization and why does it matter for LLMs?
🔍 DEFINITION:
Tokenization is the process of converting raw text into smaller units called tokens that the model can process mathematically. These tokens can be words, subwords, or characters, and each is mapped to a unique integer ID in the model's vocabulary. It serves as the critical bridge between human-readable text and the numerical representations that neural networks can understand and manipulate.
⚙️ HOW IT WORKS:
The tokenizer first normalizes text (lowercasing, Unicode normalization), then splits it according to learned rules. For example, the sentence "I love AI!" with a BPE tokenizer might become ['I', 'love', 'AI', '!'] - each mapped to IDs like [40, 512, 789, 3]. Special tokens like [CLS] (classification), [SEP] (separator), and [PAD] (padding) are added as needed. The tokenizer maintains a vocabulary file mapping tokens to IDs, typically containing 32k-256k entries. During training, the tokenizer is built by analyzing corpus statistics to determine optimal token splits. At inference, every input text must pass through the exact same tokenizer used during training.
💡 WHY IT MATTERS:
Tokenization is the gateway between human language and model mathematics, affecting virtually every aspect of LLM performance. It determines vocabulary size, influences how rare words are handled, affects sequence length (and thus context window usage), and impacts model performance on different languages and domains. Poor tokenization leads to out-of-vocabulary issues, inefficient use of context window (costing money in API calls), and degraded understanding of specialized terminology. For multilingual models, tokenizer design determines fairness across languages - if English is over-optimized, other languages suffer. Tokenization also affects inference speed and cost since APIs charge per token. Understanding tokenization helps practitioners optimize prompts, manage budgets, and debug model failures.
📋 EXAMPLE:
The word "unhappiness" with a word tokenizer might be OOV (out-of-vocabulary) and become [UNK], losing meaning. With BPE subword tokenization, it becomes ['un', 'happiness'] - both known tokens, preserving meaning. For cost comparison: English "I love AI" = 4 tokens. The same meaning in Thai "ฉันรักเอไอ" might be 8 tokens with an English-optimized tokenizer, costing twice as much for the same semantic content and using twice the context window space. This is why multilingual models need carefully designed tokenizers.