Explore topic-wise interview questions and answers.
Graph RAG
QUESTION 01
What is GraphRAG and how does it differ from standard vector-based RAG?
š DEFINITION:
GraphRAG enhances retrieval by incorporating knowledge graphs - structures that represent entities as nodes and relationships as edges. Unlike standard RAG that only uses vector similarity on text chunks, GraphRAG can traverse relationships between entities, enabling retrieval based on connections and multi-hop reasoning that pure vector search cannot capture.
āļø HOW IT WORKS:
GraphRAG typically maintains both vector indexes (for text chunks) and a knowledge graph (for entities and relationships). During retrieval, it may: 1) Identify entities in the query, 2) Find those entities in the graph, 3) Traverse graph relationships to find related entities and their associated documents, 4) Combine with vector search results. This enables finding information connected through paths like 'X acquired Y' or 'Z is a competitor of X' that vector similarity might miss. The graph can be pre-built from documents using entity extraction, or existing knowledge graphs can be integrated.
š” WHY IT MATTERS:
Standard RAG excels at finding semantically similar text but misses explicit relationships. For queries like 'What companies has Apple acquired?', vector search finds documents mentioning 'Apple acquisitions' but may miss specific acquisition documents if they don't use those exact words. GraphRAG explicitly models the 'acquired' relationship, ensuring all acquisitions are found via graph traversal. This improves recall for relationship-heavy queries by 15-30% and enables multi-hop questions impossible for standard RAG.
š EXAMPLE:
Query: 'Find drugs developed by companies that Pfizer acquired.' Standard RAG: search for 'Pfizer acquisitions' and 'drugs developed' - may find some but miss many. GraphRAG: 1) Find Pfizer node in graph. 2) Traverse 'acquired' edges to get acquired companies (Wyeth, Warner-Lambert, etc.). 3) For each, traverse 'developed' edges to get drugs. 4) Retrieve documents about those drugs. Result: comprehensive list including drugs from all acquisitions, even if never mentioned together in any document. This relationship-based retrieval is GraphRAG's power.
QUESTION 02
What is a knowledge graph and how is it constructed from unstructured text?
š DEFINITION:
A knowledge graph is a structured representation of knowledge where entities (people, organizations, concepts) are nodes and relationships between them are edges, forming a network of interconnected information. Constructing it from unstructured text involves extracting entities, identifying relationships, and resolving references to build a coherent graph.
āļø HOW IT WORKS:
Construction pipeline: 1) Entity extraction - use NER models to identify entities (person, organization, location, product, etc.) in text. 2) Coreference resolution - link mentions like 'he', 'the company' to specific entities. 3) Relationship extraction - identify relationships between entities (e.g., 'works_for', 'acquired', 'located_in') using relation extraction models or LLMs. 4) Entity resolution/deduplication - recognize that 'Apple Inc.' and 'Apple' refer to same entity. 5) Graph construction - build nodes for entities, edges for relationships, store in graph database (Neo4j, Amazon Neptune). 6) Continuous updates - as new documents added, extract new entities/relationships and integrate.
š” WHY IT MATTERS:
Knowledge graphs capture explicit relationships that text alone obscures. They enable structured queries ('find all subsidiaries of Microsoft') and relationship traversal that vector search cannot do. For RAG, graphs provide a complementary retrieval mechanism: find entities, traverse relationships, then retrieve associated documents. This is especially valuable for enterprise data where relationships (org charts, product hierarchies, acquisition histories) are critical.
š EXAMPLE:
From a news article 'Elon Musk, CEO of Tesla and SpaceX, announced new Starship plans', extraction yields: entities: Elon Musk (person), Tesla (organization), SpaceX (organization). Relationships: Elon Musk ā CEO_of ā Tesla; Elon Musk ā CEO_of ā SpaceX. From another article: 'Tesla acquired Grohmann Engineering'. Adds: Tesla ā acquired ā Grohmann Engineering. Now graph knows Grohmann is connected to Elon Musk via two hops (Grohmann ā acquired ā Tesla ā CEO_of ā Musk). This enables answering 'What companies has Elon Musk indirectly acquired?' - impossible from text alone.
QUESTION 03
What is entity extraction and relation extraction in the context of GraphRAG?
š DEFINITION:
Entity extraction identifies and classifies named entities (people, organizations, locations, products) from text, while relation extraction identifies semantic relationships between those entities (e.g., 'works for', 'acquired', 'located in'). Together they form the foundation for building knowledge graphs from unstructured documents, enabling graph-based retrieval.
āļø HOW IT WORKS:
Entity extraction uses NER models (BERT-based, spaCy, or LLMs) to find spans of text that refer to entities and classify them into types. Relation extraction goes further: given two entities in text, it predicts the relationship between them. This can be done with: 1) Fine-tuned models on relation classification tasks. 2) LLMs with prompting ('Extract relationships from this text as (entity1, relation, entity2) triples'). 3) Rule-based patterns for well-structured text. Extracted triples (subject, predicate, object) become edges in the knowledge graph. Confidence scores can be attached for filtering low-quality extractions.
š” WHY IT MATTERS:
Without entity and relation extraction, GraphRAG can't exist - the graph must be built from somewhere. These techniques turn unstructured text into structured knowledge that enables relationship-based retrieval. Quality matters: missed entities mean missing graph nodes; wrong relations mean incorrect traversal. For domain-specific applications, models need fine-tuning on domain entities and relations. The combination of extraction and graph enables answering questions that require connecting information across documents.
š EXAMPLE:
From sentence 'Pfizer acquired Biohaven in 2022 for $11.6 billion', entity extraction identifies: Pfizer (ORG), Biohaven (ORG), 2022 (DATE), $11.6 billion (MONEY). Relation extraction identifies: (Pfizer, acquired, Biohaven), (acquisition, date, 2022), (acquisition, value, $11.6B). These triples populate graph. Later, query 'What acquisitions did Pfizer make in 2022?' traverses graph from Pfizer along 'acquired' edges with date filter, finds Biohaven. Without extraction, this information would be buried in text, unreachable by relationship queries.
QUESTION 04
How does Microsoft's GraphRAG implementation work at a high level?
š DEFINITION:
Microsoft's GraphRAG (introduced in their research) is a specific implementation that combines graph-based community detection with LLM summarization to enable global sense-making over large document collections. It builds a knowledge graph from documents, partitions it into communities, generates summaries for each community, and uses these for both local (specific queries) and global (thematic) question answering.
āļø HOW IT WORKS:
Pipeline: 1) Entity and relationship extraction from documents using LLMs, building a knowledge graph. 2) Community detection using Leiden algorithm to partition graph into hierarchical communities of related entities. 3) For each community, generate summaries using LLM that capture key themes and information. 4) For user queries, two modes: Local search - retrieve relevant entities and their neighborhoods, answer based on connected text chunks. Global search - identify relevant communities via community summaries, synthesize answer across communities. This enables both detailed factual answers and high-level thematic responses.
š” WHY IT MATTERS:
Standard RAG struggles with global questions requiring synthesis across many documents ('What are the main themes in this year's shareholder letters?'). GraphRAG's community summaries provide a way to answer such questions by first understanding the high-level structure of the knowledge. Microsoft's implementation showed significant improvements on Q&A tasks requiring multi-document synthesis, especially for thematic questions. It's particularly valuable for analyzing large document collections where relationships and themes matter.
š EXAMPLE:
Analyzing 1000 research papers on climate change. Local query: 'What are the effects of warming on coral reefs?' - retrieve relevant entities and papers. Global query: 'What are the main research themes in climate science this decade?' - GraphRAG uses community summaries to identify major topics (sea-level rise, extreme weather, agricultural impacts) and synthesize across papers. Standard RAG would retrieve papers but couldn't synthesize themes effectively. GraphRAG's community structure enables this global understanding.
QUESTION 05
What is community detection in GraphRAG and why is it useful?
š DEFINITION:
Community detection in GraphRAG partitions the knowledge graph into clusters (communities) of closely related entities based on graph structure. These communities represent coherent topics or themes - groups of entities that are densely connected. Detecting them enables hierarchical understanding of the knowledge base and supports global summarization.
āļø HOW IT WORKS:
Algorithms like Leiden or Louvain detect communities by optimizing modularity - density of connections within communities vs between them. In GraphRAG, after building entity graph from documents, community detection runs to identify natural groupings. This creates a hierarchy: small communities of tightly related entities, nested within larger communities representing broader themes. Each community can be summarized by an LLM, capturing its key information. The hierarchy enables drilling down from general to specific.
š” WHY IT MATTERS:
Communities reveal the knowledge base's latent structure. In scientific literature, communities might represent research areas. In news, they might represent ongoing stories. In enterprise data, they might represent projects or topics. This structure enables: 1) Global summarization - summarize each community to understand major themes. 2) Efficient retrieval - for broad queries, identify relevant communities via their summaries. 3) Sense-making - users can explore knowledge by community. Without community detection, the graph is just nodes and edges; with it, you get organized knowledge.
š EXAMPLE:
Knowledge graph from news articles about technology companies. Community detection finds: Community A (Apple-related entities: iPhone, Tim Cook, App Store), Community B (Google-related: Sundar Pichai, Android, Search), Community C (AI companies: OpenAI, Anthropic, AI safety). Communities reveal industry structure. For global query 'What are the major tech industry trends?', GraphRAG uses community summaries to identify themes: mobile ecosystem (A), search/advertising (B), AI race (C). Without communities, would have to synthesize from all documents directly - computationally expensive and less structured.
QUESTION 06
What are the trade-offs between GraphRAG and traditional RAG?
š DEFINITION:
GraphRAG and traditional RAG offer different trade-offs in retrieval capabilities, implementation complexity, and query types they excel at. Traditional RAG is simpler and effective for factual, similarity-based questions. GraphRAG adds relationship understanding and multi-hop reasoning at cost of complexity and latency.
āļø HOW IT WORKS:
Trade-off dimensions: 1) Query types - traditional RAG excels at 'find similar' and factual queries where answer exists in text. GraphRAG excels at relationship queries ('what X acquired Y'), multi-hop ('suppliers of competitors'), and global synthesis ('themes'). 2) Implementation complexity - traditional RAG needs vector DB and embeddings; GraphRAG needs entity extraction, relation extraction, graph DB, and community detection - significantly more complex. 3) Latency - GraphRAG adds graph traversal time (10-100ms) plus potential multiple hops. 4) Freshness - GraphRAG requires rebuilding graph when documents change, slower than vector index updates. 5) Interpretability - graph paths provide explainable reasoning ('found via acquisition relationship'), vector search is black-box.
š” WHY IT MATTERS:
Choice depends on application. For customer support Q&A on product docs, traditional RAG sufficient - no complex relationships needed. For competitive intelligence, financial analysis, or research synthesis, GraphRAG's relationship understanding essential. The cost and complexity of GraphRAG must be justified by query types. Many systems combine both: vector for broad retrieval, graph for specific relationship queries.
š EXAMPLE:
Legal research system. Traditional RAG: good for finding similar cases, statutes. GraphRAG adds: finding cases citing a precedent (citation graph), judges who ruled on related cases, lawyers who argued similar cases. For simple fact lookup, use vector. For complex relationship queries, use graph. Combined system serves all needs but with higher complexity. Trade-off analysis shows graph features justify cost for legal research where relationships matter.
QUESTION 07
When should you choose GraphRAG over standard RAG?
š DEFINITION:
Choose GraphRAG when your data has rich relationships that matter for queries, when questions require multi-hop reasoning across entities, when you need to understand connections not explicit in text, or when global synthesis across documents is required. Standard RAG suffices for simpler, fact-oriented applications without complex relationships.
āļø HOW IT WORKS:
Decision criteria: 1) Relationship density - does your data have important connections (acquisitions, reporting lines, citations, dependencies)? GraphRAG essential. 2) Query patterns - do users ask 'what X related to Y' or multi-hop questions? Graph needed. 3) Document interconnectedness - if documents reference each other heavily, graph captures this. 4) Need for explanation - graph paths provide explainable reasoning. 5) Global understanding - need to summarize themes across documents? Graph communities enable this. 6) Scale and complexity tolerance - can you handle graph construction overhead?
š” WHY IT MATTERS:
Using GraphRAG unnecessarily adds cost and complexity. Using standard RAG when relationships matter leads to missed information and poor answers. The decision should be based on data and query analysis. For enterprise knowledge graphs (org charts, product hierarchies, acquisition histories), GraphRAG is natural fit. For content recommendation based on similarity, standard RAG sufficient.
š EXAMPLE:
Pharmaceutical company knowledge base. Data includes: drugs, their mechanisms, diseases they treat, clinical trials, manufacturers, research papers. Queries include: 'Find drugs that target the same pathway as Drug X' (relationship), 'What are the side effects of drugs developed by companies Pfizer acquired?' (multi-hop). Standard RAG fails on these. GraphRAG essential. Conversely, HR policy documents have few relationships - standard RAG fine. The choice driven by whether relationships are central to information needs.
QUESTION 08
What is local vs. global search in GraphRAG?
š DEFINITION:
In GraphRAG, local search focuses on retrieving specific information related to entities in the query, traversing the immediate neighborhood in the graph to find relevant documents. Global search operates at the community level, using community summaries to answer broad, thematic questions that require synthesis across many entities and documents.
āļø HOW IT WORKS:
Local search: given query, identify entities mentioned. Retrieve those entity nodes and their immediate neighbors (1-2 hops) in the graph. Gather text chunks associated with these entities. Use these as context for generation. Good for factual, entity-centric questions. Global search: identify which communities are relevant to query by matching against community summaries. Retrieve summaries of relevant communities. Use these summaries (which synthesize information across many entities) as context. Good for thematic, overview questions. Some implementations combine both: start local, if insufficient, expand to community level.
š” WHY IT MATTERS:
Different query types need different scales of context. 'What is the stock price of Apple?' needs only Apple entity (local). 'What are the major trends in consumer electronics?' needs synthesis across many companies, products, analysts (global). GraphRAG's two modes handle both. Local provides precise, entity-grounded answers. Global provides big-picture understanding impossible from local retrieval alone. This duality makes GraphRAG suitable for both fact lookup and analysis.
š EXAMPLE:
Query 'Tell me about Tesla's Cybertruck' - local search: find Tesla entity, traverse to Cybertruck product entity, retrieve associated documents (specs, reviews, announcements). Answer detailed. Query 'What are the emerging trends in electric vehicle manufacturing?' - global search: identify communities related to EV manufacturing (battery tech, automation, supply chain), use their summaries to synthesize answer about trends across industry. Local can't answer global; global can't answer local. GraphRAG provides both.
QUESTION 09
What graph databases are commonly used with GraphRAG (Neo4j, Amazon Neptune)?
š DEFINITION:
Graph databases store and query graph-structured data (nodes, edges, properties) efficiently. For GraphRAG, they provide the backend for storing the knowledge graph and executing graph traversals. Popular choices include Neo4j (leading open-source graph DB), Amazon Neptune (managed cloud service), and specialized options like ArangoDB or TigerGraph.
āļø HOW IT WORKS:
Graph databases store entities as nodes with properties, relationships as edges with types and properties. They support graph query languages: Cypher (Neo4j), Gremlin (multiple), or SPARQL (RDF stores). For GraphRAG, typical operations: 1) Find nodes matching entity names. 2) Traverse relationships (e.g., 'follow outgoing 'acquired' edges'). 3) Retrieve connected nodes and their associated document IDs. 4) Filter by node properties (date, type). Graph databases optimize these traversals with index-free adjacency - following edges is fast regardless of graph size. They can also compute community detection algorithms internally.
š” WHY IT MATTERS:
Choice affects performance, scalability, and integration. Neo4j is popular for on-premises deployments with rich ecosystem (APOC, Graph Data Science library). Amazon Neptune is fully managed, integrates with AWS, good for cloud-native apps. Performance varies with graph size and traversal patterns. For billion-node graphs, specialized distributed graph DBs may be needed. Integration with vector stores also matters - many GraphRAG implementations combine graph DB with vector DB, using graph traversal results to filter vector searches.
š EXAMPLE:
GraphRAG with Neo4j: documents processed, entities and relationships stored in Neo4j with Cypher: CREATE (p:Company {name: 'Pfizer'})-[r:ACQUIRED]->(b:Company {name: 'Biohaven'}). Query: MATCH (p:Company {name: 'Pfizer'})-[r:ACQUIRED]->(acquired) RETURN acquired. Results used to fetch associated documents from vector DB. Neo4j's graph algorithms can also compute community detection for global search. This combination provides both graph traversal and vector similarity.
QUESTION 10
How do you handle entity resolution and disambiguation in knowledge graphs?
š DEFINITION:
Entity resolution (also called deduplication or disambiguation) is the process of determining when different mentions refer to the same real-world entity and merging them into a single node. This is critical for knowledge graphs because without it, the graph becomes fragmented with duplicate nodes, breaking relationships and causing retrieval failures.
āļø HOW IT WORKS:
Entity resolution approaches: 1) Rule-based - match on exact name, normalized forms ('Apple Inc.' vs 'Apple'). 2) Similarity-based - use string similarity (Jaccard, Levenshtein) on names. 3) Embedding-based - compute similarity of entity contexts. 4) Graph-based - if entities share many neighbors, likely same. 5) ML models - trained to predict coreference. Process: generate candidate pairs, score similarity, apply threshold, merge high-confidence matches. For ambiguous names ('Apple' the fruit vs 'Apple' the company), context (neighboring entities) helps disambiguate. Resolution can be done at ingestion or as separate step.
š” WHY IT MATTERS:
Without resolution, graph is messy and unreliable. 'Apple Inc.' and 'Apple' as separate nodes means relationships to only one are incomplete. Queries for 'Apple acquisitions' miss half. 'Washington' as location vs person confused. Poor resolution leads to missed connections and wrong answers. For GraphRAG, entity resolution is as important as extraction - it ensures the graph accurately represents the world.
š EXAMPLE:
Documents mention 'Microsoft', 'MSFT', 'Microsoft Corporation'. Resolution identifies these as same entity, merges into single node with all relationships attached. Now query 'Microsoft acquisitions' finds all acquisitions from any mention. Another: 'Amazon' as rainforest and 'Amazon' as company. Disambiguation uses context: if surrounded by 'rainforest', 'river', 'deforestation' - it's the forest; if 'CEO', 'stock', 'AWS' - it's the company. Correct resolution ensures right relationships attached to right entity. Without it, company Amazon's acquisitions would be wrongly linked to the forest.
QUESTION 11
What is the role of graph embeddings in graph-based retrieval?
š DEFINITION:
Graph embeddings are vector representations of nodes in a graph that capture their structural position and neighborhood information. They enable similarity search on graph structure - finding nodes that are structurally similar (same role in different parts of graph) or have similar neighborhoods, complementing text-based embeddings.
āļø HOW IT WORKS:
Graph embedding algorithms (Node2Vec, GraphSAGE, TransE, ComplEx) learn vector representations by: 1) Random walks - sample paths from node, treat as context, use word2vec-like training. 2) Graph neural networks - aggregate neighbor information through layers. 3) Knowledge graph embeddings - optimize to make (h,r,t) triples hold (head + relation ā tail). Resulting embeddings capture structural similarity: nodes with similar neighborhood patterns have similar vectors. For retrieval, can: find nodes similar to query entities, find nodes that fill similar roles, or combine with text embeddings for hybrid entity retrieval.
š” WHY IT MATTERS:
Graph embeddings enable 'structural' similarity search. Two companies in different industries but both are market leaders might have similar graph roles (many subsidiaries, many products). Text embeddings wouldn't capture this structural similarity. Graph embeddings also enable link prediction (finding missing relationships) and node classification. In GraphRAG, they can be used to expand retrieval beyond explicit relationships to structurally similar entities.
š EXAMPLE:
In a corporate graph, Node2Vec embeddings might place 'Tesla' and 'SpaceX' close because both have Elon Musk as CEO, both in innovation-heavy sectors, similar acquisition patterns. Query about 'companies like Tesla' could use graph embedding similarity to find structurally similar companies, even if text descriptions differ. This captures analogy: 'Tesla is to automotive as SpaceX is to aerospace'. Text embeddings alone might miss this structural parallel.
QUESTION 12
What is a community summary in GraphRAG and how is it generated?
š DEFINITION:
A community summary in GraphRAG is a concise textual description generated by an LLM that captures the key information, themes, and relationships within a detected community of entities. These summaries enable global understanding and efficient retrieval for broad, thematic questions without examining every document.
āļø HOW IT WORKS:
Generation process: 1) After community detection, each community contains a set of related entities and their associated text chunks (documents where they appear). 2) For each community, collect all text chunks from entities in that community (may be many). 3) Use an LLM with a prompt like: 'Summarize the key information, relationships, and themes in these texts. Focus on what this community is about, its main entities, and how they relate.' 4) LLM generates a concise summary (paragraph or bullet points). 5) Summaries are stored, indexed, and used for global search: when a query is broad, find relevant communities by matching against summaries, then use those summaries as context.
š” WHY IT MATTERS:
Community summaries enable answering global questions that would otherwise require synthesizing thousands of documents. Instead of retrieving and processing all documents, GraphRAG uses summaries as a high-level representation. This is both efficient (one summary per community vs thousands of docs) and effective (summaries capture themes, not just facts). For large knowledge bases, community summaries provide a way to 'see the forest for the trees'.
š EXAMPLE:
Community of entities related to 'renewable energy' includes solar, wind, Tesla, solar panels, government incentives, etc. Its summary: 'This community covers renewable energy technologies including solar and wind power, key companies like Tesla and First Solar, government policies promoting adoption, and research into efficiency improvements. Main themes are technology advancement, cost reduction, and policy support.' Global query 'What are the main trends in renewable energy?' matches this summary, which provides synthesized answer directly, without retrieving hundreds of documents. The summary captures the forest, not just trees.
QUESTION 13
How does GraphRAG handle multi-hop reasoning across entities?
š DEFINITION:
Multi-hop reasoning in GraphRAG traverses multiple steps through the knowledge graph to connect entities and answer questions that require indirect relationships. For example, answering 'What drugs are developed by companies that Pfizer acquired?' requires two hops: Pfizer ā acquired ā companies ā developed ā drugs. GraphRAG enables this by following edges in the graph.
āļø HOW IT WORKS:
Process: 1) Identify starting entities from query (e.g., 'Pfizer'). 2) Parse query to understand desired traversal pattern - may need to infer relationship types ('acquired', 'developed') or use generic traversal. 3) Execute graph traversal: from start nodes, follow outgoing edges of specified types for specified number of hops. 4) Collect target entities (drugs). 5) Retrieve documents associated with target entities. 6) Use as context for generation. Graph databases optimize such traversals with index-free adjacency - following edges is fast. For complex questions, multiple traversal paths may be combined.
š” WHY IT MATTERS:
Many real-world questions require multi-hop reasoning. 'Find competitors of companies that supply to Apple' (3 hops). 'What papers cite research that was funded by organizations Gates Foundation funded?' (2 hops). Standard RAG cannot do this because relationships aren't explicit in text. GraphRAG makes multi-hop possible, dramatically expanding question-answering capabilities. This is especially valuable in domains with rich relationship structures: corporate ownership, citation networks, supply chains.
š EXAMPLE:
Query: 'Find all products developed by startups that Google acquired.' Graph traversal: 1) Start at Google node. 2) Follow 'acquired' edges to acquired companies (YouTube, Android, DeepMind, etc.). 3) From each, follow 'developed' edges to products (YouTube platform, Android OS, AlphaFold). 4) Return list of products. Result includes products from all acquisitions, even if never mentioned together in any document. Multi-hop reasoning connects information across multiple documents through explicit relationships.
QUESTION 14
What are the indexing costs and challenges of building a knowledge graph at scale?
š DEFINITION:
Building a knowledge graph at scale involves significant computational costs and engineering challenges: entity extraction from millions of documents, relation extraction, entity resolution, graph storage, and maintaining freshness. These costs can exceed those of vector-only RAG by orders of magnitude.
āļø HOW IT WORKS:
Cost components: 1) Extraction - running NER and relation extraction on millions of documents requires significant compute (GPU-hours). For 10M documents, could be $50k-$200k in cloud costs. 2) Entity resolution - comparing millions of entities pairwise to deduplicate is O(n²) if naive, requires blocking and scalable algorithms. 3) Graph storage - graph DBs can be memory-intensive; billion-node graphs need distributed storage. 4) Index maintenance - as documents update, extraction must rerun; graph must be updated consistently. 5) Community detection - algorithms like Leiden can be expensive on large graphs. 6) Summary generation - generating community summaries with LLMs for thousands of communities adds cost.
š” WHY IT MATTERS:
GraphRAG's benefits come with real costs. For small to medium knowledge bases (<1M docs), costs manageable. For enterprise scale (10M-100M docs), costs can be prohibitive. Organizations must weigh benefits against infrastructure investment. Many start with vector RAG, add graph only for specific high-value use cases. Incremental approaches: build graph for subset of entities (e.g., key people, companies) rather than all.
š EXAMPLE:
Building graph for 10M scientific papers. Extraction: 10M papers Ć 2000 tokens Ć $0.0001/token (LLM) = $2M - prohibitive. Instead, use cheaper NER models (spaCy) for entities ($10k), relation extraction with fine-tuned models ($20k), entity resolution ($5k), graph DB ($2k/month). Total ~$40k + ongoing costs. This is feasible for research institutions but still significant. For 100M documents, scale 10x. Costs must be justified by use cases requiring graph capabilities.
QUESTION 15
How do you keep a knowledge graph up to date as new documents are added?
š DEFINITION:
Keeping a knowledge graph current as new documents arrive requires incremental update strategies that add new entities and relationships, resolve them against existing graph, and maintain consistency without rebuilding from scratch. This is challenging because new information may connect to, extend, or contradict existing graph.
āļø HOW IT WORKS:
Approaches: 1) Incremental extraction - process new documents through same extraction pipeline, producing new entity and relationship triples. 2) Entity resolution - match new entities against existing graph; if match found, merge (add aliases, attach new relationships to existing node); if not, create new node. 3) Relationship addition - add new edges, potentially connecting new to existing entities. 4) Conflict resolution - if new information contradicts existing (e.g., different acquisition date), need strategy (trust newer, flag for review, keep both with versions). 5) Graph maintenance - community detection may need recomputation if graph structure changes significantly. 6) Summary updates - affected community summaries may need regeneration.
š” WHY IT MATTERS:
Knowledge without freshness is dangerous. Outdated acquisition information leads to wrong answers. If graph not updated, users lose trust. Incremental updates are essential for production systems but complex. Many systems accept eventual consistency: new documents processed batch daily, graph updated nightly, with trade-off of stale data for consistency.
š EXAMPLE:
News organization adds article: 'Microsoft acquires Activision Blizzard'. Incremental update: extract entities Microsoft (ORG), Activision Blizzard (ORG). Resolve: Microsoft exists in graph, Activision Blizzard new. Add Activision Blizzard node. Add relationship Microsoft ā acquired ā Activision Blizzard with date 2024. Now queries about Microsoft acquisitions include Activision. Later article: 'Microsoft's acquisition of Activision complete' adds details. Process similarly. Graph stays current without full rebuild. Without incremental, would need daily full rebuild - expensive and slow.
QUESTION 16
What is a property graph and how does it differ from an RDF graph?
š DEFINITION:
Property graphs and RDF (Resource Description Framework) graphs are two different data models for representing graph-structured information. Property graphs (used by Neo4j, Amazon Neptune) store nodes with properties (key-value pairs) and typed relationships. RDF graphs (used in semantic web) store triples (subject-predicate-object) and are designed for interoperability and reasoning.
āļø HOW IT WORKS:
Property graph: nodes have labels (e.g., 'Person', 'Company') and properties (name: 'Elon Musk', age: 52). Relationships have types (e.g., 'CEO_OF') and can also have properties (since: 2008). Queried with Cypher or Gremlin. RDF graph: stores triples (e.g., <ElonMusk> <ceoOf> <Tesla>). Everything identified by URIs. Supports inference (if A ceoOf B and B type Company, infer A worksFor B). Queried with SPARQL. RDF is W3C standard, promotes data sharing; property graphs are more performant for traversal.
š” WHY IT MATTERS:
Choice affects GraphRAG implementation. Property graphs are more intuitive for many developers and performant for traversals (index-free adjacency). RDF excels when integrating data from multiple sources and when reasoning needed. For GraphRAG, property graphs are more common due to performance and ease of use with entity-rich data. RDF used in academic and semantic web contexts.
š EXAMPLE:
Property graph representation: (:Person {name: 'Elon Musk'})-[:CEO_OF {since: 2008}]->(:Company {name: 'Tesla'}). Simple, intuitive. RDF representation: <http://ex/ElonMusk> <http://ex/ceoOf> <http://ex/Tesla> . <http://ex/Tesla> <http://ex/name> 'Tesla' . More verbose, globally unique URIs. For query 'Who are CEOs of companies?', property graph: MATCH (p:Person)-[:CEO_OF]->(c) RETURN p.name, c.name. RDF: SELECT ?person ?company WHERE { ?person :ceoOf ?company . ?person a :Person . ?company a :Company }. Both work; property graph often simpler for application developers.
QUESTION 17
How would you evaluate the quality of a knowledge graph built from documents?
š DEFINITION:
Evaluating knowledge graph quality involves measuring accuracy of entities and relationships extracted, completeness of coverage, and utility for downstream tasks. Unlike retrieval metrics (recall/precision), graph evaluation requires assessing structural properties and semantic correctness.
āļø HOW IT WORKS:
Evaluation dimensions: 1) Entity extraction accuracy - precision (entities extracted that are correct), recall (entities in documents that were extracted). Sample documents, have humans verify extracted entities. 2) Relation extraction accuracy - for sampled entity pairs, verify extracted relations correct. 3) Entity resolution quality - check if duplicates correctly merged (precision) and if distinct entities incorrectly merged (recall). 4) Graph completeness - do key expected entities/relations appear? Compare to domain ontology if available. 5) Connectivity - are entities properly connected? Measure average degree, isolated nodes. 6) Downstream task performance - does graph improve RAG quality on relationship queries? Compare GraphRAG vs vector-only on test queries requiring relationships.
š” WHY IT MATTERS:
A poor-quality graph harms rather than helps RAG. Wrong entities cause wrong answers; missing relationships mean failed queries. Evaluation quantifies quality and guides improvement. For production, set quality thresholds (e.g., entity precision >0.95, relation precision >0.90) before deployment. Without evaluation, you don't know if graph is trustworthy.
š EXAMPLE:
Medical knowledge graph evaluation: Sample 1000 sentences, extract entities. Human review shows entity precision 0.92 (8% wrong), recall 0.85 (missed 15%). Relation extraction on 500 entity pairs: precision 0.88, recall 0.80. Entity resolution: 95% of duplicates correctly merged. Downstream: GraphRAG answers relationship queries with 0.82 accuracy vs vector-only 0.65 - improvement shows graph adds value despite imperfect extraction. This evaluation identifies need to improve relation extraction (0.88 precision borderline). Without it, would deploy with unknown quality.
QUESTION 18
What types of queries benefit most from a graph-based retrieval approach?
š DEFINITION:
Graph-based retrieval excels at queries involving explicit relationships between entities, multi-hop connections, structural patterns, and global themes. These queries are difficult or impossible for standard vector RAG because they require understanding connections that aren't explicitly stated in individual text chunks.
āļø HOW IT WORKS:
Query types benefiting: 1) Relationship queries - 'What companies did Microsoft acquire?' (graph finds all 'acquired' edges). 2) Multi-hop queries - 'What drugs are developed by companies Pfizer acquired?' (2 hops). 3) Path queries - 'Find paths between Elon Musk and Twitter' (graph traversal finds Musk ā investor ā Twitter). 4) Pattern queries - 'Find companies with more than 10 acquisitions' (graph pattern matching). 5) Community queries - 'What are the main research areas in this institution?' (community detection). 6) Influence/citation chains - 'Papers that cite papers that cite this paper' (graph traversal). 7) Global synthesis - 'Summarize major themes in corporate sustainability reports' (community summaries).
š” WHY IT MATTERS:
Understanding which queries benefit helps decide whether GraphRAG is worth investment. If your users ask relationship questions, graph is essential. If they ask only factual questions ('what is X'), vector may suffice. Many real-world domains (finance, law, healthcare, research) have rich relationship structures, making GraphRAG valuable.
š EXAMPLE:
Financial analyst queries: 'Find all board members of companies that are major suppliers to Tesla' (multi-hop: Tesla ā suppliers ā companies ā board_members). Vector search might find documents mentioning Tesla suppliers and board members separately but can't connect them systematically. Graph traversal finds exact set. Another: 'Show the ownership structure of company X' (graph query for hierarchical relationships). Vector search impossible. These query types justify GraphRAG investment.
QUESTION 19
How does GraphRAG handle conflicting information from multiple sources?
š DEFINITION:
Handling conflicting information in GraphRAG is challenging because different sources may state contradictory facts (e.g., different acquisition dates, contradictory relationships). The graph must represent these conflicts or resolve them, and the RAG system must handle uncertainty in answers.
āļø HOW IT WORKS:
Approaches: 1) Versioning - store multiple values with provenance (source, date). Graph allows multiple edges for same relationship type. 2) Confidence scoring - attach confidence scores to extractions; use highest confidence in answers. 3) Temporal modeling - treat relationships as time-bound (valid from-to). Conflicts resolved by recency. 4) Consensus - if multiple sources agree, use that. 5) Explicit contradiction - represent as separate facts; generation can note 'Sources disagree' and present both. 6) Manual resolution - for critical conflicts, human review. In retrieval, can return conflicting information and let LLM synthesize or acknowledge uncertainty.
š” WHY IT MATTERS:
Real-world data contains contradictions. Ignoring them leads to wrong answers. GraphRAG must handle this gracefully. The approach affects answer quality and trust. For high-stakes domains (medical, legal), explicit handling with provenance is essential. For general knowledge, consensus or recency may suffice.
š EXAMPLE:
Two news articles: one says 'Microsoft acquired Activision in January 2024', another says 'Microsoft's acquisition of Activision closed October 2023'. Graph stores both with dates and sources. Query 'When did Microsoft acquire Activision?' Graph traversal returns both dates with sources. Generation: 'According to Source A (Jan 2024) but Source B reports Oct 2023. The acquisition was announced earlier but may have closed on different dates.' This honest handling builds trust. Without it, would pick one arbitrarily, potentially misleading user.
QUESTION 20
How would you pitch GraphRAG as a solution to a product or engineering team?
š DEFINITION:
Pitching GraphRAG requires translating its technical capabilities into concrete business outcomes: answering previously impossible questions, improving accuracy on relationship queries, providing explainable results, and enabling new product features that leverage connections in data.
āļø HOW IT WORKS:
Key pitch points: 1) New capabilities - 'GraphRAG lets users ask questions like "What companies has our CEO invested in?" that our current system can't answer.' 2) Improved accuracy - 'For relationship queries, GraphRAG improves accuracy by 30% over vector-only.' 3) Explainability - 'We can show the path of connections that led to each answer, building user trust.' 4) Competitive advantage - 'Competitors can't answer multi-hop questions; we will.' 5) Use cases - specific examples relevant to business: 'For sales, find connections to prospects; for research, trace influence networks; for support, find related issues.' 6) Implementation path - phased approach: start with key entities, prove value, expand.
š” WHY IT MATTERS:
Engineering and product teams need to understand why GraphRAG is worth investment. Technical features alone don't sell. Benefits must tie to user needs, product roadmap, and business metrics. A successful pitch shows how graph capabilities solve real user pain points and open new opportunities.
š EXAMPLE:
Pitch to healthcare product team: 'Our users (researchers) often ask questions like "What drugs target the same pathways as Drug X?" or "Which clinical trials cited this paper?" Current vector search misses these connections. GraphRAG would model drugs, pathways, trials, citations as a graph. Demo shows: query "Find pathways affected by drugs developed by Pfizer" returns comprehensive answer with traceable paths. This would save researchers hours, increase platform stickiness, and differentiate us from competitors. Implementation: start with drug-target-pathway graph (6 weeks), prove value, expand to trials and papers. Cost: $50k development. Expected benefit: 20% increase in research queries, $200k annual value.' This resonates.