Traditional AI tools treat code as text. That’s the problem. They miss structural relationships and dependencies that separate successful modernisation from production failure.
This technical exploration is part of our comprehensive guide on AI for legacy modernisation, where understanding old code has emerged as the killer app for enterprise AI adoption.
LLMs forget variable definitions and function dependencies separated by thousands of lines. The “Lost in the Middle” problem means AI loses attention on information buried in long contexts.
Knowledge graph architecture transforms code from linear text into a queryable network of entities—functions, variables, classes—and relationships like calls, imports, and dependencies. Abstract Syntax Trees parse code structure, Neo4j stores relationships, and GraphRAG retrieves logically connected context beyond vector similarity.
The result? AI that follows transitive dependencies and accelerates reverse engineering from 6 weeks to 2 weeks. Thoughtworks proved this with CodeConcise in production.
Why Do Knowledge Graphs Matter for Legacy Code Understanding?
Code is fundamentally relational. Functions call functions, variables flow through execution paths, modules depend on modules. Text documents can’t represent these networks properly.
Graph structure enables deterministic traversal. If A calls B and B calls C, the graph stores that A depends on C. No guessing—just verifiable relationships.
This solves “Lost in the Middle” where LLMs lose focus in long contexts. Instead of stuffing random chunks into context, you retrieve the logically connected subgraph: the function plus dependencies plus data flow sources.
COBOL migrations illustrate this. A COPYBOOK variable defined 5,000 lines from usage gets missed by text search. Graph traversal finds it by following edges that show control flow.
The business impact? Missed variable dependencies crash ATM networks. Understanding existing code provides as much value as generating new code—a core thesis of AI-assisted legacy modernisation that challenges the current focus on code generation tools.
Graph queries guarantee finding all callers of a function. Text search misses synonyms and aliases across decades-old modules.
Thoughtworks cut reverse engineering from 6 weeks to 2 weeks for a 10,000-line module. For entire mainframe programmes, that’s 240 FTE years saved—proof points detailed in our analysis of cutting legacy reverse engineering time by 66% with AI code comprehension.
What Is an Abstract Syntax Tree and How Does It Transform Code Structure?
An Abstract Syntax Tree is a hierarchical representation of code’s grammatical structure. Created during compilation, it’s the intermediary between raw text and executable instructions.
AST captures nesting—functions contain statements, statements contain expressions—plus types and structural relationships. This enables deterministic analysis without executing the programme.
It treats code as data. Language-specific parsers extract intrinsic structure. Where plain text sees characters in parentheses, AST knows “this is a function call with three arguments of specific types”.
The process: source code goes to a lexer for tokenisation, then to a parser building the AST. The tree shows logical structure, stripping concrete syntax details like semicolons and whitespace.
This distinction lets you split code at logical boundaries. AST-based semantic chunking breaks at function boundaries, not arbitrary token limits that cut mid-statement.
CodeConcise parses code into AST forests stored in graph databases. Tree-sitter parsers support COBOL, PL/I, Java, Python, and JavaScript.
One project analysed 650 tables, 1,200 stored procedures across 24 business domains and 350 screens through AST parsing. Another tackled thousands of assembly functions in compiled DLLs with missing source code.
AST provides the nodes—functions, classes, variables. Graph databases add edges—semantic relationships showing how entities connect. These architecture patterns underpin the tools implementing knowledge graph principles differently across the vendor landscape.
How Are Code Relationships Captured in Graph Databases?
Graph databases store nodes—functions, classes, variables, files—and edges representing relationships like CALLS, IMPORTS, DEFINES, and USES.
Neo4j provides query languages like Cypher for traversing relationships. Finding all functions depending on a variable? Follow edges from the variable node to every function connected by USES edges.
Symbol resolution merges duplicate references into canonical nodes. Same function referenced in 10 files becomes one node with 10 edges. No duplication.
Transitive closure calculations expand relationships beyond direct connections. The graph computes and stores the implicit A→C dependency when A calls B and B calls C.
This enables multi-hop queries like “show all code paths from this API endpoint to database queries”, traversing the graph through service calls, business logic, and data access.
Different codebases need different emphasis. Heavy inheritance requires more INHERITS_FROM edges versus COMPOSED_OF edges in composition-favouring codebases.
Edge types capture structural and behavioural relationships. CALLS for function invocations. IMPORTS for module dependencies. DEFINES for variable declarations. ACCESSES for state modification.
Graph traversal at granular level reduces noise from LLM context, showing which conditional branch in one file transfers control to code in another.
Persistent storage allows incremental updates. Parse only changed files, recompute affected subgraphs. Graph stays fresh without prohibitive costs.
What Role Does Retrieval-Augmented Generation Play in AI Code Comprehension?
RAG provides LLMs with relevant external context before generating responses. Retrieval finds relevant code, then generation has the LLM explain it.
This reduces hallucination by grounding answers in actual codebase. RAG overcomes context window limits by sending 20 relevant functions instead of 500,000 lines.
Vector embeddings enable similarity search finding semantically similar code despite different variable names. Encoder models learn that “authentication”, “login”, and “verify credentials” are conceptually related.
Semantic chunking via AST ensures retrieved chunks are logically complete. You get entire functions, not mid-statement truncation. This approach maintains lightweight identifiers and dynamically loads data at runtime.
The maths: GPT-4’s 128,000 tokens equals roughly 30,000 lines. Enterprise codebases run 500,000 to 5 million lines. Retrieval becomes necessary.
Just-in-time strategies mirror human cognition. We create indexing systems like file systems and bookmarks, retrieving what we need when we need it.
RAG responses cite source file and line number for verification. You can check whether AI explanations match actual code behaviour.
Why external retrieval matters: LLMs train on public code, not your proprietary codebase. Without retrieval, the model doesn’t know your implementation details, naming conventions, or architectural decisions.
Progressive disclosure lets agents incrementally discover context. File sizes suggest complexity, naming conventions hint at purpose, timestamps proxy for relevance.
How Does GraphRAG Differ from Vector-Based RAG for Code Understanding?
GraphRAG uses graph traversal—following relationships—not just vector similarity. Vector RAG finds textually similar code. GraphRAG finds logically connected code through dependencies, callers, and implementations.
GraphRAG follows chains where function A calls B and B uses variable C—connections vector search misses because similarity between A and C might be low despite real dependencies.
Query “how does authorisation work when viewing card details?” Vector search returns scattered functions with “auth”. GraphRAG starts with the entry point and follows CALLS edges retrieving the complete flow: endpoint function, called services, database queries, data validation.
This solves “Lost in the Middle”. Instead of scattered chunks creating incoherent jumble, graph retrieval assembles complete dependency subgraphs.
With behavioural and structural edges, you include information in called methods, surrounding packages, and data structures passed into code.
Transitive closure finds all code affected by changing a variable definition. Forward traversal from DEFINES edges shows every usage. Reverse traversal of CALLS edges answers “if I change this function signature, what breaks?”
Hybrid approaches combine both: vector search for initial relevance, graph expansion for structural completeness.
Vector search suffices for documentation search and finding examples. GraphRAG becomes necessary for dependency analysis, impact assessment, and understanding complete flows in legacy modernisation.
GraphRAG provides explainable retrieval paths. “I included function X because it’s called by Y” versus “I included X because it had high vector similarity”.
How Does CodeConcise Implement Knowledge Graph Principles?
CodeConcise is Thoughtworks’ internal modernisation accelerator. Three generations developed over 18 months tackle legacy system challenges.
The tool combines an LLM with a knowledge graph derived from codebase ASTs. It extracts structure and dependencies, builds the graph in vector and graph databases, and integrates with MCP servers like JIRA and Confluence.
Multi-pass enrichment breaks analysis into layers. In each pass, navigation enriches function context using parent or child context. Pass 1 extracts function signatures. Pass 2 analyses logic and business rules. Pass 3 resolves data flows and dependencies.
Break artifacts into manageable chunks, extract partial insights, progressively build context. This reduces hallucination risk.
Human-in-the-loop validation prevents unchecked assumptions. Engineers review AI-generated specifications against source code. Corrections feed back for iterative refinement.
Output is functional specifications describing what the system does—business logic as requirements—not implementation details. This matters for preserving business rules while changing technical implementation.
A proof of concept cut reverse engineering from 6 weeks to 2 weeks for a 10,000-line module. The accelerator was extended for COBOL/IDMS tech stacks.
One project narrowed from 4,000+ functions to 40+ through multi-pass approach.
The principle: “Don’t try to recover the code—reconstruct the functional intent”. This matters when requirements are lost and comprehension takes months.
Preserving lineage creates audit trails. Track where inferred knowledge comes from—UI screen, schema field, binary function—preventing false assumptions.
Triangulation confirms every hypothesis across two independent sources. Call stack analysis, validating signatures, cross-checking with UI layer—all build confidence.
The multi-lens approach starts from visible artifacts like UI, databases, and logs. AI accelerates archaeology layer by layer but cannot replace domain understanding.
How Do Vector Search and Semantic Retrieval Find Relevant Code Context?
Vector embeddings convert code chunks into numerical representations capturing semantic meaning. Encoder models like CodeBERT train on millions of code examples to learn patterns.
Models learn that “authentication”, “login”, and “verify credentials” are semantically similar. This enables semantic search finding conceptually related code despite different names across files.
The process: function code goes through tokenisation, then through CodeBERT to produce a 768-dimensional vector. Vectors go into databases like Pinecone, Weaviate, or Qdrant for similarity search.
Cosine similarity measures the angle between vectors. Closer angle means more related content. Query “user authentication” retrieves “verify_credentials”, “check_login”, and “authenticate_user” without exact matches.
This handles synonyms, abbreviations, and different naming conventions. Systems built over decades have multiple names for similar operations. Semantic search cuts through variation.
Vector search in the graph leverages graph structure. After finding matches through vector similarity, the system traverses neighbouring nodes accessing LLM-generated explanations.
Vector database indexing uses approximate nearest neighbour algorithms enabling fast search across millions of embeddings. ANN algorithms trade slight accuracy for speed.
AST-based splits ensure functions are embedded as complete units. Embedding complete functions produces more meaningful vectors than arbitrary token windows.
Each interaction yields context informing the next decision. File sizes suggest complexity. Naming conventions hint at purpose. Timestamps proxy for relevance.
How Do You Overcome LLM Context Window Limits with Knowledge Graphs?
LLM context windows of 32,000 to 128,000 tokens can’t fit enterprise codebases. 128,000 tokens equals roughly 30,000 lines. Enterprise applications run 500,000 to 5 million lines.
The “Lost in the Middle” phenomenon shows LLMs lose attention on information mid-context. They remember start and end better, missing dependencies buried in thousands of intervening lines.
Knowledge graphs solve this through targeted retrieval. Send only relevant subgraphs, not entire codebases. Breadth-first search retrieves immediate dependencies. Depth-first search follows complete call chains.
Prioritise direct dependencies over distant ones, variable definitions over comments, called functions over siblings.
Context is a finite resource with diminishing returns. LLMs have an “attention budget”. Every token depletes this budget.
Attention scarcity stems from transformer architecture. Every token attends to every other token—n² pairwise relationships. As context grows, the model’s ability to capture relationships thins.
Good context engineering means finding the smallest set of high-signal tokens maximising desired outcomes.
Graph traversal reduces noise from context, letting LLMs stay focused and use limited space efficiently.
The deterministic process lets you analyse code independent of how it’s organised. Files might be structured for historical reasons not matching logical dependencies. Graph traversal follows actual relationships.
For API endpoint analysis, you need the endpoint function plus called services plus database queries plus validation logic. Assemble the logically connected subgraph, not random chunks.
Iterative dialogue enables refinement. LLM asks “what does function X do?” Graph retrieves X plus callees. Progressive disclosure keeps context focused on current needs.
What Are Multi-Pass Enrichment Techniques for Building Code Context?
Multi-pass enrichment builds knowledge graphs in layers, progressively adding detail from structure to semantics. Each pass validates the previous layer, preventing error propagation.
Pass 1: Extract function signatures, class definitions, module boundaries from AST. No semantic interpretation—just grammatical structure. Fast because it’s pure parsing.
Pass 2: Resolve symbols across files, build call graph, identify data flow paths. Uses static analysis, not LLM inference. Map who calls whom, who imports what, which variables flow where.
Pass 3: Apply the LLM to analyse business logic within identified functions. Computationally expensive but runs only on targeted code after structural filtering.
Why multi-pass matters: analysing everything at once overwhelms compute budget and produces inaccurate results.
Incremental validation happens after each pass. Pass 1: verify AST completeness. Pass 2: validate call graph has no dangling references. Pass 3: check LLM explanations align with code behaviour.
The comprehension pipeline traverses the graph using algorithms like depth-first search with backtracking, enriching with LLM-generated explanations at various depths.
One project narrowed down from 4,000+ functions to 40+ through staged filtering.
Automatically generated documentation is valuable for ongoing maintenance and knowledge transfer, not just modernisation.
FAQ Section
Can knowledge graphs handle dynamic languages like Python and JavaScript where types are implicit?
Yes, through gradual typing and runtime behaviour analysis. The graph stores observed types from test executions, type hints, and LLM inference. Less precise than statically typed languages, but it captures actual usage patterns.
Hybrid approaches combine static AST parsing with dynamic profiling. The static pass extracts structure and relationships it can verify. The dynamic pass runs test suites and observes actual type behaviour at runtime. Type hints in Python and TypeScript definitions get incorporated where available. For completely untyped code, the LLM infers types based on usage context—if a variable is passed to a function expecting a string, it’s probably a string.
How do you keep knowledge graphs synchronised when codebases change daily?
Incremental update strategies parse only changed files and recompute affected subgraphs. MCP (Model Context Protocol) enables real-time graph serving for continuous synchronisation.
Most teams update nightly or per-commit in CI/CD pipelines. The trade-off between freshness and computational cost determines frequency. Per-commit updates provide real-time accuracy but consume significant compute resources. Nightly updates batch the work when systems are idle. For active development, per-commit makes sense. For stable maintenance-mode systems, nightly suffices.
Git hooks trigger the update process. Changed files get parsed, new AST nodes created or updated, affected edges recomputed. If function X changes signature, the graph recomputes all CALLS edges pointing to X. Symbol resolution runs again on modified modules. The entire process completes in minutes for typical commits touching a handful of files.
What’s the difference between control flow graphs and call graphs in knowledge graph architectures?
Call graphs map function invocation relationships—who calls whom. Control flow graphs map execution paths within a function—branches, loops, jumps. Both stored in knowledge graph as different edge types. Call graphs for dependency analysis. Control flow graphs for understanding spaghetti code with GOTOs in legacy systems.
How much storage do code knowledge graphs require compared to the source code?
Typically 2-10x source code size depending on relationship density. Example: 1GB source code might require 5-8GB graph database including AST nodes, relationships, embeddings, and metadata.
Neo4j compression helps reduce overhead. The multiplier depends on code characteristics. Object-oriented codebases with deep inheritance hierarchies generate more edges than procedural code. Microservices with many inter-service calls create dense graphs. Legacy monoliths with minimal modularisation have sparser graphs.
The trade-off is worthwhile: storage is cheap, engineer time understanding code is expensive. The graph enables queries impossible with grep. “Find all code paths that modify this database table” takes seconds with graphs, days with manual investigation.
Can you build knowledge graphs for codebases with missing dependencies or incomplete source?
Yes, with limitations. The graph marks unresolved symbols as “external” or “unknown” nodes. Partial graphs are still valuable for analysing available code. Heuristics infer missing types from usage context. Better than nothing, but completeness suffers. Ideal case includes full source with third-party dependencies for complete graph.
How does GraphRAG handle polymorphism and inheritance in object-oriented code?
The graph stores inheritance edges like EXTENDS and IMPLEMENTS, enabling traversal of type hierarchies. When retrieving callers of a virtual method, the graph follows the inheritance tree to find all overrides.
This gets complex fast. A call to vehicle.move() might invoke Car.move(), Truck.move(), or Motorcycle.move() depending on the runtime type. The graph stores all possibilities. When assembling context for the LLM, it includes all implementations the type system allows.
More complex than static calls because it requires type resolution pass determining possible types at each call site. GraphCodeBERT encoders understand OOP patterns, improving semantic search across hierarchies. The embeddings capture that Car.move() and Truck.move() are semantically related despite different implementations.
What prevents knowledge graphs from becoming outdated as code evolves?
The comprehension pipeline keeps extensibility to extract knowledge most valuable to users considering their specific domain context. Continuous integration triggers graph updates on commits. Timestamp metadata tracks last modification dates. Staleness detection lets queries filter by recency. Human-in-the-loop validation flags outdated specifications. Perfect freshness is impossible, but acceptably recent suffices for most use cases.
How do you validate that AI-generated explanations from GraphRAG are accurate?
Grounding means every explanation cites source file and line number for verification. You can read the actual code the AI referenced and confirm the explanation matches reality.
Triangulate everything: never rely on single artifacts, confirm every hypothesis across at least two independent sources. If the AI says function X handles authentication based on its name, verify by checking what it actually calls, what data it accesses, and what the UI layer expects.
Preserve lineage by tracking where every piece of inferred knowledge comes from. The AI might infer business rules from database constraints, UI validation logic, and function implementations. Knowing which sources contributed to each conclusion helps you assess confidence.
Human review validates sections where errors would be costly. Test generation verifies specs match implementation—generate unit tests from the AI’s understanding and see if they pass. Multi-model consensus compares outputs from different LLMs for critical decisions.
Can knowledge graphs support real-time code analysis during development?
Emerging capability with tools demonstrating commit-level analysis. Requires highly optimised incremental updates and caching. MCP servers expose graphs to IDE extensions for live queries. Real-time is less relevant for legacy modernisation than batch analysis. More relevant for code review and documentation use cases. Performance improving as graph databases optimise.
What’s the learning curve for teams adopting knowledge graph approaches?
Tools abstract complexity: engineers query graphs through natural language, not Cypher. Graph concepts—nodes, edges, traversal—are intuitive to developers who already think in terms of function calls and dependencies.
The steeper curve is graph database administration: indexing strategies, query optimisation, managing graph size as codebases grow. But most teams consume graphs, they don’t build infrastructure. CodeConcise and similar tools provide the interface. You ask questions in English, get answers with citations.
Most teams productive within weeks for consumption, months for graph construction if you’re building your own infrastructure. Vendors provide managed services eliminating infrastructure burden. You point the tool at your codebase, it handles AST parsing, graph construction, embedding generation, and query optimisation.
How do knowledge graphs handle code comments and documentation?
Comments are stored as properties on function and class nodes, indexed for semantic search. Documentation embeddings link to code entities they describe. Graphs can map outdated comments by comparing documentation content to implementation when function signatures change but comments don’t. When codebase includes documentation it provides additional contextual knowledge enabling LLMs to generate higher-quality answers.
Are there open-source alternatives to Neo4j for building code knowledge graphs?
Yes. Memgraph for graph databases, Apache AGE as PostgreSQL graph extension, JanusGraph for distributed graphs. Vector databases include Qdrant, Weaviate, and Chroma for embeddings. Graph construction tools include Tree-sitter for parsing. MCP has a growing ecosystem of servers for various systems including AWS services and Atlassian products. Full stack requires assembly whereas commercial tools provide integrated experience.
Understanding Knowledge Graphs in the Broader AI Modernisation Context
The knowledge graph architecture principles explored here form the technical foundation for AI-assisted legacy modernisation. They enable AI to understand code structure and relationships rather than treating codebases as unstructured text, delivering the precision that separates successful modernisation programmes from failed experiments.
For teams evaluating which tools best implement these principles, our comparison of code comprehension vs code generation tools provides vendor landscape analysis and build-versus-buy frameworks. The technical capabilities outlined in this article translate directly to the 66% reduction in reverse engineering timelines achieved through GraphRAG approaches that maintain context while respecting LLM attention budgets.