Legacy modernisation programmes don’t stall because your team can’t write code. They stall because nobody knows what the old code actually does. You can rewrite systems all day, but if you don’t know what business logic is buried in those 30-year-old COBOL modules, you’re building on quicksand.
This is the core insight driving AI-assisted legacy modernisation: understanding old code is more valuable than writing new code.
Reverse engineering legacy code typically takes six weeks per module. That’s developers working for four weeks plus wait time for scarce expert reviews. Multiply that by hundreds or thousands of modules and you’re looking at multi-year delays before coding even begins.
Thoughtworks compressed that timeline from six weeks to two using their CodeConcise tool. That’s a 66% reduction. The case study covered 10,000-line COBOL/IDMS modules in a mainframe programme with around 1,500 modules. Scale that and you’re looking at 240 FTE-years in potential savings.
The methodology is multi-pass enrichment using knowledge graphs from abstract syntax trees. The AI understands code relationships, call chains, and data flows—not just text. It can infer business rules from decades-old undocumented systems.
Here’s the proof through Thoughtworks’ case study, the methodology covering multi-pass enrichment and binary archaeology, validation approaches, and evaluation criteria for your modernisation efforts.
Why Does Reverse Engineering Create Modernisation Bottlenecks?
Before you modernise, you need to understand what the system does. That means extracting functional specifications and business rules from existing code—reverse engineering.
The problem? Documentation is missing or hopelessly out of sync, and the people who wrote the code left years ago. Your team stares at COBOL trying to trace business logic through nested IF-THEN statements, copybooks, and database schemas without a map.
Traditional reverse engineering relies on subject matter experts who understand both the legacy codebase and business rules. These experts are retiring. COBOL developers are declining significantly, with estimates showing a shortfall close to 100,000 workers.
Six weeks per module times hundreds of modules equals multi-year delays. Legacy systems drain up to 80% of IT budgets on maintenance.
This is why understanding old code is the new killer app for enterprise AI—code comprehension removes the primary blocker to modernisation at scale. Your scarce experts spend months on manual analysis when they could validate AI-generated work in days.
You can’t modernise what you don’t understand. Functional specifications are the prerequisite for any migration approach. The reverse engineering bottleneck cascades into testing, validation, and cutover, multiplying costs throughout the programme.
How Did Thoughtworks Reduce COBOL Reverse Engineering from 6 Weeks to 2 Weeks?
CodeConcise treats code as data using language-specific parsers to extract structure and map relationships. Instead of feeding raw code into an LLM and hoping, the system creates a deterministic foundation first.
The ingestion pipeline parses COBOL/IDMS source into Abstract Syntax Trees. Each node represents a code element—functions, procedures, variables, control flow. Edges capture relationships between them.
These ASTs go into Neo4j, a graph database with vector search for GraphRAG retrieval. The knowledge graph lets the AI fetch only relevant code relationships for each analysis task.
Then comes comprehension. Algorithms traverse the graph, enriching it with LLM explanations. The AI walks call chains, adds implementation details, maps dependencies, infers business rules from patterns.
Martin Fowler and Birgitta Boeckeler co-authored the case study documenting how Thoughtworks extended CodeConcise for COBOL/IDMS. The system analysed 10,000-line modules—sprawling business logic taking human experts weeks to reverse engineer.
The result? Two weeks instead of six. Four weeks saved per module. Human experts shifted to validation, freeing them to oversee multiple parallel AI workstreams instead of serial manual analysis.
The economics work even when AI accuracy isn’t perfect. With proper validation, you’re still faster than manual analysis and make better use of scarce COBOL expertise.
How Do You Extract Business Rules from Undocumented Legacy Code?
Business rules are the logic governing system behaviour: credit approval thresholds, tax calculations, eligibility criteria. In legacy systems, these rules are scattered across modules, embedded in control flow, implicit in data structures. Nobody documented them because “everyone knew” back in 1995.
Multi-pass enrichment builds understanding incrementally. Pass 1 identifies functions and procedures—straightforward structural analysis. Pass 2 adds implementation details for each function. Pass 3 maps dependencies and call chains. Pass 4 infers business rules from patterns.
Each pass keeps the AI task focused. Asking an LLM to “explain this entire system” produces generic summaries and hallucinated details. Breaking it into passes means errors in Pass 1 don’t propagate into Pass 4.
Static analysis provides code structure without executing the system. AST parsing reveals control flow, data flow, module relationships. You get dependency diagrams and flowcharts—deterministic outputs that don’t hallucinate.
Dynamic analysis adds runtime context—logs, database change data capture showing how UI actions map to database activity, actual behaviour patterns.
Thoughtworks calls this triangulation—confirming hypotheses across multiple sources. The AI might infer premium customers bypass credit checks for orders under $5,000 by analysing COBOL. You validate by checking the UI (premium flag?), database (skip credit_check table?), stored procedures (bypass logic?).
When sources align, confidence is high. When they don’t, you’ve caught a misinterpretation before it hits forward engineering.
The output converts technical code patterns into business-readable rules: “If customer credit score is less than 620, then require manual approval.” These functional specifications feed forward engineering without months of detective work.
What Is Multi-Pass Enrichment and How Does It Reduce AI Hallucination?
AI-assisted reverse engineering risks hallucination—the AI infers business rules that sound plausible but are wrong. Undetected, these propagate into modernised systems, creating bugs taking months to discover.
Multi-pass enrichment reduces this by processing codebases in targeted passes, preventing context overload triggering hallucinations. Each pass adds understanding, and errors caught early don’t compound later.
The progression is structural to behavioural to semantic. Pass 1: “list all functions and parameters”—straightforward and verifiable. Pass 2: “explain what each function does”—builds on Pass 1. Pass 3: “map dependencies and call chains”—uses earlier context. Pass 4: “infer business rules from patterns”—highest abstraction, constrained by earlier passes.
Context poisoning is another risk. If you pass what you’re looking for into the LLM, it colours output toward expectations rather than what code does. The team created a clean room for the AI, providing only deterministic code structure.
Each pass output feeds the next, creating an audit trail. When validation catches an error in Pass 3, you trace back to see what the AI misunderstood, provide corrected context, re-run targeted analysis without discarding earlier work.
Lineage preservation links every inferred specification to code locations. When a business rule looks questionable, the expert jumps to source code that generated it to verify.
Multi-pass becomes crucial for binary archaeology—reverse engineering compiled binaries when source is unavailable. You’re working from assembly or pseudocode, so misinterpretation risk is higher. Focused passes prevent errors cascading through abstraction levels.
How Does Binary Archaeology Work When Source Code Is Unavailable?
Sometimes you don’t have source. It’s lost, compiled into proprietary formats, or sitting in binaries from Windows XP era.
Thoughtworks faced this with compiled DLLs where source was gone. The system had 650 tables, 1,200 stored procedures, 350 user screens, 45 compiled DLLs.
Tools like Ghidra decompile binaries to assembly. Each DLL had thousands of functions—you need to narrow down to relevant ones.
The approach: identify entry points by examining constants and strings in the DLL. Error messages, database table names, UI labels provide clues. Then walk up the call tree from leaf to parent functions.
The team narrowed thousands of functions to manageable subsets. Multi-pass enrichment builds understanding incrementally, and AI pattern recognition identifies common patterns in assembly to infer intent.
The multi-lens approach combines UI reconstruction, change data capture, and binary analysis. Browse the live application for UI elements. Trace UI actions to database activity. Build hypotheses from data modification patterns.
Cross-validation confirms you’re right. If binary analysis suggests a function handles pricing, UI should show price calculation screens, and database change data capture should show pricing table writes.
Binary archaeology produces functional understanding but may miss developer intent. Better than nothing when source is unavailable, but accuracy is lower.
How Do You Validate AI-Generated Functional Specifications?
AI accelerates analysis but can’t replace domain expertise. Even with multi-pass enrichment, human-in-the-loop validation is required.
SMEs review specifications section-by-section, marking items accurate, inaccurate, or incomplete. Your COBOL expert understands code structure, but you need someone who knows the business domain to confirm inferred rules match processes.
Sampling depends on risk. Apply 100% validation for financial calculations or regulatory compliance. Use statistical sampling for routine CRUD operations.
Cross-check AI inferences against code, documentation, database schemas, UI behaviour. Every specification links to source code for spot-checking.
Confirm hypotheses across two sources minimum. Keep humans in the loop. Pair AI with expert validation, especially for business rules.
Expect 85-95% accuracy. Varies by complexity—straightforward CRUD hits 95%, complex logic with edge cases may be 80-85%. Even at 85%, economics favour AI approaches.
Experts shift from weeks of manual analysis to days of validation. This reduces SME dependency significantly, freeing scarce COBOL experts. One expert oversees multiple parallel AI workstreams instead of serial manual analysis.
When validation catches errors, you don’t start over. Inaccurate specifications trigger targeted re-analysis. Multi-pass structure lets you rerun just the affected pass.
What Does 240 FTE-Year Savings Actually Mean for a Modernisation Programme?
The calculation: four weeks saved per module times 1,500 modules equals 6,000 weeks. That’s 115 FTE-years in reverse engineering alone. The 240 FTE-year figure includes downstream benefits.
Better specifications reduce test cycles and rework. At £100,000 consultant cost, that’s £24 million potential savings. For the complete framework on translating time savings to ROI, see our ROI playbook.
Timeline compression matters. Programmes finishing in two to three years instead of five accelerate business value. Some reach self-funding when reverse engineering savings cover forward engineering.
Resource reallocation is another benefit. SME time shifts from months of analysis to days of validation. Three experts validating AI output across nine modules simultaneously instead of analysing three serially. Programme velocity increases without hiring.
The 240 FTE-year figure applies to large programmes with 1,000+ modules. Smaller programmes see proportional savings—150 modules means 24 FTE-years. Still significant.
Better specifications reduce modernisation failures—discovering you missed business rules breaking workflows. Clear specifications prevent expensive production issues.
When Should You Use AI-Assisted Reverse Engineering vs Traditional Methods?
Best fit scenarios: large codebases over 100,000 lines, multiple modules with similar patterns, scarce SME availability, compressed timelines.
Traditional methods have their place. Small codebases under 10,000 lines with available experts may not justify tooling investment.
Code complexity matters. Spaghetti code with deep call chains benefits from graph-based AI traversal. Well-written modular code maximises AI output quality.
Language support is a constraint. COBOL, Java, Python, C have mature AST parsers. CodeConcise supports COBOL, Python, and C. Obscure proprietary languages may lack tooling.
No documentation means AI provides high value. Extensive current documentation means traditional methods may suffice.
ROI calculation: modules times four weeks saved times consultant rate, minus tool cost, setup, and validation. For 20+ modules, break-even is typical. For 50+ modules, ROI is compelling.
Pilot with 3-5 representative modules to validate accuracy and time savings before committing.
Hybrid approaches work. Use AI for initial analysis, then apply traditional methods for complex edge cases. You get acceleration while managing risk.
CodeConcise requires Thoughtworks consulting. GitHub Copilot is commercially licensed—accessible without consulting budget.
The success pattern: remarkable progress analysing and rewriting legacy systems in weeks that previously took months. That acceleration changes programme economics.
FAQ Section
Can AI reverse engineering work for languages other than COBOL?
Yes. The knowledge graph approach applies to any language with AST parsers. CodeConcise supports COBOL, Python, and C. GitHub Copilot works with Java, C#, and JavaScript. The methodology—AST to knowledge graph to LLM enrichment—is language-agnostic. Only the parser differs. Adding Python support took half a day versus the typical two to four weeks. Obscure proprietary languages may lack mature parsers.
How accurate are AI-generated functional specifications?
Thoughtworks reported high confidence due to multiple levels of cross-checking. With multi-pass enrichment and human validation, expect 85-95% accuracy. Varies by complexity—straightforward CRUD hits the higher end, complex business logic sits lower. Financial calculations or regulatory compliance need 100% SME validation. Even at the lower end, time savings versus manual analysis remain substantial. Pair AI speed with human validation.
Do I still need COBOL experts if using AI reverse engineering?
Yes, but their role shifts. Experts move to validation instead of full-time reverse engineering. This leverage lets one expert oversee multiple parallel AI workstreams. Programme velocity improves, and scarce COBOL expertise gets used better. With COBOL developer shortfall approaching 100,000 workers, using AI to reduce expert time from weeks to days makes previously impractical programmes viable.
What’s the difference between CodeConcise and GitHub Copilot for reverse engineering?
CodeConcise is Thoughtworks’ internal modernisation accelerator for legacy code comprehension using knowledge graphs. Provides structured functional specifications through deterministic AST parsing. Requires consulting engagement—you work with Thoughtworks experts.
GitHub Copilot is Microsoft’s general-purpose AI coding assistant with emerging agentic capabilities. Offers interactive code exploration rather than formal specifications. Commercially licensed—accessible without consulting but without the structured methodology.
Can binary archaeology really extract business rules from compiled code?
Yes, with limitations. Thoughtworks used multi-lens approach to extract specifications from compiled DLLs when source was unavailable. Ghidra decompiles binaries to assembly or pseudocode, then AI identifies patterns. The team narrowed thousands of functions to manageable subsets.
Accuracy is lower than source code analysis. Developer intent may be unclear, variable naming context is lost, validation is harder. Best when source is genuinely unavailable. Multi-lens approach—UI reconstruction, change data capture, binary analysis—provides cross-validation improving confidence.
How long does it take to set up AI-assisted reverse engineering for a project?
Initial setup: 2-4 weeks for tool configuration, AST parser setup, knowledge graph schema, validation workflow. Pilot phase: 4-6 weeks analysing 3-5 modules to validate accuracy and refine methodology. Full deployment: 1-2 weeks training teams. Total: 2-3 months from decision to scaled deployment.
Investment pays back after 10-15 modules—4 weeks saved per module recovers setup time quickly. For 50+ modules, setup becomes negligible versus total savings. Thoughtworks’ pilot-first approach manages risk while building confidence.
What’s the biggest risk in AI-assisted reverse engineering?
Hallucination propagation—AI inferring incorrect business rules that sound plausible but are wrong. Undetected, these propagate into forward engineering, creating bugs taking months to discover.
Mitigation: multi-pass enrichment keeps each AI task focused. Each inference links to source for verification. Checking across sources validates findings. Human-in-the-loop catches errors. Context poisoning—passing expectations into LLM—colours output toward what you’re looking for rather than what code does. Never skip SME validation for business logic.
How does this approach handle systems with no documentation at all?
Undocumented systems are the best use case. Documentation is missing or out of sync in most legacy systems. AI analyses code structure, behaviour patterns, relationships to infer functionality—what human experts do but faster.
Combine with dynamic analysis—logs, database change data capture, UI observation—to check against actual behaviour. Thoughtworks successfully created specifications from systems with virtually no documentation. The worse your documentation, the higher your ROI. You’re not competing against good docs—you’re competing against months of detective work.
Can AI reverse engineering work for microservices or just monoliths?
Both, with different emphasis. Monoliths benefit from graph-based call chain analysis through deep nesting—knowledge graphs handle complexity overwhelming human memory. Microservices benefit from cross-service relationship mapping and API contract inference.
CodeConcise knowledge graphs work for distributed systems. Main challenge is establishing inter-service edges versus intra-service relationships—mapping how services communicate, not just what each does. Microservices’ explicit interfaces make analysis easier than monolith spaghetti, so accuracy is often higher.
What happens if the AI gets something wrong?
Every specification links to source. When validation catches errors, you trace what the AI misunderstood, provide corrected context, re-run targeted analysis. Multi-pass structure limits propagation—rerun just Pass 3 without discarding earlier work.
Financial or regulatory systems need 100% SME validation catching errors before downstream impact. Deterministic AST parsing doesn’t hallucinate, probabilistic LLM inference might hallucinate, and human validation catches hallucinations. Multiple levels of cross-checking were how Thoughtworks achieved high confidence.
Is this approach cost-effective for small companies under 100 employees?
Depends on codebase size and SME availability. Under 50,000 lines with available experts? Traditional analysis may be faster and cheaper. Over 100,000 lines with retiring experts or aggressive timeline? ROI justifies investment.
Calculate: modules times four weeks saved times consultant rate, minus tool cost and setup. For 20+ modules, break-even is typical. For 50+ modules, ROI is compelling. Small companies face the same COBOL expert scarcity as enterprises. If your experts are overloaded or leaving, AI acceleration makes sense even at smaller scale.
How does AI-assisted reverse engineering integrate with agile modernisation?
Excellently. Fast reverse engineering enables short iteration cycles. Analyse one capability, modernise it, validate, repeat. AI-generated specifications become user stories. Knowledge graph evolution tracks as understanding grows.
Capability-driven modernisation aligns with agile’s incremental delivery. The evolutionary approach makes legacy displacement safer and more effective. AI acceleration makes evolutionary modernisation practical where traditional analysis creates waterfall bottlenecks killing agile momentum.
For implementing reverse engineering in your modernisation programme, our 90-day implementation playbook provides step-by-step execution guidance that integrates AI-assisted reverse engineering with agile delivery cycles.
AI-assisted reverse engineering demonstrates that understanding old code is the new killer app for enterprise AI. The 66% timeline reduction isn’t just about speed—it’s about making previously impractical modernisation programmes viable. When you can analyse 1,500 modules in months instead of years, you unlock transformation that was economically impossible under traditional approaches.