Plan references: PLAN.md §§3.2, 3.3
Agent: PinkMountain (parser-agent)
Status: Production-ready (all tasks complete except stack-graphs integration, blocked on upstream WASM support)
The HELIOS parser extracts functions, imports, exports, and call expressions from source code using Tree-sitter (WASM). It provides the foundation for the call graph construction and subsequent analysis pipeline.
Key components:
src/parser/parser.js- Tree-sitter manager (initialization, grammar loading, parsing)src/parser/queries.js- Tree-sitter query patterns (functions, imports, exports, calls)src/extractors/javascript.js- JavaScript/TypeScript extractorsrc/extractors/python.js- Python extractorsrc/extractors/call-graph.js- Call graph builder with enhanced heuristic resolutionsrc/extractors/symbol-table.js- Symbol table manager for name resolution
Architecture:
Source Code → Tree-sitter Parser → AST → Query Matches → Extractor → Parser Payload
↓
Symbol Table Manager
↓
Call Graph Builder
Currently supported languages:
- JavaScript (
.js,.jsx) - Full support - TypeScript (
.ts,.tsx) - Full support - Python (
.py) - Full support
Future languages (v0.4):
- Go (
.go) - Rust (
.rs) - Java (
.java)
The parser uses a singleton parserManager instance that must be initialized before use:
import { parserManager } from './src/parser/parser.js';
// Initialize once (loads web-tree-sitter WASM)
await parserManager.initialize();What happens during initialization:
- Loads
web-tree-sitterWASM runtime - Registers Query constructor (or falls back to deprecated API)
- Sets up grammar loading mechanism
Language is automatically detected from file extension:
const language = parserManager.detectLanguage('src/utils/math.ts');
// Returns: 'typescript'Supported extensions:
.js,.jsx→javascript.ts,.tsx→typescript.py→python
Grammars are lazily loaded from local WASM files:
const language = await parserManager.loadLanguage('javascript');
// Loads: /grammars/tree-sitter-javascript.wasmGrammar locations:
grammars/tree-sitter-javascript.wasmgrammars/tree-sitter-typescript.wasmgrammars/tree-sitter-python.wasm
Parse source code to get AST:
const source = `function add(a, b) { return a + b; }`;
const tree = await parserManager.parse(source, 'javascript', 'src/math.js');
// Returns Tree-sitter Tree objectImportant: Always delete the tree when done to free memory:
tree.delete();Use language-specific extractors to extract code elements:
import { extractJavaScript } from './src/extractors/javascript.js';
const result = await extractJavaScript(source, filePath, language);
// Returns: { functions, exports, imports, calls, filePath }Extracted elements:
- Functions - Function declarations, methods, arrow functions
- Exports - Export statements (default and named)
- Imports - Import statements (ESM, CommonJS patterns)
- Calls - Function call expressions (direct and member calls)
The parser provides function metadata for embedding generation:
// Parser output includes function metadata
{
functions: [
{
id: "src/math.js::add",
name: "add",
fqName: "math.add",
filePath: "src/math.js",
startLine: 1,
endLine: 5,
source: "function add(a, b) { ... }",
// ... other fields
}
]
}Integration: Embeddings agent uses functions[] array to generate embeddings.
The parser provides call edges for graph construction:
// Parser output includes call edges
{
callEdges: [
{
id: "call::caller→callee",
source: "src/caller.js::caller",
target: "src/callee.js::callee",
resolution: {
status: "resolved",
candidates: [{ id: "...", confidence: 0.9 }]
},
// ... other fields
}
]
}Integration: Graph agent uses callEdges[] array to build Graphology graph.
The parser payload is stored in SQLite for persistence:
// Parser payload format matches schema
{
functions: [...],
callEdges: [...],
metadata: {
timestamp: "...",
schemaVersion: "...",
// ... other metadata
}
}Integration: Storage agent persists parser output in analysis_snapshots table.
The parser provides data for visualization:
// Viz agent consumes parser output through graph pipeline
// Nodes: functions[] mapped to graph nodes
// Links: callEdges[] mapped to graph linksIntegration: Viz agent renders parser output as 3D graph visualization.
The parser output follows the schema defined in docs/payloads.md:
Required fields:
functions[]- Array of function objectscallEdges[]- Array of call edge objects (may be empty)
Optional fields:
metadata- Parser metadata (timestamp, version, stats)stats- Statistics (function counts, edge counts, resolution stats)
Validation:
node tools/validate-parser-output.mjs <parser-output.json>Sample payload:
See docs/examples/parser-output-sample.json
The parser uses enhanced heuristic resolution (PLAN §10.2):
- Lexical scope - Prefers functions defined before calls (closures/nested scopes)
- Local matches - Functions in the same file
- Import matches - Functions imported via module imports
- Default exports - Handles default exports imported with different names
- Symbol table - Uses fully qualified names (FQN) from symbol tables
- Module similarity - Prefers functions from same directory/module
- External fallback - Other functions with same name (lower confidence)
- Resolved - Single high-confidence match
- Ambiguous - Multiple matches (2+ candidates)
- Unresolved - No matches found (creates virtual node)
Unresolved calls create virtual nodes:
{
id: "virtual:calleeName:src/caller.js",
name: "calleeName",
fqName: "[unresolved] calleeName",
isVirtual: true,
// ... other fields
}Purpose: Track unresolved calls for analysis and debugging.
To add support for a new language (e.g., Go):
-
Download Tree-sitter grammar WASM file:
# Build or download tree-sitter-go.wasm # Place in grammars/ directory
-
Update
src/parser/parser.js:const GRAMMAR_URLS = { // ... existing languages go: '/grammars/tree-sitter-go.wasm' }; const LANGUAGE_MAP = { // ... existing extensions '.go': 'go' };
Create src/extractors/go.js:
export const GO_QUERIES = {
functions: `
(function_declaration
name: (identifier) @name) @func
`,
imports: `
(import_declaration) @import
`,
calls: `
(call_expression
function: (identifier) @callee) @call
`
};Create extractor function similar to extractJavaScript:
export async function extractGo(source, filePath, language) {
const tree = await parserManager.parse(source, 'go', filePath);
// ... extract functions, imports, calls
return { functions, imports, calls, filePath };
}Update index.html or main extraction function to handle new language:
if (language === 'go') {
return await extractGo(source, filePath, language);
}Create tests/extractors/go.test.mjs with test cases for the new language.
Parser tests:
node --test tests/parser/*.test.mjsExtractor tests:
node --test tests/extractors/*.test.mjsValidate parser output:
node tools/validate-parser-output.mjs <output.json>Validate against payload schema:
node tools/validate-payload.mjs <payload.json>Run regression tests:
node tools/regression-test.mjsGolden repos:
tests/golden-repos/simple-web-app/tests/golden-repos/mixed-language-api/tests/golden-repos/typescript-library/
Issue: "Grammar load failed"
- Cause: Grammar WASM file not found or invalid
- Fix: Verify grammar file exists in
grammars/directory and is valid WASM
Issue: "Query constructor not found"
- Cause: Query constructor not registered
- Fix: Parser falls back to deprecated
language.query()- this is okay, but less efficient
Issue: "Language not detected"
- Cause: File extension not in
LANGUAGE_MAP - Fix: Add extension to
LANGUAGE_MAPinsrc/parser/parser.js
Issue: "Parse failed"
- Cause: Invalid syntax or unsupported language features
- Fix: Tree-sitter grammars may not support all language features - check grammar documentation
Enable debug logging:
// In browser console
localStorage.setItem('debug', 'parser*');Logs:
[TreeSitter]- Initialization and grammar loading[Queries]- Query compilation[Extractor]- Extraction progress
Large files:
- Tree-sitter is fast but very large files (>10k LOC) may take time
- Consider chunking large files or showing progress indicators
Many files:
- Parse files in parallel using workers (future enhancement)
- Current implementation parses sequentially
Memory:
- Always call
tree.delete()after parsing - Clear parser cache if needed:
parserManager.cleanup()
- First init: ~100-200ms (loads WASM runtime)
- Subsequent: Cached, near-instant
- First load: ~50-100ms per grammar (downloads WASM)
- Subsequent: Cached, near-instant
- Small files (<100 LOC): <1ms
- Medium files (100-1000 LOC): 1-10ms
- Large files (1000-10000 LOC): 10-100ms
- Extraction overhead: ~0.5-2ms per file (query execution + processing)
Overall: Parsing is fast enough for real-time use on repos up to ~5000 functions.
Planned (post-MVP):
- Stack-graphs integration (blocked on upstream WASM build)
- Additional language support (Go, Rust, Java)
- Worker pool for parallel parsing
- Incremental parsing (parse only changed files)
- Type-aware resolution (TypeScript type information)
Stack-graphs status:
- Currently blocked on upstream WASM-capable build
- Enhanced heuristic resolution provides good baseline accuracy
- See PLAN.md §89 for details
- Payload schema:
docs/payloads.md - Testing guide:
docs/TESTING.md - Regression testing:
docs/regression-testing.md - Storage integration:
docs/storage.md - Architecture:
PLAN.md§§3.2, 3.3
Methods:
initialize()- Initialize Tree-sitter (loads WASM)detectLanguage(filePath)- Detect language from file pathloadLanguage(language)- Load grammar WASM (lazy)parse(source, language, filePath)- Parse source to ASTcleanup()- Clear caches and free resources
JavaScript/TypeScript:
extractJavaScript(source, filePath, language)- Extract JS/TS elements
Python:
extractPython(source, filePath, language)- Extract Python elements
buildCallGraph(functions, allCalls, symbolTableManager)- Build call graph with resolution
SymbolTableManager- Manages per-file symbol tablesSymbolTable- Per-file symbol table
See source code for detailed API documentation.