Parser Documentation

Plan references: PLAN.md §§3.2, 3.3
Agent: PinkMountain (parser-agent)
Status: Production-ready (all tasks complete except stack-graphs integration, blocked on upstream WASM support)

1. Overview

The HELIOS parser extracts functions, imports, exports, and call expressions from source code using Tree-sitter (WASM). It provides the foundation for the call graph construction and subsequent analysis pipeline.

Key components:

src/parser/parser.js - Tree-sitter manager (initialization, grammar loading, parsing)
src/parser/queries.js - Tree-sitter query patterns (functions, imports, exports, calls)
src/extractors/javascript.js - JavaScript/TypeScript extractor
src/extractors/python.js - Python extractor
src/extractors/call-graph.js - Call graph builder with enhanced heuristic resolution
src/extractors/symbol-table.js - Symbol table manager for name resolution

Architecture:

Source Code → Tree-sitter Parser → AST → Query Matches → Extractor → Parser Payload
                                                                    ↓
                                                           Symbol Table Manager
                                                                    ↓
                                                           Call Graph Builder

2. Supported Languages

Currently supported languages:

JavaScript (.js, .jsx) - Full support
TypeScript (.ts, .tsx) - Full support
Python (.py) - Full support

Future languages (v0.4):

Go (.go)
Rust (.rs)
Java (.java)

3. How It Works

3.1 Initialization

The parser uses a singleton parserManager instance that must be initialized before use:

import { parserManager } from './src/parser/parser.js';

// Initialize once (loads web-tree-sitter WASM)
await parserManager.initialize();

What happens during initialization:

Loads web-tree-sitter WASM runtime
Registers Query constructor (or falls back to deprecated API)
Sets up grammar loading mechanism

3.2 Language Detection

Language is automatically detected from file extension:

const language = parserManager.detectLanguage('src/utils/math.ts');
// Returns: 'typescript'

Supported extensions:

.js, .jsx → javascript
.ts, .tsx → typescript
.py → python

3.3 Grammar Loading

Grammars are lazily loaded from local WASM files:

const language = await parserManager.loadLanguage('javascript');
// Loads: /grammars/tree-sitter-javascript.wasm

Grammar locations:

grammars/tree-sitter-javascript.wasm
grammars/tree-sitter-typescript.wasm
grammars/tree-sitter-python.wasm

3.4 Parsing

Parse source code to get AST:

const source = `function add(a, b) { return a + b; }`;
const tree = await parserManager.parse(source, 'javascript', 'src/math.js');
// Returns Tree-sitter Tree object

Important: Always delete the tree when done to free memory:

tree.delete();

3.5 Extraction

Use language-specific extractors to extract code elements:

import { extractJavaScript } from './src/extractors/javascript.js';

const result = await extractJavaScript(source, filePath, language);
// Returns: { functions, exports, imports, calls, filePath }

Extracted elements:

Functions - Function declarations, methods, arrow functions
Exports - Export statements (default and named)
Imports - Import statements (ESM, CommonJS patterns)
Calls - Function call expressions (direct and member calls)

4. Integration Points

4.1 Parser → Embeddings Agent

The parser provides function metadata for embedding generation:

// Parser output includes function metadata
{
  functions: [
    {
      id: "src/math.js::add",
      name: "add",
      fqName: "math.add",
      filePath: "src/math.js",
      startLine: 1,
      endLine: 5,
      source: "function add(a, b) { ... }",
      // ... other fields
    }
  ]
}

Integration: Embeddings agent uses functions[] array to generate embeddings.

4.2 Parser → Graph Agent

The parser provides call edges for graph construction:

// Parser output includes call edges
{
  callEdges: [
    {
      id: "call::caller→callee",
      source: "src/caller.js::caller",
      target: "src/callee.js::callee",
      resolution: {
        status: "resolved",
        candidates: [{ id: "...", confidence: 0.9 }]
      },
      // ... other fields
    }
  ]
}

Integration: Graph agent uses callEdges[] array to build Graphology graph.

4.3 Parser → Storage Agent

The parser payload is stored in SQLite for persistence:

// Parser payload format matches schema
{
  functions: [...],
  callEdges: [...],
  metadata: {
    timestamp: "...",
    schemaVersion: "...",
    // ... other metadata
  }
}

Integration: Storage agent persists parser output in analysis_snapshots table.

4.4 Parser → Viz Agent

The parser provides data for visualization:

// Viz agent consumes parser output through graph pipeline
// Nodes: functions[] mapped to graph nodes
// Links: callEdges[] mapped to graph links

Integration: Viz agent renders parser output as 3D graph visualization.

5. Payload Format

The parser output follows the schema defined in docs/payloads.md:

Required fields:

functions[] - Array of function objects
callEdges[] - Array of call edge objects (may be empty)

Optional fields:

metadata - Parser metadata (timestamp, version, stats)
stats - Statistics (function counts, edge counts, resolution stats)

Validation:

node tools/validate-parser-output.mjs <parser-output.json>

Sample payload: See docs/examples/parser-output-sample.json

6. Name Resolution

The parser uses enhanced heuristic resolution (PLAN §10.2):

6.1 Resolution Strategies

Lexical scope - Prefers functions defined before calls (closures/nested scopes)
Local matches - Functions in the same file
Import matches - Functions imported via module imports
Default exports - Handles default exports imported with different names
Symbol table - Uses fully qualified names (FQN) from symbol tables
Module similarity - Prefers functions from same directory/module
External fallback - Other functions with same name (lower confidence)

6.2 Resolution Status

Resolved - Single high-confidence match
Ambiguous - Multiple matches (2+ candidates)
Unresolved - No matches found (creates virtual node)

6.3 Virtual Nodes

Unresolved calls create virtual nodes:

{
  id: "virtual:calleeName:src/caller.js",
  name: "calleeName",
  fqName: "[unresolved] calleeName",
  isVirtual: true,
  // ... other fields
}

Purpose: Track unresolved calls for analysis and debugging.

7. Adding New Languages

To add support for a new language (e.g., Go):

7.1 Add Grammar WASM

Download Tree-sitter grammar WASM file:

# Build or download tree-sitter-go.wasm
# Place in grammars/ directory

Update src/parser/parser.js:

const GRAMMAR_URLS = {
  // ... existing languages
  go: '/grammars/tree-sitter-go.wasm'
};

const LANGUAGE_MAP = {
  // ... existing extensions
  '.go': 'go'
};

7.2 Create Query Patterns

Create src/extractors/go.js:

export const GO_QUERIES = {
  functions: `
    (function_declaration
      name: (identifier) @name) @func
  `,
  imports: `
    (import_declaration) @import
  `,
  calls: `
    (call_expression
      function: (identifier) @callee) @call
  `
};

7.3 Create Extractor

Create extractor function similar to extractJavaScript:

export async function extractGo(source, filePath, language) {
  const tree = await parserManager.parse(source, 'go', filePath);
  // ... extract functions, imports, calls
  return { functions, imports, calls, filePath };
}

7.4 Update Main Extraction Logic

Update index.html or main extraction function to handle new language:

if (language === 'go') {
  return await extractGo(source, filePath, language);
}

7.5 Add Tests

Create tests/extractors/go.test.mjs with test cases for the new language.

8. Testing

8.1 Unit Tests

Parser tests:

node --test tests/parser/*.test.mjs

Extractor tests:

node --test tests/extractors/*.test.mjs

8.2 Validation

Validate parser output:

node tools/validate-parser-output.mjs <output.json>

Validate against payload schema:

node tools/validate-payload.mjs <payload.json>

8.3 Regression Testing

Run regression tests:

node tools/regression-test.mjs

Golden repos:

tests/golden-repos/simple-web-app/
tests/golden-repos/mixed-language-api/
tests/golden-repos/typescript-library/

9. Troubleshooting

9.1 Common Issues

Issue: "Grammar load failed"

Cause: Grammar WASM file not found or invalid
Fix: Verify grammar file exists in grammars/ directory and is valid WASM

Issue: "Query constructor not found"

Cause: Query constructor not registered
Fix: Parser falls back to deprecated language.query() - this is okay, but less efficient

Issue: "Language not detected"

Cause: File extension not in LANGUAGE_MAP
Fix: Add extension to LANGUAGE_MAP in src/parser/parser.js

Issue: "Parse failed"

Cause: Invalid syntax or unsupported language features
Fix: Tree-sitter grammars may not support all language features - check grammar documentation

9.2 Debug Mode

Enable debug logging:

// In browser console
localStorage.setItem('debug', 'parser*');

Logs:

[TreeSitter] - Initialization and grammar loading
[Queries] - Query compilation
[Extractor] - Extraction progress

9.3 Performance Issues

Large files:

Tree-sitter is fast but very large files (>10k LOC) may take time
Consider chunking large files or showing progress indicators

Many files:

Parse files in parallel using workers (future enhancement)
Current implementation parses sequentially

Memory:

Always call tree.delete() after parsing
Clear parser cache if needed: parserManager.cleanup()

10. Performance Considerations

10.1 Initialization

First init: ~100-200ms (loads WASM runtime)
Subsequent: Cached, near-instant

10.2 Grammar Loading

First load: ~50-100ms per grammar (downloads WASM)
Subsequent: Cached, near-instant

10.3 Parsing

Small files (<100 LOC): <1ms
Medium files (100-1000 LOC): 1-10ms
Large files (1000-10000 LOC): 10-100ms

10.4 Extraction

Extraction overhead: ~0.5-2ms per file (query execution + processing)

Overall: Parsing is fast enough for real-time use on repos up to ~5000 functions.

11. Future Enhancements

Planned (post-MVP):

Stack-graphs integration (blocked on upstream WASM build)
Additional language support (Go, Rust, Java)
Worker pool for parallel parsing
Incremental parsing (parse only changed files)
Type-aware resolution (TypeScript type information)

Stack-graphs status:

Currently blocked on upstream WASM-capable build
Enhanced heuristic resolution provides good baseline accuracy
See PLAN.md §89 for details

12. Related Documentation

Payload schema: docs/payloads.md
Testing guide: docs/TESTING.md
Regression testing: docs/regression-testing.md
Storage integration: docs/storage.md
Architecture: PLAN.md §§3.2, 3.3

13. API Reference

`parserManager`

Methods:

initialize() - Initialize Tree-sitter (loads WASM)
detectLanguage(filePath) - Detect language from file path
loadLanguage(language) - Load grammar WASM (lazy)
parse(source, language, filePath) - Parse source to AST
cleanup() - Clear caches and free resources

Extractors

JavaScript/TypeScript:

extractJavaScript(source, filePath, language) - Extract JS/TS elements

Python:

extractPython(source, filePath, language) - Extract Python elements

Call Graph Builder

buildCallGraph(functions, allCalls, symbolTableManager) - Build call graph with resolution

Symbol Table Manager

SymbolTableManager - Manages per-file symbol tables
SymbolTable - Per-file symbol table

See source code for detailed API documentation.

FilesExpand file tree

parser.md

Latest commit

History