Skip to content

feat: Add SQL language support with dbt/Jinja awareness#147

Open
grabowskit wants to merge 2 commits intoelastic:mainfrom
grabowskit:feature/sql-dbt-parser
Open

feat: Add SQL language support with dbt/Jinja awareness#147
grabowskit wants to merge 2 commits intoelastic:mainfrom
grabowskit:feature/sql-dbt-parser

Conversation

@grabowskit
Copy link

Summary

Adds comprehensive SQL parsing with first-class dbt (data build tool) support for semantic code search across SQL codebases and modern data engineering projects.

Why This Is Needed

The Problem:
SQL is one of the most widely used languages in data engineering, analytics, and backend development. Modern data teams using dbt have thousands of SQL files with complex dependencies that are currently invisible to semantic code search.

Key Use Cases:

  • Data Engineers need to search across dbt projects to understand model dependencies
  • Analytics Engineers want to find where specific tables/sources are referenced
  • Platform Teams need to audit macro usage across projects
  • Developers working with SQL migrations want semantic search over DDL

Why dbt Awareness Matters:
dbt has become the industry standard for data transformation, used by thousands of companies. dbt SQL files contain Jinja templating ({{ ref() }}, {{ source() }}, {{ macro() }}) that standard SQL parsers can't understand. This PR extracts these as first-class dependencies.

Features

Feature Description
dbt ref() extraction Parses {{ ref('model_name') }} to track model dependencies
dbt source() extraction Parses {{ source('schema', 'table') }} for source tracking
Macro detection Identifies custom macro calls (excludes builtins like config, var)
CTE chunking Splits WITH ... AS (...) clauses into searchable chunks
CREATE statements Detects TABLE/VIEW/FUNCTION/PROCEDURE definitions
Table references Extracts FROM/JOIN table names for dependency graphs
Multi-dialect Works with BigQuery, Snowflake, Databricks, Redshift, DuckDB

Implementation Notes

  • Uses parser: null (custom parser) since tree-sitter-sql doesn't handle Jinja templating
  • Regex-based extraction for dbt constructs is intentional - Jinja isn't parseable with tree-sitter
  • CTE detection uses parenthesis balancing to handle nested queries
  • Built-in dbt functions (config, var, env_var, etc.) are excluded from macro detection

Test Plan

  • All existing tests pass (337 tests)
  • New SQL parser tests pass (16 tests):
    • dbt model parsing (ref, source, macro extraction)
    • dbt macro file parsing
    • Pure SQL parsing (CREATE, CTEs, table refs)
    • File extension recognition
    • Directory information extraction

Adds comprehensive SQL parsing with first-class dbt (data build tool) support
for semantic code search across SQL codebases and modern data engineering projects.

Features:
- dbt ref() extraction: Parses {{ ref('model_name') }} for model dependencies
- dbt source() extraction: Parses {{ source('schema', 'table') }} for source tracking
- Macro detection: Identifies custom macro calls (excludes builtins)
- CTE chunking: Splits WITH ... AS (...) clauses into searchable chunks
- CREATE statements: Detects TABLE/VIEW/FUNCTION/PROCEDURE definitions
- Table references: Extracts FROM/JOIN table names for dependency graphs
- Multi-dialect: Works with BigQuery, Snowflake, Databricks, Redshift, DuckDB

Uses parser: null (custom parser) since tree-sitter-sql cannot handle Jinja
templating embedded in SQL files.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@cla-checker-service
Copy link

cla-checker-service bot commented Feb 10, 2026

💚 CLA has been signed

@simianhacker simianhacker requested a review from Copilot February 10, 2026 02:55
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class SQL parsing with dbt/Jinja awareness to enable semantic code search over SQL/dbt projects, including dependency extraction and chunking.

Changes:

  • Added a custom SQL parser that extracts dbt refs/sources/macros, table references, CTE chunks, config chunks, and statement chunks.
  • Registered SQL as a supported language with .sql extension and added parser metrics type for SQL.
  • Added fixtures and Vitest unit tests covering dbt models/macros and pure SQL behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/unit/parser-sql.test.ts Adds unit tests for SQL parsing, chunking, and dependency extraction behaviors
tests/fixtures/sql_pure.sql Adds pure SQL fixture for CREATE/CTE/table-reference parsing
tests/fixtures/sql_dbt_model.sql Adds dbt model SQL fixture with config/ref/source/macro usage
tests/fixtures/sql_dbt_macro.sql Adds dbt macro fixture with multiple macro definitions
src/utils/parser.ts Implements custom SQL parsing, structure extraction, and dependency extraction
src/utils/constants.ts Adds sql language and parser type constants
src/languages/sql.ts Adds SQL language configuration using custom parser
src/languages/index.ts Registers SQL configuration in supported languages

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1078 to 1088
const macroContent = currentMacro.content.join('\n');
const macroDeps = this.extractSqlDependencies(macroContent);

macros.push({
content: macroContent,
type: 'macro',
name: currentMacro.name,
startLine: currentMacro.startLine,
endLine: lineNum,
dependencies: macroDeps,
});
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dependency line numbers for macro files are incorrect. extractSqlDependencies(macroContent) numbers lines starting at 1 inside the macro chunk, but those line numbers are later emitted as SymbolInfo.line without being offset by currentMacro.startLine - 1, so symbol lines won’t match the original file line numbers. Fix by offsetting macroDeps lines by (currentMacro.startLine - 1) (or by enhancing extractSqlDependencies to accept a lineOffset parameter).

Copilot uses AI. Check for mistakes.
Comment on lines 746 to 775
// Pattern for {{ macro_name(...) }} - excluding built-in functions
const macroPattern = /\{\{\s*([a-zA-Z_][a-zA-Z0-9_]*)\s*\(/g;
const builtinMacros = new Set([
'ref',
'source',
'config',
'this',
'var',
'env_var',
'target',
'adapter',
'return',
'log',
'exceptions',
'run_query',
'statement',
'set',
'do',
'if',
'for',
'elif',
'else',
'endif',
'endfor',
'macro',
'endmacro',
'call',
'block',
'filter',
]);
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The macro-call regex only captures a single identifier immediately after {{, so it will mis-handle namespaced calls like {{ dbt.date_trunc(...) }} by recording a dependency on dbt instead of date_trunc (and also incorrectly treating dbt as a custom macro). Update the pattern to support dotted identifiers and record the final segment (or explicitly treat dbt-namespaced helpers as built-ins).

Copilot uses AI. Check for mistakes.
Comment on lines 1149 to 1153
const symbols: SymbolInfo[] = params.dependencies.map((d) => ({
name: d.name,
kind: `sql.${d.type}`,
line: d.line,
}));
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For source and table dependencies, SymbolInfo.name drops the schema/source context (e.g., raw_orders vs ecom.raw_orders), which can cause collisions and makes symbols less informative to consumers. Consider using a fully-qualified name when schema is present (e.g., ${d.schema}.${d.name}) at least for sql.source and sql.table symbols.

Suggested change
const symbols: SymbolInfo[] = params.dependencies.map((d) => ({
name: d.name,
kind: `sql.${d.type}`,
line: d.line,
}));
const symbols: SymbolInfo[] = params.dependencies.map((d) => {
const name =
(d.type === 'source' || d.type === 'table') && d.schema
? `${d.schema}.${d.name}`
: d.name;
return {
name,
kind: `sql.${d.type}`,
line: d.line,
};
});

Copilot uses AI. Check for mistakes.
Comment on lines 1062 to 1063
// Pattern for macro start: {% macro name(args) %} or {%- macro name(args) -%}
const macroStartPattern = /\{%[-\s]*macro\s+([a-zA-Z_][a-zA-Z0-9_]*)\s*\(/;
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dbt/Jinja macros can be defined without parentheses (e.g., {% macro my_macro %}), but macroStartPattern requires (, so those macros won’t be detected/chunked. Adjust the regex to allow an optional argument list (match either ( or the %} token) so no-arg macros are still parsed.

Suggested change
// Pattern for macro start: {% macro name(args) %} or {%- macro name(args) -%}
const macroStartPattern = /\{%[-\s]*macro\s+([a-zA-Z_][a-zA-Z0-9_]*)\s*\(/;
// Pattern for macro start: {% macro name(args) %}, {%- macro name(args) -%},
// and macros without args: {% macro name %} or {%- macro name -%}
const macroStartPattern = /\{%[-\s]*macro\s+([a-zA-Z_][a-zA-Z0-9_]*)\s*(\(|-?%\})/;

Copilot uses AI. Check for mistakes.
Comment on lines +100 to +105
it('should set correct line numbers for macros', () => {
const filePath = path.resolve(__dirname, '../fixtures/sql_dbt_macro.sql');
const result = parser.parseFile(filePath, 'main', 'macros/cents_to_dollars.sql');

const macroChunks = result.chunks.filter((c) => c.kind === 'macro');
expect(macroChunks.length).toBeGreaterThan(0);
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests validate macro chunk startLine/endLine, but don’t validate line numbers on extracted dependencies/symbols within those macro chunks. Given the current implementation re-parses macro content, add an assertion that chunk.symbols[*].line aligns to file line numbers (e.g., >= chunk.startLine and <= chunk.endLine, or exact expected lines for known dependencies) to prevent regressions.

Copilot uses AI. Check for mistakes.
Fixes the following issues identified in code review:

1. Macro dependency line numbers - Added lineOffset parameter to
   extractSqlDependencies() so line numbers within macros are
   relative to the file, not the macro chunk

2. Namespaced macro calls - Updated regex to capture namespace.macro
   patterns (e.g., dbt.date_trunc) and skip built-in namespaces
   (dbt, adapter, exceptions)

3. Fully-qualified symbol names - Symbols with schema info now use
   ${schema}.${name} format to avoid collisions

4. Macros without arguments - Updated regex to match both
   {% macro name(args) %} and {% macro name %}

5. Added test for dependency line number validation within chunk
   boundaries

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Collaborator

@kapral18 kapral18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey thanks for the contribution.

Overall direction looks strong (language wiring, fixtures, macro/ref/source extraction, and improved dependency line handling). The main gaps I see are a couple chunk-boundary correctness issues in the SQL state machine plus missing regression tests for those edges; addressing those should make the parser much safer to ship.

mainStatementContent = [];
}
continue;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single-line statements that start with SELECT/INSERT/... and already contain ; don’t get finalized on that same line because the start branch continues and the semicolon-finalization logic only runs on later lines. That merges consecutive one-line statements into one chunk (e.g. SELECT 1; + SELECT 2;). Can we finalize immediately when the start line includes ; to keep chunk boundaries correct?

// Skip empty lines and comments at the start
if (mainStatementStart === null && !inWith && /^\s*(--|\/\*|\s*$)/.test(line)) {
continue;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WITH mode begins whenever withPattern matches and !inWith, even if a statement is already in progress (mainStatementStart !== null). This drops preamble lines like CREATE VIEW v AS when the next line is WITH ... (the emitted chunks start at the WITH, so the create header is lost). Can we prevent entering WITH mode when we’re already inside a statement, or otherwise preserve the preamble as part of the emitted chunks?


// Skip empty lines and comments at the start
if (mainStatementStart === null && !inWith && /^\s*(--|\/\*|\s*$)/.test(line)) {
continue;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related edge case: for one-line queries like WITH a AS (SELECT 1) SELECT * FROM a;, the current CTE chunking can swallow the main SELECT into the cte chunk rather than splitting cte vs main statement. Would it make sense to add a small regression test + adjust parsing to split when the main query starts on the same line as the last CTE?

]);
// Namespaces that contain built-in helpers (e.g., dbt.date_trunc)
const builtinNamespaces = new Set(['dbt', 'adapter', 'exceptions']);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Table extraction from FROM/JOIN won’t catch quoted identifiers like FROM "analytics"."orders", which is pretty common across dialects. If we want table refs to be useful broadly, should we extend the regex (or rely on grammar parsing) to support quoted/backticked identifiers?

});
});

describe('Pure SQL Parsing', () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have good coverage for the dbt fixtures, but the two chunk-boundary edges (consecutive one-line statements; CREATE … AS followed by WITH …) aren’t covered. Can we add two small regression tests for those so future tweaks don’t re-break chunk boundaries?

export const sqlConfig: LanguageConfiguration = {
name: 'sql',
fileSuffixes: ['.sql'],
parser: null, // Custom parser for SQL + dbt/Jinja support
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently goes through a custom parser (parser: null) with a hand-rolled SQL state machine plus regex extraction for dbt/Jinja. Would you be open to a hybrid approach: use an existing maintained tree-sitter SQL grammar (e.g. @derekstride/tree-sitter-sql) for pure SQL structure (statements/CTEs/CREATE bodies), and route templated files (containing {{ / {%) through the current custom dbt/Jinja-aware path for ref, source, and macro call extraction? That should reduce maintenance and edge-case chunking bugs long-term.

@grabowskit
Copy link
Author

Thank you so much @kapral18 for the thorough review! I really appreciate you taking the time to go through the code in such detail - it's clear I still have a lot to learn about SQL parsing edge cases. 😅

I've addressed the issues you raised:

Changes Made

1. Single-line statement finalization (consecutive one-line statements)

Fixed the issue where consecutive one-line statements like SELECT 1; + SELECT 2; were being merged into one chunk. Now when a statement starts and includes a semicolon on the same line, it's finalized immediately.

2. WITH mode entering when already inside a statement

Fixed the CREATE VIEW v AS WITH ... case where the preamble was being lost. Now the parser only enters WITH mode when not already inside a statement, so CREATE VIEW headers are preserved.

3. One-line CTE handling

Added handling for cases like WITH a AS (SELECT 1) SELECT * FROM a; where the main query starts on the same line as the last CTE closes.

4. Quoted identifier support for table extraction

Extended the table extraction regex to support:

  • Double-quoted identifiers (PostgreSQL, BigQuery): "schema"."table"
  • Backticked identifiers (MySQL, BigQuery): \schema`.`table``
  • Square bracket identifiers (SQL Server): [schema].[table]

5. Regression tests

Added 4 new regression tests covering:

  • Consecutive one-line statements not being merged
  • CREATE preamble preservation with WITH clause
  • One-line CTE with main query on same line
  • Quoted table identifier extraction

6. Tree-sitter hybrid approach acknowledgment

Added a comment in sql.ts documenting the suggestion to consider a hybrid approach using tree-sitter for pure SQL structure and the custom parser for dbt/Jinja templated files. This is a great idea for reducing long-term maintenance!

All existing tests continue to pass, and the new regression tests verify the fixes work correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants