feat: Add SQL language support with dbt/Jinja awareness#147
feat: Add SQL language support with dbt/Jinja awareness#147grabowskit wants to merge 2 commits intoelastic:mainfrom
Conversation
Adds comprehensive SQL parsing with first-class dbt (data build tool) support
for semantic code search across SQL codebases and modern data engineering projects.
Features:
- dbt ref() extraction: Parses {{ ref('model_name') }} for model dependencies
- dbt source() extraction: Parses {{ source('schema', 'table') }} for source tracking
- Macro detection: Identifies custom macro calls (excludes builtins)
- CTE chunking: Splits WITH ... AS (...) clauses into searchable chunks
- CREATE statements: Detects TABLE/VIEW/FUNCTION/PROCEDURE definitions
- Table references: Extracts FROM/JOIN table names for dependency graphs
- Multi-dialect: Works with BigQuery, Snowflake, Databricks, Redshift, DuckDB
Uses parser: null (custom parser) since tree-sitter-sql cannot handle Jinja
templating embedded in SQL files.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
💚 CLA has been signed |
There was a problem hiding this comment.
Pull request overview
Adds first-class SQL parsing with dbt/Jinja awareness to enable semantic code search over SQL/dbt projects, including dependency extraction and chunking.
Changes:
- Added a custom SQL parser that extracts dbt refs/sources/macros, table references, CTE chunks, config chunks, and statement chunks.
- Registered SQL as a supported language with
.sqlextension and added parser metrics type for SQL. - Added fixtures and Vitest unit tests covering dbt models/macros and pure SQL behavior.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/parser-sql.test.ts | Adds unit tests for SQL parsing, chunking, and dependency extraction behaviors |
| tests/fixtures/sql_pure.sql | Adds pure SQL fixture for CREATE/CTE/table-reference parsing |
| tests/fixtures/sql_dbt_model.sql | Adds dbt model SQL fixture with config/ref/source/macro usage |
| tests/fixtures/sql_dbt_macro.sql | Adds dbt macro fixture with multiple macro definitions |
| src/utils/parser.ts | Implements custom SQL parsing, structure extraction, and dependency extraction |
| src/utils/constants.ts | Adds sql language and parser type constants |
| src/languages/sql.ts | Adds SQL language configuration using custom parser |
| src/languages/index.ts | Registers SQL configuration in supported languages |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const macroContent = currentMacro.content.join('\n'); | ||
| const macroDeps = this.extractSqlDependencies(macroContent); | ||
|
|
||
| macros.push({ | ||
| content: macroContent, | ||
| type: 'macro', | ||
| name: currentMacro.name, | ||
| startLine: currentMacro.startLine, | ||
| endLine: lineNum, | ||
| dependencies: macroDeps, | ||
| }); |
There was a problem hiding this comment.
Dependency line numbers for macro files are incorrect. extractSqlDependencies(macroContent) numbers lines starting at 1 inside the macro chunk, but those line numbers are later emitted as SymbolInfo.line without being offset by currentMacro.startLine - 1, so symbol lines won’t match the original file line numbers. Fix by offsetting macroDeps lines by (currentMacro.startLine - 1) (or by enhancing extractSqlDependencies to accept a lineOffset parameter).
src/utils/parser.ts
Outdated
| // Pattern for {{ macro_name(...) }} - excluding built-in functions | ||
| const macroPattern = /\{\{\s*([a-zA-Z_][a-zA-Z0-9_]*)\s*\(/g; | ||
| const builtinMacros = new Set([ | ||
| 'ref', | ||
| 'source', | ||
| 'config', | ||
| 'this', | ||
| 'var', | ||
| 'env_var', | ||
| 'target', | ||
| 'adapter', | ||
| 'return', | ||
| 'log', | ||
| 'exceptions', | ||
| 'run_query', | ||
| 'statement', | ||
| 'set', | ||
| 'do', | ||
| 'if', | ||
| 'for', | ||
| 'elif', | ||
| 'else', | ||
| 'endif', | ||
| 'endfor', | ||
| 'macro', | ||
| 'endmacro', | ||
| 'call', | ||
| 'block', | ||
| 'filter', | ||
| ]); |
There was a problem hiding this comment.
The macro-call regex only captures a single identifier immediately after {{, so it will mis-handle namespaced calls like {{ dbt.date_trunc(...) }} by recording a dependency on dbt instead of date_trunc (and also incorrectly treating dbt as a custom macro). Update the pattern to support dotted identifiers and record the final segment (or explicitly treat dbt-namespaced helpers as built-ins).
| const symbols: SymbolInfo[] = params.dependencies.map((d) => ({ | ||
| name: d.name, | ||
| kind: `sql.${d.type}`, | ||
| line: d.line, | ||
| })); |
There was a problem hiding this comment.
For source and table dependencies, SymbolInfo.name drops the schema/source context (e.g., raw_orders vs ecom.raw_orders), which can cause collisions and makes symbols less informative to consumers. Consider using a fully-qualified name when schema is present (e.g., ${d.schema}.${d.name}) at least for sql.source and sql.table symbols.
| const symbols: SymbolInfo[] = params.dependencies.map((d) => ({ | |
| name: d.name, | |
| kind: `sql.${d.type}`, | |
| line: d.line, | |
| })); | |
| const symbols: SymbolInfo[] = params.dependencies.map((d) => { | |
| const name = | |
| (d.type === 'source' || d.type === 'table') && d.schema | |
| ? `${d.schema}.${d.name}` | |
| : d.name; | |
| return { | |
| name, | |
| kind: `sql.${d.type}`, | |
| line: d.line, | |
| }; | |
| }); |
src/utils/parser.ts
Outdated
| // Pattern for macro start: {% macro name(args) %} or {%- macro name(args) -%} | ||
| const macroStartPattern = /\{%[-\s]*macro\s+([a-zA-Z_][a-zA-Z0-9_]*)\s*\(/; |
There was a problem hiding this comment.
dbt/Jinja macros can be defined without parentheses (e.g., {% macro my_macro %}), but macroStartPattern requires (, so those macros won’t be detected/chunked. Adjust the regex to allow an optional argument list (match either ( or the %} token) so no-arg macros are still parsed.
| // Pattern for macro start: {% macro name(args) %} or {%- macro name(args) -%} | |
| const macroStartPattern = /\{%[-\s]*macro\s+([a-zA-Z_][a-zA-Z0-9_]*)\s*\(/; | |
| // Pattern for macro start: {% macro name(args) %}, {%- macro name(args) -%}, | |
| // and macros without args: {% macro name %} or {%- macro name -%} | |
| const macroStartPattern = /\{%[-\s]*macro\s+([a-zA-Z_][a-zA-Z0-9_]*)\s*(\(|-?%\})/; |
| it('should set correct line numbers for macros', () => { | ||
| const filePath = path.resolve(__dirname, '../fixtures/sql_dbt_macro.sql'); | ||
| const result = parser.parseFile(filePath, 'main', 'macros/cents_to_dollars.sql'); | ||
|
|
||
| const macroChunks = result.chunks.filter((c) => c.kind === 'macro'); | ||
| expect(macroChunks.length).toBeGreaterThan(0); |
There was a problem hiding this comment.
The tests validate macro chunk startLine/endLine, but don’t validate line numbers on extracted dependencies/symbols within those macro chunks. Given the current implementation re-parses macro content, add an assertion that chunk.symbols[*].line aligns to file line numbers (e.g., >= chunk.startLine and <= chunk.endLine, or exact expected lines for known dependencies) to prevent regressions.
Fixes the following issues identified in code review:
1. Macro dependency line numbers - Added lineOffset parameter to
extractSqlDependencies() so line numbers within macros are
relative to the file, not the macro chunk
2. Namespaced macro calls - Updated regex to capture namespace.macro
patterns (e.g., dbt.date_trunc) and skip built-in namespaces
(dbt, adapter, exceptions)
3. Fully-qualified symbol names - Symbols with schema info now use
${schema}.${name} format to avoid collisions
4. Macros without arguments - Updated regex to match both
{% macro name(args) %} and {% macro name %}
5. Added test for dependency line number validation within chunk
boundaries
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
kapral18
left a comment
There was a problem hiding this comment.
Hey thanks for the contribution.
Overall direction looks strong (language wiring, fixtures, macro/ref/source extraction, and improved dependency line handling). The main gaps I see are a couple chunk-boundary correctness issues in the SQL state machine plus missing regression tests for those edges; addressing those should make the parser much safer to ship.
| mainStatementContent = []; | ||
| } | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Single-line statements that start with SELECT/INSERT/... and already contain ; don’t get finalized on that same line because the start branch continues and the semicolon-finalization logic only runs on later lines. That merges consecutive one-line statements into one chunk (e.g. SELECT 1; + SELECT 2;). Can we finalize immediately when the start line includes ; to keep chunk boundaries correct?
| // Skip empty lines and comments at the start | ||
| if (mainStatementStart === null && !inWith && /^\s*(--|\/\*|\s*$)/.test(line)) { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
WITH mode begins whenever withPattern matches and !inWith, even if a statement is already in progress (mainStatementStart !== null). This drops preamble lines like CREATE VIEW v AS when the next line is WITH ... (the emitted chunks start at the WITH, so the create header is lost). Can we prevent entering WITH mode when we’re already inside a statement, or otherwise preserve the preamble as part of the emitted chunks?
|
|
||
| // Skip empty lines and comments at the start | ||
| if (mainStatementStart === null && !inWith && /^\s*(--|\/\*|\s*$)/.test(line)) { | ||
| continue; |
There was a problem hiding this comment.
Related edge case: for one-line queries like WITH a AS (SELECT 1) SELECT * FROM a;, the current CTE chunking can swallow the main SELECT into the cte chunk rather than splitting cte vs main statement. Would it make sense to add a small regression test + adjust parsing to split when the main query starts on the same line as the last CTE?
| ]); | ||
| // Namespaces that contain built-in helpers (e.g., dbt.date_trunc) | ||
| const builtinNamespaces = new Set(['dbt', 'adapter', 'exceptions']); | ||
|
|
There was a problem hiding this comment.
Table extraction from FROM/JOIN won’t catch quoted identifiers like FROM "analytics"."orders", which is pretty common across dialects. If we want table refs to be useful broadly, should we extend the regex (or rely on grammar parsing) to support quoted/backticked identifiers?
| }); | ||
| }); | ||
|
|
||
| describe('Pure SQL Parsing', () => { |
There was a problem hiding this comment.
We have good coverage for the dbt fixtures, but the two chunk-boundary edges (consecutive one-line statements; CREATE … AS followed by WITH …) aren’t covered. Can we add two small regression tests for those so future tweaks don’t re-break chunk boundaries?
| export const sqlConfig: LanguageConfiguration = { | ||
| name: 'sql', | ||
| fileSuffixes: ['.sql'], | ||
| parser: null, // Custom parser for SQL + dbt/Jinja support |
There was a problem hiding this comment.
This currently goes through a custom parser (parser: null) with a hand-rolled SQL state machine plus regex extraction for dbt/Jinja. Would you be open to a hybrid approach: use an existing maintained tree-sitter SQL grammar (e.g. @derekstride/tree-sitter-sql) for pure SQL structure (statements/CTEs/CREATE bodies), and route templated files (containing {{ / {%) through the current custom dbt/Jinja-aware path for ref, source, and macro call extraction? That should reduce maintenance and edge-case chunking bugs long-term.
|
Thank you so much @kapral18 for the thorough review! I really appreciate you taking the time to go through the code in such detail - it's clear I still have a lot to learn about SQL parsing edge cases. 😅 I've addressed the issues you raised: Changes Made1. Single-line statement finalization (consecutive one-line statements)Fixed the issue where consecutive one-line statements like 2. WITH mode entering when already inside a statementFixed the 3. One-line CTE handlingAdded handling for cases like 4. Quoted identifier support for table extractionExtended the table extraction regex to support:
5. Regression testsAdded 4 new regression tests covering:
6. Tree-sitter hybrid approach acknowledgmentAdded a comment in All existing tests continue to pass, and the new regression tests verify the fixes work correctly. |
Summary
Adds comprehensive SQL parsing with first-class dbt (data build tool) support for semantic code search across SQL codebases and modern data engineering projects.
Why This Is Needed
The Problem:
SQL is one of the most widely used languages in data engineering, analytics, and backend development. Modern data teams using dbt have thousands of SQL files with complex dependencies that are currently invisible to semantic code search.
Key Use Cases:
Why dbt Awareness Matters:
dbt has become the industry standard for data transformation, used by thousands of companies. dbt SQL files contain Jinja templating (
{{ ref() }},{{ source() }},{{ macro() }}) that standard SQL parsers can't understand. This PR extracts these as first-class dependencies.Features
{{ ref('model_name') }}to track model dependencies{{ source('schema', 'table') }}for source trackingconfig,var)WITH ... AS (...)clauses into searchable chunksImplementation Notes
parser: null(custom parser) since tree-sitter-sql doesn't handle Jinja templatingTest Plan