feat: Add SQL language support with dbt/Jinja awareness by grabowskit · Pull Request #147 · elastic/semantic-code-search-indexer

grabowskit · 2026-02-10T02:51:58Z

Summary

Adds comprehensive SQL parsing with first-class dbt (data build tool) support for semantic code search across SQL codebases and modern data engineering projects.

Why This Is Needed

The Problem:
SQL is one of the most widely used languages in data engineering, analytics, and backend development. Modern data teams using dbt have thousands of SQL files with complex dependencies that are currently invisible to semantic code search.

Key Use Cases:

Data Engineers need to search across dbt projects to understand model dependencies
Analytics Engineers want to find where specific tables/sources are referenced
Platform Teams need to audit macro usage across projects
Developers working with SQL migrations want semantic search over DDL

Why dbt Awareness Matters:
dbt has become the industry standard for data transformation, used by thousands of companies. dbt SQL files contain Jinja templating ({{ ref() }}, {{ source() }}, {{ macro() }}) that standard SQL parsers can't understand. This PR extracts these as first-class dependencies.

Features

Feature	Description
dbt ref() extraction	Parses `{{ ref('model_name') }}` to track model dependencies
dbt source() extraction	Parses `{{ source('schema', 'table') }}` for source tracking
Macro detection	Identifies custom macro calls (excludes builtins like `config`, `var`)
CTE chunking	Splits `WITH ... AS (...)` clauses into searchable chunks
CREATE statements	Detects TABLE/VIEW/FUNCTION/PROCEDURE definitions
Table references	Extracts FROM/JOIN table names for dependency graphs
Multi-dialect	Works with BigQuery, Snowflake, Databricks, Redshift, DuckDB

Implementation Notes

Uses parser: null (custom parser) since tree-sitter-sql doesn't handle Jinja templating
Regex-based extraction for dbt constructs is intentional - Jinja isn't parseable with tree-sitter
CTE detection uses parenthesis balancing to handle nested queries
Built-in dbt functions (config, var, env_var, etc.) are excluded from macro detection

Test Plan

All existing tests pass (337 tests)
New SQL parser tests pass (16 tests):
- dbt model parsing (ref, source, macro extraction)
- dbt macro file parsing
- Pure SQL parsing (CREATE, CTEs, table refs)
- File extension recognition
- Directory information extraction

Adds comprehensive SQL parsing with first-class dbt (data build tool) support for semantic code search across SQL codebases and modern data engineering projects. Features: - dbt ref() extraction: Parses {{ ref('model_name') }} for model dependencies - dbt source() extraction: Parses {{ source('schema', 'table') }} for source tracking - Macro detection: Identifies custom macro calls (excludes builtins) - CTE chunking: Splits WITH ... AS (...) clauses into searchable chunks - CREATE statements: Detects TABLE/VIEW/FUNCTION/PROCEDURE definitions - Table references: Extracts FROM/JOIN table names for dependency graphs - Multi-dialect: Works with BigQuery, Snowflake, Databricks, Redshift, DuckDB Uses parser: null (custom parser) since tree-sitter-sql cannot handle Jinja templating embedded in SQL files. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cla-checker-service · 2026-02-10T02:52:02Z

💚 CLA has been signed

Copilot

Pull request overview

Adds first-class SQL parsing with dbt/Jinja awareness to enable semantic code search over SQL/dbt projects, including dependency extraction and chunking.

Changes:

Added a custom SQL parser that extracts dbt refs/sources/macros, table references, CTE chunks, config chunks, and statement chunks.
Registered SQL as a supported language with .sql extension and added parser metrics type for SQL.
Added fixtures and Vitest unit tests covering dbt models/macros and pure SQL behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/unit/parser-sql.test.ts	Adds unit tests for SQL parsing, chunking, and dependency extraction behaviors
tests/fixtures/sql_pure.sql	Adds pure SQL fixture for CREATE/CTE/table-reference parsing
tests/fixtures/sql_dbt_model.sql	Adds dbt model SQL fixture with config/ref/source/macro usage
tests/fixtures/sql_dbt_macro.sql	Adds dbt macro fixture with multiple macro definitions
src/utils/parser.ts	Implements custom SQL parsing, structure extraction, and dependency extraction
src/utils/constants.ts	Adds `sql` language and parser type constants
src/languages/sql.ts	Adds SQL language configuration using custom parser
src/languages/index.ts	Registers SQL configuration in supported languages

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-10T02:58:03Z

src/utils/parser.ts

+          const macroContent = currentMacro.content.join('\n');
+          const macroDeps = this.extractSqlDependencies(macroContent);
+
+          macros.push({
+            content: macroContent,
+            type: 'macro',
+            name: currentMacro.name,
+            startLine: currentMacro.startLine,
+            endLine: lineNum,
+            dependencies: macroDeps,
+          });


Dependency line numbers for macro files are incorrect. extractSqlDependencies(macroContent) numbers lines starting at 1 inside the macro chunk, but those line numbers are later emitted as SymbolInfo.line without being offset by currentMacro.startLine - 1, so symbol lines won’t match the original file line numbers. Fix by offsetting macroDeps lines by (currentMacro.startLine - 1) (or by enhancing extractSqlDependencies to accept a lineOffset parameter).

Copilot · 2026-02-10T02:58:03Z

src/utils/parser.ts

+    // Pattern for {{ macro_name(...) }} - excluding built-in functions
+    const macroPattern = /\{\{\s*([a-zA-Z_][a-zA-Z0-9_]*)\s*\(/g;
+    const builtinMacros = new Set([
+      'ref',
+      'source',
+      'config',
+      'this',
+      'var',
+      'env_var',
+      'target',
+      'adapter',
+      'return',
+      'log',
+      'exceptions',
+      'run_query',
+      'statement',
+      'set',
+      'do',
+      'if',
+      'for',
+      'elif',
+      'else',
+      'endif',
+      'endfor',
+      'macro',
+      'endmacro',
+      'call',
+      'block',
+      'filter',
+    ]);


The macro-call regex only captures a single identifier immediately after {{, so it will mis-handle namespaced calls like {{ dbt.date_trunc(...) }} by recording a dependency on dbt instead of date_trunc (and also incorrectly treating dbt as a custom macro). Update the pattern to support dotted identifiers and record the final segment (or explicitly treat dbt-namespaced helpers as built-ins).

Copilot · 2026-02-10T02:58:03Z

src/utils/parser.ts

+    const symbols: SymbolInfo[] = params.dependencies.map((d) => ({
+      name: d.name,
+      kind: `sql.${d.type}`,
+      line: d.line,
+    }));


For source and table dependencies, SymbolInfo.name drops the schema/source context (e.g., raw_orders vs ecom.raw_orders), which can cause collisions and makes symbols less informative to consumers. Consider using a fully-qualified name when schema is present (e.g., ${d.schema}.${d.name}) at least for sql.source and sql.table symbols.

Suggested change

const symbols: SymbolInfo[] = params.dependencies.map((d) => ({

name: d.name,

kind: `sql.${d.type}`,

line: d.line,

}));

const symbols: SymbolInfo[] = params.dependencies.map((d) => {

const name =

(d.type === 'source' || d.type === 'table') && d.schema

? `${d.schema}.${d.name}`

: d.name;

return {

name,

kind: `sql.${d.type}`,

line: d.line,

};

});

Copilot · 2026-02-10T02:58:03Z

src/utils/parser.ts

+    // Pattern for macro start: {% macro name(args) %} or {%- macro name(args) -%}
+    const macroStartPattern = /\{%[-\s]*macro\s+([a-zA-Z_][a-zA-Z0-9_]*)\s*\(/;


dbt/Jinja macros can be defined without parentheses (e.g., {% macro my_macro %}), but macroStartPattern requires (, so those macros won’t be detected/chunked. Adjust the regex to allow an optional argument list (match either ( or the %} token) so no-arg macros are still parsed.

Suggested change

// Pattern for macro start: {% macro name(args) %} or {%- macro name(args) -%}

const macroStartPattern = /\{%[-\s]*macro\s+([a-zA-Z_][a-zA-Z0-9_]*)\s*\(/;

// Pattern for macro start: {% macro name(args) %}, {%- macro name(args) -%},

// and macros without args: {% macro name %} or {%- macro name -%}

const macroStartPattern = /\{%[-\s]*macro\s+([a-zA-Z_][a-zA-Z0-9_]*)\s*(\(|-?%\})/;

Copilot · 2026-02-10T02:58:04Z

tests/unit/parser-sql.test.ts

+    it('should set correct line numbers for macros', () => {
+      const filePath = path.resolve(__dirname, '../fixtures/sql_dbt_macro.sql');
+      const result = parser.parseFile(filePath, 'main', 'macros/cents_to_dollars.sql');
+
+      const macroChunks = result.chunks.filter((c) => c.kind === 'macro');
+      expect(macroChunks.length).toBeGreaterThan(0);


The tests validate macro chunk startLine/endLine, but don’t validate line numbers on extracted dependencies/symbols within those macro chunks. Given the current implementation re-parses macro content, add an assertion that chunk.symbols[*].line aligns to file line numbers (e.g., >= chunk.startLine and <= chunk.endLine, or exact expected lines for known dependencies) to prevent regressions.

Fixes the following issues identified in code review: 1. Macro dependency line numbers - Added lineOffset parameter to extractSqlDependencies() so line numbers within macros are relative to the file, not the macro chunk 2. Namespaced macro calls - Updated regex to capture namespace.macro patterns (e.g., dbt.date_trunc) and skip built-in namespaces (dbt, adapter, exceptions) 3. Fully-qualified symbol names - Symbols with schema info now use ${schema}.${name} format to avoid collisions 4. Macros without arguments - Updated regex to match both {% macro name(args) %} and {% macro name %} 5. Added test for dependency line number validation within chunk boundaries Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

kapral18

Hey thanks for the contribution.

Overall direction looks strong (language wiring, fixtures, macro/ref/source extraction, and improved dependency line handling). The main gaps I see are a couple chunk-boundary correctness issues in the SQL state machine plus missing regression tests for those edges; addressing those should make the parser much safer to ship.

kapral18 · 2026-02-16T01:31:01Z

src/utils/parser.ts

+          mainStatementContent = [];
+        }
+        continue;
+      }


Single-line statements that start with SELECT/INSERT/... and already contain ; don’t get finalized on that same line because the start branch continues and the semicolon-finalization logic only runs on later lines. That merges consecutive one-line statements into one chunk (e.g. SELECT 1; + SELECT 2;). Can we finalize immediately when the start line includes ; to keep chunk boundaries correct?

kapral18 · 2026-02-16T01:31:02Z

src/utils/parser.ts

+      // Skip empty lines and comments at the start
+      if (mainStatementStart === null && !inWith && /^\s*(--|\/\*|\s*$)/.test(line)) {
+        continue;
+      }


WITH mode begins whenever withPattern matches and !inWith, even if a statement is already in progress (mainStatementStart !== null). This drops preamble lines like CREATE VIEW v AS when the next line is WITH ... (the emitted chunks start at the WITH, so the create header is lost). Can we prevent entering WITH mode when we’re already inside a statement, or otherwise preserve the preamble as part of the emitted chunks?

kapral18 · 2026-02-16T01:31:02Z

src/utils/parser.ts

+
+      // Skip empty lines and comments at the start
+      if (mainStatementStart === null && !inWith && /^\s*(--|\/\*|\s*$)/.test(line)) {
+        continue;


Related edge case: for one-line queries like WITH a AS (SELECT 1) SELECT * FROM a;, the current CTE chunking can swallow the main SELECT into the cte chunk rather than splitting cte vs main statement. Would it make sense to add a small regression test + adjust parsing to split when the main query starts on the same line as the last CTE?

kapral18 · 2026-02-16T01:31:02Z

src/utils/parser.ts

+    ]);
+    // Namespaces that contain built-in helpers (e.g., dbt.date_trunc)
+    const builtinNamespaces = new Set(['dbt', 'adapter', 'exceptions']);
+


Table extraction from FROM/JOIN won’t catch quoted identifiers like FROM "analytics"."orders", which is pretty common across dialects. If we want table refs to be useful broadly, should we extend the regex (or rely on grammar parsing) to support quoted/backticked identifiers?

kapral18 · 2026-02-16T01:31:02Z

tests/unit/parser-sql.test.ts

+    });
+  });
+
+  describe('Pure SQL Parsing', () => {


We have good coverage for the dbt fixtures, but the two chunk-boundary edges (consecutive one-line statements; CREATE … AS followed by WITH …) aren’t covered. Can we add two small regression tests for those so future tweaks don’t re-break chunk boundaries?

kapral18 · 2026-02-16T01:31:02Z

src/languages/sql.ts

+export const sqlConfig: LanguageConfiguration = {
+  name: 'sql',
+  fileSuffixes: ['.sql'],
+  parser: null, // Custom parser for SQL + dbt/Jinja support


This currently goes through a custom parser (parser: null) with a hand-rolled SQL state machine plus regex extraction for dbt/Jinja. Would you be open to a hybrid approach: use an existing maintained tree-sitter SQL grammar (e.g. @derekstride/tree-sitter-sql) for pure SQL structure (statements/CTEs/CREATE bodies), and route templated files (containing {{ / {%) through the current custom dbt/Jinja-aware path for ref, source, and macro call extraction? That should reduce maintenance and edge-case chunking bugs long-term.

grabowskit · 2026-02-19T00:06:00Z

Thank you so much @kapral18 for the thorough review! I really appreciate you taking the time to go through the code in such detail - it's clear I still have a lot to learn about SQL parsing edge cases. 😅

I've addressed the issues you raised:

Changes Made

1. Single-line statement finalization (consecutive one-line statements)

Fixed the issue where consecutive one-line statements like SELECT 1; + SELECT 2; were being merged into one chunk. Now when a statement starts and includes a semicolon on the same line, it's finalized immediately.

2. WITH mode entering when already inside a statement

Fixed the CREATE VIEW v AS WITH ... case where the preamble was being lost. Now the parser only enters WITH mode when not already inside a statement, so CREATE VIEW headers are preserved.

3. One-line CTE handling

Added handling for cases like WITH a AS (SELECT 1) SELECT * FROM a; where the main query starts on the same line as the last CTE closes.

4. Quoted identifier support for table extraction

Extended the table extraction regex to support:

Double-quoted identifiers (PostgreSQL, BigQuery): "schema"."table"
Backticked identifiers (MySQL, BigQuery): \schema`.`table``
Square bracket identifiers (SQL Server): [schema].[table]

5. Regression tests

Added 4 new regression tests covering:

Consecutive one-line statements not being merged
CREATE preamble preservation with WITH clause
One-line CTE with main query on same line
Quoted table identifier extraction

6. Tree-sitter hybrid approach acknowledgment

Added a comment in sql.ts documenting the suggestion to consider a hybrid approach using tree-sitter for pure SQL structure and the custom parser for dbt/Jinja templated files. This is a great idea for reducing long-term maintenance!

All existing tests continue to pass, and the new regression tests verify the fixes work correctly.

simianhacker requested a review from Copilot February 10, 2026 02:55

Copilot AI reviewed Feb 10, 2026

View reviewed changes

kapral18 requested changes Feb 16, 2026

View reviewed changes

		// Pattern for macro start: {% macro name(args) %} or {%- macro name(args) -%}
		const macroStartPattern = /\{%[-\s]macro\s+([a-zA-Z_][a-zA-Z0-9_])\s*\(/;

-    // Pattern for macro start: {% macro name(args) %} or {%- macro name(args) -%}
-    const macroStartPattern = /\{%[-\s]*macro\s+([a-zA-Z_][a-zA-Z0-9_]*)\s*\(/;
+    // Pattern for macro start: {% macro name(args) %}, {%- macro name(args) -%},
+    // and macros without args: {% macro name %} or {%- macro name -%}
+    const macroStartPattern = /\{%[-\s]*macro\s+([a-zA-Z_][a-zA-Z0-9_]*)\s*(\(|-?%\})/;

Conversation

grabowskit commented Feb 10, 2026

Summary

Why This Is Needed

Features

Implementation Notes

Test Plan

Uh oh!

cla-checker-service bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

kapral18 left a comment

Choose a reason for hiding this comment

Uh oh!

kapral18 Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

kapral18 Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

kapral18 Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

kapral18 Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

kapral18 Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

kapral18 Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

grabowskit commented Feb 19, 2026

Changes Made

1. Single-line statement finalization (consecutive one-line statements)

2. WITH mode entering when already inside a statement

3. One-line CTE handling

4. Quoted identifier support for table extraction

5. Regression tests

6. Tree-sitter hybrid approach acknowledgment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cla-checker-service bot commented Feb 10, 2026 •

edited

Loading