mr master by shenhuan2021 · Pull Request #94 · sqlparser/sqlflow_public

shenhuan2021 · 2026-02-23T13:10:26Z

No description provided.

get csv lineage

This commit completely revamps the `1-introduction.md` document to make it more accessible and understandable for users who are new to the concept of data lineage. Key changes include: - **Simplified Language:** Replaced technical jargon with simpler terms and analogies, such as describing data lineage as a "family tree for your data." - **Clearer Structure:** The document is now organized to flow from high-level concepts to specific details, starting with "What is Data Lineage?" and then diving into core components and relationship types. - **Improved Examples:** Provided clearer, well-annotated SQL examples for direct data flow (`fdd`), indirect/impact relationships (`fdr`), and `join` conditions. - **Added Context:** Included the practical benefits of data lineage, such as for impact analysis, troubleshooting, and data governance, to motivate the reader.

Adds a new section to the end of the data lineage introduction to give users a forward-looking preview of the next-generation (v2) data lineage model. This section: - Briefly introduces the motivation for the new, more precise schema. - Provides a mapping from the current concepts (`fdd`, `fdr`) to the new, more explicit relationship types (`data_flow`, `restricts`, `groups`). - Highlights key improvements such as enhanced traceability through `observations`, detailed transformation logic with `transforms`, and robust object identification using `qualifiedName`. - Uses a practical example to illustrate the added clarity of the new model.

This commit improves the direct dataflow documentation by adding clear effectType annotations for both v1 and v2 models. Key changes: - Added effectType annotations to each example: - v1: Shows operation kind (select, function, insert) - v2: Shows copy/transformation strength (EXACT_COPY, WEAK_COPY) - Added brief rationales for each effectType (e.g., "alias passthrough; no semantic change") - Clarified how v2 handles function transformations via transforms.code - Enhanced the "Look Ahead" section with more details on v2's effectType usage - Maintained beginner-friendly explanations while adding technical precision These changes help readers understand both the operation type (v1) and the semantic strength (v2) of data flows, making the documentation more precise and intuitive.

…ith examples - Clarified the concept of indirect dataflow and the purpose of the RelationRows pseudo-column. - Added effectType notes for v1 (operation kind) and v2 (copy/aggregation strength). - Expanded sections with beginner-friendly explanations and diagrams references. - Introduced a detailed “A Look Ahead” v2 mapping: - fdr → restricts/groups - Aggregates via data_flow + AGGREGATION with transforms.code - Traceability via observations and statementKey - Appended an analysis section evaluating RelationRows (pros/cons) and best-practice v2 replacements. - Included side-by-side v1 vs v2 examples for: - WHERE filter - Aggregates with GROUP BY - Table rename (table→table lineage)

- Clarified indirect dataflow and the purpose/limits of RelationRows. - Added effectType notes (v1: operation kind; v2: EXACT_COPY/WEAK_COPY/AGGREGATION/none for restricts). - Added concrete GROUP BY example with side-by-side v1 (RelationRows, fdd/fdr) vs v2 (data_flow + AGGREGATION, groups, exact passthrough). - Added table rename and COUNT/SUM patterns with v1 vs v2 relationships. - Introduced a “Does v2 resolve RelationRows?” section explaining how v2 replaces RelationRows with restricts/groups/table-level data_flow, transforms, and observations; optional hidden metric if needed.

… mapping, examples, and best practices - Standardize wording/capitalization; fix minor typos in 04-indirect-dataflow-where-group-by.md - Embed v2 explanations immediately after each v1 sample (COUNT/SUM with/without GROUP BY) - Add mapping table, effectType guidance, and schema references - Annotate v2 edges with transforms and effectType: AGGREGATION - Append expert best-practices section; deprecate RelationRows in favor of data_flow + restricts + groups - Update Summary to align with v2 modeling and governance-focused clarity

- Rewrote 06-dataflow-chain.md with clear v1/v2 sections and simple language - v1: explained dbobjs/relations, fdd vs fdr, and effectType: select - v2: emphasized atomic 1→1 edges, restricts/groups/data_flow, observations/transforms - Replaced deprecated RelationRows with v2 lineageObjects: - restricts: Employees.ManagerID -> temp.cteReports (condition: "ManagerID IS NULL") - data_flow: COUNT/SUM to temp.result columns with transforms.code - Added effectType annotations (AGGREGATION, AMBIGUOUS) and rationale - Clarified temporary objects and deterministic qualifiedName guidance - Kept diagram and added links to v1/v2 schema and design docs

…message correctly, then provide a concise, standards-compliant commit message summarizing the terminology update and intro additions. [1 tool called] Temporary result set: rename, add intro, align with v2 - Replace “intermediate result set” with “temporary result set” - Add beginner-friendly intro explaining concept and lifecycle - Map to v1 (<resultset> types) and v2 (lineageObjects with isTemporary) - Adjust headings and examples; fix minor grammar (“there” → “three”) - Clarify that UI can hide temporary result sets in large graphs

- Define transforms in simple terms and add clear examples - Map to v2 schema: relationships.transforms, effectType, observations - Add SQL + minimal v2 JSON examples (EXACT_COPY, WEAK_COPY, AGGREGATION, PARTIAL_COPY) - Retain v1 XML example and explain mapping to v2 - Add practical tips: coordinates, statementKey, setting effectType

- Explain v1 joins using fdd/fdr/join relations and simple example - Model v2 with 1→1 data_flow (values) and restricts (join influence) - Cover INNER/LEFT/RIGHT/FULL/CROSS joins; note composite keys - Provide minimal v2 JSON edge snippets with statementKey/sqlCoordinates - Map v1 → v2 and add UI/governance tips for readability and traceability Files: - sqlflow_public/doc/basic-concepts/08-join-relation.md

- Introduce ER basics for beginners - Show v1 type="er" FK→PK example for structural link - Define v2: use has (table→column) for structure; optional restricts for FK influence; keep separate from data_flow - Include minimal v2 JSON snippets, mapping tips, and UI guidance Files: - sqlflow_public/doc/basic-concepts/12-er-diagram.md

Clarify the explanation of the v1 data lineage model for GROUP BY aggregations in the introduction document. The previous text incorrectly implied that the v1 model only captured an indirect relationship (fdr) from the grouping key. This change corrects the explanation to show that v1 captures both a direct data flow (fdd) from the aggregated column and an indirect influence (fdr) from the grouping column. This provides a more accurate comparison with the v2 model, better highlighting its improvements in relationship specificity, such as the 'groups' type and 'effectType' attribute.

Reorganize 03-indirect-dataflow-and-pseudo-column.md for easier navigation: - Add clear intro on indirect dataflow - Create same-level sections: WHERE Clause, GROUP BY Clause, RelationRows - Consolidate v1/v2 explanations and examples under WHERE and GROUP BY - Add “Aggregates Without GROUP BY” subsection for table-level aggregation - Clarify RelationRows usage in v1 and deprecation guidance in v2 - Keep effectType notes consistent and preserve references

- Emphasize v1 treats functions as lineage objects/nodes; flows use fdd/fdr - Emphasize v2 attaches functions to relationships via transforms.code - Note: UDFs/TVFs get lineage objects in v2 for dependency (calls), but value movement remains on data_flow edges (not as dataflow nodes) - Rename and expand 2.2 to cover all UDFs (scalar + TVF) with concise examples - Update overview bullets to reflect the modeling shift and key differences

Refine the documentation for data lineage basic concepts to improve clarity, accuracy, and alignment with industry best practices. The main changes include: - Update the basic concepts README with an introduction and an accurate list of all conceptual documents. - Enhance the JOIN modeling document with more detailed examples, including SEMI and ANTI joins, and clarify v2 best practices. - Append a new, comprehensive section in Chinese to the v2 design explanation document, detailing the recommended approach for modeling various JOIN types. This provides a more robust and easier-to-understand guide for users, especially regarding the nuanced topic of JOIN lineage.

…ineage-in-bigquery-views-a-practical-guide-for-the-select-challenge/

Explain how CASE expressions are modeled in v1 and v2 data lineage. Clarify that conditional columns are direct sources even if results are constants. Include SQL examples for both versions. Co-authored-by: Cursor <cursoragent@cursor.com>

shenhuan2021 and others added 30 commits June 2, 2024 14:55

Merge pull request #87 from sqlparser/feature/shenhuan

a915fab

get csv lineage

Update getcsv.py

2de9db6

Update GenerateDataLineageDemo.py

946e2b8

Update getcsv.py

9aeb3e8

Update getcsv.py

9a85afc

Update getcsv.py

36e8902

Update GenerateDataLineageDemo.py

4018232

Update GenerateTokenDemo.py

0cf11eb

Rename GenerateTokenDemo.py to GenerateToken.py

ce96a2f

Update GenerateToken.py

f90ad0e

Update getcsv.py

2487454

Create GenerateLineageParam.py

54f7ce0

Update GenerateDataLineageDemo.py

c417163

Update getcsv.py

0bf3949

Update getcsv.py

cf2aeeb

Update GenerateToken.py

9b12156

Update GenerateDataLineageDemo.py

2a814d1

fix typo

8b1c6d3

add more doc about lineage model

0279842

add document for identifier and string literal

d0f193a

add document for identifier and string literal

5d4c869

add document for identifier and string literal

60366f7

sql server proc return record set

5c1b935

Update identifier-and-string-literal.md

187757f

Update identifier-and-string-literal.md

f834029

add python demo to illustrates how to get token and call the api

fa58dc4

Create CheckSyntax.py

bd9fe4d

Update CheckSyntax.py

00f9b2f

Update and rename CheckSyntax.py to checksyntax.py

7f355c5

Create toxml.py

0d1f187

sqlparser and others added 29 commits February 21, 2025 12:34

add instructions about how to remove relationship in where clause

5c4db13

Initial content for MkDocs

f569492

github action for mkdocs

eac6ec6

github action for mkdocs 1

ad206cb

doc for gsp, use release/docs branch to build site

d5faf93

udpate gsp and sqlflow doc

e5a0fef

rename action name

05545ef

Start to add data lineage schema v2

be0259c

init for gudu sql omni

6393f2a

add images for gudu sql omni

58a43d9

update gudu sql omni readme

e8b1489

refine license file

6a8e857

add blog image https://www.dpriver.com/blog/2025/10/tracking-column-l…

4200b01

…ineage-in-bigquery-views-a-practical-guide-for-the-select-challenge/

Add CASE expression lineage documentation

a391785

Explain how CASE expressions are modeled in v1 and v2 data lineage. Clarify that conditional columns are direct sources even if results are constants. Include SQL examples for both versions. Co-authored-by: Cursor <cursoragent@cursor.com>

shenhuan2021 merged commit 1d21608 into feature/shenhuan Feb 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mr master#94

mr master#94
shenhuan2021 merged 86 commits intofeature/shenhuanfrom
master

shenhuan2021 commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shenhuan2021 commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants