Skip to content

mr master#94

Merged
shenhuan2021 merged 86 commits intofeature/shenhuanfrom
master
Feb 23, 2026
Merged

mr master#94
shenhuan2021 merged 86 commits intofeature/shenhuanfrom
master

Conversation

@shenhuan2021
Copy link
Collaborator

No description provided.

sqlparser and others added 29 commits February 21, 2025 12:34
This commit completely revamps the `1-introduction.md` document to make it more accessible and understandable for users who are new to the concept of data lineage.

Key changes include:
-   **Simplified Language:** Replaced technical jargon with simpler terms and analogies, such as describing data lineage as a "family tree for your data."
-   **Clearer Structure:** The document is now organized to flow from high-level concepts to specific details, starting with "What is Data Lineage?" and then diving into core components and relationship types.
-   **Improved Examples:** Provided clearer, well-annotated SQL examples for direct data flow (`fdd`), indirect/impact relationships (`fdr`), and `join` conditions.
-   **Added Context:** Included the practical benefits of data lineage, such as for impact analysis, troubleshooting, and data governance, to motivate the reader.
Adds a new section to the end of the data lineage introduction to give users a forward-looking preview of the next-generation (v2) data lineage model.

This section:
-   Briefly introduces the motivation for the new, more precise schema.
-   Provides a mapping from the current concepts (`fdd`, `fdr`) to the new, more explicit relationship types (`data_flow`, `restricts`, `groups`).
-   Highlights key improvements such as enhanced traceability through `observations`, detailed transformation logic with `transforms`, and robust object identification using `qualifiedName`.
-   Uses a practical example to illustrate the added clarity of the new model.
This commit improves the direct dataflow documentation by adding clear effectType annotations for both v1 and v2 models.

Key changes:
- Added effectType annotations to each example:
  - v1: Shows operation kind (select, function, insert)
  - v2: Shows copy/transformation strength (EXACT_COPY, WEAK_COPY)
- Added brief rationales for each effectType (e.g., "alias passthrough; no semantic change")
- Clarified how v2 handles function transformations via transforms.code
- Enhanced the "Look Ahead" section with more details on v2's effectType usage
- Maintained beginner-friendly explanations while adding technical precision

These changes help readers understand both the operation type (v1) and the semantic strength (v2) of data flows, making the documentation more precise and intuitive.
…ith examples

- Clarified the concept of indirect dataflow and the purpose of the RelationRows pseudo-column.
- Added effectType notes for v1 (operation kind) and v2 (copy/aggregation strength).
- Expanded sections with beginner-friendly explanations and diagrams references.
- Introduced a detailed “A Look Ahead” v2 mapping:
  - fdr → restricts/groups
  - Aggregates via data_flow + AGGREGATION with transforms.code
  - Traceability via observations and statementKey
- Appended an analysis section evaluating RelationRows (pros/cons) and best-practice v2 replacements.
- Included side-by-side v1 vs v2 examples for:
  - WHERE filter
  - Aggregates with GROUP BY
  - Table rename (table→table lineage)
- Clarified indirect dataflow and the purpose/limits of RelationRows.
- Added effectType notes (v1: operation kind; v2: EXACT_COPY/WEAK_COPY/AGGREGATION/none for restricts).
- Added concrete GROUP BY example with side-by-side v1 (RelationRows, fdd/fdr) vs v2 (data_flow + AGGREGATION, groups, exact passthrough).
- Added table rename and COUNT/SUM patterns with v1 vs v2 relationships.
- Introduced a “Does v2 resolve RelationRows?” section explaining how v2 replaces RelationRows with restricts/groups/table-level data_flow, transforms, and observations; optional hidden metric if needed.
… mapping, examples, and best practices

- Standardize wording/capitalization; fix minor typos in 04-indirect-dataflow-where-group-by.md
- Embed v2 explanations immediately after each v1 sample (COUNT/SUM with/without GROUP BY)
- Add mapping table, effectType guidance, and schema references
- Annotate v2 edges with transforms and effectType: AGGREGATION
- Append expert best-practices section; deprecate RelationRows in favor of data_flow + restricts + groups
- Update Summary to align with v2 modeling and governance-focused clarity
- Rewrote 06-dataflow-chain.md with clear v1/v2 sections and simple language
- v1: explained dbobjs/relations, fdd vs fdr, and effectType: select
- v2: emphasized atomic 1→1 edges, restricts/groups/data_flow, observations/transforms
- Replaced deprecated RelationRows with v2 lineageObjects:
  - restricts: Employees.ManagerID -> temp.cteReports (condition: "ManagerID IS NULL")
  - data_flow: COUNT/SUM to temp.result columns with transforms.code
- Added effectType annotations (AGGREGATION, AMBIGUOUS) and rationale
- Clarified temporary objects and deterministic qualifiedName guidance
- Kept diagram and added links to v1/v2 schema and design docs
…message correctly, then provide a concise, standards-compliant commit message summarizing the terminology update and intro additions.

[1 tool called]

Temporary result set: rename, add intro, align with v2

- Replace “intermediate result set” with “temporary result set”
- Add beginner-friendly intro explaining concept and lifecycle
- Map to v1 (<resultset> types) and v2 (lineageObjects with isTemporary)
- Adjust headings and examples; fix minor grammar (“there” → “three”)
- Clarify that UI can hide temporary result sets in large graphs
- Define transforms in simple terms and add clear examples
- Map to v2 schema: relationships.transforms, effectType, observations
- Add SQL + minimal v2 JSON examples (EXACT_COPY, WEAK_COPY, AGGREGATION, PARTIAL_COPY)
- Retain v1 XML example and explain mapping to v2
- Add practical tips: coordinates, statementKey, setting effectType
- Explain v1 joins using fdd/fdr/join relations and simple example
- Model v2 with 1→1 data_flow (values) and restricts (join influence)
- Cover INNER/LEFT/RIGHT/FULL/CROSS joins; note composite keys
- Provide minimal v2 JSON edge snippets with statementKey/sqlCoordinates
- Map v1 → v2 and add UI/governance tips for readability and traceability

Files:
- sqlflow_public/doc/basic-concepts/08-join-relation.md
- Introduce ER basics for beginners
- Show v1 type="er" FK→PK example for structural link
- Define v2: use has (table→column) for structure; optional restricts for FK influence; keep separate from data_flow
- Include minimal v2 JSON snippets, mapping tips, and UI guidance

Files:
- sqlflow_public/doc/basic-concepts/12-er-diagram.md
Clarify the explanation of the v1 data lineage model for
GROUP BY aggregations in the introduction document.

The previous text incorrectly implied that the v1 model only
captured an indirect relationship (fdr) from the grouping key.
This change corrects the explanation to show that v1 captures both
a direct data flow (fdd) from the aggregated column and an
indirect influence (fdr) from the grouping column.

This provides a more accurate comparison with the v2 model,
better highlighting its improvements in relationship specificity,
such as the 'groups' type and 'effectType' attribute.
Reorganize 03-indirect-dataflow-and-pseudo-column.md for easier
navigation:
- Add clear intro on indirect dataflow
- Create same-level sections: WHERE Clause, GROUP BY Clause,
  RelationRows
- Consolidate v1/v2 explanations and examples under WHERE and GROUP BY
- Add “Aggregates Without GROUP BY” subsection for table-level
  aggregation
- Clarify RelationRows usage in v1 and deprecation guidance in v2
- Keep effectType notes consistent and preserve references
- Emphasize v1 treats functions as lineage objects/nodes; flows use fdd/fdr
- Emphasize v2 attaches functions to relationships via transforms.code
- Note: UDFs/TVFs get lineage objects in v2 for dependency (calls), but
  value movement remains on data_flow edges (not as dataflow nodes)
- Rename and expand 2.2 to cover all UDFs (scalar + TVF) with concise examples
- Update overview bullets to reflect the modeling shift and key differences
Refine the documentation for data lineage basic concepts to improve
clarity, accuracy, and alignment with industry best practices.

The main changes include:

- Update the basic concepts README with an introduction and an accurate
  list of all conceptual documents.
- Enhance the JOIN modeling document with more detailed examples, including
  SEMI and ANTI joins, and clarify v2 best practices.
- Append a new, comprehensive section in Chinese to the v2 design
  explanation document, detailing the recommended approach for modeling
  various JOIN types.

This provides a more robust and easier-to-understand guide for users,
especially regarding the nuanced topic of JOIN lineage.
Explain how CASE expressions are modeled in v1 and v2 data lineage.
Clarify that conditional columns are direct sources even if results are constants.
Include SQL examples for both versions.

Co-authored-by: Cursor <cursoragent@cursor.com>
@shenhuan2021 shenhuan2021 merged commit 1d21608 into feature/shenhuan Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants