This repository was archived by the owner on Dec 28, 2025. It is now read-only.
forked from apache/incubator-hugegraph-ai
-
Notifications
You must be signed in to change notification settings - Fork 0
Text2Gremlin Data Generation and Model Fine-Tuning System (Vertical Scenarios and General Scenarios) #52
Open
LRriver
wants to merge
32
commits into
hugegraph:main
Choose a base branch
from
LRriver:text2gremlin
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Text2Gremlin Data Generation and Model Fine-Tuning System (Vertical Scenarios and General Scenarios) #52
Changes from all commits
Commits
Show all changes
32 commits
Select commit
Hold shift + click to select a range
52f2e01
feat: add configuration management module with dictionary paths and g…
LRriver fadaaf7
feat: add Gremlin parsing base classes with Step, Traversal core data…
LRriver b775d29
feat: add Gremlin expression processing module with predicates and co…
LRriver f0588a1
feat: add graph database schema management with vertex/edge labels an…
LRriver 5f3b039
feat: add Gremlin base component library with synonym replacement and…
LRriver 822272f
feat: add ANTLR syntax tree visitor with Gremlin query to Recipe pars…
LRriver 441b32c
feat: add recursive backtracking traversal generator for diverse quer…
LRriver 2de2096
feat: add main corpus generator with batch processing, global dedupli…
LRriver c92f09a
config: add global configuration file with generation parameters and …
LRriver 25ca990
data: add cypher2gremlin dataset with 3514 real query templates
LRriver 25a2876
docs: add project README with quick start guide and usage instructions
LRriver 541aa20
feat: add ANTLR-generated Gremlin grammar package with lexer, parser …
LRriver eb7eb01
data: add schema and graph data
LRriver f0579e8
feat: add template directory with schema dictionary and synonym files
LRriver 9c13457
test: add gremlin statement generalization generation test module
LRriver b14ffb3
test: add generator unit tests for corpus generation validation
LRriver 7cd8427
Add graph2gremlin.py: Initial template-based Gremlin data generation …
LRriver 4da021c
Add gremlin_checker.py: Syntax checking using Antlr4
LRriver bc10fe2
Add llm_handler.py: LLM interaction model for query generalization an…
LRriver 6ea48d5
Add qa_generalize.py: Seed data generalization using gremlin_checker …
LRriver 78f8c2a
Add instruct_convert.py: Instruction format conversion and train/test…
LRriver b7f3f4a
Add da_data: Schema and graph data
LRriver 332b879
Add data/seed_data: Seed data directory
LRriver 8a94bad
Add data/vertical_training_sets: Vertical domain scenario generalized…
LRriver 676d28c
Add books on Gremlin syntax knowledge to process data.
LRriver 90f346f
Add a dataset of Gremlin QA pairs synthesized based on LLM.
LRriver 4120356
Add README.md
LRriver 67b523a
Compatible with OpenAI format
LRriver bccc147
Increase Gremlin syntax vocabulary that supports generalization, and …
LRriver 44592b4
modify README.md
LRriver a1d614c
Add Apache-2.0 license, fix review comments
LRriver 471e141
Modify the .licenserc.yaml file to ignore license checks for .interp,…
LRriver File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,183 @@ | ||
| # Gremlin 查询语料库生成器 | ||
|
|
||
| 从模板生成大量多样化的 Gremlin 查询及其中文描述,用于测试、训练和分析。 | ||
|
|
||
| ## 快速开始 | ||
| 环境配置:python:3.12.10 | ||
| ```bash | ||
| pip install -r requirements.txt | ||
| ``` | ||
|
|
||
| ```bash | ||
| # 生成语料库 | ||
| python generate_corpus.py | ||
|
|
||
| # 查看统计 | ||
| python show_syntax_stats.py | ||
| ``` | ||
|
|
||
| 生成结果在 `output/generated_corpus_*.json` | ||
|
|
||
| --- | ||
|
|
||
| ## 核心功能 | ||
|
|
||
| - **模板泛化**: 1 个模板 → 数百个查询变体 | ||
| - **智能控制**: 自动控制组合爆炸,避免生成过多查询 | ||
| - **中文描述**: 自动生成流畅的查询描述 | ||
| - **语法分析**: 统计生成查询的语法分布 | ||
|
|
||
| --- | ||
|
|
||
| ## 项目结构 | ||
|
|
||
| ```text | ||
| ├── generate_corpus.py # 主程序 | ||
| ├── gremlin_templates.csv # 模板文件 | ||
| ├── config.json # 配置 | ||
| ├── base/ | ||
| │ ├── generator.py # 解析泛化控制器 | ||
| │ ├── Config.py # 配置管理模块 | ||
| │ ├── Schema.py # Schema和数据管理 | ||
| │ ├── GremlinParse.py # 数据结构定义 | ||
| │ ├── GremlinExpr.py # 复杂表达式定义(谓词、匿名遍历等) | ||
| │ ├── GremlinTransVisitor.py # AST解析 | ||
| │ ├── TraversalGenerator.py # 遍历生成器 | ||
| │ ├── combination_control_config.json # 组合控制配置 | ||
| │ ├── GremlinBase.py # 翻译引擎 | ||
| │ ├── gremlin/ # ANTLR生成的解析器 | ||
| │ └── template/ # 翻译字典 | ||
| │ ├── schema_dict.txt # Schema术语翻译 | ||
| │ └── syn_dict.txt # 同义词字典 | ||
| ├── db_data/ # 数据和 Schema | ||
| └── output/ # 输出目录 | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## 使用方式 | ||
|
|
||
| ### 1. 命令行 | ||
|
|
||
| ```bash | ||
| python generate_corpus.py | ||
| ``` | ||
|
|
||
| ### 2. Python API | ||
|
|
||
| ```python | ||
| from base import generate_gremlin_corpus | ||
|
|
||
| result = generate_gremlin_corpus( | ||
| templates='gremlin_templates.csv', | ||
| config_path='config.json', | ||
| schema_path='db_data/schema/movie_schema.json', | ||
| data_path='db_data/' | ||
| ) | ||
|
|
||
| print(f"生成了 {result['total_unique_queries']} 个查询") | ||
| ``` | ||
|
|
||
| ### 3. 添加模板 | ||
|
|
||
| 直接编辑 `gremlin_templates.csv`即可 | ||
|
|
||
| --- | ||
|
|
||
| ## 配置说明 | ||
|
|
||
| ### 模板文件 (`gremlin_templates.csv`) | ||
|
|
||
| | 列名 | 说明 | 示例 | | ||
| |------|------|------| | ||
| | template | Gremlin 查询模板 | `g.V().hasLabel('person')` | | ||
| | description | 模板描述(可选) | 查询所有人 | | ||
|
|
||
| ### 组合控制 (`base/combination_control_config.json`) | ||
|
|
||
| 控制查询生成数量,详见 `COMBINATION_CONTROL_GUIDE.md` | ||
|
|
||
| 核心参数: | ||
| - **链长度分类**: 短链(≤4步)、中链(5-6步)、长链(7-8步)、超长链(≥9步) | ||
| - **数据值填充**: 中间步骤填1个值,终端步骤填2-3个值 | ||
| - **属性泛化**: 根据链长度动态调整泛化程度 | ||
| - **查询数量限制**: 中链≤100,长链≤500,超长链≤50 | ||
|
|
||
| --- | ||
|
|
||
| ## 输出格式 | ||
|
|
||
| ```json | ||
| { | ||
| "metadata": { | ||
| "total_templates": 198, | ||
| "successful_templates": 198, | ||
| "total_unique_queries": 1493, | ||
| "generation_timestamp": "2025-10-29 19:07:33" | ||
| }, | ||
| "corpus": [ | ||
| { | ||
| "query": "g.V().hasLabel('person').out('acted_in')", | ||
| "description": "从图中开始查找所有顶点,过滤出'人'类型的顶点,沿'参演'边out方向遍历" | ||
| } | ||
| ] | ||
| } | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## 语法分析 | ||
|
|
||
| 生成语料库后,可以分析语法分布: | ||
|
|
||
| ```bash | ||
| # 分析语法分布 | ||
| python analyze_syntax_distribution.py | ||
|
|
||
| # 查看统计 | ||
| python show_syntax_stats.py | ||
|
|
||
| # 可视化 | ||
| python visualize_syntax_distribution.py | ||
| ``` | ||
|
|
||
| 分析结果: | ||
| - `output/syntax_distribution_stats.json` - 统计数据 | ||
| - `output/SYNTAX_ANALYSIS_SUMMARY.md` - 分析报告 | ||
|
|
||
| --- | ||
|
|
||
| ## 核心特性 | ||
|
|
||
| ### 1. 模板泛化 | ||
| 从一个模板生成多个变体: | ||
| ```text | ||
| 模板: g.V().hasLabel('person').out('acted_in') | ||
|
|
||
| 泛化: | ||
| → g.V().hasLabel('movie').out('acted_in') | ||
| → g.V().hasLabel('person').out('directed') | ||
| → g.V().hasLabel('genre').out('has_genre') | ||
| ... | ||
| ``` | ||
|
|
||
| ### 2. 智能控制 | ||
| - **链长度自适应**: 短链多泛化,长链少泛化 | ||
| - **位置敏感**: 中间步骤保守,终端步骤充分 | ||
| - **类型区分**: Schema 属性积极泛化,数据值保守填充 | ||
|
|
||
| ### 3. 自动去重 | ||
| - 查询级去重(完全相同的查询) | ||
| - 语义级去重(等价查询) | ||
| - 保证生成的查询都是唯一的 | ||
|
|
||
| ### 4. 中文翻译 | ||
| 自动生成流畅的中文描述: | ||
| ```text | ||
| g.V().hasLabel('person').out('acted_in').has('title', 'Inception') | ||
| ↓ | ||
| 从图中开始查找所有顶点,过滤出'人'类型的顶点,沿'参演'边out方向遍历,其'标题'为'Inception' | ||
| ``` | ||
coderabbitai[bot] marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
|
|
||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.