Skip to content
This repository was archived by the owner on Dec 28, 2025. It is now read-only.
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
52f2e01
feat: add configuration management module with dictionary paths and g…
LRriver Sep 30, 2025
fadaaf7
feat: add Gremlin parsing base classes with Step, Traversal core data…
LRriver Sep 30, 2025
b775d29
feat: add Gremlin expression processing module with predicates and co…
LRriver Sep 30, 2025
f0588a1
feat: add graph database schema management with vertex/edge labels an…
LRriver Sep 30, 2025
5f3b039
feat: add Gremlin base component library with synonym replacement and…
LRriver Sep 30, 2025
822272f
feat: add ANTLR syntax tree visitor with Gremlin query to Recipe pars…
LRriver Sep 30, 2025
441b32c
feat: add recursive backtracking traversal generator for diverse quer…
LRriver Sep 30, 2025
2de2096
feat: add main corpus generator with batch processing, global dedupli…
LRriver Sep 30, 2025
c92f09a
config: add global configuration file with generation parameters and …
LRriver Sep 30, 2025
25ca990
data: add cypher2gremlin dataset with 3514 real query templates
LRriver Sep 30, 2025
25a2876
docs: add project README with quick start guide and usage instructions
LRriver Sep 30, 2025
541aa20
feat: add ANTLR-generated Gremlin grammar package with lexer, parser …
LRriver Sep 30, 2025
eb7eb01
data: add schema and graph data
LRriver Sep 30, 2025
f0579e8
feat: add template directory with schema dictionary and synonym files
LRriver Sep 30, 2025
9c13457
test: add gremlin statement generalization generation test module
LRriver Sep 30, 2025
b14ffb3
test: add generator unit tests for corpus generation validation
LRriver Sep 30, 2025
7cd8427
Add graph2gremlin.py: Initial template-based Gremlin data generation …
LRriver Sep 30, 2025
4da021c
Add gremlin_checker.py: Syntax checking using Antlr4
LRriver Sep 30, 2025
bc10fe2
Add llm_handler.py: LLM interaction model for query generalization an…
LRriver Sep 30, 2025
6ea48d5
Add qa_generalize.py: Seed data generalization using gremlin_checker …
LRriver Sep 30, 2025
78f8c2a
Add instruct_convert.py: Instruction format conversion and train/test…
LRriver Sep 30, 2025
b7f3f4a
Add da_data: Schema and graph data
LRriver Sep 30, 2025
332b879
Add data/seed_data: Seed data directory
LRriver Sep 30, 2025
8a94bad
Add data/vertical_training_sets: Vertical domain scenario generalized…
LRriver Sep 30, 2025
676d28c
Add books on Gremlin syntax knowledge to process data.
LRriver Sep 30, 2025
90f346f
Add a dataset of Gremlin QA pairs synthesized based on LLM.
LRriver Sep 30, 2025
4120356
Add README.md
LRriver Sep 30, 2025
67b523a
Compatible with OpenAI format
LRriver Oct 5, 2025
bccc147
Increase Gremlin syntax vocabulary that supports generalization, and …
LRriver Oct 29, 2025
44592b4
modify README.md
LRriver Oct 29, 2025
a1d614c
Add Apache-2.0 license, fix review comments
LRriver Oct 30, 2025
471e141
Modify the .licenserc.yaml file to ignore license checks for .interp,…
LRriver Oct 30, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .licenserc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,9 @@ header: # `header` section is configurations for source codes license header.
- '**/poetry.lock'
- '.github/**/*'
- 'docker/**/*'
- '**/*.interp'
- '**/*.tokens'
- '**/*.csv'

comment: on-failure
# on what condition license-eye will comment on the pull request, `on-failure`, `always`, `never`.
Expand Down
183 changes: 183 additions & 0 deletions text2gremlin/AST_Text2Gremlin/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
# Gremlin 查询语料库生成器

从模板生成大量多样化的 Gremlin 查询及其中文描述,用于测试、训练和分析。

## 快速开始
环境配置:python:3.12.10
```bash
pip install -r requirements.txt
```

```bash
# 生成语料库
python generate_corpus.py

# 查看统计
python show_syntax_stats.py
```

生成结果在 `output/generated_corpus_*.json`

---

## 核心功能

- **模板泛化**: 1 个模板 → 数百个查询变体
- **智能控制**: 自动控制组合爆炸,避免生成过多查询
- **中文描述**: 自动生成流畅的查询描述
- **语法分析**: 统计生成查询的语法分布

---

## 项目结构

```text
├── generate_corpus.py # 主程序
├── gremlin_templates.csv # 模板文件
├── config.json # 配置
├── base/
│ ├── generator.py # 解析泛化控制器
│ ├── Config.py # 配置管理模块
│ ├── Schema.py # Schema和数据管理
│ ├── GremlinParse.py # 数据结构定义
│ ├── GremlinExpr.py # 复杂表达式定义(谓词、匿名遍历等)
│ ├── GremlinTransVisitor.py # AST解析
│ ├── TraversalGenerator.py # 遍历生成器
│ ├── combination_control_config.json # 组合控制配置
│ ├── GremlinBase.py # 翻译引擎
│ ├── gremlin/ # ANTLR生成的解析器
│ └── template/ # 翻译字典
│ ├── schema_dict.txt # Schema术语翻译
│ └── syn_dict.txt # 同义词字典
├── db_data/ # 数据和 Schema
└── output/ # 输出目录
```

---

## 使用方式

### 1. 命令行

```bash
python generate_corpus.py
```

### 2. Python API

```python
from base import generate_gremlin_corpus

result = generate_gremlin_corpus(
templates='gremlin_templates.csv',
config_path='config.json',
schema_path='db_data/schema/movie_schema.json',
data_path='db_data/'
)

print(f"生成了 {result['total_unique_queries']} 个查询")
```

### 3. 添加模板

直接编辑 `gremlin_templates.csv`即可

---

## 配置说明

### 模板文件 (`gremlin_templates.csv`)

| 列名 | 说明 | 示例 |
|------|------|------|
| template | Gremlin 查询模板 | `g.V().hasLabel('person')` |
| description | 模板描述(可选) | 查询所有人 |

### 组合控制 (`base/combination_control_config.json`)

控制查询生成数量,详见 `COMBINATION_CONTROL_GUIDE.md`

核心参数:
- **链长度分类**: 短链(≤4步)、中链(5-6步)、长链(7-8步)、超长链(≥9步)
- **数据值填充**: 中间步骤填1个值,终端步骤填2-3个值
- **属性泛化**: 根据链长度动态调整泛化程度
- **查询数量限制**: 中链≤100,长链≤500,超长链≤50

---

## 输出格式

```json
{
"metadata": {
"total_templates": 198,
"successful_templates": 198,
"total_unique_queries": 1493,
"generation_timestamp": "2025-10-29 19:07:33"
},
"corpus": [
{
"query": "g.V().hasLabel('person').out('acted_in')",
"description": "从图中开始查找所有顶点,过滤出'人'类型的顶点,沿'参演'边out方向遍历"
}
]
}
```

---

## 语法分析

生成语料库后,可以分析语法分布:

```bash
# 分析语法分布
python analyze_syntax_distribution.py

# 查看统计
python show_syntax_stats.py

# 可视化
python visualize_syntax_distribution.py
```

分析结果:
- `output/syntax_distribution_stats.json` - 统计数据
- `output/SYNTAX_ANALYSIS_SUMMARY.md` - 分析报告

---

## 核心特性

### 1. 模板泛化
从一个模板生成多个变体:
```text
模板: g.V().hasLabel('person').out('acted_in')

泛化:
→ g.V().hasLabel('movie').out('acted_in')
→ g.V().hasLabel('person').out('directed')
→ g.V().hasLabel('genre').out('has_genre')
...
```

### 2. 智能控制
- **链长度自适应**: 短链多泛化,长链少泛化
- **位置敏感**: 中间步骤保守,终端步骤充分
- **类型区分**: Schema 属性积极泛化,数据值保守填充

### 3. 自动去重
- 查询级去重(完全相同的查询)
- 语义级去重(等价查询)
- 保证生成的查询都是唯一的

### 4. 中文翻译
自动生成流畅的中文描述:
```text
g.V().hasLabel('person').out('acted_in').has('title', 'Inception')
从图中开始查找所有顶点,过滤出'人'类型的顶点,沿'参演'边out方向遍历,其'标题'为'Inception'
```



Loading
Loading