AI驱动的知识图谱生成器

本系统接收非结构化文本文档，使用您选择的大型语言模型（Large Language Model, LLM）提取主题-谓词-对象（Subject-Predicate-Object, SPO）三元组形式的知识，并将关系可视化为交互式知识图谱。使用本项目创建的知识图谱演示可在此处查看：工业革命知识图谱

功能特性

文本分块：自动将大型文档分割为可管理的块进行处理
知识提取：使用AI识别实体及其关系
实体标准化：确保跨文档块的实体命名一致性
关系推理：发现图谱中不相连部分之间的额外关系
交互式可视化：创建交互式图谱可视化
支持任何OpenAI兼容的API端点：Ollama、LM Studio、OpenAI、vLLM、LiteLLM（提供对AWS Bedrock、Azure OpenAI、Anthropic等众多LLM服务的访问）

系统要求

Python 3.11+
必要的包（使用 pip install -r requirements.txt 或 uv sync 安装）

快速开始

克隆此仓库
安装依赖：pip install -r requirements.txt
在 config.toml 中配置您的设置
运行系统：

python generate-graph.py --input your_text_file.txt --output knowledge_graph.html

或使用UV：

uv run generate-graph.py --input your_text_file.txt --output knowledge_graph.html

或作为模块安装和使用：

pip install --upgrade -e .
generate-graph --input your_text_file.txt --output knowledge_graph.html

配置

可以使用 config.toml 文件配置系统：

[llm]
model = "gemma3"  # Google开放权重模型
api_key = "sk-1234"
base_url = "http://localhost:11434/v1/chat/completions"  # 本地运行的Ollama实例（但可以是任何OpenAI兼容端点）
max_tokens = 8192  # 最大令牌数
temperature = 0.2  # 温度参数

[chunking]
chunk_size = 200  # 每个块的单词数
overlap = 20      # 块之间重叠的单词数

[standardization]
enabled = true            # 启用实体标准化
use_llm_for_entities = true  # 使用LLM进行额外的实体解析

[inference]
enabled = true             # 启用关系推理
use_llm_for_inference = true  # 使用LLM进行关系推理
apply_transitive = true    # 应用传递性推理规则

命令行选项

--input FILE: 要处理的输入文本文件
--output FILE: 可视化的输出HTML文件（默认：knowledge_graph.html）
--config FILE: 配置文件路径（默认：config.toml）
--debug: 启用调试输出，显示原始LLM响应
--no-standardize: 禁用实体标准化
--no-inference: 禁用关系推理
--test: 使用测试数据生成示例可视化

使用帮助（--help）

generate-graph --help
usage: generate-graph [-h] [--test] [--config CONFIG] [--output OUTPUT] [--input INPUT] [--debug] [--no-standardize] [--no-inference]

知识图谱生成器和可视化工具

options:
  -h, --help        显示此帮助消息并退出
  --test            使用示例数据生成测试可视化
  --config CONFIG   配置文件路径
  --output OUTPUT   输出HTML文件路径
  --input INPUT     输入文本文件路径（除非使用--test，否则必需）
  --debug           启用调试输出（原始LLM响应和提取的JSON）
  --no-standardize  禁用实体标准化
  --no-inference    禁用关系推理

示例运行

命令：

generate-graph --input data/industrial-revolution.txt --output industrial-revolution-kg.html

控制台输出：

Using input text from file: data/industrial-revolution.txt
==================================================
PHASE 1: INITIAL TRIPLE EXTRACTION
==================================================
Processing text in 13 chunks (size: 100 words, overlap: 20 words)
Processing chunk 1/13 (100 words)
Processing chunk 2/13 (100 words)
Processing chunk 3/13 (100 words)
Processing chunk 4/13 (100 words)
Processing chunk 5/13 (100 words)
Processing chunk 6/13 (100 words)
Processing chunk 7/13 (100 words)
Processing chunk 8/13 (100 words)
Processing chunk 9/13 (100 words)
Processing chunk 10/13 (100 words)
Processing chunk 11/13 (100 words)
Processing chunk 12/13 (86 words)
Processing chunk 13/13 (20 words)

Extracted a total of 216 triples from all chunks

==================================================
PHASE 2: ENTITY STANDARDIZATION
==================================================
Starting with 216 triples and 201 unique entities
Standardizing entity names across all triples...
Applied LLM-based entity standardization for 15 entity groups
Standardized 201 entities into 181 standard forms
After standardization: 216 triples and 160 unique entities

==================================================
PHASE 3: RELATIONSHIP INFERENCE
==================================================
Starting with 216 triples
Top 5 relationship types before inference:
  - enables: 20 occurrences
  - impacts: 15 occurrences
  - enabled: 12 occurrences
  - pioneered: 10 occurrences
  - invented: 9 occurrences
Inferring additional relationships between entities...
Identified 9 disconnected communities in the graph
Inferred 3 new relationships between communities
Inferred 3 new relationships between communities
Inferred 3 new relationships between communities
Inferred 3 new relationships between communities
Inferred 3 new relationships between communities
Inferred 3 new relationships between communities
Inferred 3 new relationships between communities
Inferred 3 new relationships between communities
Inferred 3 new relationships between communities
Inferred 3 new relationships between communities
Inferred 9 new relationships within communities
Inferred 2 new relationships within communities
Inferred 88 relationships based on lexical similarity
Added -22 inferred relationships

Top 5 relationship types after inference:
  - related to: 65 occurrences
  - advances via Artificial Intelligence: 36 occurrences
  - pioneered via computing: 26 occurrences
  - enables via computing: 24 occurrences
  - enables: 21 occurrences

Added 370 inferred relationships
Final knowledge graph: 564 triples
Saved raw knowledge graph data to /mnt/c/Users/rmcdermo/Documents/industrial-revolution-kg.json
Processing 564 triples for visualization
Found 161 unique nodes
Found 355 inferred relationships
Detected 9 communities using Louvain method
Nodes in NetworkX graph: 161
Edges in NetworkX graph: 537
Knowledge graph visualization saved to /mnt/c/Users/rmcdermo/Documents/industrial-revolution-kg.html
Graph Statistics: {
  "nodes": 161,
  "edges": 564,
  "original_edges": 209,
  "inferred_edges": 355,
  "communities": 9
}

Knowledge Graph Statistics:
Nodes: 161
Edges: 564
Communities: 9

To view the visualization, open the following file in your browser:
file:///mnt/c/Users/rmcdermo/Documents/industrial-revolution-kg.html

工作原理

分块处理：文档被分割为重叠的块，以适应LLM的上下文窗口
第一次处理 - SPO提取：
- 每个块由LLM处理，提取主题-谓词-对象三元组
- 在 process_with_llm 函数中实现
- LLM识别每个文本段中的实体及其关系
- 结果在所有块中收集，形成初始知识图谱
第二次处理 - 实体标准化：
- 通过文本归一化进行基本标准化
- 可选的LLM辅助实体对齐（由 standardization.use_llm_for_entities 配置控制）
- 启用时，LLM检查图谱中的所有唯一实体，并识别指同一概念的组
- 这解决了同一实体在不同块中以不同形式出现的情况（例如，"AI"、"人工智能"、"AI系统"）
- 标准化有助于创建更连贯、更易于导航的知识图谱
第三次处理 - 关系推理：
- 自动推理传递性关系
- 可选的LLM辅助推理，用于不相连的图谱组件（由 inference.use_llm_for_inference 配置控制）
- 启用时，LLM分析不相连社区的代表性实体并推断合理的关系
- 这通过添加文本中未明确说明的逻辑连接来减少图谱碎片化
- 基于规则的推理和基于LLM的推理方法协同工作，创建更全面的图谱
可视化：使用PyVis库生成交互式HTML可视化

第二次和第三次处理都是可选的，可以在配置中禁用，以最小化LLM使用或手动控制这些过程。

可视化功能

颜色编码的社区：节点颜色代表不同的社区
节点大小：节点大小根据重要性（度数、介数、特征向量中心性）确定
关系类型：原始关系显示为实线，推断关系显示为虚线
交互控件：缩放、平移、悬停查看详情、过滤和物理控制
浅色（默认）和深色主题

项目结构

.
├── config.toml                     # 系统的主要配置文件
├── generate-graph.py               # 作为脚本直接运行时的入口点
├── pyproject.toml                  # Python项目元数据和构建配置
├── requirements.txt                # 面向'pip'用户的Python依赖
├── uv.lock                         # 面向'uv'用户的Python依赖
└── src/                            # 源代码
    ├── generate_graph.py           # 作为模块运行时的主入口点脚本
    └── knowledge_graph/            # 核心包
        ├── __init__.py             # 包初始化
        ├── config.py               # 配置加载和验证
        ├── entity_standardization.py # 实体标准化算法
        ├── llm.py                  # LLM交互和响应处理
        ├── main.py                 # 主程序流程和编排
        ├── prompts.py              # LLM提示的集中集合
        ├── text_utils.py           # 文本处理和分块工具
        ├── visualization.py        # 知识图谱可视化生成器
        └── templates/              # 可视化的HTML模板
            └── graph_template.html # 交互式图谱的基础模板

程序流程

此图说明了程序流程。

flowchart TD
    %% 主入口点
    A[main.py - 入口点] --> B{解析参数}
    
    %% 测试模式分支
    B -->|--test标志| C[sample_data_visualization]
    C --> D[visualize_knowledge_graph]
    
    %% 正常处理分支
    B -->|正常处理| E[load_config]
    E --> F[process_text_in_chunks]
    
    %% 文本处理
    F --> G[chunk_text]
    G --> H[process_with_llm]
    
    %% LLM处理
    H --> I[call_llm]
    I --> J[extract_json_from_text]
    
    %% 实体标准化阶段
    F --> K{标准化已启用?}
    K -->|是| L[standardize_entities]
    K -->|否| M{推理已启用?}
    L --> M
    
    %% 关系推理阶段
    M -->|是| N[infer_relationships]
    M -->|否| O[visualize_knowledge_graph]
    N --> O
    
    %% 可视化组件
    O --> P[_calculate_centrality_metrics]
    O --> Q[_detect_communities]
    O --> R[_calculate_node_sizes]
    O --> S[_add_nodes_and_edges_to_network]
    O --> T[_get_visualization_options]
    O --> U[_save_and_modify_html]
    
    %% 子流程
    L --> L1[_resolve_entities_with_llm]
    N --> N1[_identify_communities]
    N --> N2[_infer_relationships_with_llm]
    N --> N3[_infer_within_community_relationships]
    N --> N4[_apply_transitive_inference]
    N --> N5[_infer_relationships_by_lexical_similarity]
    N --> N6[_deduplicate_triples]
    
    %% 文件输出
    U --> V[HTML可视化]
    F --> W[JSON数据导出]
    
    %% 提示使用
    Y[prompts.py] --> H
    Y --> L1
    Y --> N2
    Y --> N3
    
    %% 模块依赖
    subgraph 模块
        main.py
        config.py
        text_utils.py
        llm.py
        entity_standardization.py
        visualization.py
        prompts.py
    end
    
    %% 阶段
    subgraph 阶段1: 三元组提取
        G
        H
        I
        J
    end
    
    subgraph 阶段2: 实体标准化
        L
        L1
    end
    
    subgraph 阶段3: 关系推理
        N
        N1
        N2
        N3
        N4
        N5
        N6
    end
    
    subgraph 阶段4: 可视化
        O
        P
        Q
        R
        S
        T
        U
    end

程序流程描述

入口点：程序从 main.py 开始，解析命令行参数。
模式选择：
- 如果提供 --test 标志，则生成示例可视化
- 否则，处理输入文本文件
配置：使用 config.py 从 config.toml 加载设置
文本处理：
- 使用 text_utils.py 将文本分块并重叠
- 使用LLM处理每个块以提取三元组
- 使用 prompts.py 中的提示来指导LLM的提取过程
实体标准化（可选）：
- 标准化所有三元组中的实体名称
- 可能在模糊情况下使用LLM进行实体解析
- 使用 prompts.py 中的专门提示进行实体解析
关系推理（可选）：
- 识别图谱中的社区
- 推断不相连社区之间的关系
- 应用传递推理和词汇相似性规则
- 使用 prompts.py 中的专门提示进行关系推理
- 对三元组进行去重
可视化：
- 计算中心性指标和社区检测
- 根据重要性确定节点大小和颜色
- 使用PyVis创建交互式HTML可视化
- 使用模板自定义HTML
输出：
- 将知识图谱保存为HTML和JSON格式
- 显示有关节点、边和社区的统计信息

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
data		data
docs		docs
src		src
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
config.toml		config.toml
generate-graph.py		generate-graph.py
json_to_html.py		json_to_html.py
knowledge_graph.html		knowledge_graph.html
knowledge_graph.json		knowledge_graph.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI驱动的知识图谱生成器

功能特性

系统要求

快速开始

配置

命令行选项

使用帮助（--help）

示例运行

工作原理

可视化功能

项目结构

程序流程

程序流程描述

About

Uh oh!

Releases 1

Packages

Languages

License

purpose168/ai-knowledge-graph

Folders and files

Latest commit

History

Repository files navigation

AI驱动的知识图谱生成器

功能特性

系统要求

快速开始

配置

命令行选项

使用帮助（--help）

示例运行

工作原理

可视化功能

项目结构

程序流程

程序流程描述

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages