MarkdownParser 使用指南

🎉 v2.1 新功能：上下文感知的智能嵌套！自动识别容器章节（Methodology, Results, Architectures 等），构建完美的树形结构。支持带编号和不带编号论文的混合格式。查看详情

简介

MarkdownParser 是一个强大的学术论文解析工具，支持多种论文格式的自动识别和结构化提取。基于策略模式设计，可以智能识别不同类型的学术论文，并提取其结构化内容。

✨ 主要特性

🤖 智能格式检测：自动识别论文类型（中文博士论文、IEEE期刊、AAAI会议等）
📊 结构化解析：识别并解析论文的层次结构（章、节、小节等）
🧠 上下文感知嵌套：智能识别容器章节（Methodology, Results, Architectures等），自动构建正确的树形结构
🎯 假阳性过滤：过滤算法伪代码、代码片段等非标准章节标题
🔍 灵活查询：按级别、描述、标题等条件检索特定内容
📝 元数据提取：自动提取标题、作者、摘要等元信息
✅ 高准确率：在 17 篇测试论文上达到 100% 识别准确率和完美树形结构
🔧 易于扩展：基于策略模式，轻松添加新格式支持

📚 支持的论文格式

格式	策略类	特征	测试准确率
中文博士论文	`ThesisParsingStrategy`	第X章、摘要、参考文献	✅ 100%
IEEE期刊论文	`JournalPaperStrategy`	罗马数字编号（I., II., III.）	✅ 100% (7/7)
AAAI会议论文	`ConferencePaperStrategy`	Abstract, Introduction, Conclusion	✅ 100% (10/10)

安装

# 克隆仓库或直接复制代码包
pip install -e /path/to/Markdown_Spliter

快速入门

基本使用（自动格式检测）

from Markdown_Spliter.markdown_parser import MarkdownParser

# 自动检测论文格式并解析
parser = MarkdownParser.init_by_path("paper.md")

# 查看检测到的格式
print(f"检测格式: {parser.get_strategy_name()}")

# 获取元数据
metadata = parser.get_metadata()
print(f"标题: {metadata.get('title')}")

# 获取解析树
parse_tree = parser.get_parse_tree()

# 提取特定章节
abstract = parser.get_section_content(description="abstract")
introduction = parser.get_section_content(description="introduction")

手动指定格式

# 明确指定为会议论文格式
parser = MarkdownParser.init_by_path("paper.md", strategy="conference")

# 也可以使用完整的策略名
parser = MarkdownParser.init_by_path("paper.md", strategy="ConferencePaperStrategy")

调试模式（查看检测分数）

# 开启 verbose 模式查看所有策略的置信度分数
parser = MarkdownParser.init_by_path("paper.md", verbose=True)

# 输出示例：
# ThesisParsingStrategy: 0.000
# ConferencePaperStrategy: 0.628
# JournalPaperStrategy: 0.000
# Selected: ConferencePaperStrategy (score: 0.628)

使用示例

示例 1: 提取 IEEE 期刊论文的各个部分

from Markdown_Spliter.markdown_parser import MarkdownParser

# 解析 IEEE 论文
parser = MarkdownParser.init_by_path("ieee_paper.md")

# 提取各个章节
introduction = parser.get_section_content(description="introduction")
methodology = parser.get_section_content(description="methodology")
results = parser.get_section_content(description="results")
conclusion = parser.get_section_content(description="conclusion")
references = parser.get_section_content(description="references")

# 保存到文件
sections = {
    "introduction": introduction,
    "methodology": methodology,
    "results": results,
    "conclusion": conclusion,
    "references": references
}

for name, content in sections.items():
    if content:
        with open(f"{name}.md", "w", encoding="utf-8") as f:
            f.write(content)
        print(f"已保存: {name}.md ({len(content)} 字符)")

示例 2: 批量处理会议论文

from Markdown_Spliter.markdown_parser import MarkdownParser
import glob

# 批量处理所有 AAAI 论文
papers = glob.glob("aaai_papers/*/full.md")

for paper_path in papers:
    print(f"处理: {paper_path}")

    parser = MarkdownParser.init_by_path(paper_path)

    # 提取摘要和结论
    abstract = parser.get_section_content(description="abstract")
    conclusion = parser.get_section_content(description="conclusion")

    # 生成摘要文件
    summary = f"# {parser.get_metadata()['title']}\n\n"
    summary += f"## Abstract\n\n{abstract}\n\n"
    summary += f"## Conclusion\n\n{conclusion}\n"

    output_file = paper_path.replace("full.md", "summary.md")
    with open(output_file, "w", encoding="utf-8") as f:
        f.write(summary)

示例 3: 分析论文结构

from Markdown_Spliter.markdown_parser import MarkdownParser

parser = MarkdownParser.init_by_path("paper.md", verbose=True)

# 获取解析树
parse_tree = parser.get_parse_tree()

# 打印论文结构
def print_structure(nodes, depth=0):
    for node in nodes:
        indent = "  " * depth
        print(f"{indent}[{node['description']}] {node['title']}")
        if node.get('children'):
            print_structure(node['children'], depth + 1)

print("论文结构:")
print_structure(parse_tree)

输出示例（带智能嵌套）：

论文结构:
[abstract] Abstract
[introduction] Introduction
[related_work] Related Work
  [subsection] Deep Learning Methods
  [subsection] Reinforcement Learning
[methodology] Methodology
  [subsection] Dataset Preparation
  [subsection] Model Architecture
  [subsection] Training Procedure
[results] Experiments
  [subsection] Experimental Setup
  [subsection] Main Results
  [subsection] Ablation Studies
[conclusion] Conclusion
[references] References

示例 4: 提取特定容器的所有子章节

from Markdown_Spliter.markdown_parser import MarkdownParser

parser = MarkdownParser.init_by_path("paper.md")
parse_tree = parser.get_parse_tree()

# 查找 Methodology 章节及其所有子章节
def find_section(nodes, description):
    for node in nodes:
        if node['description'] == description:
            return node
        if node.get('children'):
            result = find_section(node['children'], description)
            if result:
                return result
    return None

methodology = find_section(parse_tree, 'methodology')
if methodology:
    print(f"章节: {methodology['title']}")
    print(f"子章节数量: {len(methodology.get('children', []))}")
    print("\n子章节:")
    for child in methodology.get('children', []):
        print(f"  - {child['title']}")

API 参考

MarkdownParser

初始化方法

# 从文件初始化
MarkdownParser.init_by_path(path, strategy=None, verbose=False)

# 从字符串初始化
MarkdownParser.init_by_path(content, strategy=None, verbose=False)

参数：

path/content: 文件路径或内容字符串
strategy: 可选，手动指定策略（"thesis", "conference", "journal"）
verbose: 是否显示检测过程信息

核心方法

# 获取元数据
metadata = parser.get_metadata()

# 获取解析树
parse_tree = parser.get_parse_tree()

# 提取特定章节
content = parser.get_section_content(**kwargs)

# 获取使用的策略名称
strategy_name = parser.get_strategy_name()

get_section_content 查询参数

参数	类型	说明	示例
`level`	int	标题级别（1=章, 2=节, 3=小节）	`level=1`
`description`	str	章节类型描述	`description="introduction"`
`title`	str	标题内容（部分匹配）	`title="研究背景"`
`exact_title`	str	标题内容（精确匹配）	`exact_title="第一章"`

支持的 description 类型

中文博士论文：

chapter - 章
section - 节
clause - 条
abstract_ch - 中文摘要
abstract_en - 英文摘要
references - 参考文献
table_of_contents - 目录

IEEE 期刊论文：

abstract - 摘要
introduction - 引言
related_work - 相关工作
methodology - 方法
results - 结果
discussion - 讨论
conclusion - 结论
references - 参考文献
subsection - 子章节

AAAI 会议论文：

abstract - 摘要
introduction - 引言
related_work - 相关工作
preliminaries - 预备知识
methodology - 方法
results - 结果/实验
discussion - 讨论
conclusion - 结论
references - 参考文献
acknowledgment - 致谢
numbered_section - 带编号的主要章节（如 "3 Title", "4 Title"）
container_section - 通用容器章节（如 Architectures, Models, Baselines）
subsection - 子章节（自动嵌套在容器章节下）
limitations - 局限性和未来工作

架构设计

策略模式

项目采用策略模式设计，每种论文格式对应一个独立的策略类：

MarkdownParser (上下文)
    ↓ 使用
ParsingStrategy (抽象策略)
    ↓ 实现
├── ThesisParsingStrategy (中文博士论文)
├── ConferencePaperStrategy (会议论文)
└── JournalPaperStrategy (期刊论文)

智能检测算法

格式检测基于多维度评分系统：

Score = Σ(required_patterns × weights)      # 必需模式
      + Σ(optional_patterns × weights)      # 可选模式
      + Σ(structural_hints × weights)       # 结构特征
      - exclusion_penalty                   # 排除模式

选择得分最高且超过阈值（默认 0.3）的策略。

智能嵌套机制

会议论文解析器使用上下文感知的智能嵌套算法，自动构建正确的树形结构：

容器章节识别

系统自动识别以下容器章节，并将其后的非主要章节嵌套为子节点：

标准容器: Related Work, Preliminaries, Methodology, Results, Experiments, Evaluation, Discussion, Acknowledgments
通用容器: 包含关键词 "Architecture", "Model", "Framework", "System", "Dataset", "Baseline", "Setting" 的章节

嵌套规则

# 示例 1：不带编号的论文
[methodology] Methodology              # 容器章节
  [subsection] Dataset Preparation     # 自动嵌套
  [subsection] Model Architecture      # 自动嵌套
  [subsection] Training Details        # 自动嵌套
[results] Experiments                  # 新的容器，退出 Methodology
  [subsection] Experimental Setup      # 嵌套在 Experiments 下

# 示例 2：带编号的论文
[introduction] 1 Introduction          # 主要章节
[related_work] 2 Related Work          # 容器章节
  [subsection] Deep Learning           # 嵌套（无编号 = 子章节）
  [subsection] Reinforcement Learning  # 嵌套（无编号 = 子章节）
[numbered_section] 3 Our Approach      # 带编号 = 主要章节，退出容器
  [subsection] 3.1 Model Design        # 3.1 是 3 的子节点
  [subsection] 3.2 Training Strategy   # 3.2 是 3 的子节点

# 示例 3：通用容器
[container_section] Architectures      # 通用容器
  [subsection] Mamba                   # 自动嵌套
  [subsection] RWKV                    # 自动嵌套
  [subsection] Transformer             # 自动嵌套
[methodology] Methodology              # 主要章节，退出容器

假阳性过滤

系统自动过滤以下非标准标题：

算法伪代码：while ... do, for ... do, Algorithm 1:, Input:, Output:
数学元素：Theorem 1:, Lemma 2:, Proof.
代码片段：call the method, compare the results, identical attributes except for...
单个字符或过短标题

测试

运行全面测试：

# 测试所有示例论文（17篇）
python test_all_papers.py

# 查看详细报告
cat paper_parsing_report.md

测试结果：

✅ IEEE 期刊论文: 7/7 (100%)
✅ AAAI 会议论文: 10/10 (100%)
✅ 总体准确率: 17/17 (100%)

扩展新格式

添加对新论文格式的支持非常简单：

1. 创建策略类

# Markdown_Spliter/strategies/your_format.py
from .base import ParsingStrategy

class YourFormatStrategy(ParsingStrategy):
    def _define_keywords_config(self):
        return [
            {"keyword": r"Abstract", "description": "abstract", "level": 1},
            # ... 更多配置
        ]

    def get_detection_features(self):
        return {
            'required_patterns': [
                (r"pattern1", 5.0),
            ],
            'optional_patterns': [
                (r"pattern2", 3.0),
            ],
            # ...
        }

2. 注册策略

# Markdown_Spliter/markdown_parser.py
from .strategies import YourFormatStrategy

StrategyFactory.register_strategy(YourFormatStrategy)

详细指南请参考 STRATEGY_GUIDE.md

项目结构

Markdown_Spliter/
├── __init__.py
├── markdown_parser.py          # 主解析器
├── config.py                   # 传统配置（兼容）
└── strategies/
    ├── __init__.py
    ├── base.py                 # 抽象基类
    ├── factory.py              # 策略工厂
    ├── thesis_strategy.py      # 博士论文策略
    ├── conference_paper.py     # 会议论文策略
    └── journal_paper.py        # 期刊论文策略

Example_Articles/               # 示例论文
├── MinerU_markdown_*.md        # IEEE 论文
└── aaai/*/full.md              # AAAI 论文

test_all_papers.py             # 全面测试脚本
paper_parsing_report.md        # 测试报告
STRATEGY_GUIDE.md              # 扩展指南
CLAUDE.md                      # 项目架构文档

常见问题

Q: 如何知道论文被识别为什么格式？

parser = MarkdownParser.init_by_path("paper.md", verbose=True)
# 或
print(parser.get_strategy_name())

Q: 检测错误怎么办？

手动指定格式：

parser = MarkdownParser.init_by_path("paper.md", strategy="conference")

Q: 如何添加新的章节类型？

参考 strategies/conference_paper.py 或 strategies/journal_paper.py 中的 _define_keywords_config() 方法。

Q: 为什么某些章节提取不到？

检查 verbose=True 模式查看检测结果
检查章节标题是否匹配 keywords_config 中的模式
查看 parse_tree 确认章节是否被正确解析

Q: 如何判断某个章节是否有子章节？

node = find_section(parse_tree, 'methodology')
if node and node.get('children'):
    print(f"有 {len(node['children'])} 个子章节")
else:
    print("没有子章节")

Q: 为什么某些标题没有被识别为章节？

可能被假阳性过滤器过滤了。系统会自动过滤：

算法伪代码（while ... do, Algorithm 1:）
代码片段（call the method, compare results）
数学元素（Theorem 1:, Proof.）
过短或无意义的标题

如果确实需要这些章节，可以修改 conference_paper.py 中的 _is_valid_heading() 方法。

性能

单篇论文解析时间: < 1秒
内存占用: 取决于论文大小，通常 < 100MB
支持的论文大小: 无限制（已测试 10MB+ 的论文）

版本历史

v2.1 (2025-12-31)

🧠 上下文感知的智能嵌套：自动识别容器章节（Methodology, Results, Architectures等），正确构建树形层级结构
🎯 假阳性过滤：过滤算法伪代码、代码片段、数学元素等非标准章节标题
🔧 通用容器支持：自动识别 Architectures, Models, Baselines, Datasets 等通用容器章节
📊 改进的 Related Work 和 Preliminaries 处理：正确识别和嵌套子章节
✅ Acknowledgments 嵌套：支持 Disclosure of funding, Author contributions 等子节点
🎯 带编号/不带编号论文的正确处理：区分 "3 Title" (主要章节) 和 "Title" (可能的子章节)
✅ 完美树形结构：所有 17 篇测试论文达到 100% 准确率和完美层级结构

v2.0 (2025-12-31)

✨ 新增策略模式架构
✨ 支持 IEEE 期刊论文格式
✨ 支持 AAAI 会议论文格式
✨ 智能格式自动检测
✨ 100% 测试准确率（17/17 论文）
📝 完善文档和示例

v1.0

支持中文博士论文格式
基础解析功能

贡献

欢迎提交问题和改进建议！

贡献方式

Fork 项目
创建特性分支
提交改动
推送到分支
创建 Pull Request

许可证

MIT License

联系方式

如有问题或建议，欢迎通过 Issues 提出。

相关文档：

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.idea		.idea
Markdown_Spliter		Markdown_Spliter
README.md		README.md
main.py		main.py

Robbings/BUAA_Markdown_Paper_Parser

Folders and files

Latest commit

History

Repository files navigation