Skip to content
This repository was archived by the owner on Dec 28, 2025. It is now read-only.

feat(llm): BREAKING CHANGE update keyword extraction method#32

Closed
Gfreely wants to merge 48 commits intohugegraph:mainfrom
Gfreely:TextRank-fix
Closed

feat(llm): BREAKING CHANGE update keyword extraction method#32
Gfreely wants to merge 48 commits intohugegraph:mainfrom
Gfreely:TextRank-fix

Conversation

@Gfreely
Copy link
Collaborator

@Gfreely Gfreely commented Jul 10, 2025

support TextRank, update new UI to support change keyword extraction method.

Main changes:
Added options to the RAG interface for selecting the keyword extraction method, along with several settings for TextRank, such as the number of keywords to extract, the size of the sliding window, and the processing language ('en' for English, 'zh' for Chinese).

A 'TextRank mask words' setting has also been added. It allows users to manually input specific phrases composed of letters and symbols to prevent them from being split during Chinese word segmentation. This requires manual entry by the user.

Summary by CodeRabbit

  • 新功能

    • 新增多语种 TextRank 提取器与混合(LLM+TextRank)加权提取策略,支持运行时选择提取模式、窗口大小与权重配置。
    • 增强 NLTK 资源校验接口,支持自动检测与可选下载。
  • 行为变更

    • 关键词输出改为带重要性分数(0.00–1.00)、按分数降序并使用 KEYWORDS:keyword:score 格式;语言在运行时统一处理。
  • 文档

    • 更新提示模板与配置说明,新增 KEYWORD_EXTRACT_TYPE、WINDOW_SIZE、HYBRID_LLM_WEIGHTS。
  • 杂务

    • 添加运行时依赖 scipy 与 python-igraph;调整忽略规则以保留示例与数据目录。

Gfreely added 4 commits June 27, 2025 15:58
fix apache#224 problem, update new UI to support change keyword extracion method
fix the pylint check bug
@coderabbitai
Copy link

coderabbitai bot commented Jul 10, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

新增多语言 TextRank 实现并将关键词抽取改为三种模式(llm/textrank/hybrid),抽取结果从集合改为 keyword→score 映射;调整 RAG 流程签名与调用、提示模板与配置项,增强 NLTK 资源检查,添加 scipy 与 python-igraph 依赖,并微调若干导入与 .gitignore(≤50字)。

Changes

Cohort / File(s) Change Summary
依赖与约束更新
pyproject.toml, hugegraph-llm/pyproject.toml
在工具级约束依赖下新增 scipy~=1.15.3python-igraph~=0.11.9
关键词抽取功能重构
src/.../operators/llm_op/keyword_extract.py
KeywordExtract 移除构造中 language,改为配置驱动;支持三种模式(llm/textrank/hybrid);新增 _extract_with_llm_extract_with_textrank_extract_with_hybrid,返回 Dict[keyword→score],改写 LLM 响应解析与计时/错误处理。
文档级 TextRank 模块
src/.../operators/document_op/textrank_word_extract.py
新增 MultiLingualTextRank:掩码、预处理、共现图构建与 PageRank 排名,支持中英分词,输出 keyword→score(依赖 igraph/jieba/nltk 等)。
RAG 流水线接口变化
src/.../operators/graph_rag_task.py
RAGPipeline.extract_keywords 签名移除 languagemax_keywords,内部调用和 docstring 更新以适配新返回格式,并引入 huge_settings/prompt 导入。
提示词模板更新
src/.../config/prompt_config.py
英/中关键词提取提示改为要求每个关键词给出 0.0–1.0 分数并按降序排序,输出格式改为 KEYWORDS:keyword:score,...(两位小数),并规范同义词与空格/标点规则。
配置项扩展
src/.../config/llm_config.py, config.md
新增 keyword_extract_type: Literal["llm","textrank","hybrid"](默认 "llm")、window_size(默认 3)与 hybrid_llm_weights(默认 0.5);文档同步新增说明。
NLTK 辅助改进
src/.../operators/common_op/nltk_helper.py
stopwords 增加路径注册、下载尝试与错误捕获并在失败时兜底空列表;新增 check_nltk_data 用于验证/按需下载 punkt/标注器等资源并记录日志。
词抽取模块接口简化
src/.../operators/document_op/word_extract.py
移除构造函数 language 参数与上下文语言传递,语言由 llm_settings.language 决定并规范为 chinese/english
示例/接口轻微格式调整
src/.../demo/rag_demo/app.py, src/.../demo/rag_demo/rag_block.py, src/.../api/rag_api.py, src/.../config/models/base_prompt_config.py
仅调整导入顺序、空行与小的格式化;无行为或签名改变。
资源忽略规则调整
hugegraph-llm/.gitignore
调整反忽略条目,放开 resources/demo/resources/prompt_examples/resources/nltk_data/corpora/stopwords/ 等路径。

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant RAG as RAGPipeline.extract_keywords
  participant KE as KeywordExtract.run
  participant LLM as LLM Backend
  participant TR as MultiLingualTextRank

  User->>RAG: 提交待抽取文本
  RAG->>KE: 调用 KeywordExtract.run()
  alt keyword_extract_type == "llm"
    KE->>LLM: 发送提示,等待响应
    LLM-->>KE: 返回 "KEYWORDS:kw:score,..." 字符串
    KE->>KE: 解析为 {kw:score}
  else keyword_extract_type == "textrank"
    KE->>TR: 调用 TR.extract_keywords(text)
    TR-->>KE: 返回 {kw:score}
  else keyword_extract_type == "hybrid"
    par LLM 路径
      KE->>LLM: 发送提示
      LLM-->>KE: 返回 {kw:score}
    and TextRank 路径
      KE->>TR: TR.extract_keywords(text)
      TR-->>KE: 返回 {kw:score}
    end
    KE->>KE: 按 hybrid_llm_weights 融合并排序
  end
  KE-->>RAG: 返回关键词与分数映射
  RAG-->>User: 返回结果
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

我是码间小兔,跳入词海潮,
TexRank 携 LLM,分数排列好,
NLTK 护路周到,依赖安放好,
KEYWORDS 两位显,混合权重来调,
小兔一甩胡须,提取更聪明嗷 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed 该拉取请求标题feat(llm): BREAKING CHANGE update keyword extraction method清晰地指向了变更集的核心内容。从提交摘要可以看出,本次更新涉及关键字提取功能的重大重构,包括:移除了languagemax_keywords参数,新增了MultiLingualTextRank类,支持llm、textrank和hybrid三种提取方法,更新了配置字段,以及修改了提示词。标题准确概括了这一主要变更,并通过BREAKING CHANGE标签恰当地标示了这是一个破坏性变更,与PR目标描述的内容完全相符。
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

@codecov-ai-reviewer review

@codecov-ai

This comment has been minimized.

@codecov-ai

This comment has been minimized.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)

257-259: 考虑优化窗口大小判断逻辑。

当唯一词数量少于窗口大小时,仍然可以构建有意义的共现图。建议调整判断逻辑。

-if len(unique_words) < self.window:
+if len(unique_words) < 2:
     return
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py (1)

297-302: 建议改进mask_words输入框的提示信息。

当前的提示信息较长,可以考虑简化并提供示例。

 mask_words_input = gr.Textbox(
     label="TextRank mask words",
-    info="""Enter any words you want to protect from being split during Chinese word segmentation(e.g., C++, website URLs). Separate each entry with a comma.""",
+    info="保护词列表,防止中文分词时被切分。用逗号分隔。示例:C++,GitHub,ChatGPT",
     show_copy_button=True,
     lines=7,
 )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8280589 and a4180ea.

📒 Files selected for processing (4)
  • hugegraph-llm/pyproject.toml (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py (11 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (5 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
hugegraph-llm/pyproject.toml (1)
Learnt from: cgwer
PR: hugegraph/hugegraph-ai#10
File: hugegraph-python-client/pyproject.toml:0-0
Timestamp: 2025-05-27T06:55:13.779Z
Learning: The hugegraph-python-client is a component within the hugegraph-ai project repository (apache/incubator-hugegraph-ai), not a standalone repository. When reviewing project URLs in pyproject.toml files within this project, they should point to the main hugegraph-ai repository.
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (2)
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py (1)
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.
🧬 Code Graph Analysis (1)
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py (4)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (4)
  • RAGPipeline (38-267)
  • extract_keywords (68-100)
  • keywords_to_vid (106-130)
  • import_schema (102-104)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/answer_synthesize.py (1)
  • AnswerSynthesize (36-259)
hugegraph-llm/src/hugegraph_llm/utils/decorators.py (1)
  • with_task_id (109-123)
hugegraph-llm/src/hugegraph_llm/operators/gremlin_generate_task.py (1)
  • import_schema (46-55)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build (3.11)
  • GitHub Check: build (3.10)
🔇 Additional comments (4)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)

68-100: 代码实现正确!

新增的TextRank相关参数设计合理,文档说明清晰。语言代码从"english"标准化为"en"是个好的改进。

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)

281-301: TextRank主方法实现完善!

关键词提取的完整流程实现合理,包含了适当的边界条件检查。

hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py (2)

83-90: 参数传递实现正确!

新增的TextRank相关参数正确传递给了extract_keywords方法。


431-460: 批处理函数更新正确!

新参数在批处理函数中的传递保持了一致性。

Comment on lines 73 to 85
if self._extract_method == "TextRank":
# 使用 TextRank 提取关键词
keywords = self._extract_with_textrank()
else:
# 使用 LLM 提取关键词
keywords = self._extract_with_llm()
keywords = {k.replace("'", "") for k in keywords}
context["keywords"] = list(keywords)[:self._max_keywords]
log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"])

# extracting keywords & expanding synonyms increase the call count by 1
context["call_count"] = context.get("call_count", 0) + 1
return context
Copy link

@coderabbitai coderabbitai bot Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

建议为LLM提取方法也添加错误处理。

当前只有TextRank方法有try-catch错误处理,建议为LLM提取方法也添加类似的错误处理,以提高系统的健壮性。

 if self._extract_method == "TextRank":
     # 使用 TextRank 提取关键词
     keywords = self._extract_with_textrank()
 else:
     # 使用 LLM 提取关键词
-    keywords = self._extract_with_llm()
+    try:
+        keywords = self._extract_with_llm()
+    except Exception as e:
+        log.error("LLM Keyword extraction error: %s", e)
+        keywords = set()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if self._extract_method == "TextRank":
# 使用 TextRank 提取关键词
keywords = self._extract_with_textrank()
else:
# 使用 LLM 提取关键词
keywords = self._extract_with_llm()
keywords = {k.replace("'", "") for k in keywords}
context["keywords"] = list(keywords)[:self._max_keywords]
log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"])
# extracting keywords & expanding synonyms increase the call count by 1
context["call_count"] = context.get("call_count", 0) + 1
return context
if self._extract_method == "TextRank":
# 使用 TextRank 提取关键词
keywords = self._extract_with_textrank()
else:
# 使用 LLM 提取关键词
try:
keywords = self._extract_with_llm()
except Exception as e:
log.error("LLM Keyword extraction error: %s", e)
keywords = set()
keywords = {k.replace("'", "") for k in keywords}
context["keywords"] = list(keywords)[:self._max_keywords]
log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"])
# extracting keywords & expanding synonyms increase the call count by 1
context["call_count"] = context.get("call_count", 0) + 1
return context
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py around
lines 73 to 85, the LLM keyword extraction method lacks error handling unlike
the TextRank method. Add a try-except block around the call to
self._extract_with_llm() to catch and handle any exceptions, ensuring the system
remains robust. Log the error appropriately and decide on a fallback or empty
keyword set if an exception occurs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The self.llm.generate method is also called elsewhere in the project and is already sufficiently robust, so it is not necessary to add error handling.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gfreely, understood! Thank you for the clarification. You have better context about the project's LLM implementation and its robustness across the codebase. I'll defer to your technical judgment on this matter.


🧠 Learnings used
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)

73-85: LLM方法仍缺少错误处理。

根据之前的评审意见,TextRank方法已经添加了错误处理,但LLM方法仍然缺少相应的错误处理机制。

 if self._extract_method == "TextRank":
     # 使用 TextRank 提取关键词
     keywords = self._extract_with_textrank()
 else:
     # 使用 LLM 提取关键词
-    keywords = self._extract_with_llm()
+    try:
+        keywords = self._extract_with_llm()
+    except Exception as e:
+        log.error("LLM Keyword extraction error: %s", e)
+        keywords = set()
🧹 Nitpick comments (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)

263-264: 图节点数量检查逻辑可能存在问题。

当unique_words数量小于window_size时直接返回,但这可能过于严格。即使词汇较少也应该尝试构建图。

-if len(unique_words) < self.window:
-    return
+if len(unique_words) < 2:
+    return
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a4180ea and 98471a4.

📒 Files selected for processing (1)
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build (3.10)
  • GitHub Check: build (3.11)
🔇 Additional comments (4)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)

153-164: 停用词文件加载已正确添加错误处理。

根据之前的评审意见,停用词文件加载现在已经正确添加了try-catch错误处理,能够优雅地处理文件不存在的情况。


199-200: 正则表达式模式已修复。

之前评审中提到的正则表达式语法错误已经修复,现在正确使用了(?![a-zA-Z0-9])作为否定前瞻断言。


292-294: 文本预处理结果检查逻辑合理。

检查预处理后的词汇列表是否为空,如果为空则直接返回空列表,避免后续处理出错。


300-301: 图状态检查逻辑完善。

在运行PageRank前检查图的状态,确保图存在且有节点,避免算法执行错误。

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)

139-142: 构造函数参数缺少默认值

根据之前的评审意见,构造函数参数应该提供默认值以提高代码的健壮性。

-    def __init__(self, keyword_num=5, window_size=5, mask_words=""):
+    def __init__(self, keyword_num=5, window_size=5, mask_words=""):

注意:这个问题在之前的评审中已经被标记,但代码中仍然存在。


168-168: mask_words 参数需要空值检查

根据之前的评审意见,当 mask_wordsNone 时,split() 方法会导致异常。

-        self.mask_words = list(filter(None, mask_words.split(',')))
+        self.mask_words = list(filter(None, (mask_words or "").split(',')))

注意:这个问题在之前的评审中已经被标记,但代码中仍然存在。

🧹 Nitpick comments (3)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3)

170-253: 文本预处理逻辑过于复杂,建议重构

_preprocess 方法包含了复杂的中英文文本处理逻辑,建议将其拆分为更小的专门方法以提高可读性和可维护性。

建议重构为以下结构:

def _preprocess(self, text, lang):
    if lang.startswith('zh'):
        return self._preprocess_chinese(text)
    elif lang.startswith('en'):
        return self._preprocess_english(text)
    return []

def _preprocess_chinese(self, text):
    # 中文预处理逻辑
    pass

def _preprocess_english(self, text):
    # 英文预处理逻辑
    pass

217-218: 动态添加词典可能影响性能

在每次调用时动态向 jieba 词典添加占位符可能会影响性能,特别是在处理大量文本时。

考虑在构造函数中一次性添加常用的占位符模式,或者使用 jieba 的临时词典功能:

# 在构造函数中预处理
jieba.initialize()

263-264: 图节点数量检查逻辑可能不准确

当唯一词汇数量小于滑动窗口大小时直接返回可能不是最佳策略,因为即使词汇较少也可能构建有意义的图。

建议修改为更合理的检查条件:

-        if len(unique_words) < self.window:
+        if len(unique_words) < 2:  # 至少需要两个词才能构建图
             return
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 98471a4 and 750d338.

📒 Files selected for processing (1)
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build (3.10)
  • GitHub Check: build (3.11)
🔇 Additional comments (2)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)

287-307: 关键词提取主函数逻辑清晰

extract_keywords 方法的实现逻辑清晰,正确处理了边界条件,包括空词汇列表和空图的情况。错误处理和流程控制都很合理。


98-109: TextRank 提取方法实现良好

方法包含了适当的错误处理、性能监控和日志记录。异常类型覆盖了常见的运行时错误,返回类型与 LLM 方法保持一致。

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)

48-57: textrank_kwargs参数仍需要空值检查

尽管您提到所有实例化代码都设置了默认值,但在构造函数中直接使用**textrank_kwargs仍存在潜在风险。如果textrank_kwargs为None,会导致运行时错误。

建议添加空值检查以提高代码健壮性:

-        self._textrank_model = MultiLingualTextRank(**textrank_kwargs)  # TextRank 参数
+        self._textrank_model = MultiLingualTextRank(**(textrank_kwargs or {}))  # TextRank 参数
🧹 Nitpick comments (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)

169-252: 文本预处理逻辑复杂但功能全面

中英文预处理逻辑设计良好,包含了特殊词遮蔽、正则表达式清理、分词和词性过滤等步骤。不过复杂的正则表达式模式可能会影响性能。

对于大量文本处理,建议考虑以下优化:

  1. 预编译常用的正则表达式模式
  2. 考虑使用更高效的文本处理库
  3. 为极长文本添加长度限制或分块处理
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 750d338 and 29ddeb1.

📒 Files selected for processing (1)
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build (3.10)
  • GitHub Check: build (3.11)
🔇 Additional comments (6)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (6)

18-37: 导入和常量定义看起来合理

新增的导入项都是TextRank功能所需的,包括jieba用于中文分词、networkx用于图算法、nltk用于英文处理等。EXTRACT_STOPWORDS常量定义也是合适的。


73-85: 提取方法分发逻辑实现正确

条件逻辑正确地在TextRank和LLM方法之间进行分发,错误处理也已经适当实现。关键词后处理和上下文更新逻辑看起来合理。


98-108: TextRank提取方法实现良好

错误处理覆盖了多种异常类型,日志记录有助于调试,返回空集合作为fallback是合理的设计。


139-168: MultiLingualTextRank构造函数实现完善

构造函数现在有了合适的默认值,POS过滤器定义合理,停用词加载包含了适当的错误处理,mask_words的空值检查也已实现。


254-285: 图构建和PageRank算法实现正确

词共现图构建逻辑合理,避免了自环问题,边权重计算正确。PageRank算法使用了标准参数,空图检查确保了健壮性。


286-306: 关键词提取主流程实现完整

主函数按照标准TextRank流程执行:预处理→图构建→PageRank→Top-K提取。边界条件处理得当,包括空词列表和空图的检查。

@imbajin imbajin requested a review from Copilot July 11, 2025 10:06
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for TextRank-based keyword extraction in the RAG pipeline, updates the pipeline operator and demo UI to expose TextRank parameters, and introduces new dependencies for graph-based ranking.

  • Added a MultiLingualTextRank class and branched keyword extraction logic between LLM and TextRank.
  • Extended RAGPipeline.extract_keywords to accept method, window size, mask words, and language settings.
  • Updated the Gradio demo (rag_block.py) to include UI controls for extraction method, language, mask words, keyword count, and sliding window.
  • Added networkx and scipy to project dependencies.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py Implemented TextRank extraction, added MultiLingualTextRank, and method branching
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py Propagated new extract_method and textrank_kwargs through pipeline
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py Added Gradio inputs for extraction method, language, mask words, max keywords, and window size
hugegraph-llm/pyproject.toml Added networkx and scipy dependencies

@imbajin
Copy link

imbajin commented Jul 11, 2025

@codecov-ai-reviewer review

@codecov-ai

This comment has been minimized.

@codecov-ai

This comment has been minimized.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)

179-262: 建议将复杂的预处理方法拆分

当前_preprocess方法承担了过多职责,包括遮蔽、清理、分词、标注等多个步骤。建议拆分为更小的方法以提高可维护性。

可以考虑将方法拆分为:

def _preprocess(self, text, lang):
    if lang.startswith('zh'):
        return self._preprocess_chinese(text)
    elif lang.startswith('en'):
        return self._preprocess_english(text)
    return []

def _preprocess_chinese(self, text):
    # 中文预处理逻辑
    
def _preprocess_english(self, text):
    # 英文预处理逻辑
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 29ddeb1 and d2e846c.

📒 Files selected for processing (1)
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4 hunks)
🧰 Additional context used
🧠 Learnings (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.
🧬 Code Graph Analysis (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (6)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
  • BaseLLM (22-74)
  • generate (26-31)
hugegraph-llm/src/hugegraph_llm/models/llms/ollama.py (1)
  • generate (37-60)
hugegraph-llm/src/hugegraph_llm/models/llms/openai.py (1)
  • generate (57-85)
hugegraph-llm/src/hugegraph_llm/models/llms/qianfan.py (1)
  • generate (38-53)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
  • extract_keywords (68-100)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (2)
  • NLTKHelper (30-80)
  • stopwords (36-53)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build (3.11)
  • GitHub Check: build (3.10)
🔇 Additional comments (11)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (11)

18-37: 导入和常量定义合理

新增的导入项支持TextRank功能实现,常量定义清晰明确。


41-57: 构造函数参数设计合理

新增的extract_methodtextrank_kwargs参数支持TextRank功能,参数传递逻辑正确。


73-85: 方法调度逻辑清晰

根据提取方法正确调度到相应的实现,关键词处理和上下文管理保持一致。


87-96: LLM提取方法保持向后兼容

维护了原有的LLM关键词提取逻辑,时间记录和日志功能完整。


98-114: TextRank提取方法实现完善

包含了适当的异常处理,时间记录与LLM方法保持一致,错误处理策略合理。


116-141: 响应解析方法保持稳定

方法逻辑与原有实现一致,格式化调整合理,停用词处理正确。


144-162: TextRank类构造函数设计良好

默认参数合理,支持中英文词性过滤,mask_words空值检查已实现,延迟加载机制恰当。


164-177: 停用词加载方法改进良好

已添加适当的文件存在性检查和错误处理,延迟加载模式避免重复加载,日志记录完善。


264-285: 图构建方法实现正确

共现图构建逻辑合理,正确处理了窗口大小不足的边界情况,避免了自环,滑动窗口方法恰当。


287-294: PageRank实现正确

正确检查了空图情况,使用标准的PageRank参数,权重处理恰当。


296-319: 主提取方法设计完善

完整地编排了关键词提取流程,各步骤的空结果处理得当,Top-K关键词返回逻辑正确。

@Gfreely
Copy link
Collaborator Author

Gfreely commented Jul 16, 2025

@codecov-ai-reviewer review

@codecov-ai

This comment has been minimized.

@codecov-ai

This comment has been minimized.

@Gfreely Gfreely changed the title feat(llm): support TextRank feat(llm): BREAKING CHANGE update keyword extraction method Sep 12, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)

64-74: extract_method 不支持上下文覆盖且大小写未标准化

与 PR 目标(UI 可切换提取模式)不完全一致。建议先从 context 覆盖并统一小写再分支。

-        if self._extract_method == "llm":
+        method = (context.get("extract_method", self._extract_method) or "hybrid").strip().lower()
+        if method == "llm":
             # LLM method
             ranks = self._extract_with_llm()
-        elif self._extract_method == "textrank":
+        elif method == "textrank":
             # TextRank method
             ranks = self._extract_with_textrank()
-        elif self._extract_method == "hybrid":
+        elif method == "hybrid":
             # Hybrid method
             ranks = self._extract_with_hybrid()
         else:
-            raise ValueError(f"Invalid extract_method: {self._extract_method}")
+            raise ValueError(f"Invalid extract_method: {method}")

76-83: 避免在 INFO 级别记录原始用户查询,降敏并降级日志

当前 INFO 日志含原始 query,存在隐私风险。建议改为 DEBUG 并截断。

-        log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"])
+        # 降低敏感信息日志等级并做截断
+        safe_query = (self._query or "")[:200]
+        log.debug("Query(len=%d): %s | Keywords: %s", len(self._query or ""), safe_query, context["keywords"])

62-62: max_keywords 需做类型与边界校验,避免切片报错

UI/HTTP 传参常为字符串或空值,直接用于切片可能抛 TypeError。建议强制转 int 并设下限。

-        self._max_keywords = context.get("max_keywords", self._max_keywords)
+        mk = context.get("max_keywords", self._max_keywords)
+        try:
+            mk = int(mk)
+        except (TypeError, ValueError):
+            mk = self._max_keywords
+        self._max_keywords = max(1, mk)

60-61: 语言归一化存在逻辑错误,几乎总是回退为英文

当前仅当值为"cn"时设为中文,否则一律设为英文,忽略了"en/english/zh/chinese/zh-cn"等常见标识,且未支持从 context 覆盖,容易导致中文场景被误判为英文。建议统一归一化并允许上游覆盖。

-        # Use English by default
-        self._language = "chinese" if self._language == "cn" else "english"
+        # 语言归一化:支持 en/zh/cn/english/chinese,并允许 context 覆盖
+        lang_raw = (context.get("language", self._language) or "").strip().lower()
+        if lang_raw in ("en", "english"):
+            self._language = "english"
+        elif lang_raw in ("zh", "cn", "chinese", "zh-cn", "zh_cn"):
+            self._language = "chinese"
+        else:
+            log.warning("Unsupported language '%s', defaulting to 'english'", lang_raw)
+            self._language = "english"
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)

66-103: 二次校验存在致命错误且未按“二选一”分组语义工作

问题:

  • Line 97 使用 nltk.data.find('tokenizers/{package}') or nltk.data.find('taggers/{package}'):对于 punkt 会错误地查 taggers/punkt 并抛 LookupError,导致函数直接异常返回。
  • 语义上应保证“tokenizers 至少命中一个(punkt 或 punkt_tab)且 taggers 至少命中一个(*_eng 或通用)”,而不是四个包都必须为 True。

建议:改为“分组校验 + 逐个下载兜底”的幂等流程,并补充返回类型注解。

-    def check_nltk_data(self):
-
-        cache_dir = self.get_cache_dir()
-        nltk_data_dir = os.environ.get("NLTK_DATA", cache_dir)
-        nltk.data.path.append(os.path.join(resource_path, "nltk_data"))
-
-        if nltk_data_dir not in nltk.data.path:
-            nltk.data.path.append(nltk_data_dir)
-
-        required_packages = {
-            'punkt': False,
-            'punkt_tab': False,
-            'averaged_perceptron_tagger': False,
-            "averaged_perceptron_tagger_eng": False}
-        for package in required_packages:
-            try:
-                if package in ['punkt', 'punkt_tab']:
-                    nltk.data.find(f'tokenizers/{package}')
-                else:
-                    nltk.data.find(f'taggers/{package}')
-                required_packages[package] = True
-            except LookupError:
-                try:
-                    log.info("Download nltk package %s", package)
-                    nltk.download(package, download_dir=nltk_data_dir)
-                except (URLError, HTTPError, PermissionError) as e:
-                    log.info("Can't download package %s as error: %s", package, e)
-
-        check_flag = all(required_packages.values())
-        if not check_flag:
-            for package in required_packages:
-                if nltk.data.find(f'tokenizers/{package}') or nltk.data.find(f'taggers/{package}'):
-                    required_packages[package] = True
-                    log.debug("Package %s is already downloaded", package)
-
-        check_flag = all(required_packages.values())
-        return check_flag
+    def check_nltk_data(self) -> bool:
+        cache_dir = self.get_cache_dir()
+        nltk_data_dir = os.environ.get("NLTK_DATA", cache_dir)
+        rp = os.path.join(resource_path, "nltk_data")
+        if rp not in nltk.data.path:
+            nltk.data.path.append(rp)
+        if nltk_data_dir not in nltk.data.path:
+            nltk.data.path.append(nltk_data_dir)
+
+        def ensure_group(group: str, names: list[str]) -> bool:
+            # 1) 已有其一即通过
+            for name in names:
+                try:
+                    nltk.data.find(f"{group}/{name}")
+                    return True
+                except LookupError:
+                    continue
+            # 2) 尝试按优先级下载
+            for name in names:
+                try:
+                    log.info("Download nltk package %s/%s", group, name)
+                    ok = nltk.download(name, download_dir=nltk_data_dir)
+                    if ok:
+                        try:
+                            nltk.data.find(f"{group}/{name}")
+                            return True
+                        except LookupError:
+                            continue
+                except (URLError, HTTPError, PermissionError) as e:
+                    log.info("Can't download package %s as error: %s", name, e)
+            return False
+
+        tokenizers_ok = ensure_group("tokenizers", ["punkt", "punkt_tab"])
+        taggers_ok = ensure_group("taggers", ["averaged_perceptron_tagger_eng", "averaged_perceptron_tagger"])
+        return bool(tokenizers_ok and taggers_ok)
🧹 Nitpick comments (5)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)

96-110: TextRank 异常处理建议:MemoryError 区分对待

MemoryError 通常表示资源耗尽,继续执行可能不可预期。建议对其单独处理(例如抛出让上层熔断,或截断输入后重试)。

         try:
             ranks = self._textrank_model.extract_keywords(self._query)
         except (TypeError, ValueError) as e:
             log.error("TextRank parameter error: %s", e)
-        except MemoryError as e:
-            log.critical("TextRank memory error (text too large?): %s", e)
+        except MemoryError as e:
+            log.critical("TextRank memory error (text too large?): %s", e)
+            raise  # 或者返回 {} 并让上游降级

115-119: Hybrid 权重解析不够健壮(仅接受 float 类型)

应接受 int/字符串并统一夹逼到[0,1]。

-        if isinstance(llm_settings.hybrid_llm_weights, float):
-            llm_weights = min(1.0, max(0.0, float(llm_settings.hybrid_llm_weights)))
-        else:
-            llm_weights = 0.5
+        try:
+            llm_weights = float(llm_settings.hybrid_llm_weights)
+        except (TypeError, ValueError):
+            llm_weights = 0.5
+        llm_weights = min(1.0, max(0.0, llm_weights))

127-128: 降低包含关键词与分数的日志到 DEBUG

避免在 INFO 级别暴露用户内容衍生数据。

-        log.info("LLM extract results: %s", llm_scores)
-        log.info("TextRank extract results: %s", tr_scores)
+        log.debug("LLM extract results: %s", llm_scores)
+        log.debug("TextRank extract results: %s", tr_scores)

152-152: regex 构造建议对 start_token 做转义以增强稳健性

虽然当前传入的是固定的 "KEYWORDS:",但通用性更好。

-        matches = re.findall(rf'{start_token}([^\n]+\n?)', response)
+        matches = re.findall(rf'{re.escape(start_token)}([^\n]+\n?)', response)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)

39-39: 避免重复追加 nltk.data.path(应当幂等)

每次调用都会 append 路径,可能产生重复项。建议追加前先判重,便于调试与打印。

可最小化修改如下:

-        nltk.data.path.append(os.path.join(resource_path, "nltk_data"))
+        rp = os.path.join(resource_path, "nltk_data")
+        if rp not in nltk.data.path:
+            nltk.data.path.append(rp)

在 Line 70-71 同样处理:

-        nltk.data.path.append(os.path.join(resource_path, "nltk_data"))
+        rp = os.path.join(resource_path, "nltk_data")
+        if rp not in nltk.data.path:
+            nltk.data.path.append(rp)

Also applies to: 70-71

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 00edd28 and 7f1ce87.

📒 Files selected for processing (3)
  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (2 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:37:41.224Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-29T13:11:08.943Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:42:31.998Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-21T12:34:10.407Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.407Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中,sub_tokens变量的复用逻辑是正确的:每次循环中sub_tokens被重新赋值为当前token的分词结果,处理后的有效子词会被添加回同一个sub_tokens列表中,最后统一添加到results。这种实现方式是有效且正确的。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧬 Code graph analysis (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
  • BaseLLM (22-74)
  • generate (26-31)
hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)
  • LLMs (25-98)
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
  • MultiLingualTextRank (30-156)
  • extract_keywords (140-156)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
  • extract_keywords (67-85)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build (3.10)
  • GitHub Check: build (3.11)
🔇 Additional comments (4)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)

85-95: LLM 抽取路径实现清晰可控

计时与解析分离良好、返回 Dict[str,float] 与架构偏好一致,LGTM。


45-47: 确认 TextRank 的 top_k/window 行为与运行时参数一致性

此处在构造时固定了 keyword_num 与 window_size。若运行时允许增大 max_keywords 或修改 window_size,则 TextRank 仍受构造时上限限制。若按既有偏好(仅初始化时承接,运行时不变动)这是可接受的;建议在 README/文档注明以避免误解。

我可以补一段 README 说明,或在 run() 中做同步(若你们决定支持运行时覆盖)。

hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (2)

22-22: 引入网络异常与日志依赖:LGTM

为下载失败提供更细粒度异常分类并统一日志接口,没问题。

Also applies to: 29-29


37-37: 已确认:KeywordExtract 层对语言值做了 NLTK 所需的归一化
在 hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py 的 run() 中,使用 llm_settings.language.lower() 并将 "cn" 映射为 "chinese",否则映射为 "english";且 document_op/textrank_word_extract.py 中也显式传入 'english'/'chinese'。因此在这些关键调用路径上不会因为上游使用 'en'/'zh' 导致 NLTKHelper.stopwords 抛 LookupError 或返回空结果。

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)

37-41: 修复 NLTK stopwords 语言映射:统一将 'zh'/'cn'/'en' 等别名映射为 'chinese'/'english'

NLTKHelper.stopwords 要求 lang 为 "english" 或 "chinese";仓内有位置只检测 "cn" 并默认其它为 "english",会把 "zh" 等值误映为英文,导致中文停用词不生效。

受影响位置(需修正):

  • hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py:52
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61

验证:hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py 已直接使用正确的 'english'/'chinese'(81–82)。

建议:在调用前或在 NLTKHelper.stopwords 内统一做别名映射(例如 'zh','cn','zh-cn' -> 'chinese';'en','eng' -> 'english'),或将默认逻辑改为显式映射表以避免隐式错误。

♻️ Duplicate comments (2)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)

101-105: 二次校验存在未捕获的 LookupError,导致函数在 tagger 分支直接异常退出

nltk.data.find(f'tokenizers/{package}') or nltk.data.find(f'taggers/{package}') 对于 tagger 包会先尝试 tokenizers 路径并抛出 LookupError,且未被捕获;当前写法会在第一个调用即异常,无法进入第二个分支。请按资源类型分别校验并捕获异常。

建议最小修复:

-        for package in required_packages:
-            if nltk.data.find(f'tokenizers/{package}') or nltk.data.find(f'taggers/{package}'):
-                required_packages[package] = True
-                log.debug("Package %s is already downloaded", package)
+        for package in required_packages:
+            try:
+                if package in ("punkt", "punkt_tab"):
+                    nltk.data.find(f"tokenizers/{package}")
+                else:
+                    nltk.data.find(f"taggers/{package}")
+                required_packages[package] = True
+                log.debug("Package %s is already downloaded", package)
+            except LookupError:
+                # 保持 False,交由 all() 结果返回
+                pass
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)

60-62: 语言归一化逻辑错误:'zh' 会被误判为英文

当前仅对 "cn" 赋中文,其余一律英文,不符合 PR 目标(支持 'en'/'zh')。请完善映射。

-        # Use English by default
-        self._language = "chinese" if self._language == "cn" else "english"
+        # Normalize language
+        lang = (self._language or "").strip().lower()
+        lang_map = {"en": "english", "zh": "chinese", "cn": "chinese"}
+        self._language = lang_map.get(lang, lang)
🧹 Nitpick comments (3)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)

39-39: 避免重复追加 nltk.data.path(微调)

两处直接 append 内置资源路径,长生命周期进程可能重复追加同一路径。建议先判断后追加。

-        nltk.data.path.append(os.path.join(resource_path, "nltk_data"))
+        rp = os.path.join(resource_path, "nltk_data")
+        if rp not in nltk.data.path:
+            nltk.data.path.append(rp)

同理应用于 Line 75。

Also applies to: 75-75

hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)

82-86: 降低敏感信息日志级别,避免在 INFO 输出用户查询

为降低隐私风险,建议将查询与关键词输出降级到 DEBUG,并避免换行。

-        log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"])
+        log.debug("User Query: %s | Keywords: %s", self._query, context["keywords"])

148-159: 正则未转义 start_token,建议使用 re.escape 防止特殊字符引发匹配偏差

注释已提示可能包含特殊字符,但实现未转义。建议小幅加固。

-        matches = re.findall(rf'{start_token}([^\n]+\n?)', response)
+        prefix = re.escape(start_token) if start_token else ""
+        matches = re.findall(prefix + r'([^\n]+\n?)', response)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7f1ce87 and f31a500.

📒 Files selected for processing (2)
  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (2 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2 hunks)
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:37:41.224Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-29T13:11:08.943Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:42:31.998Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-21T12:34:10.407Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.407Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中,sub_tokens变量的复用逻辑是正确的:每次循环中sub_tokens被重新赋值为当前token的分词结果,处理后的有效子词会被添加回同一个sub_tokens列表中,最后统一添加到results。这种实现方式是有效且正确的。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧬 Code graph analysis (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
  • BaseLLM (22-74)
  • generate (26-31)
hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)
  • LLMs (25-98)
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
  • MultiLingualTextRank (30-156)
  • extract_keywords (140-156)
🔇 Additional comments (2)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)

69-80: 分支分发与 context 覆盖方式清晰,易于扩展

基于 context 的 extract_method 覆盖与统一小写对齐合理,三分支职责清晰。


117-146: Hybrid 加权融合实现简明且与统一返回 Dict[str, float] 的设计一致

LLM/TR 分数经权重线性组合、对并集打分的策略合理,日志与计时粒度合适。

Comment on lines +45 to +47
self._textrank_model = MultiLingualTextRank(
keyword_num=max_keywords,
window_size=llm_settings.window_size)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

确认 TextRank 参数贯通是否覆盖 mask_words(与 PR 目标对齐)

此处仅传入 keyword_num/window_size。PR 描述包含 mask_words 功能与中文分词屏蔽规则,请确认是否在 MultiLingualTextRank 内部或上游已接入;否则需补齐贯通。


🏁 Script executed:

#!/bin/bash
# 检查 mask_words 的声明与传递链路
rg -nP --type=py 'mask_words'
# 检查 MultiLingualTextRank 构造签名与使用处
rg -nP --type=py -C2 'class\s+MultiLingualTextRank|MultiLingualTextRank\(' hugegraph-llm/src

Length of output: 1186


必须把 mask_words 贯通到 MultiLingualTextRank 的构造并从调用处传入

  • 验证:hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py 中的 MultiLingualTextRank.init(keyword_num: int = 5, window_size: int = 3) 仅接收 keyword_num 和 window_size(约第30–32行);仓库未发现 mask_words 的传递链。
  • 要求/修复:在该类构造中增加 mask_words 参数(例如 mask_words: Optional[List[str]] = None),并在调用处 hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py(约第45–47行)将相应的 mask_words(或 llm_settings 中的配置)传入;如果屏蔽逻辑已在上游实现,请在 PR 中指明实现位置并补充对应调用。

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)

22-22: 消除重复导入(第22行和第29行)

在第22行导入了 URLErrorHTTPError,然后在第29行又重复导入了 stopwords。此外,第29行的 stopwords 导入和第25行重复。

删除重复导入:

 from urllib.error import URLError, HTTPError

 import nltk
 from nltk.corpus import stopwords

 from hugegraph_llm.config import resource_path
 from hugegraph_llm.utils.log import log
-from nltk.corpus import stopwords
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f31a500 and 9423bb4.

📒 Files selected for processing (1)
  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
hugegraph-llm/**/*.py

📄 CodeRabbit inference engine (hugegraph-llm/AGENTS.md)

hugegraph-llm/**/*.py: Adhere to ruff code style for Python code
Type-check Python code with mypy
Keep each Python file under 600 lines for maintainability

Files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
hugegraph-llm/src/hugegraph_llm/operators/**/*.py

📄 CodeRabbit inference engine (hugegraph-llm/AGENTS.md)

Put core processing pipelines under src/hugegraph_llm/operators/

Files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
📚 Learning: 2025-08-18T14:37:41.224Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build (3.10)
  • GitHub Check: build (3.11)
🔇 Additional comments (2)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (2)

54-59: 日志消息需要修正("Start download" 应为 "Downloading")

当前代码在第55行显示"Start download nltk package stopwords",但在下载成功后的第57行记录"is already downloaded",这个消息是不准确的。建议调整日志消息使其更清晰地反映实际状态。

基于先前的反馈,日志应该准确反映下载状态:

 try:
-    log.info("Start download nltk package stopwords")
-    nltk.download("stopwords", download_dir=nltk_data_dir, quiet=False)
-    log.debug("NLTK package stopwords is already downloaded")
+    log.info("正在下载 NLTK stopwords 包到 %s", nltk_data_dir)
+    nltk.download("stopwords", download_dir=nltk_data_dir, quiet=False)
+    log.debug("NLTK stopwords 包下载完成")
 except (URLError, HTTPError, PermissionError) as e:
     log.warning("Can't download package stopwords as error: %s", e)

83-110: check_nltk_data 方法存在逻辑问题和兼容性问题

该方法有几个问题:

  1. 在 NLTK 3.8.2+ 版本中,punkt 已被替换为 punkt_tab,但当前代码假设所有包都可用
  2. 第105-106行的验证逻辑有问题:nltk.data.find() 找不到资源时会抛出 LookupError,而不会返回布尔值
  3. 缺乏回退机制处理版本差异

根据从学习记录中了解到的信息以及Web搜索的结果,建议实现兼容性处理:

 def check_nltk_data(self):
+    """检查并确保必需的NLTK数据包可用,返回是否全部就绪"""
     _hugegraph_source_dir = os.path.join(resource_path, "nltk_data")
     if _hugegraph_source_dir not in nltk.data.path:
         nltk.data.path.append(_hugegraph_source_dir)

     cache_dir = self.get_cache_dir()
     nltk_data_dir = os.environ.get("NLTK_DATA", cache_dir)
     if nltk_data_dir not in nltk.data.path:
         nltk.data.path.append(nltk_data_dir)

-    required_packages = {
-        'punkt': False,
-        'punkt_tab': False,
-        'averaged_perceptron_tagger': False,
-        "averaged_perceptron_tagger_eng": False}
-    for package in required_packages:
+    # 检查tokenizers:punkt 或 punkt_tab 其一可用即可
+    tokenizer_found = False
+    for tokenizer in ['punkt_tab', 'punkt']:  # 优先检查新版本
         try:
-            if package in ['punkt', 'punkt_tab']:
-                nltk.data.find(f'tokenizers/{package}')
-            else:
-                nltk.data.find(f'taggers/{package}')
-            required_packages[package] = True
+            nltk.data.find(f'tokenizers/{tokenizer}')
+            tokenizer_found = True
+            log.debug("找到 tokenizer: %s", tokenizer)
+            break
         except LookupError:
-            try:
-                log.info("Start download nltk package %s", package)
-                nltk.download(package, download_dir=nltk_data_dir, quiet=False)
-            except (URLError, HTTPError, PermissionError) as e:
-                log.warning("Can't download package %s as error: %s", package, e)
+            continue
+    
+    if not tokenizer_found:
+        # 尝试下载新版本的 punkt_tab,如果失败则尝试 punkt
+        for tokenizer in ['punkt_tab', 'punkt']:
+            try:
+                log.info("正在下载 NLTK tokenizer: %s", tokenizer)
+                success = nltk.download(tokenizer, download_dir=nltk_data_dir, quiet=False)
+                if success:
+                    tokenizer_found = True
+                    break
+            except (URLError, HTTPError, PermissionError) as e:
+                log.warning("下载 %s 失败: %s", tokenizer, e)
+    
+    # 检查taggers:averaged_perceptron_tagger_eng 或 averaged_perceptron_tagger 其一可用即可  
+    tagger_found = False
+    for tagger in ['averaged_perceptron_tagger_eng', 'averaged_perceptron_tagger']:
+        try:
+            nltk.data.find(f'taggers/{tagger}')
+            tagger_found = True
+            log.debug("找到 tagger: %s", tagger)
+            break
+        except LookupError:
+            continue
+    
+    if not tagger_found:
+        # 尝试下载
+        for tagger in ['averaged_perceptron_tagger_eng', 'averaged_perceptron_tagger']:
+            try:
+                log.info("正在下载 NLTK tagger: %s", tagger)
+                success = nltk.download(tagger, download_dir=nltk_data_dir, quiet=False)
+                if success:
+                    tagger_found = True
+                    break
+            except (URLError, HTTPError, PermissionError) as e:
+                log.warning("下载 %s 失败: %s", tagger, e)

-    check_flag = all(required_packages.values())
-    if not check_flag:
-        for package in required_packages:
-            if nltk.data.find(f'tokenizers/{package}') or nltk.data.find(f'taggers/{package}'):
-                required_packages[package] = True
-                log.debug("Package %s is already downloaded", package)

-    check_flag = all(required_packages.values())
-    return check_flag
+    return tokenizer_found and tagger_found

@imbajin
Copy link

imbajin commented Oct 9, 2025

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

你好,感谢你对关键词提取功能的重大更新!这次的重构引入了 TextRank 和混合模式,大大增强了功能的灵活性和效果。代码结构清晰,新功能的加入也考虑得很周到。

我在代码中发现两个潜在的问题,主要涉及 NLTK 依赖检查的稳定性和混合语言文本处理的准确性。具体的细节和修改建议请看下面的评论。

总体来说,这是一次非常出色的更新,期待这些问题修复后合并!

Comment on lines 72 to 109
def check_nltk_data(self):
_hugegraph_source_dir = os.path.join(resource_path, "nltk_data")
if _hugegraph_source_dir not in nltk.data.path:
nltk.data.path.append(_hugegraph_source_dir)

cache_dir = self.get_cache_dir()
nltk_data_dir = os.environ.get("NLTK_DATA", cache_dir)
if nltk_data_dir not in nltk.data.path:
nltk.data.path.append(nltk_data_dir)

required_packages = {
'punkt': False,
'punkt_tab': False,
'averaged_perceptron_tagger': False,
"averaged_perceptron_tagger_eng": False}
for package in required_packages:
try:
if package in ['punkt', 'punkt_tab']:
nltk.data.find(f'tokenizers/{package}')
else:
nltk.data.find(f'taggers/{package}')
required_packages[package] = True
except LookupError:
try:
log.info("Start download nltk package %s", package)
nltk.download(package, download_dir=nltk_data_dir, quiet=False)
except (URLError, HTTPError, PermissionError) as e:
log.warning("Can't download package %s as error: %s", package, e)

check_flag = all(required_packages.values())
if not check_flag:
for package in required_packages:
if nltk.data.find(f'tokenizers/{package}') or nltk.data.find(f'taggers/{package}'):
required_packages[package] = True
log.debug("Package %s is already downloaded", package)

check_flag = all(required_packages.values())
return check_flag

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

check_nltk_data 方法的实现存在严重问题,可能导致程序在特定情况下崩溃。

  1. 第二个循环的逻辑错误: 在第102-108行,if not check_flag: 块内的循环逻辑是有缺陷的。nltk.data.find() 在找不到资源时会抛出 LookupError 异常。这里的 if nltk.data.find(...) or nltk.data.find(...) 语句没有处理这个异常。当检查一个 tagger 包(如 'averaged_perceptron_tagger')时,nltk.data.find(f'tokenizers/{package}') 会首先执行并因路径不匹配而抛出异常,导致程序崩溃。

  2. 冗余和低效: 该方法使用了两个独立的循环来完成一项任务,这使得代码冗余且难以理解。第一个循环在尝试下载后没有更新包的状态,这使得第二个(有缺陷的)循环成为必要。

建议将整个逻辑重构为一个更清晰、更健壮的循环,正确地处理检查、下载和验证的流程。

    def check_nltk_data(self):
        _hugegraph_source_dir = os.path.join(resource_path, "nltk_data")
        if _hugegraph_source_dir not in nltk.data.path:
            nltk.data.path.append(_hugegraph_source_dir)

        cache_dir = self.get_cache_dir()
        nltk_data_dir = os.environ.get("NLTK_DATA", cache_dir)
        if nltk_data_dir not in nltk.data.path:
            nltk.data.path.append(nltk_data_dir)

        required_packages = {
            'punkt': 'tokenizers/punkt',
            'punkt_tab': 'tokenizers/punkt_tab',
            'averaged_perceptron_tagger': 'taggers/averaged_perceptron_tagger',
            "averaged_perceptron_tagger_eng": 'taggers/averaged_perceptron_tagger_eng'
        }

        for package, path in required_packages.items():
            try:
                nltk.data.find(path)
            except LookupError:
                log.info("Start download nltk package %s", package)
                try:
                    if not nltk.download(package, download_dir=nltk_data_dir, quiet=False):
                        log.warning("NLTK download command returned False for package %s.", package)
                        return False
                    # Verify after download
                    nltk.data.find(path)
                except (URLError, HTTPError, PermissionError) as e:
                    log.warning("Can't download package %s as error: %s", package, e)
                    return False
                except LookupError:
                    log.error("Package %s not found after download. Check package name and nltk_data paths.", package)
                    return False
        return True

Comment on lines 78 to 110
def _multi_preprocess(self, text):
words = []
ch_tokens = []
en_stop_words = NLTKHelper().stopwords(lang='english')
ch_stop_words = NLTKHelper().stopwords(lang='chinese')

# Filtering special words, cleansing punctuation marks, and filtering out invalid tokens
masked_text, placeholder_map = self._word_mask(text)
pos_tags = self._get_valid_tokens(masked_text)

# English word segmentation
for word, flag in pos_tags:
if word in placeholder_map:
words.append(placeholder_map[word])
else:
if len(word) >= 1 and flag in self.pos_filter['english'] and word.lower() not in en_stop_words:
words.append(word)
if re.compile('[\u4e00-\u9fff]').search(word):
ch_tokens.append(word)

# Chinese word segmentation
if len(ch_tokens) > 0:
ch_tokens = list(set(ch_tokens))
for ch_token in ch_tokens:
idx = words.index(ch_token)
ch_words = []
jieba_tokens = pseg.cut(ch_token)
for word, flag in jieba_tokens:
if len(word) >= 1 and flag in self.pos_filter['chinese'] and word not in ch_stop_words:
ch_words.append(word)
words = words[:idx] + ch_words + words[idx+1:]

return words

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

_multi_preprocess 方法在处理包含重复中文词组的文本时存在缺陷,可能导致分词不完整。

  1. 信息丢失: 在第100行,ch_tokens = list(set(ch_tokens)) 会去除重复的中文词组,这会丢失词语在原文中的频率和位置信息。
  2. 错误的替换: 在第102行,idx = words.index(ch_token) 只会找到词组在 words 列表中的第一个匹配项的索引。如果一个中文词组在文本中出现多次,只有第一个会被正确地分词和替换,其余的将保持原样,从而导致分词不准确。

例如,对于文本 "你好世界,我爱你好世界",words 列表在处理前可能是 ['你好世界', '我', '爱', '你好世界']ch_tokens 在去重后只包含一个 '你好世界'words.index('你好世界') 将返回 0,只有第一个 '你好世界' 会被替换,第二个则不会被处理。

建议重构此方法以确保所有词元都得到正确处理。一个更简单、更健壮的方法是遍历一次 pos_tags,并根据词元是中文还是英文来应用不同的分词策略,然后构建最终的词列表。

    def _multi_preprocess(self, text):
        words = []
        en_stop_words = NLTKHelper().stopwords(lang='english')
        ch_stop_words = NLTKHelper().stopwords(lang='chinese')

        masked_text, placeholder_map = self._word_mask(text)
        pos_tags = self._get_valid_tokens(masked_text)

        for word, flag in pos_tags:
            if word in placeholder_map:
                words.append(placeholder_map[word])
                continue

            # Check if the word contains Chinese characters
            if re.search('[一-鿿]', word):
                jieba_tokens = pseg.cut(word)
                for ch_word, ch_flag in jieba_tokens:
                    if len(ch_word) >= 1 and ch_flag in self.pos_filter['chinese'] and ch_word not in ch_stop_words:
                        words.append(ch_word)
            # Process as English word
            elif len(word) >= 1 and flag in self.pos_filter['english'] and word.lower() not in en_stop_words:
                words.append(word)

        return words

@imbajin
Copy link

imbajin commented Oct 9, 2025

/gemini summary

@gemini-code-assist
Copy link

Summary of Changes

This pull request significantly upgrades the keyword extraction capabilities within the RAG pipeline by introducing TextRank and a hybrid approach as alternatives to the LLM-only method. This change provides users with more granular control over keyword extraction, allowing for optimization based on performance, cost, and specific use case requirements, while also standardizing the output format to include importance scores.

Highlights

  • New Keyword Extraction Methods: Introduced TextRank and a Hybrid method alongside the existing LLM-based approach, offering more flexibility and control over keyword extraction within the RAG pipeline.
  • Configurable Settings: Added new configuration options for keyword extraction, including method selection, TextRank window size, and hybrid LLM weights, allowing users to fine-tune the process.
  • Enhanced Output Format: Keyword extraction now includes importance scores (0.00-1.00), sorted in descending order, and formatted as 'KEYWORDS:keyword:score' for clearer results.
  • Improved NLTK Resource Management: Enhanced NLTK data checking, download handling, and resource path configuration for better robustness and reliability when using NLTK functionalities.
  • New Dependencies: Added 'scipy' and 'python-igraph' to support the graph-based TextRank algorithm, expanding the project's analytical capabilities.
Changelog
  • hugegraph-llm/.gitignore
    • Updated ignore rules to correctly manage NLTK data and other resources.
  • hugegraph-llm/config.md
    • Documented new configuration parameters for keyword extraction type, TextRank window size, and hybrid LLM weights.
  • hugegraph-llm/pyproject.toml
    • Added 'scipy' and 'python-igraph' as core dependencies.
  • hugegraph-llm/src/hugegraph_llm/api/rag_api.py
    • Minor formatting adjustments.
  • hugegraph-llm/src/hugegraph_llm/config/llm_config.py
    • Introduced 'keyword_extract_type', 'window_size', and 'hybrid_llm_weights' to the LLM configuration.
  • hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py
    • Minor formatting adjustments.
  • hugegraph-llm/src/hugegraph_llm/config/prompt_config.py
    • Modified English and Chinese keyword extraction prompt templates to specify importance scores and the new 'keyword:score' output format.
  • hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py
    • Minor formatting adjustments.
  • hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py
    • Minor formatting adjustments.
  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
    • Refined the 'stopwords' method with better error handling for downloads.
    • Added 'check_nltk_data' for comprehensive NLTK resource validation.
  • hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
    • New file implementing 'MultiLingualTextRank' for multi-lingual keyword extraction using graph-based ranking.
  • hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py
    • Simplified the 'WordExtract' operator by removing the 'language' parameter from its constructor, now deriving it from global settings.
  • hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py
    • Updated 'extract_word' and 'extract_keywords' methods to align with the new keyword extraction configuration and removed redundant parameters.
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
    • Refactored the 'KeywordExtract' class to support 'llm', 'textrank', and 'hybrid' extraction methods.
    • Integrated 'MultiLingualTextRank' and handles the new score-based output format.
  • pyproject.toml
    • Added 'scipy' and 'python-igraph' to the 'constraint-dependencies'.
Activity
  • The pull request introduces a significant feature: new keyword extraction methods (TextRank, Hybrid) and configurable options.
  • Initial automated reviews from 'coderabbitai[bot]' and 'codecov-ai[bot]' provided detailed summaries, walkthroughs, and suggestions.
  • The author, Gfreely, actively engaged with the review comments, providing clarifications and addressing several suggestions, such as the SciPy version compatibility for Python 3.10, the design choice for TextRank's 'top_k' parameter, and the packaging of NLTK resources.
  • Multiple automated suggestions were made and addressed, including dependency versioning, error handling, regex issues, and prompt template refinements.
  • There were discussions around the robustness of LLM calls, the handling of 'mask_words', and the thread-safety of 'jieba' modifications, with the author often providing context on existing design decisions.
  • A sequence diagram was generated by 'coderabbitai[bot]' upon request.
  • Recent high-priority issues were identified by 'gemini-code-assist[bot]' regarding 'check_nltk_data's error handling and '_multi_preprocess's logic for handling duplicate Chinese tokens.

"numpy",
"pandas",
"pydantic",
"scipy",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议在依赖末尾添加注释说明这些包的用途:

"scipy",  # TextRank 算法需要
"python-igraph",  # 图结构计算需要

同时建议验证这两个包在 Windows/macOS/Linux 平台上的兼容性。


def _word_mask(self, text):

placeholder_id_counter = 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_word_mask 方法中存在潜在的性能问题:

  1. 嵌套函数中使用 nonlocal: _create_placeholder 函数每次匹配都会修改外层变量,建议将 placeholder 生成逻辑提取为类方法
  2. 正则编译位置: 建议将 special_regex 编译移到 __init__ 方法中,避免每次调用都重新编译
  3. 字符串拼接: f" __shieldword_{placeholder_id_counter}__ " 两侧的空格可能导致意外的分词结果

建议优化:

def __init__(self, ...):
    # ... 其他初始化代码
    self.special_regex = regex.compile('|'.join(self.rules), regex.V1)
    
def _create_placeholder(self, original_word: str, counter: int) -> tuple:
    placeholder = f"__shieldword_{counter}__"
    return placeholder, original_word

# Chinese word segmentation
if len(ch_tokens) > 0:
ch_tokens = list(set(ch_tokens))
for ch_token in ch_tokens:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

安全隐患: 此处缺少异常处理,如果 ch_token 不在 words 列表中会导致 ValueError

建议修改:

try:
    idx = words.index(ch_token)
except ValueError:
    log.warning(f"Chinese token {ch_token} not found in words list")
    continue

另外,在同一个列表上执行切片重组时,建议使用 collections.deque 或构建新列表,避免频繁的列表操作影响性能。

self._language = "chinese" if self._language == "cn" else "english"
mk = context.get("max_keywords", self._max_keywords)
try:
mk = int(mk)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

代码质量问题:

  1. 变量命名: mk 变量名不够清晰,建议改为 max_keywords_value
  2. 类型转换逻辑: 建议使用更 Pythonic 的方式处理类型转换:
max_keywords_raw = context.get("max_keywords", self._max_keywords)
if isinstance(max_keywords_raw, (int, float)):
    self._max_keywords = max(1, int(max_keywords_raw))
elif isinstance(max_keywords_raw, str) and max_keywords_raw.isdigit():
    self._max_keywords = max(1, int(max_keywords_raw))
else:
    self._max_keywords = self._max_keywords  # 保持默认值
  1. 验证范围: 建议添加最大值限制,避免用户输入过大的值导致性能问题

mk = self._max_keywords
self._max_keywords = max(1, mk)

method = (context.get("extract_method", self._extract_method) or "hybrid").strip().lower()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

关键安全问题: 此处对 method 参数的处理存在隐患:

  1. strip() 后可能为空: 如果用户传入只包含空格的字符串,strip() 后为空字符串,会抛出 ValueError
  2. 建议使用枚举: 应该使用 Python EnumLiteral 类型来限制可选值

建议修改:

from typing import Literal

EXTRACT_METHOD = Literal["llm", "textrank", "hybrid"]

# 在 run 方法中
method = (context.get("extract_method", self._extract_method) or self._extract_method).strip().lower()
if method not in ["llm", "textrank", "hybrid"]:
    log.warning(f"Invalid extract_method '{method}', fallback to '{self._extract_method}'")
    method = self._extract_method

node_names = self.graph.vs['name']
return dict(zip(node_names, pagerank_scores))

def extract_keywords(self, text) -> Dict[str, float]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

代码质量: 此处的异常处理不够完善:

  1. 如果 check_nltk_data() 返回 False,直接返回空字典会导致静默失败,用户无法得知原因
  2. 建议添加日志记录或抛出更明确的异常

建议修改:

if not NLTKHelper().check_nltk_data():
    log.error("NLTK data check failed, cannot proceed with keyword extraction")
    raise RuntimeError("NLTK required data packages are not available. Please check network or manually install.")

或者如果需要优雅降级:

if not NLTKHelper().check_nltk_data():
    log.warning("NLTK data unavailable, returning empty keywords")
    context["keywords"] = []
    context["ranks"] = {}
    return context

!/src/hugegraph_llm/resources/demo/*
!/src/hugegraph_llm/resources/nltk_data/*
!/src/hugegraph_llm/resources/prompt_examples/*
!/src/hugegraph_llm/resources/demo/
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.gitignore 修改存在风险:

!/src/hugegraph_llm/resources/demo/* 改为 !/src/hugegraph_llm/resources/demo/ 会改变匹配行为:

  • 原规则: 排除 resources/ 下所有文件,但保留 demo/ 目录下的所有文件
  • 新规则: 只保留 demo/ 目录本身,不保留其下的文件

这可能导致 demo 文件被意外忽略。建议:

  1. 保持原有的 /* 后缀
  2. 或者明确说明为何需要修改,并在 PR 描述中补充说明

required_packages[package] = True
except LookupError:
try:
log.info("Start download nltk package %s", package)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

下载失败处理不够健壮:

  1. 捕获的异常类型包括 PermissionError,但如果是权限问题,仅 warning 可能导致后续代码因缺少数据而失败
  2. 建议区分不同的失败原因:
    • 网络问题: 可以重试
    • 权限问题: 应该抛出异常
    • 其他问题: 记录详细错误信息

建议改进:

except PermissionError as e:
    log.error(f"Permission denied when downloading {package}: {e}")
    raise RuntimeError(f"Cannot write NLTK data to {nltk_data_dir}, please check permissions") from e
except (URLError, HTTPError) as e:
    log.warning(f"Network error downloading {package}: {e}, will retry with backup method")
    # 可以尝试从本地缓存或镜像源下载

| `TEXT2GQL_LLM_TYPE` | Literal["openai", "litellm", "ollama/local"] | openai | 文本转 GQL LLM 类型 |
| `EMBEDDING_TYPE` | Optional[Literal["openai", "litellm", "ollama/local"]] | openai | 嵌入模型类型 |
| `RERANKER_TYPE` | Optional[Literal["cohere", "siliconflow"]] | None | 重排序模型类型:cohere/siliconflow |
| `KEYWORD_EXTRACT_TYPE` | Literal["llm", "textrank", "hybrid"] | llm | 关键词提取模型类型:llm/textrank/hybrid |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

配置文档建议:

表格新增的配置项建议补充:

  1. WINDOW_SIZE:

    • 添加合理取值范围 (如: 1-10)
    • 说明该值对性能和结果的影响
  2. HYBRID_LLM_WEIGHTS:

    • 说明取值范围 (0.0-1.0)
    • 解释该参数的含义,如 "LLM 结果权重,TextRank 权重为 1 - hybrid_llm_weights"
    • 提供推荐值及使用场景

示例:

| `WINDOW_SIZE` | Optional[Integer] | 3 | TextRank 滑窗大小 (范围: 1-10),较大的窗口可以捕获更长距离的词语关系,但会增加计算复杂度 |
| `HYBRID_LLM_WEIGHTS` | Optional[Float] | 0.5 | 混合模式中 LLM 结果的权重 (范围: 0.0-1.0),TextRank 权重 = 1 - 该值。推荐 0.5 以平衡两种方法 |

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (1)

94-103: 存在冗余的条件判断,可能导致混淆。

Lines 94-95 已经检查了英文词性和停用词过滤,然后 Line 96-101 处理中文字符的情况,但 Line 102-103 又重复检查了相同的英文条件。这个重复的 elif 分支永远不会执行,因为如果满足 Line 94 的条件,已经在 Line 95 append 过了。

建议移除冗余分支:

             if len(word) >= 1 and flag in self.pos_filter['english'] and word.lower() not in en_stop_words:
                 words.append(word)
                 if re.compile('[\u4e00-\u9fff]').search(word):
                     jieba_tokens = pseg.cut(word)
                     for ch_word, ch_flag in jieba_tokens:
                         if len(ch_word) >= 1 and ch_flag in self.pos_filter['chinese'] \
                                 and ch_word not in ch_stop_words:
                             words.append(ch_word)
-                elif len(word) >= 1 and flag in self.pos_filter['english'] and word.lower() not in en_stop_words:
-                    words.append(word)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)

86-86: 隐私风险:在 INFO 级别记录用户查询可能泄露敏感信息。

当前在 INFO 级别记录完整的用户查询和提取的关键词,这在生产环境中可能导致隐私泄露。建议降级为 DEBUG 级别:

-        log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"])
+        log.debug("User Query: %s | Keywords: %s", self._query, context["keywords"])

Based on learnings

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3ce504d and 4083ae0.

📒 Files selected for processing (5)
  • hugegraph-llm/config.md (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (3 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2 hunks)
  • pyproject.toml (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • hugegraph-llm/config.md
  • pyproject.toml
🧰 Additional context used
📓 Path-based instructions (2)
hugegraph-llm/**/*.py

📄 CodeRabbit inference engine (hugegraph-llm/AGENTS.md)

hugegraph-llm/**/*.py: Adhere to ruff code style for Python code
Type-check Python code with mypy
Keep each Python file under 600 lines for maintainability

Files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
  • hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
hugegraph-llm/src/hugegraph_llm/operators/**/*.py

📄 CodeRabbit inference engine (hugegraph-llm/AGENTS.md)

Put core processing pipelines under src/hugegraph_llm/operators/

Files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
  • hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧠 Learnings (10)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
  • hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:37:41.224Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
  • hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
📚 Learning: 2025-08-18T14:42:31.998Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-21T12:34:10.407Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.407Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中,sub_tokens变量的复用逻辑是正确的:每次循环中sub_tokens被重新赋值为当前token的分词结果,处理后的有效子词会被添加回同一个sub_tokens列表中,最后统一添加到results。这种实现方式是有效且正确的。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-29T13:11:08.943Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧬 Code graph analysis (3)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)
hugegraph-ml/src/hugegraph_ml/models/seal.py (3)
  • info (794-795)
  • debug (791-792)
  • warning (800-801)
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (1)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (3)
  • NLTKHelper (31-136)
  • stopwords (37-70)
  • check_nltk_data (72-109)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
  • BaseLLM (22-74)
  • generate (26-31)
hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)
  • LLMs (25-98)
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
  • MultiLingualTextRank (31-151)
  • extract_keywords (134-151)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
  • extract_keywords (67-85)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build (3.10)
  • GitHub Check: build (3.11)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)

86-86: 避免在 INFO 级别日志中记录用户查询内容。

在 INFO 级别记录完整的用户查询和关键词可能泄露敏感信息。过往审查中已多次标记此隐私风险。建议降级为 DEBUG 级别:

-        log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"])
+        log.debug("User Query: %s | Keywords: %s", self._query, context["keywords"])
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)

82-109: 至少需要放宽资源检测为“主/备”而非全部必备
当前 required_packagespunktpunkt_tab(以及两个 tagger)都视作必需项,意味着即便环境里已经有常见的 punkt,只要缺少较罕见的 punkt_tabcheck_nltk_data() 就会尝试下载并在无法获取时返回 False。类似地,averaged_perceptron_tagger_eng 只要缺一就判失败。这样会让默认安装在绝大多数环境里都无法通过检测,从而阻断后续流程。请恢复成主/备组校验:同一组只要命中其一即可,否则再依次尝试下载备用项。建议按 tokenizers 与 taggers 分组检测,任何一组无法满足才返回 False。

-        required_packages = {
-            'punkt': 'tokenizers/punkt',
-            'punkt_tab': 'tokenizers/punkt_tab',
-            'averaged_perceptron_tagger': 'taggers/averaged_perceptron_tagger',
-            "averaged_perceptron_tagger_eng": 'taggers/averaged_perceptron_tagger_eng'
-        }
-
-        for package, path in required_packages.items():
-            try:
-                nltk.data.find(path)
-            except LookupError:
-                ...
-                    nltk.data.find(path)
-                ...
-        return True
+        resource_groups = [
+            ("tokenizers", ["punkt", "punkt_tab"]),
+            ("taggers", ["averaged_perceptron_tagger_eng", "averaged_perceptron_tagger"]),
+        ]
+
+        for category, packages in resource_groups:
+            group_ready = False
+            for package in packages:
+                path = f"{category}/{package}"
+                try:
+                    nltk.data.find(path)
+                    group_ready = True
+                    break
+                except LookupError:
+                    log.info("Start download nltk package %s", package)
+                    try:
+                        if nltk.download(package, download_dir=nltk_data_dir, quiet=False):
+                            nltk.data.find(path)
+                            group_ready = True
+                            break
+                    except PermissionError as e:
+                        log.error("Permission denied when downloading %s: %s", package, e)
+                        return False
+                    except (URLError, HTTPError) as e:
+                        log.warning("Network error downloading %s: %s, will retry with backup method", package, e)
+                        continue
+                    except LookupError:
+                        log.error("Package %s not found after download. Check package name and nltk_data paths.", package)
+                        continue
+            if not group_ready:
+                return False
+        return True
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4083ae0 and ff00016.

📒 Files selected for processing (2)
  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (3 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
hugegraph-llm/**/*.py

📄 CodeRabbit inference engine (hugegraph-llm/AGENTS.md)

hugegraph-llm/**/*.py: Adhere to ruff code style for Python code
Type-check Python code with mypy
Keep each Python file under 600 lines for maintainability

Files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
hugegraph-llm/src/hugegraph_llm/operators/**/*.py

📄 CodeRabbit inference engine (hugegraph-llm/AGENTS.md)

Put core processing pipelines under src/hugegraph_llm/operators/

Files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧠 Learnings (10)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:37:41.224Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
📚 Learning: 2025-08-29T13:11:08.943Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:42:31.998Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-21T12:34:10.407Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.407Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中,sub_tokens变量的复用逻辑是正确的:每次循环中sub_tokens被重新赋值为当前token的分词结果,处理后的有效子词会被添加回同一个sub_tokens列表中,最后统一添加到results。这种实现方式是有效且正确的。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧬 Code graph analysis (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
  • BaseLLM (22-74)
  • generate (26-31)
hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)
  • LLMs (25-98)
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
  • MultiLingualTextRank (31-151)
  • extract_keywords (134-151)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build (3.10)
  • GitHub Check: build (3.11)
🔇 Additional comments (4)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)

92-101: LGTM!

LLM 提取方法实现清晰,正确返回 Dict[str, float] 格式,与架构设计一致。计时日志有助于性能监控。

基于学习记录


103-116: LGTM!

TextRank 提取方法实现健壮:

  • 异常处理覆盖了常见错误类型
  • 内存错误使用 critical 级别记录(合理)
  • 错误时返回空字典作为安全后备
  • 返回 Dict[str, float] 与架构设计一致

基于学习记录


118-147: LGTM!

混合提取方法实现正确:

  • hybrid_llm_weights 进行了边界验证(限制在 [0,1])
  • 基于并集计算加权评分,确保覆盖两种方法的所有关键词
  • 返回 Dict[str, float],由 run() 统一处理排序与截断,符合架构设计偏好
  • 使用 DEBUG 级别记录详细信息,避免隐私泄露

基于学习记录


149-178: LGTM!

响应解析方法实现健壮:

  • 在访问前验证分割结果长度(防止 IndexError)
  • 使用 try-except 捕获分数解析错误(防止 ValueError)
  • 遇到无效条目时跳过而非崩溃
  • 正确返回 Dict[str, float] 格式

self._language = context.get("language", self._language).lower()
self._max_keywords = context.get("max_keywords", self._max_keywords)
# Use English by default
self._language = "chinese" if self._language == "cn" else "english"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

语言映射不完整,可能导致停用词加载失败。

当前仅将 "cn" 映射为 "chinese",但未处理其他常见的语言代码如 "en""zh" 等。根据之前的学习记录,NLTKHelper 和 MultiLingualTextRank 依赖精确的语言标识符("english""chinese"),不完整的映射可能导致停用词过滤失效。

建议完善语言映射逻辑:

-        # Use English by default
-        self._language = "chinese" if self._language == "cn" else "english"
+        # Normalize language codes to match NLTK/TextRank expectations
+        lang = self._language.strip().lower()
+        if lang in ("cn", "zh", "zh-cn", "chinese"):
+            self._language = "chinese"
+        else:
+            self._language = "english"

基于学习记录

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self._language = "chinese" if self._language == "cn" else "english"
self._language = (language or "").strip().lower()
if self._language in ("en", "en-us", "english"):
self._language = "english"
elif self._language in ("cn", "zh", "zh-cn", "zh_cn", "chinese"):
self._language = "chinese"
else:
# Default to english for unknown languages
self._language = "english"
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py around
line 61, the current language mapping only converts "cn" to "chinese" which is
incomplete and can break stopword loading; normalize the incoming language
string (lowercase, strip), and map common codes to the expected identifiers: map
"cn", "zh", "zh-cn" -> "chinese", map "en", "eng" -> "english", leave
"english"/"chinese" unchanged, and for any unknown or empty value default to
"english" (or raise a clear error if preferred); update the assignment
accordingly so downstream NLTKHelper and MultiLingualTextRank receive the exact
expected identifiers.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)

60-61: 语言代码映射不完整,可能导致停用词和分词失效。

当前仅将 "cn" 映射为 "chinese",但未处理其他常见的语言代码(如 "en", "zh", "zh-cn" 等)。根据检索到的学习记录,NLTKHelperMultiLingualTextRank 需要精确的语言标识符("english""chinese"),不完整的映射会导致停用词过滤和分词策略失效。

建议修复:

-        # Use English by default
-        self._language = "chinese" if self._language == "cn" else "english"
+        # Normalize language codes to match NLTK/TextRank expectations
+        lang = self._language.strip().lower()
+        if lang in ("cn", "zh", "zh-cn", "zh_cn", "chinese"):
+            self._language = "chinese"
+        elif lang in ("en", "eng", "english"):
+            self._language = "english"
+        else:
+            log.warning("Unsupported language '%s', defaulting to 'english'", lang)
+            self._language = "english"

Based on learnings


103-116: MemoryError 处理策略可能掩盖严重的资源问题。

第 112 行捕获 MemoryError 并记录为 critical 级别,但没有重新抛出异常。内存错误通常表示严重的资源问题(如文本过大、系统资源不足等),静默处理并返回空结果可能会让调用方误以为没有关键词,而不是系统出现了严重问题。

建议根据业务需求选择以下策略之一:

策略 1:重新抛出(推荐用于关键业务流程)

         except MemoryError as e:
             log.critical("TextRank memory error (text too large?): %s", e)
+            raise  # 让上层感知到资源问题

策略 2:降级处理并明确标记

         except MemoryError as e:
             log.critical("TextRank memory error (text too large?): %s", e)
+            context["extraction_error"] = "memory_error"
+            context["error_message"] = str(e)
         end_time = time.perf_counter()
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)

53-58: 日志信息不准确,应反映真实下载状态

第 56 行的日志信息 "already downloaded" 具有误导性——此时刚完成下载操作,应改为 "downloaded successfully" 或 "download completed"。

应用此修改:

-                    log.debug("NLTK package stopwords is already downloaded")
+                    log.debug("NLTK package stopwords downloaded successfully")
🧹 Nitpick comments (2)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)

32-47: 建议添加构造参数的边界验证。

虽然 MultiLingualTextRank 内部会验证 window_size(范围 0-10),但在此处提前验证 max_keywordswindow_size 的合法性可以提高代码的健壮性和可读性,避免将无效参数传递到下游组件。

建议补充:

def __init__(
    self,
    text: Optional[str] = None,
    llm: Optional[BaseLLM] = None,
    max_keywords: int = 5,
    extract_template: Optional[str] = None,
):
    if max_keywords < 1:
        log.warning("max_keywords must be >= 1, using default 5")
        max_keywords = 5
    
    self._llm = llm
    self._query = text
    # ... 其余初始化逻辑
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)

59-68: 错误处理改进合理,但存在冗余检查

第 59-63 行增加的 LookupError 捕获确保了即使下载失败也能优雅降级为空列表,符合先前审查建议。

但第 65-68 行的 None 检查略显冗余:第 63 行已在异常时设置 self._stopwords[lang] = [],第 42 行的条件 if self._stopwords.get(lang) is None 确保仅在首次调用时执行此块,因此 final_stopwords 不会是 None。

可选优化(移除冗余检查):

-        # final check
-        final_stopwords = self._stopwords[lang]
-        if final_stopwords is None:
-            return []
-
         return self._stopwords[lang]
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ff00016 and caf3156.

📒 Files selected for processing (2)
  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (3 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
hugegraph-llm/**/*.py

📄 CodeRabbit inference engine (hugegraph-llm/AGENTS.md)

hugegraph-llm/**/*.py: Adhere to ruff code style for Python code
Type-check Python code with mypy
Keep each Python file under 600 lines for maintainability

Files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
hugegraph-llm/src/hugegraph_llm/operators/**/*.py

📄 CodeRabbit inference engine (hugegraph-llm/AGENTS.md)

Put core processing pipelines under src/hugegraph_llm/operators/

Files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧠 Learnings (10)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:37:41.224Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
📚 Learning: 2025-08-29T13:11:08.943Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:42:31.998Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-21T12:34:10.407Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.407Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中,sub_tokens变量的复用逻辑是正确的:每次循环中sub_tokens被重新赋值为当前token的分词结果,处理后的有效子词会被添加回同一个sub_tokens列表中,最后统一添加到results。这种实现方式是有效且正确的。

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧬 Code graph analysis (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
  • BaseLLM (22-74)
  • generate (26-31)
hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)
  • LLMs (25-98)
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
  • MultiLingualTextRank (31-151)
  • extract_keywords (134-151)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
  • extract_keywords (67-85)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build (3.11)
  • GitHub Check: build (3.10)
🔇 Additional comments (3)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)

149-184: 响应解析逻辑实现良好。

该方法的错误处理非常完善:

  • 对格式不正确的项目进行适当的警告记录(第 170 行)
  • 对评分进行范围验证和截断(第 176-178 行)
  • 捕获并记录解析异常而不会导致整个流程失败(第 181-182 行)

返回类型更改为 Dict[str, float] 使得下游可以利用评分信息进行加权处理,提高了接口的灵活性。

hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (2)

22-22: 导入项合理,支持增强的错误处理

新增的 URLError、HTTPError 和 log 导入为方法中的网络异常处理和日志记录提供了必要支持。

Also applies to: 28-28


39-41: 正确注册项目内置资源路径

优先将项目内置的 nltk_data 路径加入搜索列表,确保本地资源被优先使用,避免不必要的下载。这与 check_nltk_data() 的行为保持一致。

Comment on lines +72 to +109
def check_nltk_data(self):
_hugegraph_source_dir = os.path.join(resource_path, "nltk_data")
if _hugegraph_source_dir not in nltk.data.path:
nltk.data.path.append(_hugegraph_source_dir)

cache_dir = self.get_cache_dir()
nltk_data_dir = os.environ.get("NLTK_DATA", cache_dir)
if nltk_data_dir not in nltk.data.path:
nltk.data.path.append(nltk_data_dir)

required_packages = {
'punkt': 'tokenizers/punkt',
'punkt_tab': 'tokenizers/punkt_tab',
'averaged_perceptron_tagger': 'taggers/averaged_perceptron_tagger',
"averaged_perceptron_tagger_eng": 'taggers/averaged_perceptron_tagger_eng'
}

for package, path in required_packages.items():
try:
nltk.data.find(path)
except LookupError:
log.info("Start download nltk package %s", package)
try:
if not nltk.download(package, download_dir=nltk_data_dir, quiet=False):
log.warning("NLTK download command returned False for package %s.", package)
return False
# Verify after download
nltk.data.find(path)
except PermissionError as e:
log.error("Permission denied when downloading %s: %s", package, e)
return False
except (URLError, HTTPError) as e:
log.warning("Network error downloading %s: %s, will retry with backup method", package, e)
return False
except LookupError:
log.error("Package %s not found after download. Check package name and nltk_data paths.", package)
return False
return True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

下载策略低效且日志信息误导——应实现回退机制而非下载所有备选包

当前实现存在以下问题:

  1. 冗余下载:方法会下载全部 4 个包,但 punktpunkt_tab 是互为备选的分词器资源,averaged_perceptron_taggeraveraged_perceptron_tagger_eng 是互为备选的词性标注器资源。正确做法是每组仅需下载其中一个可用资源即可,当前策略浪费带宽和时间。

  2. 误导性日志:第 104 行提示 "will retry with backup method",但代码中并未实现任何重试或备用下载方法,直接返回 False。

  3. 与先前审查建议不符:过往审查明确建议对 tokenizer 和 tagger 使用回退名称策略(先尝试首选资源,失败后再尝试备选资源),而非盲目下载所有变体。

建议重构为回退检查策略:

     def check_nltk_data(self):
         _hugegraph_source_dir = os.path.join(resource_path, "nltk_data")
         if _hugegraph_source_dir not in nltk.data.path:
             nltk.data.path.append(_hugegraph_source_dir)
 
         cache_dir = self.get_cache_dir()
         nltk_data_dir = os.environ.get("NLTK_DATA", cache_dir)
         if nltk_data_dir not in nltk.data.path:
             nltk.data.path.append(nltk_data_dir)
 
-        required_packages = {
-            'punkt': 'tokenizers/punkt',
-            'punkt_tab': 'tokenizers/punkt_tab',
-            'averaged_perceptron_tagger': 'taggers/averaged_perceptron_tagger',
-            "averaged_perceptron_tagger_eng": 'taggers/averaged_perceptron_tagger_eng'
-        }
-
-        for package, path in required_packages.items():
-            try:
-                nltk.data.find(path)
-            except LookupError:
-                log.info("Start download nltk package %s", package)
-                try:
-                    if not nltk.download(package, download_dir=nltk_data_dir, quiet=False):
-                        log.warning("NLTK download command returned False for package %s.", package)
-                        return False
-                    # Verify after download
-                    nltk.data.find(path)
-                except PermissionError as e:
-                    log.error("Permission denied when downloading %s: %s", package, e)
-                    return False
-                except (URLError, HTTPError) as e:
-                    log.warning("Network error downloading %s: %s, will retry with backup method", package, e)
-                    return False
-                except LookupError:
-                    log.error("Package %s not found after download. Check package name and nltk_data paths.", package)
-                    return False
-        return True
+        # 使用回退策略:每组资源仅需一个可用
+        resource_groups = [
+            ("tokenizers", ["punkt", "punkt_tab"]),
+            ("taggers", ["averaged_perceptron_tagger_eng", "averaged_perceptron_tagger"]),
+        ]
+
+        def _ensure_resource(resource_type, candidates):
+            """确保至少一个候选资源可用"""
+            for candidate in candidates:
+                path = f"{resource_type}/{candidate}"
+                try:
+                    nltk.data.find(path)
+                    log.debug("Found existing NLTK resource: %s", path)
+                    return True  # 找到可用资源
+                except LookupError:
+                    continue  # 尝试下一个候选
+            
+            # 所有候选都不存在,尝试下载首选项
+            preferred = candidates[0]
+            log.info("Downloading NLTK package: %s", preferred)
+            try:
+                if not nltk.download(preferred, download_dir=nltk_data_dir, quiet=False):
+                    log.warning("NLTK download returned False for %s", preferred)
+                    return False
+                nltk.data.find(f"{resource_type}/{preferred}")
+                log.info("Successfully downloaded and verified %s", preferred)
+                return True
+            except PermissionError as e:
+                log.error("Permission denied downloading %s: %s", preferred, e)
+                raise RuntimeError(f"Cannot write NLTK data to {nltk_data_dir}, check permissions") from e
+            except (URLError, HTTPError) as e:
+                log.error("Network error downloading %s: %s", preferred, e)
+                return False
+            except LookupError:
+                log.error("Package %s not found after download", preferred)
+                return False
+        
+        for resource_type, candidates in resource_groups:
+            if not _ensure_resource(resource_type, candidates):
+                return False
+        
+        return True

此方案:

  • 每组资源仅下载一个可用变体(优先首选项)
  • 避免冗余下载
  • 对 PermissionError 抛出异常(符合 imbajin 先前建议的严重性区分)
  • 移除误导性的 "retry with backup method" 日志
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py around
lines 72 to 109, the current implementation redundantly attempts to download
every variant and logs a misleading "will retry with backup method" without
actually implementing a fallback; refactor to use a fallback-per-group strategy
where each logical resource group (tokenizer: ['punkt', 'punkt_tab'], tagger:
['averaged_perceptron_tagger', 'averaged_perceptron_tagger_eng']) is checked in
priority order: for each group, try nltk.data.find for the preferred name, if
not found try the fallback name, and only if neither exists attempt a single
download for the first available name that succeeds (verify with nltk.data.find
afterwards); do not download both variants, remove the misleading "will retry
with backup method" log, return False only on genuine download/network failures,
and re-raise PermissionError instead of swallowing it so callers can handle
permission issues appropriately.

Comment on lines +69 to +81
method = (context.get("extract_method", self._extract_method) or "LLM").strip().lower()
if method == "llm":
# LLM method
ranks = self._extract_with_llm()
elif method == "textrank":
# TextRank method
ranks = self._extract_with_textrank()
elif method == "hybrid":
# Hybrid method
ranks = self._extract_with_hybrid()
else:
log.warning("Invalid extract_method %s", method)
raise ValueError(f"Invalid extract_method: {method}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

extract_method 参数处理存在默认值不一致和空字符串风险。

存在的问题:

  1. 第 69 行的默认值硬编码为 "LLM",与构造函数中从 llm_settings.keyword_extract_type 读取的 self._extract_method 不一致
  2. context.get("extract_method") 返回仅包含空格的字符串时,strip().lower() 会产生空字符串,导致第 80 行抛出 ValueError

建议修复:

-        method = (context.get("extract_method", self._extract_method) or "LLM").strip().lower()
+        method = (context.get("extract_method", self._extract_method) or self._extract_method).strip().lower()
+        if not method:
+            log.warning("Empty extract_method, using default: %s", self._extract_method)
+            method = self._extract_method.lower()
+        
         if method == "llm":
             # LLM method
             ranks = self._extract_with_llm()
         elif method == "textrank":
             # TextRank method
             ranks = self._extract_with_textrank()
         elif method == "hybrid":
             # Hybrid method
             ranks = self._extract_with_hybrid()
         else:
-            log.warning("Invalid extract_method %s", method)
-            raise ValueError(f"Invalid extract_method: {method}")
+            log.warning("Invalid extract_method '%s', valid options: llm/textrank/hybrid", method)
+            raise ValueError(f"Invalid extract_method: '{method}'. Must be 'llm', 'textrank', or 'hybrid'")

keywords = [] if not ranks else sorted(ranks, key=ranks.get, reverse=True)
keywords = [k.replace("'", "") for k in keywords]
context["keywords"] = keywords[:self._max_keywords]
log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

在 INFO 级别记录用户查询存在隐私风险。

第 86 行在 INFO 日志级别输出完整的用户查询和提取的关键词。用户查询可能包含敏感信息(如个人数据、业务机密等),在 INFO 级别记录会增加隐私泄露风险,不符合最小化数据记录的安全实践。

建议修复:

-        log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"])
+        log.debug("User Query: %s | Keywords: %s", self._query, context["keywords"])

或者,如果需要在 INFO 级别记录,应脱敏处理:

query_preview = self._query[:50] + "..." if len(self._query) > 50 else self._query
log.info("Keywords extracted (query length: %d) | Keywords: %s", len(self._query), context["keywords"])
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py around
line 86, remove logging of the full user query at INFO level to avoid leaking
sensitive data; instead log only non-sensitive metadata (e.g., query length) and
the extracted keywords, or log a redacted/truncated preview of the query if
strictly required, and if full query must be recorded keep it at DEBUG level;
implement one of these changes so INFO logs never contain the complete original
query.

@imbajin imbajin closed this Oct 21, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants