feat(llm): BREAKING CHANGE update keyword extraction method#32
feat(llm): BREAKING CHANGE update keyword extraction method#32Gfreely wants to merge 48 commits intohugegraph:mainfrom
Conversation
fix apache#224 problem, update new UI to support change keyword extracion method
fix the pylint check bug
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. Walkthrough新增多语言 TextRank 实现并将关键词抽取改为三种模式(llm/textrank/hybrid),抽取结果从集合改为 keyword→score 映射;调整 RAG 流程签名与调用、提示模板与配置项,增强 NLTK 资源检查,添加 scipy 与 python-igraph 依赖,并微调若干导入与 .gitignore(≤50字)。 Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant RAG as RAGPipeline.extract_keywords
participant KE as KeywordExtract.run
participant LLM as LLM Backend
participant TR as MultiLingualTextRank
User->>RAG: 提交待抽取文本
RAG->>KE: 调用 KeywordExtract.run()
alt keyword_extract_type == "llm"
KE->>LLM: 发送提示,等待响应
LLM-->>KE: 返回 "KEYWORDS:kw:score,..." 字符串
KE->>KE: 解析为 {kw:score}
else keyword_extract_type == "textrank"
KE->>TR: 调用 TR.extract_keywords(text)
TR-->>KE: 返回 {kw:score}
else keyword_extract_type == "hybrid"
par LLM 路径
KE->>LLM: 发送提示
LLM-->>KE: 返回 {kw:score}
and TextRank 路径
KE->>TR: TR.extract_keywords(text)
TR-->>KE: 返回 {kw:score}
end
KE->>KE: 按 hybrid_llm_weights 融合并排序
end
KE-->>RAG: 返回关键词与分数映射
RAG-->>User: 返回结果
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@codecov-ai-reviewer review |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
257-259: 考虑优化窗口大小判断逻辑。当唯一词数量少于窗口大小时,仍然可以构建有意义的共现图。建议调整判断逻辑。
-if len(unique_words) < self.window: +if len(unique_words) < 2: returnhugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py (1)
297-302: 建议改进mask_words输入框的提示信息。当前的提示信息较长,可以考虑简化并提供示例。
mask_words_input = gr.Textbox( label="TextRank mask words", - info="""Enter any words you want to protect from being split during Chinese word segmentation(e.g., C++, website URLs). Separate each entry with a comma.""", + info="保护词列表,防止中文分词时被切分。用逗号分隔。示例:C++,GitHub,ChatGPT", show_copy_button=True, lines=7, )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
hugegraph-llm/pyproject.toml(1 hunks)hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py(11 hunks)hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py(5 hunks)hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py(4 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
hugegraph-llm/pyproject.toml (1)
Learnt from: cgwer
PR: hugegraph/hugegraph-ai#10
File: hugegraph-python-client/pyproject.toml:0-0
Timestamp: 2025-05-27T06:55:13.779Z
Learning: The hugegraph-python-client is a component within the hugegraph-ai project repository (apache/incubator-hugegraph-ai), not a standalone repository. When reviewing project URLs in pyproject.toml files within this project, they should point to the main hugegraph-ai repository.
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (2)
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py (1)
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.
🧬 Code Graph Analysis (1)
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py (4)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (4)
RAGPipeline(38-267)extract_keywords(68-100)keywords_to_vid(106-130)import_schema(102-104)hugegraph-llm/src/hugegraph_llm/operators/llm_op/answer_synthesize.py (1)
AnswerSynthesize(36-259)hugegraph-llm/src/hugegraph_llm/utils/decorators.py (1)
with_task_id(109-123)hugegraph-llm/src/hugegraph_llm/operators/gremlin_generate_task.py (1)
import_schema(46-55)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.11)
- GitHub Check: build (3.10)
🔇 Additional comments (4)
hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
68-100: 代码实现正确!新增的TextRank相关参数设计合理,文档说明清晰。语言代码从
"english"标准化为"en"是个好的改进。hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
281-301: TextRank主方法实现完善!关键词提取的完整流程实现合理,包含了适当的边界条件检查。
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py (2)
83-90: 参数传递实现正确!新增的TextRank相关参数正确传递给了
extract_keywords方法。
431-460: 批处理函数更新正确!新参数在批处理函数中的传递保持了一致性。
| if self._extract_method == "TextRank": | ||
| # 使用 TextRank 提取关键词 | ||
| keywords = self._extract_with_textrank() | ||
| else: | ||
| # 使用 LLM 提取关键词 | ||
| keywords = self._extract_with_llm() | ||
| keywords = {k.replace("'", "") for k in keywords} | ||
| context["keywords"] = list(keywords)[:self._max_keywords] | ||
| log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"]) | ||
|
|
||
| # extracting keywords & expanding synonyms increase the call count by 1 | ||
| context["call_count"] = context.get("call_count", 0) + 1 | ||
| return context |
There was a problem hiding this comment.
🛠️ Refactor suggestion
建议为LLM提取方法也添加错误处理。
当前只有TextRank方法有try-catch错误处理,建议为LLM提取方法也添加类似的错误处理,以提高系统的健壮性。
if self._extract_method == "TextRank":
# 使用 TextRank 提取关键词
keywords = self._extract_with_textrank()
else:
# 使用 LLM 提取关键词
- keywords = self._extract_with_llm()
+ try:
+ keywords = self._extract_with_llm()
+ except Exception as e:
+ log.error("LLM Keyword extraction error: %s", e)
+ keywords = set()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if self._extract_method == "TextRank": | |
| # 使用 TextRank 提取关键词 | |
| keywords = self._extract_with_textrank() | |
| else: | |
| # 使用 LLM 提取关键词 | |
| keywords = self._extract_with_llm() | |
| keywords = {k.replace("'", "") for k in keywords} | |
| context["keywords"] = list(keywords)[:self._max_keywords] | |
| log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"]) | |
| # extracting keywords & expanding synonyms increase the call count by 1 | |
| context["call_count"] = context.get("call_count", 0) + 1 | |
| return context | |
| if self._extract_method == "TextRank": | |
| # 使用 TextRank 提取关键词 | |
| keywords = self._extract_with_textrank() | |
| else: | |
| # 使用 LLM 提取关键词 | |
| try: | |
| keywords = self._extract_with_llm() | |
| except Exception as e: | |
| log.error("LLM Keyword extraction error: %s", e) | |
| keywords = set() | |
| keywords = {k.replace("'", "") for k in keywords} | |
| context["keywords"] = list(keywords)[:self._max_keywords] | |
| log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"]) | |
| # extracting keywords & expanding synonyms increase the call count by 1 | |
| context["call_count"] = context.get("call_count", 0) + 1 | |
| return context |
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py around
lines 73 to 85, the LLM keyword extraction method lacks error handling unlike
the TextRank method. Add a try-except block around the call to
self._extract_with_llm() to catch and handle any exceptions, ensuring the system
remains robust. Log the error appropriately and decide on a fallback or empty
keyword set if an exception occurs.
There was a problem hiding this comment.
The self.llm.generate method is also called elsewhere in the project and is already sufficiently robust, so it is not necessary to add error handling.
There was a problem hiding this comment.
@Gfreely, understood! Thank you for the clarification. You have better context about the project's LLM implementation and its robustness across the codebase. I'll defer to your technical judgment on this matter.
🧠 Learnings used
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
73-85: LLM方法仍缺少错误处理。根据之前的评审意见,TextRank方法已经添加了错误处理,但LLM方法仍然缺少相应的错误处理机制。
if self._extract_method == "TextRank": # 使用 TextRank 提取关键词 keywords = self._extract_with_textrank() else: # 使用 LLM 提取关键词 - keywords = self._extract_with_llm() + try: + keywords = self._extract_with_llm() + except Exception as e: + log.error("LLM Keyword extraction error: %s", e) + keywords = set()
🧹 Nitpick comments (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
263-264: 图节点数量检查逻辑可能存在问题。当unique_words数量小于window_size时直接返回,但这可能过于严格。即使词汇较少也应该尝试构建图。
-if len(unique_words) < self.window: - return +if len(unique_words) < 2: + return
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py(4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
🔇 Additional comments (4)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)
153-164: 停用词文件加载已正确添加错误处理。根据之前的评审意见,停用词文件加载现在已经正确添加了try-catch错误处理,能够优雅地处理文件不存在的情况。
199-200: 正则表达式模式已修复。之前评审中提到的正则表达式语法错误已经修复,现在正确使用了
(?![a-zA-Z0-9])作为否定前瞻断言。
292-294: 文本预处理结果检查逻辑合理。检查预处理后的词汇列表是否为空,如果为空则直接返回空列表,避免后续处理出错。
300-301: 图状态检查逻辑完善。在运行PageRank前检查图的状态,确保图存在且有节点,避免算法执行错误。
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
139-142: 构造函数参数缺少默认值根据之前的评审意见,构造函数参数应该提供默认值以提高代码的健壮性。
- def __init__(self, keyword_num=5, window_size=5, mask_words=""): + def __init__(self, keyword_num=5, window_size=5, mask_words=""):注意:这个问题在之前的评审中已经被标记,但代码中仍然存在。
168-168: mask_words 参数需要空值检查根据之前的评审意见,当
mask_words为None时,split()方法会导致异常。- self.mask_words = list(filter(None, mask_words.split(','))) + self.mask_words = list(filter(None, (mask_words or "").split(',')))注意:这个问题在之前的评审中已经被标记,但代码中仍然存在。
🧹 Nitpick comments (3)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3)
170-253: 文本预处理逻辑过于复杂,建议重构
_preprocess方法包含了复杂的中英文文本处理逻辑,建议将其拆分为更小的专门方法以提高可读性和可维护性。建议重构为以下结构:
def _preprocess(self, text, lang): if lang.startswith('zh'): return self._preprocess_chinese(text) elif lang.startswith('en'): return self._preprocess_english(text) return [] def _preprocess_chinese(self, text): # 中文预处理逻辑 pass def _preprocess_english(self, text): # 英文预处理逻辑 pass
217-218: 动态添加词典可能影响性能在每次调用时动态向 jieba 词典添加占位符可能会影响性能,特别是在处理大量文本时。
考虑在构造函数中一次性添加常用的占位符模式,或者使用 jieba 的临时词典功能:
# 在构造函数中预处理 jieba.initialize()
263-264: 图节点数量检查逻辑可能不准确当唯一词汇数量小于滑动窗口大小时直接返回可能不是最佳策略,因为即使词汇较少也可能构建有意义的图。
建议修改为更合理的检查条件:
- if len(unique_words) < self.window: + if len(unique_words) < 2: # 至少需要两个词才能构建图 return
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py(4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
🔇 Additional comments (2)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
287-307: 关键词提取主函数逻辑清晰
extract_keywords方法的实现逻辑清晰,正确处理了边界条件,包括空词汇列表和空图的情况。错误处理和流程控制都很合理。
98-109: TextRank 提取方法实现良好方法包含了适当的错误处理、性能监控和日志记录。异常类型覆盖了常见的运行时错误,返回类型与 LLM 方法保持一致。
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
48-57: textrank_kwargs参数仍需要空值检查尽管您提到所有实例化代码都设置了默认值,但在构造函数中直接使用
**textrank_kwargs仍存在潜在风险。如果textrank_kwargs为None,会导致运行时错误。建议添加空值检查以提高代码健壮性:
- self._textrank_model = MultiLingualTextRank(**textrank_kwargs) # TextRank 参数 + self._textrank_model = MultiLingualTextRank(**(textrank_kwargs or {})) # TextRank 参数
🧹 Nitpick comments (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
169-252: 文本预处理逻辑复杂但功能全面中英文预处理逻辑设计良好,包含了特殊词遮蔽、正则表达式清理、分词和词性过滤等步骤。不过复杂的正则表达式模式可能会影响性能。
对于大量文本处理,建议考虑以下优化:
- 预编译常用的正则表达式模式
- 考虑使用更高效的文本处理库
- 为极长文本添加长度限制或分块处理
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py(4 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
🔇 Additional comments (6)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (6)
18-37: 导入和常量定义看起来合理新增的导入项都是TextRank功能所需的,包括jieba用于中文分词、networkx用于图算法、nltk用于英文处理等。EXTRACT_STOPWORDS常量定义也是合适的。
73-85: 提取方法分发逻辑实现正确条件逻辑正确地在TextRank和LLM方法之间进行分发,错误处理也已经适当实现。关键词后处理和上下文更新逻辑看起来合理。
98-108: TextRank提取方法实现良好错误处理覆盖了多种异常类型,日志记录有助于调试,返回空集合作为fallback是合理的设计。
139-168: MultiLingualTextRank构造函数实现完善构造函数现在有了合适的默认值,POS过滤器定义合理,停用词加载包含了适当的错误处理,mask_words的空值检查也已实现。
254-285: 图构建和PageRank算法实现正确词共现图构建逻辑合理,避免了自环问题,边权重计算正确。PageRank算法使用了标准参数,空图检查确保了健壮性。
286-306: 关键词提取主流程实现完整主函数按照标准TextRank流程执行:预处理→图构建→PageRank→Top-K提取。边界条件处理得当,包括空词列表和空图的检查。
There was a problem hiding this comment.
Pull Request Overview
This PR adds support for TextRank-based keyword extraction in the RAG pipeline, updates the pipeline operator and demo UI to expose TextRank parameters, and introduces new dependencies for graph-based ranking.
- Added a
MultiLingualTextRankclass and branched keyword extraction logic between LLM and TextRank. - Extended
RAGPipeline.extract_keywordsto accept method, window size, mask words, and language settings. - Updated the Gradio demo (
rag_block.py) to include UI controls for extraction method, language, mask words, keyword count, and sliding window. - Added
networkxandscipyto project dependencies.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py | Implemented TextRank extraction, added MultiLingualTextRank, and method branching |
| hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py | Propagated new extract_method and textrank_kwargs through pipeline |
| hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py | Added Gradio inputs for extraction method, language, mask words, max keywords, and window size |
| hugegraph-llm/pyproject.toml | Added networkx and scipy dependencies |
|
@codecov-ai-reviewer review |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
179-262: 建议将复杂的预处理方法拆分当前
_preprocess方法承担了过多职责,包括遮蔽、清理、分词、标注等多个步骤。建议拆分为更小的方法以提高可维护性。可以考虑将方法拆分为:
def _preprocess(self, text, lang): if lang.startswith('zh'): return self._preprocess_chinese(text) elif lang.startswith('en'): return self._preprocess_english(text) return [] def _preprocess_chinese(self, text): # 中文预处理逻辑 def _preprocess_english(self, text): # 英文预处理逻辑
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py(4 hunks)
🧰 Additional context used
🧠 Learnings (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.
🧬 Code Graph Analysis (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (6)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
BaseLLM(22-74)generate(26-31)hugegraph-llm/src/hugegraph_llm/models/llms/ollama.py (1)
generate(37-60)hugegraph-llm/src/hugegraph_llm/models/llms/openai.py (1)
generate(57-85)hugegraph-llm/src/hugegraph_llm/models/llms/qianfan.py (1)
generate(38-53)hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
extract_keywords(68-100)hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (2)
NLTKHelper(30-80)stopwords(36-53)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.11)
- GitHub Check: build (3.10)
🔇 Additional comments (11)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (11)
18-37: 导入和常量定义合理新增的导入项支持TextRank功能实现,常量定义清晰明确。
41-57: 构造函数参数设计合理新增的
extract_method和textrank_kwargs参数支持TextRank功能,参数传递逻辑正确。
73-85: 方法调度逻辑清晰根据提取方法正确调度到相应的实现,关键词处理和上下文管理保持一致。
87-96: LLM提取方法保持向后兼容维护了原有的LLM关键词提取逻辑,时间记录和日志功能完整。
98-114: TextRank提取方法实现完善包含了适当的异常处理,时间记录与LLM方法保持一致,错误处理策略合理。
116-141: 响应解析方法保持稳定方法逻辑与原有实现一致,格式化调整合理,停用词处理正确。
144-162: TextRank类构造函数设计良好默认参数合理,支持中英文词性过滤,mask_words空值检查已实现,延迟加载机制恰当。
164-177: 停用词加载方法改进良好已添加适当的文件存在性检查和错误处理,延迟加载模式避免重复加载,日志记录完善。
264-285: 图构建方法实现正确共现图构建逻辑合理,正确处理了窗口大小不足的边界情况,避免了自环,滑动窗口方法恰当。
287-294: PageRank实现正确正确检查了空图情况,使用标准的PageRank参数,权重处理恰当。
296-319: 主提取方法设计完善完整地编排了关键词提取流程,各步骤的空结果处理得当,Top-K关键词返回逻辑正确。
|
@codecov-ai-reviewer review |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (5)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)
64-74: extract_method 不支持上下文覆盖且大小写未标准化与 PR 目标(UI 可切换提取模式)不完全一致。建议先从 context 覆盖并统一小写再分支。
- if self._extract_method == "llm": + method = (context.get("extract_method", self._extract_method) or "hybrid").strip().lower() + if method == "llm": # LLM method ranks = self._extract_with_llm() - elif self._extract_method == "textrank": + elif method == "textrank": # TextRank method ranks = self._extract_with_textrank() - elif self._extract_method == "hybrid": + elif method == "hybrid": # Hybrid method ranks = self._extract_with_hybrid() else: - raise ValueError(f"Invalid extract_method: {self._extract_method}") + raise ValueError(f"Invalid extract_method: {method}")
76-83: 避免在 INFO 级别记录原始用户查询,降敏并降级日志当前 INFO 日志含原始 query,存在隐私风险。建议改为 DEBUG 并截断。
- log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"]) + # 降低敏感信息日志等级并做截断 + safe_query = (self._query or "")[:200] + log.debug("Query(len=%d): %s | Keywords: %s", len(self._query or ""), safe_query, context["keywords"])
62-62: max_keywords 需做类型与边界校验,避免切片报错UI/HTTP 传参常为字符串或空值,直接用于切片可能抛 TypeError。建议强制转 int 并设下限。
- self._max_keywords = context.get("max_keywords", self._max_keywords) + mk = context.get("max_keywords", self._max_keywords) + try: + mk = int(mk) + except (TypeError, ValueError): + mk = self._max_keywords + self._max_keywords = max(1, mk)
60-61: 语言归一化存在逻辑错误,几乎总是回退为英文当前仅当值为"cn"时设为中文,否则一律设为英文,忽略了"en/english/zh/chinese/zh-cn"等常见标识,且未支持从 context 覆盖,容易导致中文场景被误判为英文。建议统一归一化并允许上游覆盖。
- # Use English by default - self._language = "chinese" if self._language == "cn" else "english" + # 语言归一化:支持 en/zh/cn/english/chinese,并允许 context 覆盖 + lang_raw = (context.get("language", self._language) or "").strip().lower() + if lang_raw in ("en", "english"): + self._language = "english" + elif lang_raw in ("zh", "cn", "chinese", "zh-cn", "zh_cn"): + self._language = "chinese" + else: + log.warning("Unsupported language '%s', defaulting to 'english'", lang_raw) + self._language = "english"hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)
66-103: 二次校验存在致命错误且未按“二选一”分组语义工作问题:
- Line 97 使用
nltk.data.find('tokenizers/{package}') or nltk.data.find('taggers/{package}'):对于punkt会错误地查taggers/punkt并抛 LookupError,导致函数直接异常返回。- 语义上应保证“tokenizers 至少命中一个(punkt 或 punkt_tab)且 taggers 至少命中一个(*_eng 或通用)”,而不是四个包都必须为 True。
建议:改为“分组校验 + 逐个下载兜底”的幂等流程,并补充返回类型注解。
- def check_nltk_data(self): - - cache_dir = self.get_cache_dir() - nltk_data_dir = os.environ.get("NLTK_DATA", cache_dir) - nltk.data.path.append(os.path.join(resource_path, "nltk_data")) - - if nltk_data_dir not in nltk.data.path: - nltk.data.path.append(nltk_data_dir) - - required_packages = { - 'punkt': False, - 'punkt_tab': False, - 'averaged_perceptron_tagger': False, - "averaged_perceptron_tagger_eng": False} - for package in required_packages: - try: - if package in ['punkt', 'punkt_tab']: - nltk.data.find(f'tokenizers/{package}') - else: - nltk.data.find(f'taggers/{package}') - required_packages[package] = True - except LookupError: - try: - log.info("Download nltk package %s", package) - nltk.download(package, download_dir=nltk_data_dir) - except (URLError, HTTPError, PermissionError) as e: - log.info("Can't download package %s as error: %s", package, e) - - check_flag = all(required_packages.values()) - if not check_flag: - for package in required_packages: - if nltk.data.find(f'tokenizers/{package}') or nltk.data.find(f'taggers/{package}'): - required_packages[package] = True - log.debug("Package %s is already downloaded", package) - - check_flag = all(required_packages.values()) - return check_flag + def check_nltk_data(self) -> bool: + cache_dir = self.get_cache_dir() + nltk_data_dir = os.environ.get("NLTK_DATA", cache_dir) + rp = os.path.join(resource_path, "nltk_data") + if rp not in nltk.data.path: + nltk.data.path.append(rp) + if nltk_data_dir not in nltk.data.path: + nltk.data.path.append(nltk_data_dir) + + def ensure_group(group: str, names: list[str]) -> bool: + # 1) 已有其一即通过 + for name in names: + try: + nltk.data.find(f"{group}/{name}") + return True + except LookupError: + continue + # 2) 尝试按优先级下载 + for name in names: + try: + log.info("Download nltk package %s/%s", group, name) + ok = nltk.download(name, download_dir=nltk_data_dir) + if ok: + try: + nltk.data.find(f"{group}/{name}") + return True + except LookupError: + continue + except (URLError, HTTPError, PermissionError) as e: + log.info("Can't download package %s as error: %s", name, e) + return False + + tokenizers_ok = ensure_group("tokenizers", ["punkt", "punkt_tab"]) + taggers_ok = ensure_group("taggers", ["averaged_perceptron_tagger_eng", "averaged_perceptron_tagger"]) + return bool(tokenizers_ok and taggers_ok)
🧹 Nitpick comments (5)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)
96-110: TextRank 异常处理建议:MemoryError 区分对待MemoryError 通常表示资源耗尽,继续执行可能不可预期。建议对其单独处理(例如抛出让上层熔断,或截断输入后重试)。
try: ranks = self._textrank_model.extract_keywords(self._query) except (TypeError, ValueError) as e: log.error("TextRank parameter error: %s", e) - except MemoryError as e: - log.critical("TextRank memory error (text too large?): %s", e) + except MemoryError as e: + log.critical("TextRank memory error (text too large?): %s", e) + raise # 或者返回 {} 并让上游降级
115-119: Hybrid 权重解析不够健壮(仅接受 float 类型)应接受 int/字符串并统一夹逼到[0,1]。
- if isinstance(llm_settings.hybrid_llm_weights, float): - llm_weights = min(1.0, max(0.0, float(llm_settings.hybrid_llm_weights))) - else: - llm_weights = 0.5 + try: + llm_weights = float(llm_settings.hybrid_llm_weights) + except (TypeError, ValueError): + llm_weights = 0.5 + llm_weights = min(1.0, max(0.0, llm_weights))
127-128: 降低包含关键词与分数的日志到 DEBUG避免在 INFO 级别暴露用户内容衍生数据。
- log.info("LLM extract results: %s", llm_scores) - log.info("TextRank extract results: %s", tr_scores) + log.debug("LLM extract results: %s", llm_scores) + log.debug("TextRank extract results: %s", tr_scores)
152-152: regex 构造建议对 start_token 做转义以增强稳健性虽然当前传入的是固定的 "KEYWORDS:",但通用性更好。
- matches = re.findall(rf'{start_token}([^\n]+\n?)', response) + matches = re.findall(rf'{re.escape(start_token)}([^\n]+\n?)', response)hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)
39-39: 避免重复追加 nltk.data.path(应当幂等)每次调用都会 append 路径,可能产生重复项。建议追加前先判重,便于调试与打印。
可最小化修改如下:
- nltk.data.path.append(os.path.join(resource_path, "nltk_data")) + rp = os.path.join(resource_path, "nltk_data") + if rp not in nltk.data.path: + nltk.data.path.append(rp)在 Line 70-71 同样处理:
- nltk.data.path.append(os.path.join(resource_path, "nltk_data")) + rp = os.path.join(resource_path, "nltk_data") + if rp not in nltk.data.path: + nltk.data.path.append(rp)Also applies to: 70-71
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py(2 hunks)hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py(1 hunks)hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.pyhugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.pyhugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:37:41.224Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.pyhugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-29T13:11:08.943Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:42:31.998Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-21T12:34:10.407Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.407Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中,sub_tokens变量的复用逻辑是正确的:每次循环中sub_tokens被重新赋值为当前token的分词结果,处理后的有效子词会被添加回同一个sub_tokens列表中,最后统一添加到results。这种实现方式是有效且正确的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧬 Code graph analysis (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
BaseLLM(22-74)generate(26-31)hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)
LLMs(25-98)hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
MultiLingualTextRank(30-156)extract_keywords(140-156)hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
extract_keywords(67-85)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
🔇 Additional comments (4)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
85-95: LLM 抽取路径实现清晰可控计时与解析分离良好、返回 Dict[str,float] 与架构偏好一致,LGTM。
45-47: 确认 TextRank 的 top_k/window 行为与运行时参数一致性此处在构造时固定了 keyword_num 与 window_size。若运行时允许增大 max_keywords 或修改 window_size,则 TextRank 仍受构造时上限限制。若按既有偏好(仅初始化时承接,运行时不变动)这是可接受的;建议在 README/文档注明以避免误解。
我可以补一段 README 说明,或在 run() 中做同步(若你们决定支持运行时覆盖)。
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (2)
22-22: 引入网络异常与日志依赖:LGTM为下载失败提供更细粒度异常分类并统一日志接口,没问题。
Also applies to: 29-29
37-37: 已确认:KeywordExtract 层对语言值做了 NLTK 所需的归一化
在 hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py 的 run() 中,使用 llm_settings.language.lower() 并将 "cn" 映射为 "chinese",否则映射为 "english";且 document_op/textrank_word_extract.py 中也显式传入 'english'/'chinese'。因此在这些关键调用路径上不会因为上游使用 'en'/'zh' 导致 NLTKHelper.stopwords 抛 LookupError 或返回空结果。
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)
37-41: 修复 NLTK stopwords 语言映射:统一将 'zh'/'cn'/'en' 等别名映射为 'chinese'/'english'NLTKHelper.stopwords 要求 lang 为 "english" 或 "chinese";仓内有位置只检测 "cn" 并默认其它为 "english",会把 "zh" 等值误映为英文,导致中文停用词不生效。
受影响位置(需修正):
- hugegraph-llm/src/hugegraph_llm/operators/document_op/word_extract.py:52
- hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61
验证:hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py 已直接使用正确的 'english'/'chinese'(81–82)。
建议:在调用前或在 NLTKHelper.stopwords 内统一做别名映射(例如 'zh','cn','zh-cn' -> 'chinese';'en','eng' -> 'english'),或将默认逻辑改为显式映射表以避免隐式错误。
♻️ Duplicate comments (2)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)
101-105: 二次校验存在未捕获的 LookupError,导致函数在 tagger 分支直接异常退出
nltk.data.find(f'tokenizers/{package}') or nltk.data.find(f'taggers/{package}')对于 tagger 包会先尝试 tokenizers 路径并抛出 LookupError,且未被捕获;当前写法会在第一个调用即异常,无法进入第二个分支。请按资源类型分别校验并捕获异常。建议最小修复:
- for package in required_packages: - if nltk.data.find(f'tokenizers/{package}') or nltk.data.find(f'taggers/{package}'): - required_packages[package] = True - log.debug("Package %s is already downloaded", package) + for package in required_packages: + try: + if package in ("punkt", "punkt_tab"): + nltk.data.find(f"tokenizers/{package}") + else: + nltk.data.find(f"taggers/{package}") + required_packages[package] = True + log.debug("Package %s is already downloaded", package) + except LookupError: + # 保持 False,交由 all() 结果返回 + passhugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
60-62: 语言归一化逻辑错误:'zh' 会被误判为英文当前仅对 "cn" 赋中文,其余一律英文,不符合 PR 目标(支持 'en'/'zh')。请完善映射。
- # Use English by default - self._language = "chinese" if self._language == "cn" else "english" + # Normalize language + lang = (self._language or "").strip().lower() + lang_map = {"en": "english", "zh": "chinese", "cn": "chinese"} + self._language = lang_map.get(lang, lang)
🧹 Nitpick comments (3)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)
39-39: 避免重复追加 nltk.data.path(微调)两处直接 append 内置资源路径,长生命周期进程可能重复追加同一路径。建议先判断后追加。
- nltk.data.path.append(os.path.join(resource_path, "nltk_data")) + rp = os.path.join(resource_path, "nltk_data") + if rp not in nltk.data.path: + nltk.data.path.append(rp)同理应用于 Line 75。
Also applies to: 75-75
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
82-86: 降低敏感信息日志级别,避免在 INFO 输出用户查询为降低隐私风险,建议将查询与关键词输出降级到 DEBUG,并避免换行。
- log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"]) + log.debug("User Query: %s | Keywords: %s", self._query, context["keywords"])
148-159: 正则未转义 start_token,建议使用 re.escape 防止特殊字符引发匹配偏差注释已提示可能包含特殊字符,但实现未转义。建议小幅加固。
- matches = re.findall(rf'{start_token}([^\n]+\n?)', response) + prefix = re.escape(start_token) if start_token else "" + matches = re.findall(prefix + r'([^\n]+\n?)', response)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py(2 hunks)hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py(2 hunks)
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.pyhugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.pyhugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:37:41.224Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.pyhugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-29T13:11:08.943Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:42:31.998Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-21T12:34:10.407Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.407Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中,sub_tokens变量的复用逻辑是正确的:每次循环中sub_tokens被重新赋值为当前token的分词结果,处理后的有效子词会被添加回同一个sub_tokens列表中,最后统一添加到results。这种实现方式是有效且正确的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧬 Code graph analysis (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
BaseLLM(22-74)generate(26-31)hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)
LLMs(25-98)hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
MultiLingualTextRank(30-156)extract_keywords(140-156)
🔇 Additional comments (2)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
69-80: 分支分发与 context 覆盖方式清晰,易于扩展基于 context 的
extract_method覆盖与统一小写对齐合理,三分支职责清晰。
117-146: Hybrid 加权融合实现简明且与统一返回 Dict[str, float] 的设计一致LLM/TR 分数经权重线性组合、对并集打分的策略合理,日志与计时粒度合适。
| self._textrank_model = MultiLingualTextRank( | ||
| keyword_num=max_keywords, | ||
| window_size=llm_settings.window_size) |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
确认 TextRank 参数贯通是否覆盖 mask_words(与 PR 目标对齐)
此处仅传入 keyword_num/window_size。PR 描述包含 mask_words 功能与中文分词屏蔽规则,请确认是否在 MultiLingualTextRank 内部或上游已接入;否则需补齐贯通。
🏁 Script executed:
#!/bin/bash
# 检查 mask_words 的声明与传递链路
rg -nP --type=py 'mask_words'
# 检查 MultiLingualTextRank 构造签名与使用处
rg -nP --type=py -C2 'class\s+MultiLingualTextRank|MultiLingualTextRank\(' hugegraph-llm/srcLength of output: 1186
必须把 mask_words 贯通到 MultiLingualTextRank 的构造并从调用处传入
- 验证:hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py 中的 MultiLingualTextRank.init(keyword_num: int = 5, window_size: int = 3) 仅接收 keyword_num 和 window_size(约第30–32行);仓库未发现 mask_words 的传递链。
- 要求/修复:在该类构造中增加 mask_words 参数(例如 mask_words: Optional[List[str]] = None),并在调用处 hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py(约第45–47行)将相应的 mask_words(或 llm_settings 中的配置)传入;如果屏蔽逻辑已在上游实现,请在 PR 中指明实现位置并补充对应调用。
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)
22-22: 消除重复导入(第22行和第29行)在第22行导入了
URLError和HTTPError,然后在第29行又重复导入了stopwords。此外,第29行的stopwords导入和第25行重复。删除重复导入:
from urllib.error import URLError, HTTPError import nltk from nltk.corpus import stopwords from hugegraph_llm.config import resource_path from hugegraph_llm.utils.log import log -from nltk.corpus import stopwords
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
hugegraph-llm/**/*.py
📄 CodeRabbit inference engine (hugegraph-llm/AGENTS.md)
hugegraph-llm/**/*.py: Adhere to ruff code style for Python code
Type-check Python code with mypy
Keep each Python file under 600 lines for maintainability
Files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
hugegraph-llm/src/hugegraph_llm/operators/**/*.py
📄 CodeRabbit inference engine (hugegraph-llm/AGENTS.md)
Put core processing pipelines under src/hugegraph_llm/operators/
Files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
📚 Learning: 2025-08-18T14:37:41.224Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
🔇 Additional comments (2)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (2)
54-59: 日志消息需要修正("Start download" 应为 "Downloading")当前代码在第55行显示"Start download nltk package stopwords",但在下载成功后的第57行记录"is already downloaded",这个消息是不准确的。建议调整日志消息使其更清晰地反映实际状态。
基于先前的反馈,日志应该准确反映下载状态:
try: - log.info("Start download nltk package stopwords") - nltk.download("stopwords", download_dir=nltk_data_dir, quiet=False) - log.debug("NLTK package stopwords is already downloaded") + log.info("正在下载 NLTK stopwords 包到 %s", nltk_data_dir) + nltk.download("stopwords", download_dir=nltk_data_dir, quiet=False) + log.debug("NLTK stopwords 包下载完成") except (URLError, HTTPError, PermissionError) as e: log.warning("Can't download package stopwords as error: %s", e)
83-110: check_nltk_data 方法存在逻辑问题和兼容性问题该方法有几个问题:
- 在 NLTK 3.8.2+ 版本中,
punkt已被替换为punkt_tab,但当前代码假设所有包都可用- 第105-106行的验证逻辑有问题:
nltk.data.find()找不到资源时会抛出LookupError,而不会返回布尔值- 缺乏回退机制处理版本差异
根据从学习记录中了解到的信息以及Web搜索的结果,建议实现兼容性处理:
def check_nltk_data(self): + """检查并确保必需的NLTK数据包可用,返回是否全部就绪""" _hugegraph_source_dir = os.path.join(resource_path, "nltk_data") if _hugegraph_source_dir not in nltk.data.path: nltk.data.path.append(_hugegraph_source_dir) cache_dir = self.get_cache_dir() nltk_data_dir = os.environ.get("NLTK_DATA", cache_dir) if nltk_data_dir not in nltk.data.path: nltk.data.path.append(nltk_data_dir) - required_packages = { - 'punkt': False, - 'punkt_tab': False, - 'averaged_perceptron_tagger': False, - "averaged_perceptron_tagger_eng": False} - for package in required_packages: + # 检查tokenizers:punkt 或 punkt_tab 其一可用即可 + tokenizer_found = False + for tokenizer in ['punkt_tab', 'punkt']: # 优先检查新版本 try: - if package in ['punkt', 'punkt_tab']: - nltk.data.find(f'tokenizers/{package}') - else: - nltk.data.find(f'taggers/{package}') - required_packages[package] = True + nltk.data.find(f'tokenizers/{tokenizer}') + tokenizer_found = True + log.debug("找到 tokenizer: %s", tokenizer) + break except LookupError: - try: - log.info("Start download nltk package %s", package) - nltk.download(package, download_dir=nltk_data_dir, quiet=False) - except (URLError, HTTPError, PermissionError) as e: - log.warning("Can't download package %s as error: %s", package, e) + continue + + if not tokenizer_found: + # 尝试下载新版本的 punkt_tab,如果失败则尝试 punkt + for tokenizer in ['punkt_tab', 'punkt']: + try: + log.info("正在下载 NLTK tokenizer: %s", tokenizer) + success = nltk.download(tokenizer, download_dir=nltk_data_dir, quiet=False) + if success: + tokenizer_found = True + break + except (URLError, HTTPError, PermissionError) as e: + log.warning("下载 %s 失败: %s", tokenizer, e) + + # 检查taggers:averaged_perceptron_tagger_eng 或 averaged_perceptron_tagger 其一可用即可 + tagger_found = False + for tagger in ['averaged_perceptron_tagger_eng', 'averaged_perceptron_tagger']: + try: + nltk.data.find(f'taggers/{tagger}') + tagger_found = True + log.debug("找到 tagger: %s", tagger) + break + except LookupError: + continue + + if not tagger_found: + # 尝试下载 + for tagger in ['averaged_perceptron_tagger_eng', 'averaged_perceptron_tagger']: + try: + log.info("正在下载 NLTK tagger: %s", tagger) + success = nltk.download(tagger, download_dir=nltk_data_dir, quiet=False) + if success: + tagger_found = True + break + except (URLError, HTTPError, PermissionError) as e: + log.warning("下载 %s 失败: %s", tagger, e) - check_flag = all(required_packages.values()) - if not check_flag: - for package in required_packages: - if nltk.data.find(f'tokenizers/{package}') or nltk.data.find(f'taggers/{package}'): - required_packages[package] = True - log.debug("Package %s is already downloaded", package) - check_flag = all(required_packages.values()) - return check_flag + return tokenizer_found and tagger_found
|
/gemini review |
| def check_nltk_data(self): | ||
| _hugegraph_source_dir = os.path.join(resource_path, "nltk_data") | ||
| if _hugegraph_source_dir not in nltk.data.path: | ||
| nltk.data.path.append(_hugegraph_source_dir) | ||
|
|
||
| cache_dir = self.get_cache_dir() | ||
| nltk_data_dir = os.environ.get("NLTK_DATA", cache_dir) | ||
| if nltk_data_dir not in nltk.data.path: | ||
| nltk.data.path.append(nltk_data_dir) | ||
|
|
||
| required_packages = { | ||
| 'punkt': False, | ||
| 'punkt_tab': False, | ||
| 'averaged_perceptron_tagger': False, | ||
| "averaged_perceptron_tagger_eng": False} | ||
| for package in required_packages: | ||
| try: | ||
| if package in ['punkt', 'punkt_tab']: | ||
| nltk.data.find(f'tokenizers/{package}') | ||
| else: | ||
| nltk.data.find(f'taggers/{package}') | ||
| required_packages[package] = True | ||
| except LookupError: | ||
| try: | ||
| log.info("Start download nltk package %s", package) | ||
| nltk.download(package, download_dir=nltk_data_dir, quiet=False) | ||
| except (URLError, HTTPError, PermissionError) as e: | ||
| log.warning("Can't download package %s as error: %s", package, e) | ||
|
|
||
| check_flag = all(required_packages.values()) | ||
| if not check_flag: | ||
| for package in required_packages: | ||
| if nltk.data.find(f'tokenizers/{package}') or nltk.data.find(f'taggers/{package}'): | ||
| required_packages[package] = True | ||
| log.debug("Package %s is already downloaded", package) | ||
|
|
||
| check_flag = all(required_packages.values()) | ||
| return check_flag |
There was a problem hiding this comment.
check_nltk_data 方法的实现存在严重问题,可能导致程序在特定情况下崩溃。
-
第二个循环的逻辑错误: 在第102-108行,
if not check_flag:块内的循环逻辑是有缺陷的。nltk.data.find()在找不到资源时会抛出LookupError异常。这里的if nltk.data.find(...) or nltk.data.find(...)语句没有处理这个异常。当检查一个 tagger 包(如 'averaged_perceptron_tagger')时,nltk.data.find(f'tokenizers/{package}')会首先执行并因路径不匹配而抛出异常,导致程序崩溃。 -
冗余和低效: 该方法使用了两个独立的循环来完成一项任务,这使得代码冗余且难以理解。第一个循环在尝试下载后没有更新包的状态,这使得第二个(有缺陷的)循环成为必要。
建议将整个逻辑重构为一个更清晰、更健壮的循环,正确地处理检查、下载和验证的流程。
def check_nltk_data(self):
_hugegraph_source_dir = os.path.join(resource_path, "nltk_data")
if _hugegraph_source_dir not in nltk.data.path:
nltk.data.path.append(_hugegraph_source_dir)
cache_dir = self.get_cache_dir()
nltk_data_dir = os.environ.get("NLTK_DATA", cache_dir)
if nltk_data_dir not in nltk.data.path:
nltk.data.path.append(nltk_data_dir)
required_packages = {
'punkt': 'tokenizers/punkt',
'punkt_tab': 'tokenizers/punkt_tab',
'averaged_perceptron_tagger': 'taggers/averaged_perceptron_tagger',
"averaged_perceptron_tagger_eng": 'taggers/averaged_perceptron_tagger_eng'
}
for package, path in required_packages.items():
try:
nltk.data.find(path)
except LookupError:
log.info("Start download nltk package %s", package)
try:
if not nltk.download(package, download_dir=nltk_data_dir, quiet=False):
log.warning("NLTK download command returned False for package %s.", package)
return False
# Verify after download
nltk.data.find(path)
except (URLError, HTTPError, PermissionError) as e:
log.warning("Can't download package %s as error: %s", package, e)
return False
except LookupError:
log.error("Package %s not found after download. Check package name and nltk_data paths.", package)
return False
return True| def _multi_preprocess(self, text): | ||
| words = [] | ||
| ch_tokens = [] | ||
| en_stop_words = NLTKHelper().stopwords(lang='english') | ||
| ch_stop_words = NLTKHelper().stopwords(lang='chinese') | ||
|
|
||
| # Filtering special words, cleansing punctuation marks, and filtering out invalid tokens | ||
| masked_text, placeholder_map = self._word_mask(text) | ||
| pos_tags = self._get_valid_tokens(masked_text) | ||
|
|
||
| # English word segmentation | ||
| for word, flag in pos_tags: | ||
| if word in placeholder_map: | ||
| words.append(placeholder_map[word]) | ||
| else: | ||
| if len(word) >= 1 and flag in self.pos_filter['english'] and word.lower() not in en_stop_words: | ||
| words.append(word) | ||
| if re.compile('[\u4e00-\u9fff]').search(word): | ||
| ch_tokens.append(word) | ||
|
|
||
| # Chinese word segmentation | ||
| if len(ch_tokens) > 0: | ||
| ch_tokens = list(set(ch_tokens)) | ||
| for ch_token in ch_tokens: | ||
| idx = words.index(ch_token) | ||
| ch_words = [] | ||
| jieba_tokens = pseg.cut(ch_token) | ||
| for word, flag in jieba_tokens: | ||
| if len(word) >= 1 and flag in self.pos_filter['chinese'] and word not in ch_stop_words: | ||
| ch_words.append(word) | ||
| words = words[:idx] + ch_words + words[idx+1:] | ||
|
|
||
| return words |
There was a problem hiding this comment.
_multi_preprocess 方法在处理包含重复中文词组的文本时存在缺陷,可能导致分词不完整。
- 信息丢失: 在第100行,
ch_tokens = list(set(ch_tokens))会去除重复的中文词组,这会丢失词语在原文中的频率和位置信息。 - 错误的替换: 在第102行,
idx = words.index(ch_token)只会找到词组在words列表中的第一个匹配项的索引。如果一个中文词组在文本中出现多次,只有第一个会被正确地分词和替换,其余的将保持原样,从而导致分词不准确。
例如,对于文本 "你好世界,我爱你好世界",words 列表在处理前可能是 ['你好世界', '我', '爱', '你好世界']。ch_tokens 在去重后只包含一个 '你好世界'。words.index('你好世界') 将返回 0,只有第一个 '你好世界' 会被替换,第二个则不会被处理。
建议重构此方法以确保所有词元都得到正确处理。一个更简单、更健壮的方法是遍历一次 pos_tags,并根据词元是中文还是英文来应用不同的分词策略,然后构建最终的词列表。
def _multi_preprocess(self, text):
words = []
en_stop_words = NLTKHelper().stopwords(lang='english')
ch_stop_words = NLTKHelper().stopwords(lang='chinese')
masked_text, placeholder_map = self._word_mask(text)
pos_tags = self._get_valid_tokens(masked_text)
for word, flag in pos_tags:
if word in placeholder_map:
words.append(placeholder_map[word])
continue
# Check if the word contains Chinese characters
if re.search('[一-鿿]', word):
jieba_tokens = pseg.cut(word)
for ch_word, ch_flag in jieba_tokens:
if len(ch_word) >= 1 and ch_flag in self.pos_filter['chinese'] and ch_word not in ch_stop_words:
words.append(ch_word)
# Process as English word
elif len(word) >= 1 and flag in self.pos_filter['english'] and word.lower() not in en_stop_words:
words.append(word)
return words|
/gemini summary |
Summary of ChangesThis pull request significantly upgrades the keyword extraction capabilities within the RAG pipeline by introducing TextRank and a hybrid approach as alternatives to the LLM-only method. This change provides users with more granular control over keyword extraction, allowing for optimization based on performance, cost, and specific use case requirements, while also standardizing the output format to include importance scores. Highlights
Changelog
Activity
|
| "numpy", | ||
| "pandas", | ||
| "pydantic", | ||
| "scipy", |
There was a problem hiding this comment.
建议在依赖末尾添加注释说明这些包的用途:
"scipy", # TextRank 算法需要
"python-igraph", # 图结构计算需要同时建议验证这两个包在 Windows/macOS/Linux 平台上的兼容性。
|
|
||
| def _word_mask(self, text): | ||
|
|
||
| placeholder_id_counter = 0 |
There was a problem hiding this comment.
在 _word_mask 方法中存在潜在的性能问题:
- 嵌套函数中使用 nonlocal:
_create_placeholder函数每次匹配都会修改外层变量,建议将 placeholder 生成逻辑提取为类方法 - 正则编译位置: 建议将
special_regex编译移到__init__方法中,避免每次调用都重新编译 - 字符串拼接:
f" __shieldword_{placeholder_id_counter}__ "两侧的空格可能导致意外的分词结果
建议优化:
def __init__(self, ...):
# ... 其他初始化代码
self.special_regex = regex.compile('|'.join(self.rules), regex.V1)
def _create_placeholder(self, original_word: str, counter: int) -> tuple:
placeholder = f"__shieldword_{counter}__"
return placeholder, original_word| # Chinese word segmentation | ||
| if len(ch_tokens) > 0: | ||
| ch_tokens = list(set(ch_tokens)) | ||
| for ch_token in ch_tokens: |
There was a problem hiding this comment.
安全隐患: 此处缺少异常处理,如果 ch_token 不在 words 列表中会导致 ValueError
建议修改:
try:
idx = words.index(ch_token)
except ValueError:
log.warning(f"Chinese token {ch_token} not found in words list")
continue另外,在同一个列表上执行切片重组时,建议使用 collections.deque 或构建新列表,避免频繁的列表操作影响性能。
| self._language = "chinese" if self._language == "cn" else "english" | ||
| mk = context.get("max_keywords", self._max_keywords) | ||
| try: | ||
| mk = int(mk) |
There was a problem hiding this comment.
代码质量问题:
- 变量命名:
mk变量名不够清晰,建议改为max_keywords_value - 类型转换逻辑: 建议使用更 Pythonic 的方式处理类型转换:
max_keywords_raw = context.get("max_keywords", self._max_keywords)
if isinstance(max_keywords_raw, (int, float)):
self._max_keywords = max(1, int(max_keywords_raw))
elif isinstance(max_keywords_raw, str) and max_keywords_raw.isdigit():
self._max_keywords = max(1, int(max_keywords_raw))
else:
self._max_keywords = self._max_keywords # 保持默认值- 验证范围: 建议添加最大值限制,避免用户输入过大的值导致性能问题
| mk = self._max_keywords | ||
| self._max_keywords = max(1, mk) | ||
|
|
||
| method = (context.get("extract_method", self._extract_method) or "hybrid").strip().lower() |
There was a problem hiding this comment.
关键安全问题: 此处对 method 参数的处理存在隐患:
- strip() 后可能为空: 如果用户传入只包含空格的字符串,
strip()后为空字符串,会抛出 ValueError - 建议使用枚举: 应该使用 Python
Enum或Literal类型来限制可选值
建议修改:
from typing import Literal
EXTRACT_METHOD = Literal["llm", "textrank", "hybrid"]
# 在 run 方法中
method = (context.get("extract_method", self._extract_method) or self._extract_method).strip().lower()
if method not in ["llm", "textrank", "hybrid"]:
log.warning(f"Invalid extract_method '{method}', fallback to '{self._extract_method}'")
method = self._extract_method| node_names = self.graph.vs['name'] | ||
| return dict(zip(node_names, pagerank_scores)) | ||
|
|
||
| def extract_keywords(self, text) -> Dict[str, float]: |
There was a problem hiding this comment.
代码质量: 此处的异常处理不够完善:
- 如果
check_nltk_data()返回 False,直接返回空字典会导致静默失败,用户无法得知原因 - 建议添加日志记录或抛出更明确的异常
建议修改:
if not NLTKHelper().check_nltk_data():
log.error("NLTK data check failed, cannot proceed with keyword extraction")
raise RuntimeError("NLTK required data packages are not available. Please check network or manually install.")或者如果需要优雅降级:
if not NLTKHelper().check_nltk_data():
log.warning("NLTK data unavailable, returning empty keywords")
context["keywords"] = []
context["ranks"] = {}
return context| !/src/hugegraph_llm/resources/demo/* | ||
| !/src/hugegraph_llm/resources/nltk_data/* | ||
| !/src/hugegraph_llm/resources/prompt_examples/* | ||
| !/src/hugegraph_llm/resources/demo/ |
There was a problem hiding this comment.
.gitignore 修改存在风险:
从 !/src/hugegraph_llm/resources/demo/* 改为 !/src/hugegraph_llm/resources/demo/ 会改变匹配行为:
- 原规则: 排除
resources/下所有文件,但保留demo/目录下的所有文件 - 新规则: 只保留
demo/目录本身,不保留其下的文件
这可能导致 demo 文件被意外忽略。建议:
- 保持原有的
/*后缀 - 或者明确说明为何需要修改,并在 PR 描述中补充说明
| required_packages[package] = True | ||
| except LookupError: | ||
| try: | ||
| log.info("Start download nltk package %s", package) |
There was a problem hiding this comment.
下载失败处理不够健壮:
- 捕获的异常类型包括
PermissionError,但如果是权限问题,仅 warning 可能导致后续代码因缺少数据而失败 - 建议区分不同的失败原因:
- 网络问题: 可以重试
- 权限问题: 应该抛出异常
- 其他问题: 记录详细错误信息
建议改进:
except PermissionError as e:
log.error(f"Permission denied when downloading {package}: {e}")
raise RuntimeError(f"Cannot write NLTK data to {nltk_data_dir}, please check permissions") from e
except (URLError, HTTPError) as e:
log.warning(f"Network error downloading {package}: {e}, will retry with backup method")
# 可以尝试从本地缓存或镜像源下载| | `TEXT2GQL_LLM_TYPE` | Literal["openai", "litellm", "ollama/local"] | openai | 文本转 GQL LLM 类型 | | ||
| | `EMBEDDING_TYPE` | Optional[Literal["openai", "litellm", "ollama/local"]] | openai | 嵌入模型类型 | | ||
| | `RERANKER_TYPE` | Optional[Literal["cohere", "siliconflow"]] | None | 重排序模型类型:cohere/siliconflow | | ||
| | `KEYWORD_EXTRACT_TYPE` | Literal["llm", "textrank", "hybrid"] | llm | 关键词提取模型类型:llm/textrank/hybrid | |
There was a problem hiding this comment.
配置文档建议:
表格新增的配置项建议补充:
-
WINDOW_SIZE:
- 添加合理取值范围 (如: 1-10)
- 说明该值对性能和结果的影响
-
HYBRID_LLM_WEIGHTS:
- 说明取值范围 (0.0-1.0)
- 解释该参数的含义,如 "LLM 结果权重,TextRank 权重为 1 - hybrid_llm_weights"
- 提供推荐值及使用场景
示例:
| `WINDOW_SIZE` | Optional[Integer] | 3 | TextRank 滑窗大小 (范围: 1-10),较大的窗口可以捕获更长距离的词语关系,但会增加计算复杂度 |
| `HYBRID_LLM_WEIGHTS` | Optional[Float] | 0.5 | 混合模式中 LLM 结果的权重 (范围: 0.0-1.0),TextRank 权重 = 1 - 该值。推荐 0.5 以平衡两种方法 |There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (1)
94-103: 存在冗余的条件判断,可能导致混淆。Lines 94-95 已经检查了英文词性和停用词过滤,然后 Line 96-101 处理中文字符的情况,但 Line 102-103 又重复检查了相同的英文条件。这个重复的
elif分支永远不会执行,因为如果满足 Line 94 的条件,已经在 Line 95 append 过了。建议移除冗余分支:
if len(word) >= 1 and flag in self.pos_filter['english'] and word.lower() not in en_stop_words: words.append(word) if re.compile('[\u4e00-\u9fff]').search(word): jieba_tokens = pseg.cut(word) for ch_word, ch_flag in jieba_tokens: if len(ch_word) >= 1 and ch_flag in self.pos_filter['chinese'] \ and ch_word not in ch_stop_words: words.append(ch_word) - elif len(word) >= 1 and flag in self.pos_filter['english'] and word.lower() not in en_stop_words: - words.append(word)hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
86-86: 隐私风险:在 INFO 级别记录用户查询可能泄露敏感信息。当前在 INFO 级别记录完整的用户查询和提取的关键词,这在生产环境中可能导致隐私泄露。建议降级为 DEBUG 级别:
- log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"]) + log.debug("User Query: %s | Keywords: %s", self._query, context["keywords"])Based on learnings
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
hugegraph-llm/config.md(1 hunks)hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py(3 hunks)hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py(1 hunks)hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py(2 hunks)pyproject.toml(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- hugegraph-llm/config.md
- pyproject.toml
🧰 Additional context used
📓 Path-based instructions (2)
hugegraph-llm/**/*.py
📄 CodeRabbit inference engine (hugegraph-llm/AGENTS.md)
hugegraph-llm/**/*.py: Adhere to ruff code style for Python code
Type-check Python code with mypy
Keep each Python file under 600 lines for maintainability
Files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.pyhugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.pyhugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
hugegraph-llm/src/hugegraph_llm/operators/**/*.py
📄 CodeRabbit inference engine (hugegraph-llm/AGENTS.md)
Put core processing pipelines under src/hugegraph_llm/operators/
Files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.pyhugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.pyhugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧠 Learnings (10)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.pyhugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.pyhugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.pyhugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:37:41.224Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.pyhugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py
📚 Learning: 2025-08-18T14:42:31.998Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.pyhugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-21T12:34:10.407Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.407Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中,sub_tokens变量的复用逻辑是正确的:每次循环中sub_tokens被重新赋值为当前token的分词结果,处理后的有效子词会被添加回同一个sub_tokens列表中,最后统一添加到results。这种实现方式是有效且正确的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.pyhugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-29T13:11:08.943Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧬 Code graph analysis (3)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)
hugegraph-ml/src/hugegraph_ml/models/seal.py (3)
info(794-795)debug(791-792)warning(800-801)
hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (1)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (3)
NLTKHelper(31-136)stopwords(37-70)check_nltk_data(72-109)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
BaseLLM(22-74)generate(26-31)hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)
LLMs(25-98)hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
MultiLingualTextRank(31-151)extract_keywords(134-151)hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
extract_keywords(67-85)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
86-86: 避免在 INFO 级别日志中记录用户查询内容。在 INFO 级别记录完整的用户查询和关键词可能泄露敏感信息。过往审查中已多次标记此隐私风险。建议降级为 DEBUG 级别:
- log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"]) + log.debug("User Query: %s | Keywords: %s", self._query, context["keywords"])hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)
82-109: 至少需要放宽资源检测为“主/备”而非全部必备
当前required_packages将punkt、punkt_tab(以及两个 tagger)都视作必需项,意味着即便环境里已经有常见的punkt,只要缺少较罕见的punkt_tab,check_nltk_data()就会尝试下载并在无法获取时返回 False。类似地,averaged_perceptron_tagger与_eng只要缺一就判失败。这样会让默认安装在绝大多数环境里都无法通过检测,从而阻断后续流程。请恢复成主/备组校验:同一组只要命中其一即可,否则再依次尝试下载备用项。建议按 tokenizers 与 taggers 分组检测,任何一组无法满足才返回 False。- required_packages = { - 'punkt': 'tokenizers/punkt', - 'punkt_tab': 'tokenizers/punkt_tab', - 'averaged_perceptron_tagger': 'taggers/averaged_perceptron_tagger', - "averaged_perceptron_tagger_eng": 'taggers/averaged_perceptron_tagger_eng' - } - - for package, path in required_packages.items(): - try: - nltk.data.find(path) - except LookupError: - ... - nltk.data.find(path) - ... - return True + resource_groups = [ + ("tokenizers", ["punkt", "punkt_tab"]), + ("taggers", ["averaged_perceptron_tagger_eng", "averaged_perceptron_tagger"]), + ] + + for category, packages in resource_groups: + group_ready = False + for package in packages: + path = f"{category}/{package}" + try: + nltk.data.find(path) + group_ready = True + break + except LookupError: + log.info("Start download nltk package %s", package) + try: + if nltk.download(package, download_dir=nltk_data_dir, quiet=False): + nltk.data.find(path) + group_ready = True + break + except PermissionError as e: + log.error("Permission denied when downloading %s: %s", package, e) + return False + except (URLError, HTTPError) as e: + log.warning("Network error downloading %s: %s, will retry with backup method", package, e) + continue + except LookupError: + log.error("Package %s not found after download. Check package name and nltk_data paths.", package) + continue + if not group_ready: + return False + return True
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py(3 hunks)hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
hugegraph-llm/**/*.py
📄 CodeRabbit inference engine (hugegraph-llm/AGENTS.md)
hugegraph-llm/**/*.py: Adhere to ruff code style for Python code
Type-check Python code with mypy
Keep each Python file under 600 lines for maintainability
Files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.pyhugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
hugegraph-llm/src/hugegraph_llm/operators/**/*.py
📄 CodeRabbit inference engine (hugegraph-llm/AGENTS.md)
Put core processing pipelines under src/hugegraph_llm/operators/
Files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.pyhugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧠 Learnings (10)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.pyhugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.pyhugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:37:41.224Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
📚 Learning: 2025-08-29T13:11:08.943Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:42:31.998Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-21T12:34:10.407Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.407Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中,sub_tokens变量的复用逻辑是正确的:每次循环中sub_tokens被重新赋值为当前token的分词结果,处理后的有效子词会被添加回同一个sub_tokens列表中,最后统一添加到results。这种实现方式是有效且正确的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧬 Code graph analysis (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (3)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
BaseLLM(22-74)generate(26-31)hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)
LLMs(25-98)hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
MultiLingualTextRank(31-151)extract_keywords(134-151)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
🔇 Additional comments (4)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)
92-101: LGTM!LLM 提取方法实现清晰,正确返回
Dict[str, float]格式,与架构设计一致。计时日志有助于性能监控。基于学习记录
103-116: LGTM!TextRank 提取方法实现健壮:
- 异常处理覆盖了常见错误类型
- 内存错误使用
critical级别记录(合理)- 错误时返回空字典作为安全后备
- 返回
Dict[str, float]与架构设计一致基于学习记录
118-147: LGTM!混合提取方法实现正确:
- 对
hybrid_llm_weights进行了边界验证(限制在 [0,1])- 基于并集计算加权评分,确保覆盖两种方法的所有关键词
- 返回
Dict[str, float],由run()统一处理排序与截断,符合架构设计偏好- 使用 DEBUG 级别记录详细信息,避免隐私泄露
基于学习记录
149-178: LGTM!响应解析方法实现健壮:
- 在访问前验证分割结果长度(防止 IndexError)
- 使用 try-except 捕获分数解析错误(防止 ValueError)
- 遇到无效条目时跳过而非崩溃
- 正确返回
Dict[str, float]格式
| self._language = context.get("language", self._language).lower() | ||
| self._max_keywords = context.get("max_keywords", self._max_keywords) | ||
| # Use English by default | ||
| self._language = "chinese" if self._language == "cn" else "english" |
There was a problem hiding this comment.
语言映射不完整,可能导致停用词加载失败。
当前仅将 "cn" 映射为 "chinese",但未处理其他常见的语言代码如 "en"、"zh" 等。根据之前的学习记录,NLTKHelper 和 MultiLingualTextRank 依赖精确的语言标识符("english" 或 "chinese"),不完整的映射可能导致停用词过滤失效。
建议完善语言映射逻辑:
- # Use English by default
- self._language = "chinese" if self._language == "cn" else "english"
+ # Normalize language codes to match NLTK/TextRank expectations
+ lang = self._language.strip().lower()
+ if lang in ("cn", "zh", "zh-cn", "chinese"):
+ self._language = "chinese"
+ else:
+ self._language = "english"基于学习记录
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| self._language = "chinese" if self._language == "cn" else "english" | |
| self._language = (language or "").strip().lower() | |
| if self._language in ("en", "en-us", "english"): | |
| self._language = "english" | |
| elif self._language in ("cn", "zh", "zh-cn", "zh_cn", "chinese"): | |
| self._language = "chinese" | |
| else: | |
| # Default to english for unknown languages | |
| self._language = "english" |
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py around
line 61, the current language mapping only converts "cn" to "chinese" which is
incomplete and can break stopword loading; normalize the incoming language
string (lowercase, strip), and map common codes to the expected identifiers: map
"cn", "zh", "zh-cn" -> "chinese", map "en", "eng" -> "english", leave
"english"/"chinese" unchanged, and for any unknown or empty value default to
"english" (or raise a clear error if preferred); update the assignment
accordingly so downstream NLTKHelper and MultiLingualTextRank receive the exact
expected identifiers.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (3)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (2)
60-61: 语言代码映射不完整,可能导致停用词和分词失效。当前仅将
"cn"映射为"chinese",但未处理其他常见的语言代码(如"en","zh","zh-cn"等)。根据检索到的学习记录,NLTKHelper和MultiLingualTextRank需要精确的语言标识符("english"或"chinese"),不完整的映射会导致停用词过滤和分词策略失效。建议修复:
- # Use English by default - self._language = "chinese" if self._language == "cn" else "english" + # Normalize language codes to match NLTK/TextRank expectations + lang = self._language.strip().lower() + if lang in ("cn", "zh", "zh-cn", "zh_cn", "chinese"): + self._language = "chinese" + elif lang in ("en", "eng", "english"): + self._language = "english" + else: + log.warning("Unsupported language '%s', defaulting to 'english'", lang) + self._language = "english"Based on learnings
103-116: MemoryError 处理策略可能掩盖严重的资源问题。第 112 行捕获
MemoryError并记录为 critical 级别,但没有重新抛出异常。内存错误通常表示严重的资源问题(如文本过大、系统资源不足等),静默处理并返回空结果可能会让调用方误以为没有关键词,而不是系统出现了严重问题。建议根据业务需求选择以下策略之一:
策略 1:重新抛出(推荐用于关键业务流程)
except MemoryError as e: log.critical("TextRank memory error (text too large?): %s", e) + raise # 让上层感知到资源问题策略 2:降级处理并明确标记
except MemoryError as e: log.critical("TextRank memory error (text too large?): %s", e) + context["extraction_error"] = "memory_error" + context["error_message"] = str(e) end_time = time.perf_counter()hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)
53-58: 日志信息不准确,应反映真实下载状态第 56 行的日志信息 "already downloaded" 具有误导性——此时刚完成下载操作,应改为 "downloaded successfully" 或 "download completed"。
应用此修改:
- log.debug("NLTK package stopwords is already downloaded") + log.debug("NLTK package stopwords downloaded successfully")
🧹 Nitpick comments (2)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
32-47: 建议添加构造参数的边界验证。虽然
MultiLingualTextRank内部会验证window_size(范围 0-10),但在此处提前验证max_keywords和window_size的合法性可以提高代码的健壮性和可读性,避免将无效参数传递到下游组件。建议补充:
def __init__( self, text: Optional[str] = None, llm: Optional[BaseLLM] = None, max_keywords: int = 5, extract_template: Optional[str] = None, ): if max_keywords < 1: log.warning("max_keywords must be >= 1, using default 5") max_keywords = 5 self._llm = llm self._query = text # ... 其余初始化逻辑hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (1)
59-68: 错误处理改进合理,但存在冗余检查第 59-63 行增加的 LookupError 捕获确保了即使下载失败也能优雅降级为空列表,符合先前审查建议。
但第 65-68 行的 None 检查略显冗余:第 63 行已在异常时设置
self._stopwords[lang] = [],第 42 行的条件if self._stopwords.get(lang) is None确保仅在首次调用时执行此块,因此final_stopwords不会是 None。可选优化(移除冗余检查):
- # final check - final_stopwords = self._stopwords[lang] - if final_stopwords is None: - return [] - return self._stopwords[lang]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py(3 hunks)hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
hugegraph-llm/**/*.py
📄 CodeRabbit inference engine (hugegraph-llm/AGENTS.md)
hugegraph-llm/**/*.py: Adhere to ruff code style for Python code
Type-check Python code with mypy
Keep each Python file under 600 lines for maintainability
Files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.pyhugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
hugegraph-llm/src/hugegraph_llm/operators/**/*.py
📄 CodeRabbit inference engine (hugegraph-llm/AGENTS.md)
Put core processing pipelines under src/hugegraph_llm/operators/
Files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.pyhugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧠 Learnings (10)
📓 Common learnings
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.pyhugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: In hugegraph-llm KeywordExtract, NLTKHelper loads both English and Chinese stopwords during initialization, but the stopwords(lang) method still requires the correct language key ("english" or "chinese") to return the appropriate stopword set for filtering.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.pyhugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:37:41.224Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:139-141
Timestamp: 2025-08-18T14:37:41.224Z
Learning: In hugegraph-llm textrank_word_extract.py, NLTK resources are packaged locally under src/hugegraph_llm/resources/nltk_data/ and the code properly configures nltk.data.path using get_project_root() to point to these local resources before making NLTK calls, preventing LookupError issues.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py
📚 Learning: 2025-08-29T13:11:08.943Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:113-154
Timestamp: 2025-08-29T13:11:08.943Z
Learning: 在 hugegraph-llm 的 KeywordExtract 类中,用户 Gfreely 希望保持提取方法的接口一致性:所有提取方法(_extract_with_llm、_extract_with_textrank、_extract_with_hybrid)都返回 Dict[str, float],由主 run() 方法统一处理排序和截断,而不是在各个提取方法内部进行预处理。这是他们的架构设计偏好。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank关键词提取中,虽然KeywordExtract.run()会在最后对结果进行切片限制(context["keywords"] = list(keywords)[:self._max_keywords]),但TextRank模型内部的top_k参数在初始化后不会更新,这意味着如果运行时max_keywords增大,TextRank仍只能生成初始数量的候选关键词。用户Gfreely确认这种设计在他们的使用场景中是可接受的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:45:20.756Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:39-41
Timestamp: 2025-08-18T14:45:20.756Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely选择简化UI设置,移除了窗口大小(window_size)的动态调整功能,只保留top_k(最大关键词数)的配置。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-18T14:42:31.998Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py:0-0
Timestamp: 2025-08-18T14:42:31.998Z
Learning: 在hugegraph-llm的TextRank实现中,用户Gfreely使用了ch_tokens去重策略(list(set(ch_tokens)))来避免重复中文token的处理,这既解决了words.index()可能的ValueError问题,又提高了批量替换的效率。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
📚 Learning: 2025-08-21T12:34:10.407Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:187-202
Timestamp: 2025-08-21T12:34:10.407Z
Learning: 在hugegraph-llm的KeywordExtract._extract_keywords_from_response方法中,sub_tokens变量的复用逻辑是正确的:每次循环中sub_tokens被重新赋值为当前token的分词结果,处理后的有效子词会被添加回同一个sub_tokens列表中,最后统一添加到results。这种实现方式是有效且正确的。
Applied to files:
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py
🧬 Code graph analysis (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (4)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
BaseLLM(22-74)generate(26-31)hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (1)
LLMs(25-98)hugegraph-llm/src/hugegraph_llm/operators/document_op/textrank_word_extract.py (2)
MultiLingualTextRank(31-151)extract_keywords(134-151)hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (1)
extract_keywords(67-85)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build (3.11)
- GitHub Check: build (3.10)
🔇 Additional comments (3)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py (1)
149-184: 响应解析逻辑实现良好。该方法的错误处理非常完善:
- 对格式不正确的项目进行适当的警告记录(第 170 行)
- 对评分进行范围验证和截断(第 176-178 行)
- 捕获并记录解析异常而不会导致整个流程失败(第 181-182 行)
返回类型更改为
Dict[str, float]使得下游可以利用评分信息进行加权处理,提高了接口的灵活性。hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py (2)
22-22: 导入项合理,支持增强的错误处理新增的 URLError、HTTPError 和 log 导入为方法中的网络异常处理和日志记录提供了必要支持。
Also applies to: 28-28
39-41: 正确注册项目内置资源路径优先将项目内置的 nltk_data 路径加入搜索列表,确保本地资源被优先使用,避免不必要的下载。这与 check_nltk_data() 的行为保持一致。
| def check_nltk_data(self): | ||
| _hugegraph_source_dir = os.path.join(resource_path, "nltk_data") | ||
| if _hugegraph_source_dir not in nltk.data.path: | ||
| nltk.data.path.append(_hugegraph_source_dir) | ||
|
|
||
| cache_dir = self.get_cache_dir() | ||
| nltk_data_dir = os.environ.get("NLTK_DATA", cache_dir) | ||
| if nltk_data_dir not in nltk.data.path: | ||
| nltk.data.path.append(nltk_data_dir) | ||
|
|
||
| required_packages = { | ||
| 'punkt': 'tokenizers/punkt', | ||
| 'punkt_tab': 'tokenizers/punkt_tab', | ||
| 'averaged_perceptron_tagger': 'taggers/averaged_perceptron_tagger', | ||
| "averaged_perceptron_tagger_eng": 'taggers/averaged_perceptron_tagger_eng' | ||
| } | ||
|
|
||
| for package, path in required_packages.items(): | ||
| try: | ||
| nltk.data.find(path) | ||
| except LookupError: | ||
| log.info("Start download nltk package %s", package) | ||
| try: | ||
| if not nltk.download(package, download_dir=nltk_data_dir, quiet=False): | ||
| log.warning("NLTK download command returned False for package %s.", package) | ||
| return False | ||
| # Verify after download | ||
| nltk.data.find(path) | ||
| except PermissionError as e: | ||
| log.error("Permission denied when downloading %s: %s", package, e) | ||
| return False | ||
| except (URLError, HTTPError) as e: | ||
| log.warning("Network error downloading %s: %s, will retry with backup method", package, e) | ||
| return False | ||
| except LookupError: | ||
| log.error("Package %s not found after download. Check package name and nltk_data paths.", package) | ||
| return False | ||
| return True |
There was a problem hiding this comment.
下载策略低效且日志信息误导——应实现回退机制而非下载所有备选包
当前实现存在以下问题:
-
冗余下载:方法会下载全部 4 个包,但
punkt与punkt_tab是互为备选的分词器资源,averaged_perceptron_tagger与averaged_perceptron_tagger_eng是互为备选的词性标注器资源。正确做法是每组仅需下载其中一个可用资源即可,当前策略浪费带宽和时间。 -
误导性日志:第 104 行提示 "will retry with backup method",但代码中并未实现任何重试或备用下载方法,直接返回 False。
-
与先前审查建议不符:过往审查明确建议对 tokenizer 和 tagger 使用回退名称策略(先尝试首选资源,失败后再尝试备选资源),而非盲目下载所有变体。
建议重构为回退检查策略:
def check_nltk_data(self):
_hugegraph_source_dir = os.path.join(resource_path, "nltk_data")
if _hugegraph_source_dir not in nltk.data.path:
nltk.data.path.append(_hugegraph_source_dir)
cache_dir = self.get_cache_dir()
nltk_data_dir = os.environ.get("NLTK_DATA", cache_dir)
if nltk_data_dir not in nltk.data.path:
nltk.data.path.append(nltk_data_dir)
- required_packages = {
- 'punkt': 'tokenizers/punkt',
- 'punkt_tab': 'tokenizers/punkt_tab',
- 'averaged_perceptron_tagger': 'taggers/averaged_perceptron_tagger',
- "averaged_perceptron_tagger_eng": 'taggers/averaged_perceptron_tagger_eng'
- }
-
- for package, path in required_packages.items():
- try:
- nltk.data.find(path)
- except LookupError:
- log.info("Start download nltk package %s", package)
- try:
- if not nltk.download(package, download_dir=nltk_data_dir, quiet=False):
- log.warning("NLTK download command returned False for package %s.", package)
- return False
- # Verify after download
- nltk.data.find(path)
- except PermissionError as e:
- log.error("Permission denied when downloading %s: %s", package, e)
- return False
- except (URLError, HTTPError) as e:
- log.warning("Network error downloading %s: %s, will retry with backup method", package, e)
- return False
- except LookupError:
- log.error("Package %s not found after download. Check package name and nltk_data paths.", package)
- return False
- return True
+ # 使用回退策略:每组资源仅需一个可用
+ resource_groups = [
+ ("tokenizers", ["punkt", "punkt_tab"]),
+ ("taggers", ["averaged_perceptron_tagger_eng", "averaged_perceptron_tagger"]),
+ ]
+
+ def _ensure_resource(resource_type, candidates):
+ """确保至少一个候选资源可用"""
+ for candidate in candidates:
+ path = f"{resource_type}/{candidate}"
+ try:
+ nltk.data.find(path)
+ log.debug("Found existing NLTK resource: %s", path)
+ return True # 找到可用资源
+ except LookupError:
+ continue # 尝试下一个候选
+
+ # 所有候选都不存在,尝试下载首选项
+ preferred = candidates[0]
+ log.info("Downloading NLTK package: %s", preferred)
+ try:
+ if not nltk.download(preferred, download_dir=nltk_data_dir, quiet=False):
+ log.warning("NLTK download returned False for %s", preferred)
+ return False
+ nltk.data.find(f"{resource_type}/{preferred}")
+ log.info("Successfully downloaded and verified %s", preferred)
+ return True
+ except PermissionError as e:
+ log.error("Permission denied downloading %s: %s", preferred, e)
+ raise RuntimeError(f"Cannot write NLTK data to {nltk_data_dir}, check permissions") from e
+ except (URLError, HTTPError) as e:
+ log.error("Network error downloading %s: %s", preferred, e)
+ return False
+ except LookupError:
+ log.error("Package %s not found after download", preferred)
+ return False
+
+ for resource_type, candidates in resource_groups:
+ if not _ensure_resource(resource_type, candidates):
+ return False
+
+ return True此方案:
- 每组资源仅下载一个可用变体(优先首选项)
- 避免冗余下载
- 对 PermissionError 抛出异常(符合 imbajin 先前建议的严重性区分)
- 移除误导性的 "retry with backup method" 日志
🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/operators/common_op/nltk_helper.py around
lines 72 to 109, the current implementation redundantly attempts to download
every variant and logs a misleading "will retry with backup method" without
actually implementing a fallback; refactor to use a fallback-per-group strategy
where each logical resource group (tokenizer: ['punkt', 'punkt_tab'], tagger:
['averaged_perceptron_tagger', 'averaged_perceptron_tagger_eng']) is checked in
priority order: for each group, try nltk.data.find for the preferred name, if
not found try the fallback name, and only if neither exists attempt a single
download for the first available name that succeeds (verify with nltk.data.find
afterwards); do not download both variants, remove the misleading "will retry
with backup method" log, return False only on genuine download/network failures,
and re-raise PermissionError instead of swallowing it so callers can handle
permission issues appropriately.
| method = (context.get("extract_method", self._extract_method) or "LLM").strip().lower() | ||
| if method == "llm": | ||
| # LLM method | ||
| ranks = self._extract_with_llm() | ||
| elif method == "textrank": | ||
| # TextRank method | ||
| ranks = self._extract_with_textrank() | ||
| elif method == "hybrid": | ||
| # Hybrid method | ||
| ranks = self._extract_with_hybrid() | ||
| else: | ||
| log.warning("Invalid extract_method %s", method) | ||
| raise ValueError(f"Invalid extract_method: {method}") |
There was a problem hiding this comment.
extract_method 参数处理存在默认值不一致和空字符串风险。
存在的问题:
- 第 69 行的默认值硬编码为
"LLM",与构造函数中从llm_settings.keyword_extract_type读取的self._extract_method不一致 - 当
context.get("extract_method")返回仅包含空格的字符串时,strip().lower()会产生空字符串,导致第 80 行抛出ValueError
建议修复:
- method = (context.get("extract_method", self._extract_method) or "LLM").strip().lower()
+ method = (context.get("extract_method", self._extract_method) or self._extract_method).strip().lower()
+ if not method:
+ log.warning("Empty extract_method, using default: %s", self._extract_method)
+ method = self._extract_method.lower()
+
if method == "llm":
# LLM method
ranks = self._extract_with_llm()
elif method == "textrank":
# TextRank method
ranks = self._extract_with_textrank()
elif method == "hybrid":
# Hybrid method
ranks = self._extract_with_hybrid()
else:
- log.warning("Invalid extract_method %s", method)
- raise ValueError(f"Invalid extract_method: {method}")
+ log.warning("Invalid extract_method '%s', valid options: llm/textrank/hybrid", method)
+ raise ValueError(f"Invalid extract_method: '{method}'. Must be 'llm', 'textrank', or 'hybrid'")| keywords = [] if not ranks else sorted(ranks, key=ranks.get, reverse=True) | ||
| keywords = [k.replace("'", "") for k in keywords] | ||
| context["keywords"] = keywords[:self._max_keywords] | ||
| log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"]) |
There was a problem hiding this comment.
在 INFO 级别记录用户查询存在隐私风险。
第 86 行在 INFO 日志级别输出完整的用户查询和提取的关键词。用户查询可能包含敏感信息(如个人数据、业务机密等),在 INFO 级别记录会增加隐私泄露风险,不符合最小化数据记录的安全实践。
建议修复:
- log.info("User Query: %s\nKeywords: %s", self._query, context["keywords"])
+ log.debug("User Query: %s | Keywords: %s", self._query, context["keywords"])或者,如果需要在 INFO 级别记录,应脱敏处理:
query_preview = self._query[:50] + "..." if len(self._query) > 50 else self._query
log.info("Keywords extracted (query length: %d) | Keywords: %s", len(self._query), context["keywords"])🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py around
line 86, remove logging of the full user query at INFO level to avoid leaking
sensitive data; instead log only non-sensitive metadata (e.g., query length) and
the extracted keywords, or log a redacted/truncated preview of the query if
strictly required, and if full query must be recorded keep it at DEBUG level;
implement one of these changes so INFO logs never contain the complete original
query.
support TextRank, update new UI to support change keyword extraction method.
Main changes:
Added options to the RAG interface for selecting the keyword extraction method, along with several settings for TextRank, such as the number of keywords to extract, the size of the sliding window, and the processing language ('en' for English, 'zh' for Chinese).
A 'TextRank mask words' setting has also been added. It allows users to manually input specific phrases composed of letters and symbols to prevent them from being split during Chinese word segmentation. This requires manual entry by the user.
Summary by CodeRabbit
新功能
行为变更
文档
杂务