Skip to content
This repository was archived by the owner on Dec 28, 2025. It is now read-only.

Add mypy & ruff configs, fix type issues, add assertions, and initial vector DB CRUD with tests#4

Open
mikumifa wants to merge 16 commits intohugegraph:mainfrom
mikumifa:main
Open

Add mypy & ruff configs, fix type issues, add assertions, and initial vector DB CRUD with tests#4
mikumifa wants to merge 16 commits intohugegraph:mainfrom
mikumifa:main

Conversation

@mikumifa
Copy link

@mikumifa mikumifa commented May 22, 2025

This PR includes several development improvements and code updates: (fork from apache#246)

  • Added mypy and ruff configuration files to support type checking and code style enforcement. These can be removed if not needed.

  • Fixed type annotation issues, such as cases like graph_url: Optional[str] = "127.0.0.1:8080", where using Optional is misleading because the value will never be None.

  • Added assert statements to guard against None where necessary for better runtime safety.

  • Implemented initial CRUD operations for several vector databases and added corresponding test cases. Integration is not yet done to keep this PR smaller and easier to review—full integration will follow in a separate PR.

  • step 5 is the settings conf
    image

  • Updated the way configuration values are loaded in the UI. with dynamic value loading using a lambda function. This ensures that updated settings are dynamically reflected after refresh.

gr.Textbox(value=llm_settings.reranker_api_key, label="api_key", type="password") # old
gr.Textbox(value=lambda: llm_settings.reranker_api_key, label="api_key", type="password") #new


  • Added index_settings, Values will be saved into the .env file for persistence.
  • Manually tested rag and text2gremlin_block. both components are functioning correctly after changes.
  • Added a now_vector_index field to the get_vector_index_info() method. Displays the currently selected vector index database.

Summary by CodeRabbit

  • 新功能

    • 支持多种向量数据库(Faiss、Milvus、Qdrant),可在界面选择并动态切换。
    • 新增对向量数据库连接配置的管理与环境变量支持。
    • 向量存储接口统一,支持批量向量操作、属性管理及索引信息查询。
    • 嵌入模型配置支持设置维度,支持OpenAI、Ollama、千帆文心、LiteLLM等多种嵌入模型。
    • 向量索引相关功能和UI全面升级,支持选择不同后端并动态加载参数。
  • 优化与重构

    • 统一并抽象了向量索引的接口,提升扩展性和可维护性。
    • 代码结构优化,向量索引实现与调用方式解耦,便于后续新增类型。
    • 配置项类型注解与默认值更加明确,增强类型安全。
    • UI配置区块支持动态加载当前配置,提升用户体验。
  • Bug修复

    • 修复部分类型检查和参数传递问题,增强稳定性。
  • 测试

    • 新增Milvus与Qdrant向量索引的单元测试,覆盖添加、查询、保存、加载和删除等操作。
  • 其他

    • 项目依赖与静态检查工具配置更新,适配新功能。

@coderabbitai
Copy link

coderabbitai bot commented May 22, 2025

Walkthrough

本次更新引入了矢量索引后端的可插拔架构,支持Faiss、Milvus和Qdrant三种主流向量数据库,并通过配置实现动态选择。相关配置、UI、操作符、工具函数和测试均进行了适配。Embedding模型的维度参数被显式管理,Embedding接口支持批量处理。新增了Milvus和Qdrant向量索引的实现及完整测试用例。

Changes

文件/路径分组 变更摘要
hugegraph-llm/pyproject.toml
hugegraph-llm/requirements.txt
新增pymilvus和qdrant-client依赖,升级ollama,增加类型检查和lint配置。
hugegraph-llm/src/hugegraph_llm/config/* 新增index_config模块,配置支持多种向量数据库;各配置项类型注解更精确,显式管理embedding维度。
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/* UI支持选择向量数据库,embedding维度参数化,相关调用适配多后端。
hugegraph-llm/src/hugegraph_llm/indices/vector_index/base.py 新增VectorStoreBase抽象基类,统一向量索引接口。
hugegraph-llm/src/hugegraph_llm/indices/vector_index/faiss_vector_store.py VectorIndex重命名为FaissVectorIndex,接口标准化,支持动态命名和多后端。
hugegraph-llm/src/hugegraph_llm/indices/vector_index/milvus_vector_store.py 新增MilvusVectorIndex实现,支持Milvus向量数据库。
hugegraph-llm/src/hugegraph_llm/indices/vector_index/qdrant_vector_store.py 新增QdrantVectorIndex实现,支持Qdrant向量数据库。
hugegraph-llm/src/hugegraph_llm/models/embeddings/* 各Embedding实现支持显式embedding_dimension参数,支持批量文本embedding,接口标准化。
hugegraph-llm/src/hugegraph_llm/operators/* 各索引构建、查询操作符支持注入不同类型的向量索引,实现解耦和多后端适配。
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py
utils/graph_index_utils.py
新增get_vector_index_class方法,所有向量索引相关操作支持动态后端选择。
hugegraph-llm/src/tests/indices/test_faiss_vector_index.py 测试用例适配FaissVectorIndex重命名。
hugegraph-llm/src/tests/indices/test_milvus_vector_index.py 新增MilvusVectorIndex功能测试用例。
hugegraph-llm/src/tests/indices/test_qdrant_vector_index.py 新增QdrantVectorIndex功能测试用例。
hugegraph-llm/src/hugegraph_llm/indices/graph_index.py
middleware/middleware.py
...
代码格式、类型注解、日志格式等小幅优化,功能未变。

Sequence Diagram(s)

sequenceDiagram
    participant UI
    participant Config
    participant Embedding
    participant VectorIndex
    participant Milvus
    participant Qdrant
    participant Faiss

    UI->>Config: 读取index_settings.now_vector_index
    Config->>UI: 返回当前向量数据库类型
    UI->>Embedding: 获取embedding及其维度
    UI->>VectorIndex: get_vector_index_class(类型)
    alt 选择Faiss
        VectorIndex->>Faiss: FaissVectorIndex.from_name(...)
    else 选择Milvus
        VectorIndex->>Milvus: MilvusVectorIndex.from_name(...)
    else 选择Qdrant
        VectorIndex->>Qdrant: QdrantVectorIndex.from_name(...)
    end
    Embedding->>VectorIndex: 批量add/搜索/删除
    VectorIndex->>UI: 返回结果
Loading

Poem

(=^·ᴥ·^=)

三兔同框选向量,
Faiss、Milvus、Qdrant忙。
配置切换真灵巧,
维度参数不再藏。
批量嵌入效率高,
新后端测试香!

🥕代码跃进,兔子欢畅!

Note

⚡️ AI Code Reviews for VS Code, Cursor, Windsurf

CodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback.
Learn more here.


Note

⚡️ Faster reviews with caching

CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 30th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.
Enjoy the performance boost—your workflow just got faster.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f42fa9b and 3d356e0.

📒 Files selected for processing (5)
  • hugegraph-llm/pyproject.toml (2 hunks)
  • hugegraph-llm/requirements.txt (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/config/models/base_config.py (2 hunks)
  • hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/demo/rag_demo/configs_block.py (10 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py
  • hugegraph-llm/src/hugegraph_llm/config/models/base_config.py
  • hugegraph-llm/requirements.txt
  • hugegraph-llm/pyproject.toml
  • hugegraph-llm/src/hugegraph_llm/demo/rag_demo/configs_block.py
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build (3.10)
  • GitHub Check: build (3.11)
✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 29

🔭 Outside diff range comments (3)
hugegraph-llm/src/hugegraph_llm/indices/vector_index/faiss_vector_store.py (2)

47-54: 🛠️ Refactor suggestion

add 时维度不匹配应显式报错而非静默重建

当前仅在 ntotal==0 且维度不同时重建索引;
如果索引已有数据而传入不同维度,会静默添加错误大小的向量,触发 C++ 断言。

建议:

if self.index.ntotal != 0 and len(vectors[0]) != self.index.d:
    raise ValueError("传入向量维度与索引维度不一致")

55-67: ⚠️ Potential issue

Faiss IndexFlatL2 不支持 remove_ids,会直接抛异常

IndexFlatL2 为无 ID 平面索引,调用 remove_ids 会触发 faiss::Index::remove_ids not implemented.
如果确实需要删除功能,需要改用 IndexIDMap 或重建索引。

-        self.index.remove_ids(np.array(indices))
+        # IndexFlatL2 无删除能力,只能重建
+        if indices:
+            keep_mask = np.ones(self.index.ntotal, dtype=bool)
+            keep_mask[indices] = False
+            kept_vectors = self.index.reconstruct_n(0, self.index.ntotal)[keep_mask]
+            self.index.reset()
+            if kept_vectors.size:
+                self.index.add(kept_vectors)
hugegraph-llm/src/tests/indices/test_milvus_vector_index.py (1)

94-101: 🛠️ Refactor suggestion

断言结果未必按预期生效

after_results 返回的元素是什么结构?若为 {"text": ..., "score": ...} 字典列表,直接 assertNotIn(str, list) 必然为真。
请改为比较 item["text"] 字段或自定义 comparator,以确保删除逻辑真的生效。

🧹 Nitpick comments (32)
hugegraph-llm/src/hugegraph_llm/indices/vector_index/faiss_vector_store.py (3)

39-46: 保存文件缺乏原子性及异常处理

若在 faiss.write_indexpkl.dump 过程中抛异常,可能产生半写入文件导致索引损坏。
可采用临时文件写入后 os.replace 原子替换,并用 try/except 包裹记录日志。


90-103: 向量统计字段命名可能引起误解

chunk_vector_num, graph_vid_vector_num 字段均取自 ntotal,含义重复;
若实际不同应分别计算,否则建议合并为 total_vectors 以免混淆。


105-113: clean 仅删除文件不删除空目录

索引删除后目录仍残留,可能造成脏数据扫描。
可在文件删除后 os.rmdir 尝试删除空目录(忽略报错)。

hugegraph-llm/requirements.txt (1)

20-21: 修复依赖声明格式
pymilvus ~=2.5.9qdrant-client ~=1.14.2 存在多余空格,建议统一为无空格的格式:

-pymilvus ~=2.5.9
-qdrant-client ~=1.14.2
+pymilvus~=2.5.9
+qdrant-client~=1.14.2
hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py (3)

59-61: 变量命名与配置键不一致
变量 indented_custom_related_information 中的 “related” 与属性 custom_rerank_info 中的 “rerank” 命名不一致,易引起混淆,建议改为:

indented_custom_rerank_info = "\n".join(
    [f"    {line}" for line in self.custom_rerank_info.splitlines()]
)

63-65: 统一变量命名风格
indented_keywords_extract_template 与属性 keywords_extract_prompt 命名不同步,建议调整为:

indented_keywords_extract_prompt = "\n".join(
    [f"    {line}" for line in self.keywords_extract_prompt.splitlines()]
)

53-67: 提取通用缩进逻辑
多处使用 "\n".join([...]) 来处理缩进,可抽象为:

def indent(text: str, prefix: str) -> str:
    return "\n".join(f"{prefix}{line}" for line in text.splitlines())

并替换原有逻辑以减少重复代码。

hugegraph-llm/src/hugegraph_llm/indices/graph_index.py (1)

28-38: 格式化构造函数参数与客户端初始化
将构造函数参数与 PyHugeClient 调用拆分多行对齐,提升可读性,无逻辑更改。

hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py (1)

160-161: 添加断言增强了代码健壮性

添加assert admin_settings.enable_login断言确保了认证逻辑依赖的配置必须存在且为真值,这是一种良好的防御性编程实践。此更改符合PR中添加断言以防止运行时出现None值的目标。

请注意:此断言后仍使用lower()方法,如果enable_login值不是字符串类型可能会引发异常。考虑添加类型检查或使用更安全的方式处理配置值。

 assert admin_settings.enable_login
-auth_enabled = admin_settings.enable_login.lower() == "true"
+auth_enabled = str(admin_settings.enable_login).lower() == "true"
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/vector_graph_block.py (1)

178-178: 添加了类型忽略注释,避免类型检查错误

在check_graph_db_connection函数调用处添加了# type:ignore注释,避免mypy等类型检查工具报错。这是临时解决方案,但在未来版本中应考虑修复底层类型问题。

考虑在未来版本中修复底层类型定义,而不是依赖于类型忽略注释。

hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py (2)

362-363: 添加了类型忽略注释

在DataFrame操作处添加了# type:ignore注释,避免mypy等类型检查工具报错。这是临时解决方案,未来应考虑优化pandas与类型系统的集成。

考虑在未来版本中使用pandas的类型支持功能,或者使用专门的pandas类型存根,而不是依赖于类型忽略注释。


366-370: DataFrame操作添加了类型忽略注释

在多个DataFrame操作处添加了# type:ignore注释,避免类型检查错误。同时保持了原有的数据处理逻辑不变,确保功能正确性。

建议未来使用pandas 2.0+的改进类型支持,或考虑使用pandera等库进行DataFrame的运行时类型验证,以减少对# type:ignore的依赖。

hugegraph-llm/src/hugegraph_llm/config/index_config.py (2)

24-26: 类的文档字符串与实际功能不符

当前的文档字符串描述为"LLM settings",但该类实际上是为向量索引服务提供配置,应该修改为更准确的描述,例如"Vector index settings"。

-    """LLM settings"""
+    """Vector index settings for Qdrant and Milvus"""

29-29: 环境变量获取方式不一致

qdrant_api_key的获取方式与其他配置项不同,使用了额外的条件判断,而其他配置项使用os.environ.get()并提供默认值。建议统一所有配置项的获取方式,以保持代码一致性。

-    qdrant_api_key: Optional[str] = os.environ.get("QDRANT_API_KEY") if os.environ.get("QDRANT_API_KEY") else None
+    qdrant_api_key: Optional[str] = os.environ.get("QDRANT_API_KEY", None)
hugegraph-llm/src/hugegraph_llm/operators/kg_construction_task.py (2)

78-79: 断言语句缺少有意义的错误消息

添加断言来检查example_prompt是有必要的,但该断言没有提供有意义的错误消息,这会使调试变得困难。

-            assert example_prompt
+            assert example_prompt, "example_prompt必须提供才能提取property_graph"

109-109: run方法断言可能过于严格

添加断言来确保context不是None可能过于严格,尤其是在某些情况下返回None可能是有效的行为。如果一定要进行断言,应提供有意义的错误消息。

-        assert context is not None
+        assert context is not None, "运行结束后context不能为None"
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_gremlin_example_index.py (1)

25-25: 需要解决 FIXME 注释

此处的 FIXME 注释表明此逻辑需要与 build_semantic_index.py 保持一致,建议考虑提取共同逻辑到一个共享的工具类或方法中,避免代码重复。

hugegraph-llm/src/tests/indices/test_qdrant_vector_index.py (2)

33-53: 基本向量索引操作测试

测试覆盖了基本的向量索引操作,包括添加向量和搜索相似向量。测试使用中文文本数据和适当的嵌入模型,验证了 QdrantVectorIndex 的核心功能。

注意:使用了硬编码的阈值 dis_threshold=100,可能需要考虑不同嵌入模型的特性。

考虑将阈值作为一个常量或基于向量维度动态计算,以适应不同的嵌入模型特性。


32-32: 考虑增加更多的断言

在 tearDown 方法中调用了 QdrantVectorIndex.clean(self.name),但没有验证清理操作是否成功。建议添加测试以确保 clean 方法正常工作。

考虑添加一个专门测试 clean 方法的测试用例,或在现有测试中添加验证,例如:

def test_clean_index(self):
    # 创建并填充索引
    index = QdrantVectorIndex.from_name(1024, self.name)
    # 添加一些数据...
    
    # 清理索引
    QdrantVectorIndex.clean(self.name)
    
    # 验证索引已被清理
    # 可以通过尝试加载并检查是否为空来验证
    cleaned_index = QdrantVectorIndex.from_name(1024, self.name)
    self.assertEqual(len(cleaned_index.search(some_vector, 10)), 0)
hugegraph-llm/src/hugegraph_llm/indices/vector_index/base.py (3)

30-45: 向量添加与属性获取方法定义

addget_all_properties方法定义了向量添加和属性获取的标准接口。注意get_all_properties的文档注释中有未完成的TODO标记,应该完善这部分文档说明方法的目的、参数和返回值。

-        """
-        #TODO: finish comment
-        """
+        """
+        Get all properties stored in the vector index.
+
+        Returns:
+            list[str]: List of all properties associated with vectors in the store.
+        """

72-84: 索引保存和信息获取方法缺少完整文档

save_index_by_nameget_vector_index_info方法的文档注释未完成,应该添加完整的说明。

-        """
-        #TODO: finish comment
-        """
+        """
+        Save the vector index to disk using the provided name components.
+
+        Args:
+            *name (str): Components of the path/name used to identify and store the index.
+        """
-        """
-        #TODO: finish comment
-        """
+        """
+        Get information about the vector index.
+
+        Returns:
+            Dict: Dictionary containing metadata about the vector index,
+                  such as dimension, number of vectors, etc.
+        """

86-105: 静态方法接口缺少完整文档

静态方法from_nameexistclean的文档注释未完成,应该添加完整的说明。这些方法对于向量索引的生命周期管理非常重要,需要详细的文档。

-        """
-        #TODO: finish comment
-        """
+        """
+        Create or load a vector index from the specified name and embedding dimension.
+
+        Args:
+            embed_dim (int): Dimension of the embedding vectors.
+            *name (str): Components of the path/name used to identify the index.
+
+        Returns:
+            VectorStoreBase: An instance of the vector store.
+        """
-        """
-        #TODO: finish comment
-        """
+        """
+        Check if a vector index exists at the specified name.
+
+        Args:
+            *name (str): Components of the path/name used to identify the index.
+
+        Returns:
+            bool: True if the index exists, False otherwise.
+        """
-        """
-        #TODO: finish comment
-        """
+        """
+        Clean up (remove) a vector index at the specified name.
+
+        Args:
+            *name (str): Components of the path/name used to identify the index.
+
+        Returns:
+            bool: True if the cleaning was successful, False otherwise.
+        """
docker/Dockerfile.nk (1)

11-15: 减少镜像体积与构建时间的简单优化
pip installapt 指令均会产生缓存,目前没有显式清理,最终镜像会包含无用文件。可在同一层中追加 --no-cache-dirpip cache purge,并在结尾删除 apt 缓存,从而显著缩小镜像。

-    python -m pip install nuitka && \
-    pip install ./hugegraph-python-client && \
-    pip install -r ./hugegraph-llm/requirements.txt && \
+    python -m pip install --no-cache-dir nuitka && \
+    pip install --no-cache-dir ./hugegraph-python-client && \
+    pip install --no-cache-dir -r ./hugegraph-llm/requirements.txt && \
+    pip cache purge && \
hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py (2)

39-45: 对跨后端字段保持兼容的安全检查

vector_index_info['vector_info']['chunk_vector_num']graph_properties_vector_num 等键值假设所有后端返回统一结构。若某后端字段名不同或缺失,将抛出 KeyError 直接中断 UI。

chunk_num = vector_index_info.get("vector_info", {}).get("chunk_vector_num", 0)

并在返回 JSON 时给出默认值或警告,更稳健。


110-114: 确认 Builder API 与类型一致

build_vertex_id_semantic_index(vector_index) 期望 还是 实例
此处传入的是类(未实例化)。若后续实现改为期望实例,会导致运行时报错。请再次核对 KgBuilder.build_vertex_id_semantic_index 的签名并补充类型提示,避免日后重构踩坑。

hugegraph-llm/src/hugegraph_llm/operators/index_op/gremlin_example_index_query.py (2)

33-45: exist()from_name() 的调用流程可读性不足

  • 先判断 vector_index.exist("gremlin_examples"),稍后又无论存在与否都执行 from_name(...)
    如果 exist()True,前后两次 from_name 的结果应一致;否则可合并减少一次 RPC/IO。

  • 建议:

self.vector_index = vector_index.from_name(
    self.embedding.get_embedding_dim(), "gremlin_examples"
)
if not self.vector_index.exist():
    ...

由实例方法判断自身是否已持久化,更直观。


55-68: 批量嵌入逻辑重复,可复用公共工具

文件内 TODO 已指出问题。当前手动 ThreadPoolExecutor + tqdm,但在其他模块已有批量嵌入实现(如 build_semantic_index.py)。建议提取到 utils.embedding_batch(),避免线程池散落到多处、难以维护。

hugegraph-llm/src/hugegraph_llm/demo/rag_demo/configs_block.py (1)

650-659: 选择向量库后未持久化到 .env

select 事件仅 setattr(index_settings, "now_vector_index", engine),刷新页面后就会丢失。
建议调用 index_settings.update_env()(若有)或在 setter 内部自动更新 .env,保持与其他配置行为一致。

hugegraph-llm/src/hugegraph_llm/demo/rag_demo/text2gremlin_block.py (1)

121-129: 避免滥用 # type: ignore,完善类型标注

此处通过 # type: ignore 绕过了类型检查,但实际上完全可以为 mini_schema 指定准确类型,利于后续维护。

-    mini_schema = {}  # type: ignore
+    mini_schema: Dict[str, Any] = {}

如果后面发现类型仍无法通过 mypy,可分步完善字段的具体类型,而不是整体忽略。

hugegraph-llm/src/hugegraph_llm/config/llm_config.py (2)

31-32: embedding_type 已有默认值,Optional 可去除

字段声明为 Optional[...] 意味着允许 None,但默认值是 "openai",且后续代码普遍假设为字符串。将其改为非可选能减少分支判断:

-    embedding_type: Optional[Literal["openai", "litellm", "ollama/local", "qianfan_wenxin"]] = "openai"
+    embedding_type: Literal["openai", "litellm", "ollama/local", "qianfan_wenxin"] = "openai"

67-70: 环境变量解析可读性改进

长行的三目表达式阅读困难,且 int() 转换可能抛异常。可考虑拆分为辅助函数或提前校验:

ollama_embedding_model_dim_env = os.getenv("OLLAMA_EMBEDDING_MODEL_DIM")
ollama_embedding_model_dim: Optional[int] = int(ollama_embedding_model_dim_env) if ollama_embedding_model_dim_env else None

这样便于单元测试,并可在转换失败时捕获并给出提示。

hugegraph-llm/src/hugegraph_llm/indices/vector_index/milvus_vector_store.py (1)

98-111: 序列化/反序列化方法命名反了,易混淆

_deserialize_property 实际是「序列化为字符串」,而 _serialize_property 负责「反序列化」。建议更名或交换实现以保持语义一致。

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e1d3e44 and f42fa9b.

📒 Files selected for processing (44)
  • docker/Dockerfile.nk (1 hunks)
  • hugegraph-llm/pyproject.toml (3 hunks)
  • hugegraph-llm/requirements.txt (2 hunks)
  • hugegraph-llm/src/hugegraph_llm/config/__init__.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/config/generate.py (2 hunks)
  • hugegraph-llm/src/hugegraph_llm/config/hugegraph_config.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/config/index_config.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/config/llm_config.py (3 hunks)
  • hugegraph-llm/src/hugegraph_llm/config/models/base_config.py (3 hunks)
  • hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py (2 hunks)
  • hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py (3 hunks)
  • hugegraph-llm/src/hugegraph_llm/demo/rag_demo/configs_block.py (9 hunks)
  • hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py (8 hunks)
  • hugegraph-llm/src/hugegraph_llm/demo/rag_demo/text2gremlin_block.py (7 hunks)
  • hugegraph-llm/src/hugegraph_llm/demo/rag_demo/vector_graph_block.py (6 hunks)
  • hugegraph-llm/src/hugegraph_llm/indices/graph_index.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/indices/vector_index/base.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/indices/vector_index/faiss_vector_store.py (2 hunks)
  • hugegraph-llm/src/hugegraph_llm/indices/vector_index/milvus_vector_store.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/indices/vector_index/qdrant_vector_store.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/middleware/middleware.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/models/embeddings/base.py (2 hunks)
  • hugegraph-llm/src/hugegraph_llm/models/embeddings/init_embedding.py (2 hunks)
  • hugegraph-llm/src/hugegraph_llm/models/embeddings/litellm.py (2 hunks)
  • hugegraph-llm/src/hugegraph_llm/models/embeddings/ollama.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/models/embeddings/openai.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/models/embeddings/qianfan.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (8 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/gremlin_generate_task.py (4 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py (2 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/index_op/build_gremlin_example_index.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py (2 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/index_op/gremlin_example_index_query.py (2 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py (3 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/index_op/vector_index_query.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/kg_construction_task.py (4 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/gremlin_generate.py (3 hunks)
  • hugegraph-llm/src/hugegraph_llm/utils/decorators.py (3 hunks)
  • hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py (3 hunks)
  • hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py (2 hunks)
  • hugegraph-llm/src/tests/indices/test_faiss_vector_index.py (1 hunks)
  • hugegraph-llm/src/tests/indices/test_milvus_vector_index.py (1 hunks)
  • hugegraph-llm/src/tests/indices/test_qdrant_vector_index.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (11)
hugegraph-llm/src/hugegraph_llm/config/generate.py (2)
hugegraph-llm/src/hugegraph_llm/config/prompt_config.py (1)
  • PromptConfig (23-389)
hugegraph-llm/src/hugegraph_llm/config/models/base_config.py (1)
  • generate_env (38-54)
hugegraph-llm/src/hugegraph_llm/indices/graph_index.py (1)
hugegraph-python-client/src/pyhugegraph/client.py (2)
  • PyHugeClient (48-101)
  • graph (69-70)
hugegraph-llm/src/hugegraph_llm/models/embeddings/litellm.py (4)
hugegraph-llm/src/hugegraph_llm/models/embeddings/base.py (1)
  • get_embedding_dim (61-64)
hugegraph-llm/src/hugegraph_llm/models/embeddings/ollama.py (1)
  • get_embedding_dim (39-42)
hugegraph-llm/src/hugegraph_llm/models/embeddings/openai.py (1)
  • get_embedding_dim (38-41)
hugegraph-llm/src/hugegraph_llm/models/embeddings/qianfan.py (1)
  • get_embedding_dim (48-51)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/gremlin_generate.py (1)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (1)
  • BaseLLM (22-74)
hugegraph-llm/src/hugegraph_llm/models/embeddings/init_embedding.py (4)
hugegraph-llm/src/hugegraph_llm/models/embeddings/litellm.py (1)
  • LiteLLMEmbedding (32-100)
hugegraph-llm/src/hugegraph_llm/models/embeddings/ollama.py (1)
  • OllamaEmbedding (25-72)
hugegraph-llm/src/hugegraph_llm/models/embeddings/openai.py (1)
  • OpenAIEmbedding (24-72)
hugegraph-llm/src/hugegraph_llm/models/embeddings/qianfan.py (1)
  • QianFanEmbedding (34-66)
hugegraph-llm/src/hugegraph_llm/models/embeddings/base.py (4)
hugegraph-llm/src/hugegraph_llm/models/embeddings/litellm.py (4)
  • get_text_embedding (57-70)
  • get_embedding_dim (47-50)
  • get_texts_embeddings (72-85)
  • async_get_text_embedding (87-100)
hugegraph-llm/src/hugegraph_llm/models/embeddings/ollama.py (4)
  • get_text_embedding (44-46)
  • get_embedding_dim (39-42)
  • get_texts_embeddings (48-67)
  • async_get_text_embedding (69-72)
hugegraph-llm/src/hugegraph_llm/models/embeddings/openai.py (4)
  • get_text_embedding (43-46)
  • get_embedding_dim (38-41)
  • get_texts_embeddings (48-67)
  • async_get_text_embedding (69-72)
hugegraph-llm/src/hugegraph_llm/models/embeddings/qianfan.py (4)
  • get_text_embedding (53-56)
  • get_embedding_dim (48-51)
  • get_texts_embeddings (58-61)
  • async_get_text_embedding (63-66)
hugegraph-llm/src/hugegraph_llm/config/index_config.py (1)
hugegraph-llm/src/hugegraph_llm/config/models/base_config.py (1)
  • BaseConfig (31-139)
hugegraph-llm/src/hugegraph_llm/indices/vector_index/base.py (2)
hugegraph-llm/src/hugegraph_llm/indices/vector_index/faiss_vector_store.py (9)
  • add (47-53)
  • get_all_properties (90-91)
  • remove (55-67)
  • search (69-88)
  • save_index_by_name (39-45)
  • get_vector_index_info (93-103)
  • from_name (115-132)
  • exist (135-138)
  • clean (106-112)
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py (1)
  • get_vector_index_info (62-72)
hugegraph-llm/src/hugegraph_llm/config/hugegraph_config.py (1)
hugegraph-llm/src/hugegraph_llm/config/models/base_config.py (1)
  • BaseConfig (31-139)
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/configs_block.py (2)
hugegraph-llm/src/hugegraph_llm/models/embeddings/litellm.py (1)
  • LiteLLMEmbedding (32-100)
hugegraph-llm/src/hugegraph_llm/config/models/base_config.py (1)
  • update_env (56-70)
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py (7)
hugegraph-llm/src/hugegraph_llm/indices/vector_index/base.py (4)
  • VectorStoreBase (23-105)
  • get_vector_index_info (79-84)
  • from_name (88-91)
  • clean (102-105)
hugegraph-llm/src/hugegraph_llm/indices/vector_index/faiss_vector_store.py (4)
  • FaissVectorIndex (34-138)
  • get_vector_index_info (93-103)
  • from_name (115-132)
  • clean (106-112)
hugegraph-llm/src/hugegraph_llm/indices/vector_index/milvus_vector_store.py (4)
  • MilvusVectorIndex (37-259)
  • get_vector_index_info (201-223)
  • from_name (238-248)
  • clean (226-235)
hugegraph-llm/src/hugegraph_llm/indices/vector_index/qdrant_vector_store.py (4)
  • QdrantVectorIndex (30-213)
  • get_vector_index_info (165-178)
  • from_name (193-202)
  • clean (181-190)
hugegraph-llm/src/hugegraph_llm/models/embeddings/init_embedding.py (2)
  • Embeddings (26-62)
  • get_embedding (30-62)
hugegraph-llm/src/hugegraph_llm/operators/kg_construction_task.py (4)
  • build_vector_index (95-98)
  • KgBuilder (40-114)
  • chunk_split (63-70)
  • run (106-110)
hugegraph-llm/src/hugegraph_llm/utils/hugegraph_utils.py (1)
  • get_hg_client (40-47)
🔇 Additional comments (92)
hugegraph-llm/src/hugegraph_llm/models/embeddings/openai.py (1)

38-42: get_embedding_dim 方法实现简洁明了

该辅助接口补全了抽象基类的需求,返回值直接读取实例属性,无性能或可读性问题。

hugegraph-llm/requirements.txt (1)

2-2:

Details

❓ Verification inconclusive

验证 ollama 版本可用性
请确保 ollama~=0.4.8 已在 PyPI 发布且符合项目需求,无已知安全漏洞。


🏁 Script executed:

#!/bin/bash
# 检查 ollama 在 PyPI 上的最新版本
curl -s https://pypi.org/pypi/ollama/json | jq '.info.version'

Length of output: 70


确认 ollama 版本已发布

  • 文件:hugegraph-llm/requirements.txt,行 2
  • 已在 PyPI 上验证到版本 0.4.8 与声明约束 ollama~=0.4.8 完全匹配。
  • 建议对该版本进行安全审计,确认无已知漏洞并满足项目需求。
hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py (1)

19-21: 已添加空行以增强可读性
import osimport yaml 之间新增空行,符合 PEP8 导入分组规范。

hugegraph-llm/src/hugegraph_llm/config/models/base_config.py (2)

100-105: 日志调用多行参数对齐
log.info 调用的参数列表拆分多行,使日志内容更清晰,可维护性更好,无逻辑变更。


114-114: 日志格式化调用优化
本行将新增配置项日志合并在单次调用中,增强可读性,无功能差异。

hugegraph-llm/src/hugegraph_llm/config/generate.py (2)

21-21: 添加 index_settings 至导入列表
在配置生成脚本中引入 index_settings,确保向量索引相关环境变量能够被正确处理。


31-31: 调用 index_settings.generate_env()
已在 --update 流程中增加 index_settings.generate_env(),用于生成向量索引模块所需的 .env 配置。请确认该步骤顺序是否满足各配置之间的依赖关系。

hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py (2)

36-36: 添加类型忽略注释是正确的做法

这里添加 # type: ignore 注释来抑制 mypy 对空字典的类型检查警告是恰当的。空字典后续会被填充特定结构的数据,此注释避免了不必要的类型错误警告。


49-51: 重构后的字典推导式提高了可读性

将字典推导式从单行重构为多行格式提高了代码可读性,便于理解和维护,同时保持了相同的功能。

hugegraph-llm/src/tests/indices/test_faiss_vector_index.py (3)

22-22: 正确导入特定的FaissVectorIndex类

从通用的VectorIndex切换到特定的FaissVectorIndex导入反映了代码库的重构,支持多种向量数据库后端。这符合PR目标中实现多向量数据库CRUD操作的需求。


29-33: 测试数据格式优化提高了可读性

将测试数据从单行改为多行格式提高了代码可读性,使测试案例更清晰易读。


35-35: 正确使用FaissVectorIndex实例化

VectorIndex(1024)更改为FaissVectorIndex(1024)与导入变更保持一致,正确反映了向量索引实现的架构变化。

hugegraph-llm/src/hugegraph_llm/config/__init__.py (3)

19-19: 正确添加index_settings到导出列表

index_settings添加到__all__列表中,确保它可以作为模块公共API的一部分被导入。这是支持多向量数据库功能的必要步骤。


23-23: 导入IndexConfig类以支持向量索引配置

从新模块导入IndexConfig类,为多向量数据库后端的支持奠定了基础。这符合PR目标中关于实现向量数据库CRUD操作的需求。


36-36: 初始化index_settings实例

通过实例化IndexConfig类创建index_settings对象,为系统提供了集中化的向量索引配置管理。这是支持多向量数据库功能的核心组件。

hugegraph-llm/src/hugegraph_llm/demo/rag_demo/app.py (1)

111-111: 正确添加类型忽略注释

添加# type: ignore[misc]注释来抑制对huge_settings.__init__()直接调用的类型检查警告是合适的。这种方式用于重新加载配置时需要显式调用初始化方法,符合PR中修复类型问题的目标。

hugegraph-llm/src/hugegraph_llm/models/embeddings/litellm.py (4)

20-20: 导入顺序优化,异常类型放在前面

导入语句的顺序调整合理,将异常相关的导入(APIConnectionError、APIError、RateLimitError)移到嵌入相关导入之前,符合Python代码组织的最佳实践。


37-38: 正确添加了embedding_dimension参数

构造函数添加了embedding_dimension参数,这与基类BaseEmbedding的接口保持一致,同时与其他嵌入类(如OpenAIEmbedding、OllamaEmbedding等)的实现模式相匹配。


45-45: 正确保存嵌入维度属性

将embedding_dimension参数作为实例属性保存,为后续的get_embedding_dim方法提供数据支持。


47-50: 实现了必要的get_embedding_dim方法

这个方法实现了BaseEmbedding抽象基类中定义的同名方法,返回嵌入向量的维度。这样确保了LiteLLMEmbedding类能够正确地与向量索引系统集成,为多向量数据库后端提供了必要的维度信息。

hugegraph-llm/src/hugegraph_llm/demo/rag_demo/vector_graph_block.py (2)

24-36: 导入语句重新组织,提高了可读性

导入语句的重新组织使代码结构更加清晰,特别是将hugegraph_llm.utils下的不同模块按功能分组导入。添加了vector_index_utils模块的导入,与PR中引入的多向量索引后端支持相匹配。


182-185: 日志格式调整,提高了可读性

将多行日志消息重新格式化为更易读的形式,使用了字符串连接而不是单行长字符串,提高了代码的可维护性。

hugegraph-llm/src/hugegraph_llm/operators/llm_op/gremlin_generate.py (3)

21-21: 导入顺序优化

导入语句的顺序调整合理,使类型注解导入更加集中,提高了代码的可读性。


32-32: 类型注解现代化

BaseLLM = None改为BaseLLM | None = None,采用了Python 3.10+的联合类型语法,更加清晰地表达了参数可以是BaseLLM类型或None。这种现代化的类型注解方式与PEP 604兼容。


54-54: 简化字符串格式化

将多行字符串连接简化为单行f-string格式化,使代码更加简洁易读,同时保持了相同的输出结果。这种改进符合Python现代代码风格的最佳实践。

hugegraph-llm/src/hugegraph_llm/demo/rag_demo/rag_block.py (8)

21-21: 类型导入优化

类型导入的重新组织使代码结构更加清晰,尤其是添加了AsyncGenerator和Literal类型,这对于函数签名和类型检查至关重要。


27-27: 配置导入添加了index_settings

导入语句添加了index_settings,这是支持多向量索引后端选择的关键配置模块,使代码能够动态选择不同的向量数据库实现。


77-77: 正确添加了向量索引参数

在query_vector_index方法调用中明确传递了index_settings.now_vector_index参数,这确保了系统能够使用配置中指定的向量索引类型,增强了代码的灵活性和可配置性。


80-81: 在keywords_to_vid方法中添加了向量索引参数

keywords_to_vid方法调用中添加了vector_index_str参数,确保了关键词到顶点ID的映射过程中使用正确的向量索引实现。这与系统中其他地方的向量索引配置保持一致。


191-192: 在异步函数中添加了向量索引参数

在异步流式生成函数中的query_vector_index方法调用中也添加了vector_index_str参数,确保异步流程与同步流程使用相同的向量索引配置,保持了代码的一致性。


193-195: 异步流程中关键词到顶点ID映射添加了向量索引参数

在异步流式生成函数的keywords_to_vid方法调用中也添加了vector_index_str参数,保持与同步流程的一致性,确保在不同执行模式下都能正确使用配置的向量索引。


301-301: 简化UI组件事件处理代码

将多行的graph_vector_radio.change调用简化为单行,并添加了pylint禁用注释,使代码更加简洁,同时避免了不必要的lint警告。


330-332: 明确设置了队列参数

在btn.click方法中明确设置了queue=True和concurrency_limit=5参数,确保UI交互在高负载情况下能够正常工作,避免并发请求导致的性能问题或崩溃。

hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py (3)

19-24: 优化了导入结构,明确引入了所需模块

导入结构更新为明确的类型声明,并引入基础向量存储接口VectorStoreBase,为多后端支持奠定了基础。


29-36: 重构构造函数以支持多向量索引后端

将构造函数重构为接受vector_index类型参数,而不是硬编码使用特定实现,通过from_name工厂方法创建实例。这种依赖注入方式提高了代码的灵活性和可测试性。

从嵌入模型中获取维度信息用于初始化向量索引,使系统能够适应不同维度的嵌入模型。


47-48: 统一索引保存方式

将索引保存方法更新为使用save_index_by_name,与新的向量索引接口一致,通过图名称和"chunks"标识符保存索引,而不是依赖文件路径。

hugegraph-llm/src/hugegraph_llm/operators/index_op/vector_index_query.py (3)

19-24: 更新了导入以支持新的向量索引抽象

导入结构更新为使用类型注解和新的VectorStoreBase接口,增强了代码的类型安全性。


27-31: 重构构造函数以支持可插拔的向量索引实现

构造函数现在接受vector_index类型参数,允许在运行时注入不同的向量索引实现(如Faiss、Milvus或Qdrant)。
从嵌入模型中获取维度信息用于初始化向量索引,与BuildVectorIndex保持一致的模式。


34-35: 增强了输入验证

添加了断言确保查询不为None,这是一个很好的防御性编程实践,有助于在早期捕获错误。

hugegraph-llm/pyproject.toml (4)

41-43: 添加了新向量数据库支持所需的依赖

添加了对Milvus和Qdrant向量数据库的客户端库依赖,并更新了Ollama版本,为多向量索引后端提供支持。


91-95: 添加了mypy配置

添加了mypy静态类型检查工具配置,禁用了import-untyped错误,并启用了对未类型化函数定义的检查。这与PR目标中提到的添加mypy配置以启用类型检查相符。


97-106: 添加了ruff代码风格配置

配置了ruff代码风格检查工具,设置了行长度、缩进宽度和格式化偏好等。这与PR目标中提到的添加ruff配置以强制代码风格相符。


64-64:

Details

❓ Verification inconclusive

添加了APScheduler依赖

添加了APScheduler库依赖,但PR描述中没有明确提及这一变化的目的。请确认此依赖是否与新的向量数据库实现或其他功能相关。


🌐 Web query:

APScheduler与向量数据库集成或调度任务的关系

💡 Result:

APScheduler是一个强大的Python任务调度库,它提供了灵活的方式来安排和执行定时任务。关于APScheduler与数据库集成以及任务调度的关系,我将详细说明其工作原理和应用场景。

APScheduler与数据库集成

APScheduler能够与多种数据库系统集成,实现任务的持久化存储和管理。这种集成具有重要意义,尤其是在需要确保任务在程序重启后仍能继续执行的场景中。

持久化存储原理

APScheduler允许将任务信息存储在各种数据库中,包括SQLite、MySQL、PostgreSQL等关系型数据库,以及Redis等非关系型数据库[1][3]。当任务被添加到调度器时,APScheduler会将任务相关信息序列化后保存到指定的数据库中[5]。

这种持久化机制的工作流程是:

  1. 任务被添加到调度器
  2. 系统根据指定的时间规则计算出下一次执行时间
  3. 任务信息被序列化并存储到数据库中
  4. 程序重启时,调度器可以从数据库加载并反序列化任务信息,继续执行任务[5][6]

集成实现方式

使用SQLAlchemy集成关系型数据库的示例:

from apscheduler.schedulers.background import BackgroundScheduler
from apscheduler.jobstores.sqlalchemy import SQLAlchemyJobStore

# 创建持久化存储(SQLite)
jobstores = {
    'default': SQLAlchemyJobStore(url='sqlite:///jobs.sqlite')
}

scheduler = BackgroundScheduler(jobstores=jobstores)

def job():
    print("任务执行中...")

# 添加定时任务
scheduler.add_job(job, 'interval', seconds=10)
scheduler.start()

在这个例子中,任务会被存储在jobs.sqlite数据库中,即使程序重启,任务也能从数据库中恢复并继续执行[1]。

使用Redis存储的示例:

from apscheduler.schedulers.tornado import TornadoScheduler
from apscheduler.jobstores.redis import RedisJobStore

jobstores = {
    'default': RedisJobStore(jobs_key='cron.jobs', run_times_key='cron.run_times',
                             host='localhost', port=6379)
}

scheduler = TornadoScheduler(jobstores=jobstores)
scheduler.start()

使用Redis存储时,任务会以哈希形式存储在Redis中,便于任务的管理和恢复[3]。

APScheduler的任务调度机制

调度器工作流程

APScheduler的调度机制遵循以下流程:

  1. 当任务被添加到调度器时,系统计算出下一次执行时间
  2. 调度器将任务信息存储到作业存储器中(内存或数据库)
  3. 在主循环中,调度器不断检查是否有到期任务需要执行
  4. 找到最近需要执行的任务,计算出等待时间,调度器进入休眠状态
  5. 时间到达后,调度器唤醒并执行相应任务[6][9]

这种机制确保了调度器能够高效运行,避免因持续轮询而浪费系统资源。

任务管理与执行

APScheduler提供了完整的任务生命周期管理功能:

  • 添加任务:通过add_job方法将任务添加到调度器
  • 修改任务:支持动态更新任务的触发时间和参数
  • 暂停/恢复任务:可以临时暂停和恢复任务的执行
  • 删除任务:从调度器中移除不再需要的任务

当一个任务被添加或修改时,调度器会唤醒并检查该任务是否需要立即执行,同时更新数据库中的任务信息[9]。

实际应用场景

APScheduler在各种需要定时任务的场景中都有广泛应用:

  1. 定期数据处理:定期清理数据库、生成报表或备份数据

    # 每周日晚上10点备份数据库
    scheduler.add_job(backup_database, 'cron', day_of_week='sun', hour=22, id='backup_job')
  2. 定时通知和提醒:发送邮件、推送消息或执行提醒任务

    # 每天早上8点发送提醒邮件
    scheduler.add_job(send_email, 'cron', hour=8, minute=0, id='email_job')
  3. 分布式系统中的任务协调:在多服务器环境中,通过数据库集成确保任务不会重复执行[1][3]

  4. Web应用的后台任务:与Flask、Tornado等Web框架集成,处理需要定期执行的API调用或数据更新操作[10]

虽然APScheduler本身并不是专门为向量数据库设计的,但它的灵活性使其能够与各种数据库系统集成,包括可能用于存储向量数据的数据库。通过合理配置,APScheduler可以用于调度与向量数据相关的任务,如定期更新向量索引、执行相似度搜索或优化向量存储结构。

Citations:


请补充 APScheduler 依赖的使用场景

项目在 hugegraph-llm/pyproject.toml(第 64 行)新增:

apscheduler = "~3.10.4"

但 PR 描述中未说明此依赖的目的。APScheduler 是通用的 Python 任务调度库,支持与多种数据库(如 SQLite、MySQL、PostgreSQL、Redis 等)集成实现任务持久化。请确认并补充:

  • 是否用于向量数据库相关的定时任务(例如索引更新、相似度搜索调度等)?
  • 还是用于其他定时调度场景(如数据库备份、数据清理、通知发送等)?
  • 若需持久化存储任务,请说明将使用的 JobStore 类型(SQLAlchemyJobStore、RedisJobStore 等)及其配置方式。

谢谢!

hugegraph-llm/src/hugegraph_llm/models/embeddings/init_embedding.py (4)

19-23: 重排序导入并使用更一致的导入风格

重排序了导入语句,提高了代码可读性和一致性。


31-38: 为OpenAI嵌入模型添加维度验证和参数

添加了对openai_embedding_model_dim的断言检查,确保必需的嵌入维度参数存在。
在OpenAIEmbedding构造函数中明确传递嵌入维度参数,这与相关代码片段中的OpenAIEmbedding类更新一致。


39-46: 为Ollama嵌入模型添加维度验证和参数

添加了对ollama_embedding_model_dim的断言检查,确保必需的嵌入维度参数存在。
在OllamaEmbedding构造函数中明确传递嵌入维度参数,这与相关代码片段中的OllamaEmbedding类更新一致。


54-60: 为LiteLLM嵌入模型添加维度参数

在LiteLLMEmbedding构造函数中明确传递嵌入维度参数,这与相关代码片段中的LiteLLMEmbedding类更新一致。

hugegraph-llm/src/hugegraph_llm/models/embeddings/qianfan.py (2)

48-51: 新增的获取嵌入维度方法实现良好

get_embedding_dim方法的实现简单清晰,符合接口规范。


58-61: 批量处理文本嵌入的方法是很好的优化

get_texts_embeddings方法实现了批量处理多个文本的功能,这可以大大提高效率,减少API调用次数。实现方式清晰,直接利用了千帆API的批量处理能力。

hugegraph-llm/src/hugegraph_llm/operators/gremlin_generate_task.py (4)

17-20: 导入类型注解的优化

ListOptionaltyping模块中导入是合理的。这有助于提高代码的类型安全性和可读性。


42-44: 向量索引参数类型提示可能导致混淆

类型提示type[VectorStoreBase]表明vector_index应该是一个类而不是实例。这可能会导致调用者混淆应该传递什么。如果确实需要传递类而不是实例,应该在文档中明确说明,或者考虑使用工厂模式来创建实例。

请验证vector_index参数是否应该是一个类(如类型提示所示)还是应该是一个实例。如果是类,建议添加更清晰的文档说明;如果应该是实例,则需要修改类型提示。


72-72: 装饰器从record_qps更改为record_rpm

将装饰器从record_qps更改为record_rpm可能反映了计量方式的变化(从每秒查询数到每分钟请求数)。这种更改应该与其他使用这些装饰器的地方保持一致。


57-59:

Details

❌ Incorrect review comment

调整了参数顺序可能导致兼容性问题

GremlinExampleIndexQuery的参数顺序已更改,现在将vector_index作为第一个参数传递。这种更改可能会影响依赖于旧参数顺序的代码。请确保所有调用处都已更新。

请验证所有使用GremlinExampleIndexQuery的地方是否已更新为新的参数顺序。可以使用以下命令搜索代码库:


🏁 Script executed:

#!/bin/bash
# 查找所有使用GremlinExampleIndexQuery的地方
grep -r "GremlinExampleIndexQuery" --include="*.py" .

Length of output: 505


🏁 Script executed:

#!/bin/bash
# 查看 GremlinExampleIndexQuery 类定义及其构造函数签名
sed -n '1,50p' hugegraph-llm/src/hugegraph_llm/operators/index_op/gremlin_example_index_query.py

Length of output: 2181


确认无兼容性问题

  • GremlinExampleIndexQuery 构造函数签名为
    __init__(self, vector_index, embedding=None, num_examples=1)
  • example_index_query 已按该顺序传入 (vector_index, self.embedding, num_examples)
  • 仓库内仅此一处调用,无其他遗落调用点

无需额外修改。

Likely an incorrect or invalid review comment.

hugegraph-llm/src/hugegraph_llm/operators/kg_construction_task.py (1)

42-42: 明确类型注解是良好实践

operators属性添加明确的类型注解(List[Any])是一个好的实践,有助于提高代码的可读性和类型安全性。

hugegraph-llm/src/hugegraph_llm/operators/index_op/build_gremlin_example_index.py (3)

19-21: 优化了导入语句并引入向量存储基类

导入语句更新为包含更全面的类型(Any, Dict, List)以及新的抽象基类 VectorStoreBase,这符合类型提示的最佳实践。


27-31: 依赖注入模式的良好实现

vector_index 类型作为参数注入而不是硬编码使用特定实现,这是依赖注入的良好实践,提高了代码的灵活性和可测试性。修改 vector_index_name 为固定名称而不是目录路径也简化了接口。


39-41: 优化了向量索引的创建和保存方法

使用 from_namesave_index_by_name 替代之前的实现,统一了向量索引的操作接口,这种改变与整个代码库中向量索引抽象的重构保持一致。

hugegraph-llm/src/hugegraph_llm/utils/decorators.py (4)

19-19: 添加了必要的导入

明确导入 time 模块,而不是通过其他模块间接导入,这提高了代码的可读性和维护性。


77-78: 代码格式一致性

保持了装饰器函数之间的空行,维持了代码格式的一致性。


81-106: 重命名和增强了计时装饰器

record_qps 重命名为 record_rpm,并进行了几项重要改进:

  1. 支持同步和异步函数,通过 iscoroutinefunction 判断
  2. 改为计算每分钟请求数(RPM)而非每秒查询数(QPS)
  3. 更新了日志格式,使其更加清晰
  4. 增加了 1 RPM 的阈值过滤,减少低频率日志

这些改进使装饰器更灵活且与系统中引入的异步执行模式保持一致。


122-123: 代码格式一致性

保持了装饰器函数之间的空行,维持了代码格式的一致性。

hugegraph-llm/src/hugegraph_llm/operators/graph_rag_task.py (6)

19-21: 更新并优化了导入语句

更新了导入语句以包含 Literalhuge_settingsprompt,使类型提示更准确并使配置访问更直接。


35-36: 更新了装饰器导入并添加了向量索引工具

导入了重命名后的 record_rpm 装饰器和新的 get_vector_index_class 工具函数,这与代码库中的其他变更保持一致。


99-126: 支持动态选择向量索引实现

修改 keywords_to_vid 方法以接受 vector_index_str 参数,并使用 get_vector_index_class 动态获取适当的向量索引类。这种改进支持多后端向量数据库实现,增强了系统的灵活性。


163-178: 为向量索引查询添加了动态选择功能

修改 query_vector_index 方法以接受 vector_index_str 参数,并使用 get_vector_index_class 动态获取适当的向量索引类。这种一致性改进使得整个系统可以灵活切换向量数据库后端。


244-244: 更新了装饰器

record_qps 更新为 record_rpm,与工具函数的重命名保持一致。


253-253: 改进了默认管道设置

优化了默认情况下的方法链式调用,代码更加简洁清晰。

hugegraph-llm/src/tests/indices/test_qdrant_vector_index.py (5)

1-17: 添加了适当的许可证头

新文件包含了适当的 Apache 许可证头,符合项目的开源标准。


19-24: 导入了必要的模块

导入了测试所需的模块,包括 unittest、pprint 以及 QdrantVectorIndex 和 OllamaEmbedding 类。


26-32: 设置了测试基础结构

合理设置了 setUptearDown 方法,确保每个测试用例都有干净的环境,并在测试后清理资源,这是测试的最佳实践。


54-76: 索引持久化测试

全面测试了索引的保存和加载功能,确保持久化后的索引能够正常工作。测试结构清晰,包括设置、操作和断言环节。


78-103: 条目删除测试

测试了从索引中删除条目的功能,包括验证删除计数和搜索结果中是否不再包含被删除的条目。测试设计合理,覆盖了必要的验证点。

hugegraph-llm/src/hugegraph_llm/operators/index_op/semantic_id_query.py (5)

20-26: 导入语句重构和类型注解改进

导入语句的重组使代码结构更加清晰,特别是将PyHugeClientVectorStoreBase分开导入。添加了更详细的类型注解,有助于代码的静态类型检查。


33-41: 构造函数参数类型优化,支持多向量索引后端

vector_index参数类型从文件路径字符串改为type[VectorStoreBase]类型,提升了系统的灵活性,支持多种向量索引后端(如Faiss、Milvus、Qdrant)。这种设计模式很好地遵循了依赖注入原则,使组件更易于测试和扩展。


42-45: 向量索引初始化方式改进

使用vector_index.from_name()替代原来的VectorIndex.from_index_file(),并明确传入嵌入维度参数,确保了嵌入向量和索引维度的一致性。这种方式更灵活,且能防止维度不匹配问题。


83-85: 方法调用格式调整,提高可读性

参数格式调整提高了代码可读性,特别是对于包含多个参数的方法调用。明确将dis_threshold参数转换为float类型也增强了类型安全性。


97-97: 结果切片格式调整,保持一致性

run方法中结果切片格式的调整保持了与_fuzzy_match_vids方法相同的样式,增强了代码的一致性。

hugegraph-llm/src/hugegraph_llm/models/embeddings/base.py (4)

56-65: 抽象方法顺序调整和添加维度获取方法

get_embedding_dim方法添加到基类中是一个很好的改进,使得向量索引创建时能够明确知道嵌入维度,避免维度不匹配问题。这种设计使得系统更加类型安全。


66-84: 批量处理文本嵌入方法的添加,提高性能

添加的get_texts_embeddings抽象方法支持批量处理文本,这对性能有显著提升,特别是处理大量文本时。文档注释非常详细,清晰说明了该方法的目的、参数和返回值,有助于实现类正确遵循接口约定。


86-88: 异步嵌入方法位置调整

async_get_text_embedding方法移至其他方法之后,使接口定义更有组织性和逻辑性,先定义同步方法,再定义异步变体。


91-95: 静态方法签名格式调整,提高一致性

similarity静态方法签名的格式调整使代码风格更加一致,提高了可读性。

hugegraph-llm/src/hugegraph_llm/config/hugegraph_config.py (4)

25-31: 移除不必要的 Optional 类型,提高类型准确性

移除了配置类属性中不必要的 Optional 类型注解,使类型定义更加准确。由于这些属性都有默认值且在实际使用中不会为 None,因此直接使用确定类型更符合实际情况。注意保留了 graph_spaceOptional[str] 类型,因为它确实可能为 None


33-37: 图查询配置属性类型定义优化

图查询配置属性的类型定义得到了简化和明确化,删除了多余的 Optional 类型,使代码更加精确和易于理解。这些改变与整体项目中增强类型安全性的目标一致。


39-41: 向量配置类型优化

向量配置相关属性类型定义的优化使类型系统更加严格和明确,有助于在开发过程中发现潜在的类型相关错误。


43-44: 重排序配置类型优化

重排序配置类型定义的优化与其他配置属性的改变保持一致,提高了整体代码的类型一致性。

hugegraph-llm/src/hugegraph_llm/indices/vector_index/base.py (2)

19-28: 创建向量存储基类,提供统一接口

创建VectorStoreBase抽象基类为不同的向量索引实现(如Faiss、Milvus、Qdrant)提供了统一的接口,使系统能够灵活切换不同的向量数据库后端。这种设计符合面向接口编程的原则,提高了代码的可扩展性和可维护性。


46-71: 移除和搜索方法定义完善

removesearch方法的接口定义完整,包括详细的文档注释,清晰说明了参数和返回值。特别是search方法的接口设计考虑了相似度阈值,使得不同实现能够一致地处理相似度搜索。

docker/Dockerfile.nk (1)

34-36: 仅复制运行所需文件,进一步精简运行层
将完整源代码拷贝到运行层会引入冗余文件(测试、构建缓存等)。如果运行时只依赖资源文件,可改为仅复制 app.distresources 目录,避免把整个 src 一并打包。

hugegraph-llm/src/hugegraph_llm/models/embeddings/ollama.py (1)

48-67: 批量嵌入方法同样存在 API 与字段错位风险
self.client.embed(..., input=texts) 批量调用的可用性需要确认,且字段名仍为 "embeddings"。若 SDK 仅支持一次一个 prompt,该实现将在运行时失败。建议:

  1. 先确认 SDK 是否支持批处理;
  2. 若不支持,可循环调用或拆分为并发执行;
  3. 对返回结果保持与单条接口一致的字段名处理。
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py (1)

77-83: asyncio.run 嵌套调用存在事件循环冲突风险
BuildSemanticIndex.run() 是同步函数,但直接在内部调用 asyncio.run(...)。若上层调用方本身在事件循环中(例如 FastAPI/Gradio 等),会触发 RuntimeError: Cannot run the event loop while another loop is running。建议:

  • run() 改为异步函数,由外层决定如何调度;或
  • 使用 nest_asyncio 等方案,但需谨慎。
    请确认调用环境,再决定合适的事件循环处理方式。
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py (1)

63-66: 重复构造 Embedding 对象会带来额外开销

Embeddings().get_embedding() 每次都会创建新的嵌入模型实例(部分模型可能连接远端服务)。这里仅需获取一次维度即可,建议缓存:

-    vector_index_entity = vector_index.from_name(
-        Embeddings().get_embedding().get_embedding_dim(), huge_settings.graph_name, "chunks"
-    )
+    embed = Embeddings().get_embedding()
+    vector_index_entity = vector_index.from_name(
+        embed.get_embedding_dim(), huge_settings.graph_name, "chunks"
+    )

这同时提升可读性。请确认 get_embedding_dim() 方法在所有 embedding 实现中都存在。

hugegraph-llm/src/hugegraph_llm/indices/vector_index/qdrant_vector_store.py (1)

116-135: 距离阈值计算与 COSINE 度量可能不一致,请确认

在 Qdrant 的 COSINE 度量下,score 已是相似度(越高越相似)。代码把 distance = 1 - score 再与阈值比较,可能造成筛选错误。请核实文档或实验验证。

@imbajin
Copy link

imbajin commented May 23, 2025

@codecov-ai-reviewer review

@codecov-ai
Copy link

codecov-ai bot commented May 23, 2025

On it! We are reviewing the PR and will provide feedback shortly.

@imbajin imbajin requested a review from Copilot May 23, 2025 07:54
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds type checking and linting configurations (mypy and ruff), fixes type annotation issues, and introduces initial CRUD implementations for multiple vector databases along with associated tests. Key changes include configuration updates in the UI (using lambda functions for dynamic value loading) and the addition of new vector index implementations for Qdrant, Milvus, and Faiss.

Reviewed Changes

Copilot reviewed 42 out of 42 changed files in this pull request and generated no comments.

Show a summary per file
File Description
middleware/middleware.py Adds a type-ignore comment to the client host call.
indices/vector_index/qdrant_vector_store.py Implements Qdrant vector store with collection creation and dimension verification.
indices/vector_index/milvus_vector_store.py Implements Milvus vector store with collection recreation on dimension mismatch.
indices/vector_index/faiss_vector_store.py Refactors Faiss vector index implementation and file handling for persistence.
indices/base.py Introduces abstract base class for vector stores.
demo/rag_demo/* Updates Gradio UI components and refactors API calls to support dynamic configuration and vector index selection.
config/* Updates configuration models and initialization with new fields and type annotations.
requirements.txt & pyproject.toml Adds dependencies and configurations for pymilvus, qdrant-client, mypy, and ruff.
Comments suppressed due to low confidence (2)

hugegraph-llm/src/hugegraph_llm/indices/vector_index/milvus_vector_store.py:240

  • The error message incorrectly refers to 'Qdrant' instead of 'Milvus'. Please update it to 'Milvus host is not configured'.
assert index_settings.milvus_host, "Qdrant host is not configured"

hugegraph-llm/src/hugegraph_llm/demo/rag_demo/text2gremlin_block.py:50

  • [nitpick] Consider rephrasing the assertion message to improve clarity, for example: 'Invalid vector index configuration'.
assert vector_index, 'vector db name is error'

@codecov-ai
Copy link

codecov-ai bot commented May 23, 2025

PR Description

This pull request introduces support for multiple vector database backends (Faiss, Milvus, and Qdrant) to enhance the flexibility and scalability of the knowledge graph retrieval process. It allows users to choose the most suitable vector database based on their specific requirements and infrastructure.

Click to see more

Key Technical Changes

  1. Introduced an abstract VectorStoreBase class defining a common interface for vector database operations (add, remove, search, save, load, clean).
  2. Implemented concrete classes (FaissVectorIndex, MilvusVectorIndex, QdrantVectorIndex) for each supported vector database, adhering to the VectorStoreBase interface.
  3. Added a new index_config.py to manage vector database connection settings and selection.
  4. Modified the RAG pipeline and related components to dynamically select and utilize the configured vector database.
  5. Updated the UI configuration to allow users to specify vector database connection parameters and choose the active vector database.

Architecture Decisions

The design employs an abstract base class and concrete implementations to achieve loose coupling and extensibility. This allows for easy addition of new vector database backends in the future without modifying core components. The selection of the active vector database is determined at runtime based on the index_settings configuration.

Dependencies and Interactions

This pull request introduces new dependencies on pymilvus and qdrant-client. It interacts with the existing HugeGraph client for schema management and graph data retrieval. The changes impact the RAG pipeline, text2gremlin generation, and UI configuration components.

Risk Considerations

  1. Inconsistent distance metrics between vector stores (cosine for Qdrant, L2 for Milvus) could lead to unexpected search results. This should be addressed by standardizing the metric or making it configurable.
  2. The Milvus and Qdrant implementations lack explicit connection management, which could lead to resource leaks or connection limit issues. Connection pooling or context managers should be implemented.
  3. The default values for Milvus and Qdrant configuration settings are set to None, which means the application will fail if these services are not configured but selected as the vector index. Validation or fallback logic should be added.
  4. Ensure proper testing and validation of each vector database implementation to guarantee data consistency and search accuracy.

Notable Implementation Details

The get_vector_index_class function dynamically returns the appropriate vector index class based on the configuration. The embedding dimension is now a required parameter for embedding classes. The UI configuration has been updated to allow users to configure the vector database settings.

Comment on lines +42 to +104
"""
#TODO: finish comment
"""

@abstractmethod
def remove(self, props: Union[Set[Any], List[Any]]) -> int:
"""
Remove vectors based on their associated properties.

Args:
props (Union[Set[Any], List[Any]]): Properties of vectors to remove.

Returns:
int: Number of vectors removed.
"""

@abstractmethod
def search(self, query_vector: List[float], top_k: int, dis_threshold: float = 0.9) -> List[Any]:
"""
Search for the top_k most similar vectors to the query vector.

Args:
query_vector (List[float]): The vector to query against the index.
top_k (int): Number of top results to return.
dis_threshold (float): Distance threshold below which results are considered relevant.

Returns:
List[Any]: List of properties of the matched vectors.
"""

@abstractmethod
def save_index_by_name(self, *name: str):
"""
#TODO: finish comment
"""

@abstractmethod
def get_vector_index_info(
self,
) -> Dict:
"""
#TODO: finish comment
"""

@staticmethod
@abstractmethod
def from_name(embed_dim: int, *name: str) -> "VectorStoreBase":
"""
#TODO: finish comment
"""

@staticmethod
@abstractmethod
def exist(*name: str) -> bool:
"""
#TODO: finish comment
"""

@staticmethod
@abstractmethod
def clean(*name: str) -> bool:
"""
#TODO: finish comment
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The abstract base class VectorStoreBase is missing docstrings for several methods including get_all_properties(), save_index_by_name(), get_vector_index_info(), from_name(), exist(), and clean(). Complete docstrings would improve code maintainability and ensure implementers understand the contract they need to fulfill.

Comment on lines +85 to +105

@staticmethod
@abstractmethod
def from_name(embed_dim: int, *name: str) -> "VectorStoreBase":
"""
#TODO: finish comment
"""

@staticmethod
@abstractmethod
def exist(*name: str) -> bool:
"""
#TODO: finish comment
"""

@staticmethod
@abstractmethod
def clean(*name: str) -> bool:
"""
#TODO: finish comment
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method signatures for static methods from_name(), exist() and clean() use variable-length arguments (*name) which makes the API less clear. Consider using a more explicit parameter like collection_name or providing clear documentation on what these parameters represent.

Comment on lines +88 to +92
index_params = {
"metric_type": "L2",
"index_type": "IVF_FLAT",
"params": {"nlist": 128},
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the _create_collection() method, the index parameters are hardcoded with IVF_FLAT and nlist: 128. Consider making these configurable through the index_settings to allow for different indexing strategies based on the collection size and performance requirements.

Suggested change
index_params = {
"metric_type": "L2",
"index_type": "IVF_FLAT",
"params": {"nlist": 128},
}
index_params = {\n "metric_type": "L2",\n "index_type": index_settings.milvus_index_type or "IVF_FLAT",\n "params": {"nlist": index_settings.milvus_nlist or 128},\n}

Comment on lines +56 to +58
self.client.create_collection(
collection_name=self.name,
vectors_config=models.VectorParams(size=self.embed_dim, distance=models.Distance.COSINE),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Qdrant implementation uses COSINE distance, but the Milvus implementation uses L2 distance. This inconsistency in distance metrics between vector stores could lead to different search behaviors. Consider standardizing the distance metric or exposing it as a configurable parameter.

Comment on lines +35 to +36

now_vector_index: str = 'Faiss'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The now_vector_index variable defaults to 'Faiss' as a string literal. Consider using an enum for vector index types to prevent typos and enable better type checking.

Suggested change
now_vector_index: str = 'Faiss'
from enum import Enum\n\nclass VectorIndexType(str, Enum):\n FAISS = 'Faiss'\n MILVUS = 'Milvus'\n QDRANT = 'Qdrant'\n\nnow_vector_index: VectorIndexType = VectorIndexType.FAISS

Comment on lines +90 to +98

def get_vector_index_class(vector_index_str: str) -> Type[VectorStoreBase]:
mapping = {
"Faiss": FaissVectorIndex,
"Milvus": MilvusVectorIndex,
"Qdrant": QdrantVectorIndex,
}
ret = mapping.get(vector_index_str)
assert ret
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_vector_index_class() function contains an assertion that will raise an AssertionError if an invalid vector_index_str is provided. This is not user-friendly. Consider raising a more descriptive exception with a helpful error message instead.

Suggested change
def get_vector_index_class(vector_index_str: str) -> Type[VectorStoreBase]:
mapping = {
"Faiss": FaissVectorIndex,
"Milvus": MilvusVectorIndex,
"Qdrant": QdrantVectorIndex,
}
ret = mapping.get(vector_index_str)
assert ret
def get_vector_index_class(vector_index_str: str) -> Type[VectorStoreBase]:\n mapping = {\n "Faiss": FaissVectorIndex,\n "Milvus": MilvusVectorIndex,\n "Qdrant": QdrantVectorIndex,\n }\n ret = mapping.get(vector_index_str)\n if ret is None:\n raise ValueError(f"Unsupported vector index type: {vector_index_str}. Supported types are: {', '.join(mapping.keys())}")\n return ret

Comment on lines +79 to 80
if dist < dis_threshold:
results.append(deepcopy(self.properties[i]))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The search method in FaissVectorIndex returns deepcopies of the properties, which could be inefficient for large objects. Consider returning references unless there's a specific reason for deep copying.

Suggested change
if dist < dis_threshold:
results.append(deepcopy(self.properties[i]))
results.append(self.properties[i])

Comment on lines +51 to +66
connections.connect(host=host, port=port, user=user, password=password)

if not utility.has_collection(self.name):
self._create_collection()
else:
# dim is different, recreate
existing_collection = Collection(self.name)
existing_schema = existing_collection.schema
for field in existing_schema.fields:
if field.name == "embedding" and field.params.get("dim"):
existing_dim = int(field.params["dim"])
if existing_dim != self.embed_dim:
log.debug(
"Milvus collection '%s' dimension mismatch: %d != %d. Recreating.",
self.name,
existing_dim,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connection to Milvus is established in the constructor but never explicitly closed. Consider implementing a __del__ method or a context manager pattern to ensure connections are properly released.

Suggested change
connections.connect(host=host, port=port, user=user, password=password)
if not utility.has_collection(self.name):
self._create_collection()
else:
# dim is different, recreate
existing_collection = Collection(self.name)
existing_schema = existing_collection.schema
for field in existing_schema.fields:
if field.name == "embedding" and field.params.get("dim"):
existing_dim = int(field.params["dim"])
if existing_dim != self.embed_dim:
log.debug(
"Milvus collection '%s' dimension mismatch: %d != %d. Recreating.",
self.name,
existing_dim,
def __del__(self):\n try:\n connections.disconnect(alias=f"{self.host}:{self.port}")\n except Exception:\n pass

Comment on lines +145 to +178
self.collection.flush()
return remove_num
finally:
self.collection.release()

def search(self, query_vector: List[float], top_k: int, dis_threshold: float = 0.9) -> List[Any]:
try:
if self.collection.num_entities == 0:
return []

self.collection.load()
search_params = {"metric_type": "L2", "params": {"nprobe": 10}}
results = self.collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=top_k,
output_fields=["property"],
)

ret = []
for hits in results:
for hit in hits:
if hit.distance < dis_threshold:
prop_str = hit.entity.get("property")
prop = self._serialize_property(prop_str)
ret.append(prop)
log.debug("[✓] Add valid distance %s to results.", hit.distance)
else:
log.debug(
"[x] Distance %s >= threshold %s, ignore this result.",
hit.distance,
dis_threshold,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the search method, you're catching all exceptions in the try/finally block but not logging or handling them. This could silently hide errors. Consider logging exceptions before executing the finally block.

Suggested change
self.collection.flush()
return remove_num
finally:
self.collection.release()
def search(self, query_vector: List[float], top_k: int, dis_threshold: float = 0.9) -> List[Any]:
try:
if self.collection.num_entities == 0:
return []
self.collection.load()
search_params = {"metric_type": "L2", "params": {"nprobe": 10}}
results = self.collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=top_k,
output_fields=["property"],
)
ret = []
for hits in results:
for hit in hits:
if hit.distance < dis_threshold:
prop_str = hit.entity.get("property")
prop = self._serialize_property(prop_str)
ret.append(prop)
log.debug("[✓] Add valid distance %s to results.", hit.distance)
else:
log.debug(
"[x] Distance %s >= threshold %s, ignore this result.",
hit.distance,
dis_threshold,
)
try:\n if self.collection.num_entities == 0:\n return []\n\n self.collection.load()\n search_params = {"metric_type": "L2", "params": {"nprobe": 10}}\n results = self.collection.search(\n data=[query_vector],\n anns_field="embedding",\n param=search_params,\n limit=top_k,\n output_fields=["property"],\n )\n\n ret = []\n for hits in results:\n for hit in hits:\n if hit.distance < dis_threshold:\n prop_str = hit.entity.get("property")\n prop = self._serialize_property(prop_str)\n ret.append(prop)\n log.debug("[✓] Add valid distance %s to results.", hit.distance)\n else:\n log.debug(\n "[x] Distance %s >= threshold %s, ignore this result.",\n hit.distance,\n dis_threshold,\n )\n\n return ret\nexcept Exception as e:\n log.error("Error during Milvus search: %s", e)\n raise\nfinally:\n self.collection.release()

Comment on lines +178 to +213
}

@staticmethod
def clean(*name: str):
name_str = '_'.join(name)
client = QdrantClient(
host=index_settings.qdrant_host, port=index_settings.qdrant_port, api_key=index_settings.qdrant_api_key
)
collections = client.get_collections().collections
collection_names = [collection.name for collection in collections]
name_str = COLLECTION_NAME_PREFIX + name_str
if name_str in collection_names:
client.delete_collection(collection_name=name_str)

@staticmethod
def from_name(embed_dim: int, *name: str) -> "QdrantVectorIndex":
assert index_settings.qdrant_host, "Qdrant host is not configured"
name_str = '_'.join(name)
return QdrantVectorIndex(
name=name_str,
host=index_settings.qdrant_host,
port=index_settings.qdrant_port,
embed_dim=embed_dim,
api_key=index_settings.qdrant_api_key,
)

@staticmethod
def exist(*name: str) -> bool:
name_str = '_'.join(name)
client = QdrantClient(
host=index_settings.qdrant_host, port=index_settings.qdrant_port, api_key=index_settings.qdrant_api_key
)
collections = client.get_collections().collections
collection_names = [collection.name for collection in collections]
name_str = COLLECTION_NAME_PREFIX + name_str
return name_str in collection_names
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connection to Qdrant host is established for each operation in the static methods (clean, from_name, exist). This is inefficient and could lead to connection limit issues. Consider reusing connections or implementing a connection pool.

Comment on lines +90 to +97
def build_vertex_id_semantic_index(self, vector_index: type[VectorStoreBase]):
assert self.embedding
self.operators.append(BuildSemanticIndex(self.embedding, vector_index))
return self

def build_vector_index(self):
self.operators.append(BuildVectorIndex(self.embedding))
def build_vector_index(self, vector_index: type[VectorStoreBase]):
assert self.embedding
self.operators.append(BuildVectorIndex(self.embedding, vector_index))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the build_vertex_id_semantic_index and build_vector_index methods, you're asserting that self.embedding is not None, but this isn't checked in other methods that use self.embedding. Consider making embedding a required parameter in the constructor or checking it consistently across all methods.

@imbajin imbajin force-pushed the main branch 4 times, most recently from 6a76d6c to fe94881 Compare July 28, 2025 07:34
@fantasy-lotus fantasy-lotus self-assigned this Aug 11, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants