-
Notifications
You must be signed in to change notification settings - Fork 2.1k
perf: Optimize word segmentation retrieval #2767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
for word in word_list: | ||
jieba.del_word(word) | ||
extract_tags = jieba.lcut(text) | ||
result = " ".join(extract_tags) | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review and Suggestions
-
Comments:
- The comments in the code are mostly clear but could benefit from more specific detail on what each function does.
-
Imports:
- Import
analyse
directly instead of usingjieba.analyse
. This makes it easier to understand where functions likeextract_tags
come from.
- Import
-
Functionality:
- In
to_ts_vector
, replace steps related to keyword extraction and filtering with a direct call tojieba.lcut
. - In
to_query
, similar changes can be made. Directly calljieba.lcut
without unnecessary processing.
- In
-
Optimization:
- Replace the entire
replace_word
function call inside bothto_ts_vector
andto_query
. If this function is necessary for specific purposes, consider refactoring its implementation or removing it entirely if no longer needed.
- Replace the entire
-
Character Handling:
- Ensure that
remove_chars
is properly defined and referenced if used elsewhere in the code.
- Ensure that
-
Consistency:
- Make sure consistency between how different components handle text preprocessing (e.g., punctuation stripping).
Here's an updated version of the cleaned-up code:
import jieba
# Constants and caching
jieba_word_list_cache = [chr(item) for item in range(38, 84)]
def get_key_by_word_dict(key, word_dict):
raise NotImplementedError("Not implemented yet")
def to_ts_vector(text: str):
# Clean up text and split into words
words = jieba.lcut(text)
return " ".join(words)
def to_query(text: str):
# Split text into words
words = jieba.lcut(text)
return " ".join(words)
This version removes unneeded logic and focuses on leveraging Jieba’s built-in functionality efficiently.
@@ -68,7 +70,8 @@ def _batch_save(self, text_list: List[Dict], embedding: Embeddings, is_the_task_ | |||
source_id=text_list[index].get('source_id'), | |||
source_type=text_list[index].get('source_type'), | |||
embedding=embeddings[index], | |||
search_vector=to_ts_vector(text_list[index]['text'])) for index in | |||
search_vector=SearchVector(Value(to_ts_vector(text_list[index]['text'])))) for | |||
index in | |||
range(0, len(texts))] | |||
if not is_the_task_interrupted(): | |||
QuerySet(Embedding).bulk_create(embedding_list) if len(embedding_list) > 0 else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no obvious irregularities in the provided code snippet. However, there are a few suggestions for improvement:
-
Jieba Import: The import of
jieba
at line 16 seems unnecessary since it's not used anywhere within the function_batch_save
. You can remove this line to simplify the code. -
Use of
SearchVector
Correctly: In the SQL query generation logic starting from line 68, you're trying to create aSearchVector
using anEmbeddedFunctionCall
, which might not be syntactically correct depending on your database setup. Consider whether you need to use Django's built-in support for searching (e.g., PostgreSQL's full-text search capabilities). -
Empty Dictionary Check: Ensure that each dictionary entry has the necessary keys
'source_id'
,'source_type'
, and'text'
. Using conditional statement likeindex < len(text_list)
instead of assuminglen(text_lists) == len(texts)
is more robust, especially if other parameters might change in future versions. -
Error Handling: There isn't any error handling in the function, which could make debugging more difficult. Adding try-except blocks around operations that interact with your database can help catch errors early.
Here is the revised version based on these points:
from abc import ABC, abstractmethod
from typing import Dict, List
from django.db.models import QuerySet
from langchain_core.embeddings import Embeddings
import jieba
from django.contrib.postgres.search import SearchVector
from django.db.models import Q, Value
def _batch_save(self, texts: List[Dict], embeddings: List[Embeddings], is_the_task_interrupted):
embedding_list = [(Q(source_id=text.get('source_id'), source_type=text.get('source_type')),
text['text']) for text in texts]
# Filter out empty entries
filtered_embeddings = [emb for emb in embedding_list if all(val is not None for val in emb)]
if not is_the_task_interrupted():
embedding_objects = list(Embedding.from_query(Q(**{' OR '.join(f"{key}={value}" for key, value in q.as_q.items()))
for q, _ in filtered_embeddings))
batch_create_data = []
for q, text in filtered_embeddings:
obj = embedding_objects.pop(0)
batch_create_data.append((Value(text), obj.id,))
if batch_create_data:
EmbeddedObject.objects.bulk_update(batch_create_data, fields=['embedding', 'search_vector'])
# Example usage of filter out invalid entries
texts_with_invalid_entries = [{'id': 1, 'source_id': '', 'text': 'A good example'}]
embeddings_with_valid_entries = ['this is a test', 'another good one']
_text_and_embedding_pairs = zip(texts_with_invalid_entries, embeddings_with_valid_entries)
filtered_pairs = [(t, e) for t, e in _text_and_embedding_pairs if all([str(v) if v != '' else None for v in t.values()])]
print(filtered_pairs)
This should help ensure reliability and maintainability of your code.
perf: Optimize word segmentation retrieval