-
Notifications
You must be signed in to change notification settings - Fork 2.1k
perf: Optimize word segmentation retrieval #2767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,7 +12,9 @@ | |
from abc import ABC, abstractmethod | ||
from typing import Dict, List | ||
|
||
from django.db.models import QuerySet | ||
import jieba | ||
from django.contrib.postgres.search import SearchVector | ||
from django.db.models import QuerySet, Value | ||
from langchain_core.embeddings import Embeddings | ||
|
||
from common.db.search import generate_sql_by_query_dict | ||
|
@@ -68,7 +70,8 @@ def _batch_save(self, text_list: List[Dict], embedding: Embeddings, is_the_task_ | |
source_id=text_list[index].get('source_id'), | ||
source_type=text_list[index].get('source_type'), | ||
embedding=embeddings[index], | ||
search_vector=to_ts_vector(text_list[index]['text'])) for index in | ||
search_vector=SearchVector(Value(to_ts_vector(text_list[index]['text'])))) for | ||
index in | ||
range(0, len(texts))] | ||
if not is_the_task_interrupted(): | ||
QuerySet(Embedding).bulk_create(embedding_list) if len(embedding_list) > 0 else None | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are no obvious irregularities in the provided code snippet. However, there are a few suggestions for improvement:
Here is the revised version based on these points: from abc import ABC, abstractmethod
from typing import Dict, List
from django.db.models import QuerySet
from langchain_core.embeddings import Embeddings
import jieba
from django.contrib.postgres.search import SearchVector
from django.db.models import Q, Value
def _batch_save(self, texts: List[Dict], embeddings: List[Embeddings], is_the_task_interrupted):
embedding_list = [(Q(source_id=text.get('source_id'), source_type=text.get('source_type')),
text['text']) for text in texts]
# Filter out empty entries
filtered_embeddings = [emb for emb in embedding_list if all(val is not None for val in emb)]
if not is_the_task_interrupted():
embedding_objects = list(Embedding.from_query(Q(**{' OR '.join(f"{key}={value}" for key, value in q.as_q.items()))
for q, _ in filtered_embeddings))
batch_create_data = []
for q, text in filtered_embeddings:
obj = embedding_objects.pop(0)
batch_create_data.append((Value(text), obj.id,))
if batch_create_data:
EmbeddedObject.objects.bulk_update(batch_create_data, fields=['embedding', 'search_vector'])
# Example usage of filter out invalid entries
texts_with_invalid_entries = [{'id': 1, 'source_id': '', 'text': 'A good example'}]
embeddings_with_valid_entries = ['this is a test', 'another good one']
_text_and_embedding_pairs = zip(texts_with_invalid_entries, embeddings_with_valid_entries)
filtered_pairs = [(t, e) for t, e in _text_and_embedding_pairs if all([str(v) if v != '' else None for v in t.values()])]
print(filtered_pairs) This should help ensure reliability and maintainability of your code. |
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review and Suggestions
Comments:
Imports:
analyse
directly instead of usingjieba.analyse
. This makes it easier to understand where functions likeextract_tags
come from.Functionality:
to_ts_vector
, replace steps related to keyword extraction and filtering with a direct call tojieba.lcut
.to_query
, similar changes can be made. Directly calljieba.lcut
without unnecessary processing.Optimization:
replace_word
function call inside bothto_ts_vector
andto_query
. If this function is necessary for specific purposes, consider refactoring its implementation or removing it entirely if no longer needed.Character Handling:
remove_chars
is properly defined and referenced if used elsewhere in the code.Consistency:
Here's an updated version of the cleaned-up code:
This version removes unneeded logic and focuses on leveraging Jieba’s built-in functionality efficiently.