Skip to content

feat: Knowledge base generation problem #2760

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 1, 2025

Conversation

shaohuzhang1
Copy link
Contributor

feat: Knowledge base generation problem

Copy link

f2c-ci-robot bot commented Apr 1, 2025

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

f2c-ci-robot bot commented Apr 1, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

generate_related_by_dataset_id.delay(dataset_id, model_id, prompt, state_list)
except AlreadyQueued as e:
raise AppApiException(500, _('Failed to send the vectorization task, please try again later!'))

def list_application(self, with_valid=True):
if with_valid:
self.is_valid(raise_exception=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code you provided contains some general improvements and optimizations that can be made:

  1. Imports: The Import statement for QuerySet, Reverse, Substr should be at the top of each function where it is used.

  2. Docstring Style: It's good practice to have docstrings for classes and methods, especially when they're complex or perform multiple actions. Ensure all functions have clear explanations.

  3. Function Separation: You might consider separating the generate_related and other related logic into separate modules or within a more logical class hierarchy.

  4. Error Handling: The exception handling in re_embedding already uses AlreadyQueued. If there are any specific exceptions or error messages needed, consider adding them.

  5. Variable Naming: Ensure variable names are descriptive and adhere to PEP 8 standards. For instance, use state_list instead of 'state_list'.

Here’s an improved version of the relevant section of code with these suggestions:

from django.core import validators
from django.db import transaction, models
from django.db.models import QuerySet
from django.db.models.functions import Reverse, Substr

from drf_yasg import openapi
from rest_framework import serializers

from dataset.models import Dataset, Document, Paragraph, Problem, Type, ProblemParagraphMapping, TaskType, \
    State, File, Image
from dataset.serializers.common_serializers import list_paragraph, MetaSerializer, ProblemParagraphManage, \
    get_embedding_model_by_dataset_id, get_embedding_model_id_by_dataset_id, write_image, zip_dir
from dataset.serializers.document_serializers import DocumentSerializers, DocumentInstanceSerializer
from.dataset.task import sync_web_dataset, sync_replace_web_dataset, generate_related_by_dataset_id
from embedding.models import SearchMode
from embedding.task import embedding_by_dataset, delete_embedding_by_dataset
from setting.models import AuthOperate, Model


class YourClassName(serializers.Serializer):
    data = serializers.DictField(required=True)

    class Meta:
        fields = ('data')

    def re_embedding(self, with_valid=True):
        if with_valid:
            self.is_valid(raise_exception=True)
        dataset_id = self.data['id']
        
        try:
            # Your existing implementation here
        except AlreadyQueued as e:
            raise AppApiException(500, _('Failed to send the vectorization task, please try again later!'))

    def generate_related(self, with_valid=True):
        if with_valid:
            self.is_valid(raise_exception=True)
            GenerateRelatedSerializer.data = self.validated_data
        
        dataset_id = self.validated_data.get('id')
        model_id = self.validated_data.get("model_id")
        prompt = self.validated_data.get("prompt")
        state_list = [s.lower() for s in self.validated_data.get('state_list')]
        
        ListenerManagement.update_status(
            queryset=Document.objects.filter(dataset_id=dataset_id),
            task_type=TaskType.GENERATE_PROBLEM,
            new_state=State.PENDING
        )
        
        ListenerManagement.update_status(
            queryset=(
                Paragraph.objects.annotate(
                    status_reverse=Reverse('status'),
                    task_type_status=Substr('status_reverse', TaskType.GENERATE_PROBLEM.value, 1)
                )
                .filter(task_type_status='pending', dataset_id=dataset_id)
                .values('id')
            ),
            task_type=TaskType.GENERATE_PROBLEM,
            new_state=State.PENDING
        )

        ListenerManagement.get_aggregation_document_status_by_dataset_id(dataset_id)()
        
        try:
            generate_related_by_dataset_id.delay(dataset_id, model_id, prompt, tuple(state_list))
        except AlreadyQueued as e:
            raise AppApiException(500, _('Failed to send the vectorization task, please try again later!'))


    def list_application(self, with_valid=True):
        if with_valid:
            self.is_valid(raise_exception=True)

Make sure to adapt the class name (YourClassName) and adjust imports and method names according to the context of your application.

items=openapi.Schema(type=openapi.TYPE_STRING),
title=_('state list'))
}
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provided code seems to be well-written and follows Python best practices. However, there are a few minor improvements that can be made:

  1. Type Hinting: The use of List from the typing module is recommended for better readability and clarity.

  2. Error Messages: Ensure that the error messages are translated properly using _ (gettext), as they might need to be checked against your gettext translation files.

  3. API Documentation: The generate_related_serializer.get_request_body_api() method returns an OpenAPI schema, which is useful for API documentation but does not directly affect the logic. It should be documented within comments if necessary.

    # Returns an OpenAPI schema representation of the request body for generate-related requests.

Here's the updated version with some additional comments:

@@ -222,3 +222,27 @@ def get_embedding_model_id_by_dataset_id_list(dataset_id_list: List[str]:
     """Retrieve the embedding mode ID based on dataset IDs."""
     if len(dataset_list) == 0:
         raise Exception(_(u"Knowledge base setting error, please reset the knowledge base"))
     return str(dataset_list[0].embedding_mode_id)

+
+import typing
+from rest_framework import serializers
+from drf_yasg.utils import swagger_auto_schema
+from core.constants import ErrMessage
+
+class GenerateRelatedSerializer(ApiMixin, serializers.Serializer):
+    """
+    Serializer class for the related generation API endpoint.
+    
+    Fields:
+       model_id (UUIDField): Required UUID representing the AI model ID.
+       prompt (CharField): Required string prompting the model to generate text.
+       state_list (ListField): Optional list of strings indicating states for further processing.
+       
+    Methods:
+       get_request_body_api() -> openapi.Schema: Returns an OpenAPI Schema instance describing the request format.
+    """
+
+    model_id = serializers.UUIDField(required=True,
+                                      error_messages=ErrMessage.uuid(_("Model id")))
         
+    prompt = serializers.CharField(required=True,
+                                   error_messages=ErrMessage.uuid(_("Prompt word")))
     
     

Summary of Recommendations:

  • Use List[str] instead of just List.
  • Ensure proper translation of error messages.
  • Add docstrings to explain the purpose and usage of the methods and fields.

return result.success(
DataSetSerializers.Operate(data={'id': dataset_id, 'user_id': request.user.id}).generate_related(
request.data))

class Export(APIView):
authentication_classes = [TokenAuth]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provided code is already well-structured and does not seem to have any obvious issues. However, there are a few minor suggestions for optimization:

  1. Method Naming: Use more descriptive method names. For example, operate can be renamed to something like execute_operate.

  2. Docstrings: Ensure that all methods have clear docstrings explaining their purpose. This will help other developers understand the code better.

  3. Error Handling: Consider adding try-except blocks around API calls to handle exceptions gracefully.

  4. Logging: While current logging seems fine, consider whether you need more detailed logs or additional log levels if needed.

Here's an updated version of the code with some of these improvements:

from django.http import HttpResponse
from rest_framework.response import Response
from django.utils.decorators import method_decorator

# Import necessary modules here for improved readability

class DataSetView(APIView):
    # ... (rest of the codes remains unchanged)

class GenerateRelated(APIView):
    authentication_classes = [TokenAuth]

    @action(methods=['PUT'], detail=False)
    @swagger_auto_schema(
        operation_summary=_('Generate Related Documents'),
        operation_id=_('Generate Related Documents'),
        manual_parameters=[DataSetSerializers.Operate.get_request_params_api()],
        request_body=GenerateRelatedSerializer.get_request_body_api(),
        tags=[_('Knowledge Base')]
    )
    @log(menu='document', operate="Generate related documents", 
         get_operation_object=lambda r, keywords: get_dataset_operation_object(keywords.get('dataset_id')))
    def put(self, request: Request, dataset_id: str) -> HttpResponse:
        data = {
            'id': dataset_id,
            'user_id': request.user.id
        }
        
        try:
            return response.success(DataSetSerializers.Execute(data=data).generate_related(request.data))
        except Exception as e:
            logger.error(f"Failed to generate related documents for dataset {dataset_id}: {e}")
            return HttpResponse("An error occurred while generating related documents.", status=500)

class Export(APIView):
    authentication_classes = [TokenAuth]
    
    # ... (rest of the codes remains unchanged)

These changes aim to make the code cleaner and potentially easier to maintain or update in the future. Let me know if further assistance is needed!

@shaohuzhang1 shaohuzhang1 merged commit 4b9cecd into main Apr 1, 2025
4 checks passed
@shaohuzhang1 shaohuzhang1 deleted the pr@main@feat_dataset_generate_related branch April 1, 2025 04:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant