Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: The knowledge base table file upload is missing a header #2185

Merged
merged 1 commit into from
Feb 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
fix: The knowledge base table file upload is missing a header
  • Loading branch information
shaohuzhang1 committed Feb 10, 2025
commit 0f5cea98951b8cffee373ae8add9940da97b9d34
2 changes: 1 addition & 1 deletion apps/common/handle/impl/csv_split_handle.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ def handle(self, file, pattern_list: List, with_filter: bool, limit: int, get_bu
result_item_content += next_md_content
else:
paragraphs.append({'content': result_item_content, 'title': ''})
result_item_content = next_md_content
result_item_content = title_md_content + next_md_content
if len(result_item_content) > 0:
paragraphs.append({'content': result_item_content, 'title': ''})
return result
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provided code has several potential issues:

  1. title_md_content is not defined at line 38, which causes an uninitialized variable error when it attempts to concatenate.
  2. The logic of adding the current paragraph to the list should be separate from appending it after checking the condition related to result_item_content.

Here’s a revised version of the function to address these issues:

@@ -55,7 +55,7 @@
         result_item_content += next_md_content
     else:
         paragraphs.append({'content': result_item_content, 'title': ''})
-        result_item_content = next_md_content
+        result_item_content = next_md_content.replace('\n', '')  # Ensure no newline characters interfere

# This ensures that only complete sentences are added.
if (len(result_item_content) > 0) and ('.' in result_item_content):
    paragraphs.append({'content': result_item_content, 'title': ''})

return result

In this updated version:

  1. Added result_item_content.replace('\n', '') to ensure there are no unexpected newlines in the final content.
  2. Ensured that each sentence is included before being appended to the paragraphs list by adding a check for the presence of a period (.). Adjust as needed based on actual requirements.

Optimization suggestions depend on specific use cases and constraints, such as performance, memory usage, or text processing needs. If further optimizations are required, please let me know!

Expand Down
2 changes: 1 addition & 1 deletion apps/common/handle/impl/xls_split_handle.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ def handle_sheet(file_name, sheet, limit: int):
result_item_content += next_md_content
else:
paragraphs.append({'content': result_item_content, 'title': ''})
result_item_content = next_md_content
result_item_content = title_md_content + next_md_content
if len(result_item_content) > 0:
paragraphs.append({'content': result_item_content, 'title': ''})
return result
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provided code appears to be intended to parse Markdown content from a file and organize it into paragraphs with optional titles. There are several points of concern and improvements that can be made:

Potential Issues:

  1. Variable Scope: The variable result_item_content is used within both the else block and after the loop. This might lead to unexpected behavior if not handled carefully.

  2. Edge Cases: The code does not handle cases where there isn't enough data to split a paragraph (e.g., a single line without a break).

  3. Missing Imports: Assuming this is part of a larger script, ensure all necessary imports are present (if there are none, just comment them out or remove them as they're not needed for the current snippet).

Optimization Suggestions:

  1. Simplify Logic: Consider merging some logic based on conditionals. Here’s a simplified version that addresses common scenarios more concisely:

    def handle_sheet(file_name, sheet, limit: int):
        paragraphs = []
        
        # Initialize a boolean flag indicating whether we should append a new title
        need_title = False
      
        for markdown_chunk in get_markdown_chunks(sheet, limit):  # Replace placeholder with actual chunk method
            if markdown_chunk.strip():  # Check if the chunk contains anything meaningful
                if need_title:
                    paragraphs.append({'content': '', 'title': markdown_chunk})
                    need_title = False
                    
                paragraphs.append({'content': markdown_chunk + result_item_content, 'title': ''})
      
        return paragraphs

    Note that you may need to implement get_markdown_chunks method appropriately depending on how the original code processes its input.

  2. Handling Limitations: Ensure handling cases where the document might exceed the given limit. You could add additional checks to abort early with an appropriate message or truncate parts of longer blocks correctly.

  3. Error Handling: Implement basic error handling to manage situations like missing files or invalid inputs gracefully.

  4. String Formatting: Instead of concatenating strings directly, consider using f-strings (f"{..."}) or string.format() methods for better readability and maintainability.

By addressing these points, the script will be cleaner, more robust, and easier to understand and extend in future modifications. It also introduces a slight performance enhancement by reducing unnecessary operations when building content chunks.

Expand Down
2 changes: 1 addition & 1 deletion apps/common/handle/impl/xlsx_split_handle.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ def handle_sheet(file_name, sheet, image_dict, limit: int):
result_item_content += next_md_content
else:
paragraphs.append({'content': result_item_content, 'title': ''})
result_item_content = next_md_content
result_item_content = title_md_content + next_md_content
if len(result_item_content) > 0:
paragraphs.append({'content': result_item_content, 'title': ''})
return result
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks nearly identical to the previous version but with a slight difference in how result_item_content is set for non-Markdown paragraphs (i.e., where the delimiter might be something other than @-). It seems like you intended to keep the logic consistent, which means using title_md_content in non-Markdown cases. However, since the change doesn't affect the output based on the provided description, it should function correctly. Here are some minor points:

  1. Consistency: Ensure that both Markdown and non-Markdown handling follow similar patterns.
  2. Logging/Comments: Consider adding comments or logging statements to clarify what each section of the function does.

Overall, the code is clean and efficient, with little room for improvement at this point.

Expand Down