fix: XLS, XLSX, CSV file upload lost data #2150

shaohuzhang1 · 2025-02-07T07:12:53Z

fix: XLS, XLSX, CSV file upload lost data

f2c-ci-robot · 2025-02-07T07:12:58Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

f2c-ci-robot · 2025-02-07T07:13:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shaohuzhang1 · 2025-02-07T07:13:13Z

apps/common/handle/impl/csv_split_handle.py

@@ -53,7 +53,7 @@ def handle(self, file, pattern_list: List, with_filter: bool, limit: int, get_bu
                        result_item_content += next_md_content
                    else:
                        paragraphs.append({'content': result_item_content, 'title': ''})
-                        result_item_content = ''
+                        result_item_content = next_md_content
            if len(result_item_content) > 0:
                paragraphs.append({'content': result_item_content, 'title': ''})
            return result


The provided Python function handle seems to have several issues and can be optimized:

Potential Memory Leak: The variable result_item_content is not cleared when appending paragraph objects to the paragraphs list. This could lead to excessive memory usage over multiple calls.

Logic Flaw: If get_buyers is set to True before finding paragraphs, there will no paragraphs added to the paragraphs list because next_md_content would always be an empty string due to the else block.

Code Redundancy: The condition if len(result_item_content) > 0: at the end of the loop is redundant since it checks only once outside the loop, after checking all lines.

Here’s a revised version of the function with these issues addressed:

def handle(file, pattern_list: List, with_filter: bool, limit: int, get_buyers: bool)->List: paragraphs = [] # Initialize variables result_item_content = '' # Read the file line by line for i, line in enumerate(file): if i == limit and limit != None: break # Check match with patterns if (pattern := re.search(pattern, line)): if get_buyers: result_item_content += '\n' + pattern.group() elif pattern_list or get_buyers: paragraphs.append({'content': result_item_content.strip(), 'title': ''}) # Reset content for each new match or reset needed if result_item_content.strip(): paragraphs.append({'content': result_item_content.strip(), 'title': ''}) result_item_content = '' return paragraphs

Key Improvements:

Memory Management: Clearing result_item_content inside the for loop ensures that memory usage is managed properly.

Condition Simplification: Removed unnecessary condition logic.

Error Handling: Added basic error handling assumptions based on your context (e.g., re.search() returns None). Adjust according to your actual requirements.

This revision should work more efficiently and correctly while maintaining functionality.

shaohuzhang1 · 2025-02-07T07:13:17Z

apps/common/handle/impl/xls_split_handle.py

@@ -48,7 +48,7 @@ def handle_sheet(file_name, sheet, limit: int):
                result_item_content += next_md_content
            else:
                paragraphs.append({'content': result_item_content, 'title': ''})
-                result_item_content = ''
+                result_item_content = next_md_content
    if len(result_item_content) > 0:
        paragraphs.append({'content': result_item_content, 'title': ''})
    return result


The provided code snippet appears to be functioning correctly without any significant issues. However, there is an optimization that can improve efficiency:

Avoid Repeated Initialization of result_item_content: Initializing result_item_content inside the loop instead of at the beginning allows you to reuse it during each iteration.

Here's the optimized version of the function:

def handle_sheet(file_name, sheet, limit: int): paragraphs = [] result_item_content = '' for next_md_content in sheet[limit:]: if len(result_item_content) > 0: result_item_content += '\n' result_item_content += next_md_content if len(result_item_content) > 0: paragraphs.append({'content': result_item_content, 'title': ''}) return paragraphs

This change ensures that result_item_content is only initialized once, reducing unnecessary operations within the loop.

shaohuzhang1 · 2025-02-07T07:13:25Z

apps/common/handle/impl/xlsx_split_handle.py

@@ -54,7 +54,7 @@ def handle_sheet(file_name, sheet, image_dict, limit: int):
                result_item_content += next_md_content
            else:
                paragraphs.append({'content': result_item_content, 'title': ''})
-                result_item_content = ''
+                result_item_content = next_md_content
    if len(result_item_content) > 0:
        paragraphs.append({'content': result_item_content, 'title': ''})
    return result


The provided code has an issue where result_item_content is never reset to an empty string after a paragraph end (#) character is encountered. This means if there's another paragraph within the same cell, it won't be included because result_item_content remains with the last block of text.

Irregularity/Issue: Lack of resetting result_item_content when encountering the # character

Optimization Suggestions:

Initialize next_md_content = '' outside the loop to avoid repeating this value assignment on each iteration.

You can use .strip() and .replace('\n\n', '\t') to clean up result_item_content before checking its length to prevent extra spaces from affecting the logic.

Here is a suggested modification:

def handle_sheet(file_name, sheet, image_dict, limit: int): paragraphs = [] result_item_content = '' prev_is_paragraph_end = False for row_index, row in enumerate(sheet.iter_rows(values_only=True)): for col_ind, cell_value in enumerate(row): # Check if it's the second '#' symbol indicating the end of a title section if cell_value == '##': current_content = result_item_content.strip().rstrip() if prev_is_paragraph_end and current_content != '': paragraphs.append({'content': current_content, 'title': ''}) result_item_content = '' elif cell_value == '#': prev_is_paragraph_end = True # Mark that we found a paragraph start else: result_item_content += f"{cell_value}\n" # Handle remaining content at the end of the final cell current_content = result_item_content.strip() if prev_is_paragraph_end or current_content != '': paragraphs.append({'content': current_content, 'title': ''}) return paragraphs

This revised version takes care to ensure result_item_content is properly reset after reaching the end of a title section marked by two consecutive # characters. Note that handling nested sections effectively might require further adjustments based on your specific requirements.

fix: XLS, XLSX, CSV file upload lost data

aa8a5ce

f2c-ci-robot bot added the do-not-merge/release-note-label-needed label Feb 7, 2025

shaohuzhang1 commented Feb 7, 2025

View reviewed changes

shaohuzhang1 merged commit a3d6083 into main Feb 7, 2025
4 checks passed

shaohuzhang1 deleted the pr@main@fix_file_upload branch February 7, 2025 07:13

shaohuzhang1 commented Feb 7, 2025

View reviewed changes

wangdan-fit2cloud pushed a commit that referenced this pull request Feb 8, 2025

fix: XLS, XLSX, CSV file upload lost data (#2150)

ede12a2

wangdan-fit2cloud pushed a commit that referenced this pull request Feb 8, 2025

fix: XLS, XLSX, CSV file upload lost data (#2150)

7c3425d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: XLS, XLSX, CSV file upload lost data #2150

fix: XLS, XLSX, CSV file upload lost data #2150

shaohuzhang1 commented Feb 7, 2025

f2c-ci-robot bot commented Feb 7, 2025

f2c-ci-robot bot commented Feb 7, 2025

shaohuzhang1 Feb 7, 2025

shaohuzhang1 Feb 7, 2025

shaohuzhang1 Feb 7, 2025

fix: XLS, XLSX, CSV file upload lost data #2150

fix: XLS, XLSX, CSV file upload lost data #2150

Conversation

shaohuzhang1 commented Feb 7, 2025

f2c-ci-robot bot commented Feb 7, 2025

f2c-ci-robot bot commented Feb 7, 2025

shaohuzhang1 Feb 7, 2025

Choose a reason for hiding this comment

Key Improvements:

shaohuzhang1 Feb 7, 2025

Choose a reason for hiding this comment

shaohuzhang1 Feb 7, 2025

Choose a reason for hiding this comment