Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: XLS, XLSX, CSV file upload lost data #2150

Merged
merged 1 commit into from
Feb 7, 2025

Conversation

shaohuzhang1
Copy link
Contributor

fix: XLS, XLSX, CSV file upload lost data

Copy link

f2c-ci-robot bot commented Feb 7, 2025

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

f2c-ci-robot bot commented Feb 7, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -53,7 +53,7 @@ def handle(self, file, pattern_list: List, with_filter: bool, limit: int, get_bu
result_item_content += next_md_content
else:
paragraphs.append({'content': result_item_content, 'title': ''})
result_item_content = ''
result_item_content = next_md_content
if len(result_item_content) > 0:
paragraphs.append({'content': result_item_content, 'title': ''})
return result
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provided Python function handle seems to have several issues and can be optimized:

  1. Potential Memory Leak: The variable result_item_content is not cleared when appending paragraph objects to the paragraphs list. This could lead to excessive memory usage over multiple calls.

  2. Logic Flaw: If get_buyers is set to True before finding paragraphs, there will no paragraphs added to the paragraphs list because next_md_content would always be an empty string due to the else block.

  3. Code Redundancy: The condition if len(result_item_content) > 0: at the end of the loop is redundant since it checks only once outside the loop, after checking all lines.

Here’s a revised version of the function with these issues addressed:

def handle(file, pattern_list: List, with_filter: bool, limit: int, get_buyers: bool)->List:
    paragraphs = []
    
    # Initialize variables
    result_item_content = ''
    
    # Read the file line by line
    for i, line in enumerate(file):
        if i == limit and limit != None:
            break
        
        # Check match with patterns
        if (pattern := re.search(pattern, line)):
            if get_buyers:
                result_item_content += '\n' + pattern.group()
        elif pattern_list or get_buyers:
            paragraphs.append({'content': result_item_content.strip(), 'title': ''})
            
        # Reset content for each new match or reset needed
        if result_item_content.strip():
            paragraphs.append({'content': result_item_content.strip(), 'title': ''})
            result_item_content = ''

    return paragraphs

Key Improvements:

  • Memory Management: Clearing result_item_content inside the for loop ensures that memory usage is managed properly.
  • Condition Simplification: Removed unnecessary condition logic.
  • Error Handling: Added basic error handling assumptions based on your context (e.g., re.search() returns None). Adjust according to your actual requirements.

This revision should work more efficiently and correctly while maintaining functionality.

@shaohuzhang1 shaohuzhang1 merged commit a3d6083 into main Feb 7, 2025
4 checks passed
@shaohuzhang1 shaohuzhang1 deleted the pr@main@fix_file_upload branch February 7, 2025 07:13
@@ -48,7 +48,7 @@ def handle_sheet(file_name, sheet, limit: int):
result_item_content += next_md_content
else:
paragraphs.append({'content': result_item_content, 'title': ''})
result_item_content = ''
result_item_content = next_md_content
if len(result_item_content) > 0:
paragraphs.append({'content': result_item_content, 'title': ''})
return result
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provided code snippet appears to be functioning correctly without any significant issues. However, there is an optimization that can improve efficiency:

  • Avoid Repeated Initialization of result_item_content: Initializing result_item_content inside the loop instead of at the beginning allows you to reuse it during each iteration.

Here's the optimized version of the function:

def handle_sheet(file_name, sheet, limit: int):
    paragraphs = []
    result_item_content = ''

    for next_md_content in sheet[limit:]:
        if len(result_item_content) > 0:
            result_item_content += '\n'
        result_item_content += next_md_content

    if len(result_item_content) > 0:
        paragraphs.append({'content': result_item_content, 'title': ''})

    return paragraphs

This change ensures that result_item_content is only initialized once, reducing unnecessary operations within the loop.

@@ -54,7 +54,7 @@ def handle_sheet(file_name, sheet, image_dict, limit: int):
result_item_content += next_md_content
else:
paragraphs.append({'content': result_item_content, 'title': ''})
result_item_content = ''
result_item_content = next_md_content
if len(result_item_content) > 0:
paragraphs.append({'content': result_item_content, 'title': ''})
return result
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provided code has an issue where result_item_content is never reset to an empty string after a paragraph end (#) character is encountered. This means if there's another paragraph within the same cell, it won't be included because result_item_content remains with the last block of text.

Irregularity/Issue: Lack of resetting result_item_content when encountering the # character

Optimization Suggestions:

  • Initialize next_md_content = '' outside the loop to avoid repeating this value assignment on each iteration.
  • You can use .strip() and .replace('\n\n', '\t') to clean up result_item_content before checking its length to prevent extra spaces from affecting the logic.

Here is a suggested modification:

def handle_sheet(file_name, sheet, image_dict, limit: int):
    paragraphs = []
    result_item_content = ''

    prev_is_paragraph_end = False

    for row_index, row in enumerate(sheet.iter_rows(values_only=True)):
        for col_ind, cell_value in enumerate(row):
            # Check if it's the second '#' symbol indicating the end of a title section
            if cell_value == '##':
                current_content = result_item_content.strip().rstrip()

                if prev_is_paragraph_end and current_content != '':
                    paragraphs.append({'content': current_content, 'title': ''})
                
                result_item_content = ''
            elif cell_value == '#':
                prev_is_paragraph_end = True  # Mark that we found a paragraph start
            else:
                result_item_content += f"{cell_value}\n"

    # Handle remaining content at the end of the final cell
    current_content = result_item_content.strip()
    
    if prev_is_paragraph_end or current_content != '':
        paragraphs.append({'content': current_content, 'title': ''})

    return paragraphs

This revised version takes care to ensure result_item_content is properly reset after reaching the end of a title section marked by two consecutive # characters. Note that handling nested sections effectively might require further adjustments based on your specific requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant