-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: XLS, XLSX, CSV file upload lost data #2150
Conversation
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -53,7 +53,7 @@ def handle(self, file, pattern_list: List, with_filter: bool, limit: int, get_bu | |||
result_item_content += next_md_content | |||
else: | |||
paragraphs.append({'content': result_item_content, 'title': ''}) | |||
result_item_content = '' | |||
result_item_content = next_md_content | |||
if len(result_item_content) > 0: | |||
paragraphs.append({'content': result_item_content, 'title': ''}) | |||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The provided Python function handle
seems to have several issues and can be optimized:
-
Potential Memory Leak: The variable
result_item_content
is not cleared when appending paragraph objects to theparagraphs
list. This could lead to excessive memory usage over multiple calls. -
Logic Flaw: If
get_buyers
is set toTrue
before finding paragraphs, there will no paragraphs added to theparagraphs
list becausenext_md_content
would always be an empty string due to theelse
block. -
Code Redundancy: The condition
if len(result_item_content) > 0:
at the end of the loop is redundant since it checks only once outside the loop, after checking all lines.
Here’s a revised version of the function with these issues addressed:
def handle(file, pattern_list: List, with_filter: bool, limit: int, get_buyers: bool)->List:
paragraphs = []
# Initialize variables
result_item_content = ''
# Read the file line by line
for i, line in enumerate(file):
if i == limit and limit != None:
break
# Check match with patterns
if (pattern := re.search(pattern, line)):
if get_buyers:
result_item_content += '\n' + pattern.group()
elif pattern_list or get_buyers:
paragraphs.append({'content': result_item_content.strip(), 'title': ''})
# Reset content for each new match or reset needed
if result_item_content.strip():
paragraphs.append({'content': result_item_content.strip(), 'title': ''})
result_item_content = ''
return paragraphs
Key Improvements:
- Memory Management: Clearing
result_item_content
inside thefor
loop ensures that memory usage is managed properly. - Condition Simplification: Removed unnecessary condition logic.
- Error Handling: Added basic error handling assumptions based on your context (e.g.,
re.search()
returnsNone
). Adjust according to your actual requirements.
This revision should work more efficiently and correctly while maintaining functionality.
@@ -48,7 +48,7 @@ def handle_sheet(file_name, sheet, limit: int): | |||
result_item_content += next_md_content | |||
else: | |||
paragraphs.append({'content': result_item_content, 'title': ''}) | |||
result_item_content = '' | |||
result_item_content = next_md_content | |||
if len(result_item_content) > 0: | |||
paragraphs.append({'content': result_item_content, 'title': ''}) | |||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The provided code snippet appears to be functioning correctly without any significant issues. However, there is an optimization that can improve efficiency:
- Avoid Repeated Initialization of
result_item_content
: Initializingresult_item_content
inside the loop instead of at the beginning allows you to reuse it during each iteration.
Here's the optimized version of the function:
def handle_sheet(file_name, sheet, limit: int):
paragraphs = []
result_item_content = ''
for next_md_content in sheet[limit:]:
if len(result_item_content) > 0:
result_item_content += '\n'
result_item_content += next_md_content
if len(result_item_content) > 0:
paragraphs.append({'content': result_item_content, 'title': ''})
return paragraphs
This change ensures that result_item_content
is only initialized once, reducing unnecessary operations within the loop.
@@ -54,7 +54,7 @@ def handle_sheet(file_name, sheet, image_dict, limit: int): | |||
result_item_content += next_md_content | |||
else: | |||
paragraphs.append({'content': result_item_content, 'title': ''}) | |||
result_item_content = '' | |||
result_item_content = next_md_content | |||
if len(result_item_content) > 0: | |||
paragraphs.append({'content': result_item_content, 'title': ''}) | |||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The provided code has an issue where result_item_content
is never reset to an empty string after a paragraph end (#
) character is encountered. This means if there's another paragraph within the same cell, it won't be included because result_item_content
remains with the last block of text.
Irregularity/Issue: Lack of resetting result_item_content
when encountering the #
character
Optimization Suggestions:
- Initialize
next_md_content = ''
outside the loop to avoid repeating this value assignment on each iteration. - You can use
.strip()
and.replace('\n\n', '\t')
to clean upresult_item_content
before checking its length to prevent extra spaces from affecting the logic.
Here is a suggested modification:
def handle_sheet(file_name, sheet, image_dict, limit: int):
paragraphs = []
result_item_content = ''
prev_is_paragraph_end = False
for row_index, row in enumerate(sheet.iter_rows(values_only=True)):
for col_ind, cell_value in enumerate(row):
# Check if it's the second '#' symbol indicating the end of a title section
if cell_value == '##':
current_content = result_item_content.strip().rstrip()
if prev_is_paragraph_end and current_content != '':
paragraphs.append({'content': current_content, 'title': ''})
result_item_content = ''
elif cell_value == '#':
prev_is_paragraph_end = True # Mark that we found a paragraph start
else:
result_item_content += f"{cell_value}\n"
# Handle remaining content at the end of the final cell
current_content = result_item_content.strip()
if prev_is_paragraph_end or current_content != '':
paragraphs.append({'content': current_content, 'title': ''})
return paragraphs
This revised version takes care to ensure result_item_content
is properly reset after reaching the end of a title section marked by two consecutive #
characters. Note that handling nested sections effectively might require further adjustments based on your specific requirements.
fix: XLS, XLSX, CSV file upload lost data