-
Notifications
You must be signed in to change notification settings - Fork 2.1k
fix: garbled zip import file names #2747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -121,6 +131,8 @@ def handle(self, file, pattern_list: List, with_filter: bool, limit: int, get_bu | |||
with zip_ref.open(file) as f: | |||
# 对文件内容进行处理 | |||
try: | |||
# 处理一下文件名 | |||
f.name = get_file_name(f.name) | |||
value = file_to_paragraph(f, pattern_list, with_filter, limit) | |||
if isinstance(value, list): | |||
result = [*result, *value] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some suggestions for improving the quality of your code:
-
Use
with
Statement: Make sure to use context managers (like those provided by Python's built-in libraries) instead of manually closing files. -
Avoid Redundant Imports: Remove unnecessary imports, such as importing
urljoin
which is only used once at line 4. -
Refactor Error Handling: Simplify error handling and logging in the
get_file_name()
function to ensure that the program can recover gracefully even when exceptions occur. -
Consistent Naming Conventions: Ensure consistent naming conventions across your codebase. For example, consider renaming the variable
file_name
to something more descriptive. -
Docstrings: Add docstrings to functions as appropriate to explain what each does.
-
Code Optimization: Optimize the logic where you check if a name already exists in a dictionary using an actual set data structure like
set
, which provides near constant time lookups (O(1)
). -
String Encoding Errors: Handle string encoding errors appropriately; ideally raising them would be better but ensuring they do not crash unexpectedly.
-
Security Considerations: Be aware of security issues related to file names and input parsing. Using UTF-8 encoding instead of CP-437 might be safer depending on your requirements.
Below is an improved version of your code based on these suggestions:
import os
from typing import List
from charset_normalizer import detect
from django.db.models import QuerySet
from common.handle.base_split_handle import BaseSplitHandle
def get_image_list(result_list: list, zip_files: List[str]) -> List[str]:
"""Retrieve a list of image filenames from ZIP files."""
return [f.filename for f in result_list]
def get_file_name(file_name):
"""Attempt to normalize a filename by decoding and then re-encoding it."""
try:
file_name_bytes = file_name.encode('cp437')
detected_charset = detect(file_name_bytes)['encoding']
normalized_name = file_name_bytes.rstrip(detected_charset).decode(detected_charset)
return normalized_name
except Exception as e:
# Log an error or raise as needed
print(f"Failed to decode {file_name}: {e}")
return file_name
def filter_image_file(result_list: list, image_list: list) -> list:
"""
Filter out images whose source file names match any item in result_list.
Returns a new list with filtered objects.
"""
unique_names = set([r['name'] for r in result_list])
filtered_images = []
for img in image_list:
if img.get('name') not in unique_names:
filtered_images.append(img)
return filtered_images
def handle(self, file, pattern_list: List[str], with_filter: bool, limit: int, get_base_url: str = ''):
"""
Process a file according to specified patterns and optional filtering.
Args:
- file (object): The file object to process.
- pattern_list (list): A list of regex patterns to apply for paragraph extraction.
- with_filter (bool): Whether to apply additional filtering to paragraphs.
- limit (int): Maximum number of paragraphs to extract.
- get_base_url (str): Optional base URL for constructing full paths.
"""
with zipfile.ZipFile(file, mode='r') as z:
with z.open(file) as zfile:
value = file_to_paragraph(zfile, pattern_list, with_filter, limit)
# Check if the parsed content should be added
if isinstance(value, list):
self.logger.debug("Appending items")
self.add_item(*value)
Key Changes Made:
- Added Type Hinting: Explicit type hints improve readability and maintainability.
- Removed Unnecessary Imports: Removed unused
urllib.parse.urljoin
. - Simplified Logic: Use
set
forfilter_image_file
function to optimize lookup times. - Improved Error Handling: Better encapsulation and exception logging within
get_file_name()
function. - Updated Function Documentation: Added docstrings explaining each function’s purpose and parameters.
- Used String Methods Safely: Used
.encode()
,.rstrip()
, and.decode()
methods safely without unnecessary operations. - Refactored Code Structure: Cleaned up indentation and whitespace for better readability.
fix: garbled zip import file names