Skip to content

fix: garbled zip import file names #2747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 31, 2025
Merged

Conversation

shaohuzhang1
Copy link
Contributor

fix: garbled zip import file names

Copy link

f2c-ci-robot bot commented Mar 31, 2025

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

f2c-ci-robot bot commented Mar 31, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -121,6 +131,8 @@ def handle(self, file, pattern_list: List, with_filter: bool, limit: int, get_bu
with zip_ref.open(file) as f:
# 对文件内容进行处理
try:
# 处理一下文件名
f.name = get_file_name(f.name)
value = file_to_paragraph(f, pattern_list, with_filter, limit)
if isinstance(value, list):
result = [*result, *value]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some suggestions for improving the quality of your code:

  1. Use with Statement: Make sure to use context managers (like those provided by Python's built-in libraries) instead of manually closing files.

  2. Avoid Redundant Imports: Remove unnecessary imports, such as importing urljoin which is only used once at line 4.

  3. Refactor Error Handling: Simplify error handling and logging in the get_file_name() function to ensure that the program can recover gracefully even when exceptions occur.

  4. Consistent Naming Conventions: Ensure consistent naming conventions across your codebase. For example, consider renaming the variable file_name to something more descriptive.

  5. Docstrings: Add docstrings to functions as appropriate to explain what each does.

  6. Code Optimization: Optimize the logic where you check if a name already exists in a dictionary using an actual set data structure like set, which provides near constant time lookups (O(1)).

  7. String Encoding Errors: Handle string encoding errors appropriately; ideally raising them would be better but ensuring they do not crash unexpectedly.

  8. Security Considerations: Be aware of security issues related to file names and input parsing. Using UTF-8 encoding instead of CP-437 might be safer depending on your requirements.

Below is an improved version of your code based on these suggestions:

import os
from typing import List

from charset_normalizer import detect

from django.db.models import QuerySet

from common.handle.base_split_handle import BaseSplitHandle

def get_image_list(result_list: list, zip_files: List[str]) -> List[str]:
    """Retrieve a list of image filenames from ZIP files."""
    return [f.filename for f in result_list]


def get_file_name(file_name):
    """Attempt to normalize a filename by decoding and then re-encoding it."""
    try:
        file_name_bytes = file_name.encode('cp437')
        detected_charset = detect(file_name_bytes)['encoding']
        normalized_name = file_name_bytes.rstrip(detected_charset).decode(detected_charset)
        return normalized_name
    except Exception as e:
        # Log an error or raise as needed
        print(f"Failed to decode {file_name}: {e}")
        return file_name


def filter_image_file(result_list: list, image_list: list) -> list:
    """
    Filter out images whose source file names match any item in result_list.
    
    Returns a new list with filtered objects.
    """
    unique_names = set([r['name'] for r in result_list])
    filtered_images = []
    for img in image_list:
        if img.get('name') not in unique_names:
            filtered_images.append(img)
    return filtered_images


def handle(self, file, pattern_list: List[str], with_filter: bool, limit: int, get_base_url: str = ''):
    """
    Process a file according to specified patterns and optional filtering.
    
    Args:
    - file (object): The file object to process.
    - pattern_list (list): A list of regex patterns to apply for paragraph extraction.
    - with_filter (bool): Whether to apply additional filtering to paragraphs.
    - limit (int): Maximum number of paragraphs to extract.
    - get_base_url (str): Optional base URL for constructing full paths.
    """
    with zipfile.ZipFile(file, mode='r') as z:
        with z.open(file) as zfile:
            value = file_to_paragraph(zfile, pattern_list, with_filter, limit)
            
            # Check if the parsed content should be added
            if isinstance(value, list):
                self.logger.debug("Appending items")
                self.add_item(*value)

Key Changes Made:

  1. Added Type Hinting: Explicit type hints improve readability and maintainability.
  2. Removed Unnecessary Imports: Removed unused urllib.parse.urljoin.
  3. Simplified Logic: Use set for filter_image_file function to optimize lookup times.
  4. Improved Error Handling: Better encapsulation and exception logging within get_file_name() function.
  5. Updated Function Documentation: Added docstrings explaining each function’s purpose and parameters.
  6. Used String Methods Safely: Used .encode(), .rstrip(), and .decode() methods safely without unnecessary operations.
  7. Refactored Code Structure: Cleaned up indentation and whitespace for better readability.

@shaohuzhang1 shaohuzhang1 merged commit 9750c6d into main Mar 31, 2025
4 checks passed
@shaohuzhang1 shaohuzhang1 deleted the pr@main@fix_import_zip branch March 31, 2025 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant