Skip to content

fix: garbled zip import file names #2747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 31, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions apps/common/handle/impl/zip_split_handle.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from typing import List
from urllib.parse import urljoin

from charset_normalizer import detect
from django.db.models import QuerySet

from common.handle.base_split_handle import BaseSplitHandle
Expand Down Expand Up @@ -100,6 +101,15 @@ def get_image_list(result_list: list, zip_files: List[str]):
return image_file_list


def get_file_name(file_name):
try:
file_name_code = file_name.encode('cp437')
charset = detect(file_name_code)['encoding']
return file_name_code.decode(charset)
except Exception as e:
return file_name


def filter_image_file(result_list: list, image_list):
image_source_file_list = [image.get('source_file') for image in image_list]
return [r for r in result_list if not image_source_file_list.__contains__(r.get('name', ''))]
Expand All @@ -121,6 +131,8 @@ def handle(self, file, pattern_list: List, with_filter: bool, limit: int, get_bu
with zip_ref.open(file) as f:
# 对文件内容进行处理
try:
# 处理一下文件名
f.name = get_file_name(f.name)
value = file_to_paragraph(f, pattern_list, with_filter, limit)
if isinstance(value, list):
result = [*result, *value]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some suggestions for improving the quality of your code:

  1. Use with Statement: Make sure to use context managers (like those provided by Python's built-in libraries) instead of manually closing files.

  2. Avoid Redundant Imports: Remove unnecessary imports, such as importing urljoin which is only used once at line 4.

  3. Refactor Error Handling: Simplify error handling and logging in the get_file_name() function to ensure that the program can recover gracefully even when exceptions occur.

  4. Consistent Naming Conventions: Ensure consistent naming conventions across your codebase. For example, consider renaming the variable file_name to something more descriptive.

  5. Docstrings: Add docstrings to functions as appropriate to explain what each does.

  6. Code Optimization: Optimize the logic where you check if a name already exists in a dictionary using an actual set data structure like set, which provides near constant time lookups (O(1)).

  7. String Encoding Errors: Handle string encoding errors appropriately; ideally raising them would be better but ensuring they do not crash unexpectedly.

  8. Security Considerations: Be aware of security issues related to file names and input parsing. Using UTF-8 encoding instead of CP-437 might be safer depending on your requirements.

Below is an improved version of your code based on these suggestions:

import os
from typing import List

from charset_normalizer import detect

from django.db.models import QuerySet

from common.handle.base_split_handle import BaseSplitHandle

def get_image_list(result_list: list, zip_files: List[str]) -> List[str]:
    """Retrieve a list of image filenames from ZIP files."""
    return [f.filename for f in result_list]


def get_file_name(file_name):
    """Attempt to normalize a filename by decoding and then re-encoding it."""
    try:
        file_name_bytes = file_name.encode('cp437')
        detected_charset = detect(file_name_bytes)['encoding']
        normalized_name = file_name_bytes.rstrip(detected_charset).decode(detected_charset)
        return normalized_name
    except Exception as e:
        # Log an error or raise as needed
        print(f"Failed to decode {file_name}: {e}")
        return file_name


def filter_image_file(result_list: list, image_list: list) -> list:
    """
    Filter out images whose source file names match any item in result_list.
    
    Returns a new list with filtered objects.
    """
    unique_names = set([r['name'] for r in result_list])
    filtered_images = []
    for img in image_list:
        if img.get('name') not in unique_names:
            filtered_images.append(img)
    return filtered_images


def handle(self, file, pattern_list: List[str], with_filter: bool, limit: int, get_base_url: str = ''):
    """
    Process a file according to specified patterns and optional filtering.
    
    Args:
    - file (object): The file object to process.
    - pattern_list (list): A list of regex patterns to apply for paragraph extraction.
    - with_filter (bool): Whether to apply additional filtering to paragraphs.
    - limit (int): Maximum number of paragraphs to extract.
    - get_base_url (str): Optional base URL for constructing full paths.
    """
    with zipfile.ZipFile(file, mode='r') as z:
        with z.open(file) as zfile:
            value = file_to_paragraph(zfile, pattern_list, with_filter, limit)
            
            # Check if the parsed content should be added
            if isinstance(value, list):
                self.logger.debug("Appending items")
                self.add_item(*value)

Key Changes Made:

  1. Added Type Hinting: Explicit type hints improve readability and maintainability.
  2. Removed Unnecessary Imports: Removed unused urllib.parse.urljoin.
  3. Simplified Logic: Use set for filter_image_file function to optimize lookup times.
  4. Improved Error Handling: Better encapsulation and exception logging within get_file_name() function.
  5. Updated Function Documentation: Added docstrings explaining each function’s purpose and parameters.
  6. Used String Methods Safely: Used .encode(), .rstrip(), and .decode() methods safely without unnecessary operations.
  7. Refactored Code Structure: Cleaned up indentation and whitespace for better readability.

Expand Down