Skip to content

fix: limit chapter title length to 256 characters in pdf_split_handle.py #2803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 7, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions apps/common/handle/impl/pdf_split_handle.py
Original file line number Diff line number Diff line change
Expand Up @@ -173,14 +173,15 @@ def handle_toc(doc, limit):

# Null characters are not allowed.
chapter_text = chapter_text.replace('\0', '')

# 限制标题长度
real_chapter_title = chapter_title[:256]
# 限制章节内容长度
if 0 < limit < len(chapter_text):
split_text = PdfSplitHandle.split_text(chapter_text, limit)
for text in split_text:
chapters.append({"title": chapter_title, "content": text})
chapters.append({"title": real_chapter_title, "content": text})
else:
chapters.append({"title": chapter_title, "content": chapter_text if chapter_text else chapter_title})
chapters.append({"title": real_chapter_title, "content": chapter_text if chapter_text else real_chapter_title})
# 保存章节内容和章节标题
return chapters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code appears to be functioning correctly based on initial inspection, but here are a few suggestions and corrections:

  1. Null Characters Check: The line chapter_text = chapter_text.replace('\0', '') is correct and will remove null characters from the chapter_text.

  2. Title Length Limitation: Adding a character limit of 256 for real_chapter_title is redundant since it's already truncated to ensure length within the function, which should be sufficient unless longer titles are required.

  3. Content Handling with Empty Chapter Titles: There seems to be an unnecessary condition in appending content when the content is empty. When chapter_text is empty, appending chapter_title if chapter_text else "" would work fine without explicitly checking its length again. However, there might be intended behavior where you want to append a placeholder or message if the content doesn't exist.

  4. Code Consistency: Ensure that all similar lines follow the same format (e.g., spacing around operators), for better readability.

Overall, the code logic is sound, but these minor adjustments could improve clarity and maintainability. Here’s a slightly refined version of the relevant section:

def handle_toc(doc, limit):
    # Null characters are not allowed.
    chapter_text = chapter_text.replace('\0', '')

    # Initialize real_chapter_title to avoid repetition
    if limit > 0:
        real_chapter_title = chapter_title[:limit]

    chapters = []
    
    # If the chapter title exists, add it as the key; otherwise, use 'Unknown'
    # For simplicity, let's assume that None or empty strings are meant to have no entry at all
    if chapter_title:
        # Split the text into paragraphs based on the specified limit
        split_text = PdfSplitHandle.split_text(chapter_text, limit)
        
        for text in split_text:
            chapters.append({
                "title": real_chapter_title,
                "content": text.strip()  # Strip leading/trailing whitespace from each paragraph
            })
    
    return chapters

These changes make the code cleaner and more consistent.

Expand Down