Skip to content

fix: limit chapter title length to 256 characters in pdf_split_handle.py #2803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 7, 2025

Conversation

shaohuzhang1
Copy link
Contributor

fix: limit chapter title length to 256 characters in pdf_split_handle.py --bug=1054363 --user=刘瑞斌 【知识库】导入PDF文档,分段标题长度超长时,没有自动截断 https://www.tapd.cn/57709429/s/1681044

--bug=1054363 --user=刘瑞斌 【知识库】导入PDF文档,分段标题长度超长时,没有自动截断 https://www.tapd.cn/57709429/s/1681044
Copy link

f2c-ci-robot bot commented Apr 7, 2025

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

f2c-ci-robot bot commented Apr 7, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

else:
chapters.append({"title": chapter_title, "content": chapter_text if chapter_text else chapter_title})
chapters.append({"title": real_chapter_title, "content": chapter_text if chapter_text else real_chapter_title})
# 保存章节内容和章节标题
return chapters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code appears to be functioning correctly based on initial inspection, but here are a few suggestions and corrections:

  1. Null Characters Check: The line chapter_text = chapter_text.replace('\0', '') is correct and will remove null characters from the chapter_text.

  2. Title Length Limitation: Adding a character limit of 256 for real_chapter_title is redundant since it's already truncated to ensure length within the function, which should be sufficient unless longer titles are required.

  3. Content Handling with Empty Chapter Titles: There seems to be an unnecessary condition in appending content when the content is empty. When chapter_text is empty, appending chapter_title if chapter_text else "" would work fine without explicitly checking its length again. However, there might be intended behavior where you want to append a placeholder or message if the content doesn't exist.

  4. Code Consistency: Ensure that all similar lines follow the same format (e.g., spacing around operators), for better readability.

Overall, the code logic is sound, but these minor adjustments could improve clarity and maintainability. Here’s a slightly refined version of the relevant section:

def handle_toc(doc, limit):
    # Null characters are not allowed.
    chapter_text = chapter_text.replace('\0', '')

    # Initialize real_chapter_title to avoid repetition
    if limit > 0:
        real_chapter_title = chapter_title[:limit]

    chapters = []
    
    # If the chapter title exists, add it as the key; otherwise, use 'Unknown'
    # For simplicity, let's assume that None or empty strings are meant to have no entry at all
    if chapter_title:
        # Split the text into paragraphs based on the specified limit
        split_text = PdfSplitHandle.split_text(chapter_text, limit)
        
        for text in split_text:
            chapters.append({
                "title": real_chapter_title,
                "content": text.strip()  # Strip leading/trailing whitespace from each paragraph
            })
    
    return chapters

These changes make the code cleaner and more consistent.

@liuruibin liuruibin merged commit 560890f into main Apr 7, 2025
4 of 5 checks passed
@liuruibin liuruibin deleted the pr@main@fix_limit_title branch April 7, 2025 02:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants