Skip to content

Selecting top-level sections #71

@ghyman-oreilly

Description

@ghyman-oreilly

Hi Danny! This is more of an idea than an issue. Currently, it looks like get_top_level_sections returns sections at all levels of the document in the case of the absence of a article wrapper with role of main:

    if len(section_wrappers) == 0:
        sections = soup.find_all('section')

In the case of a book I've been working on, this resulted in all chapters meeting the criteria to trigger the warning in get_main_section, even though all chapters but for two appeared to be processed correctly/in full, because they had a single outermost section. This sort of obscured the problem with the two chapters where content was actually dropped.

I was wondering if in the case of the absence of an article meeting the wrapper criteria, we could look for sections that don't have a section parent:

def get_top_level_sections(soup):
    """
    Helper utility to grab top-level sections in main <article>. Returns
    all but bibliography sections
    """
    section_wrappers = soup.find_all("article", attrs={"role": "main"})
    top_level_sections = []

    # test case for partial files, not expected in production
    if len(section_wrappers) == 0:
        sections = soup.find_all('section')

        for section in sections:
            if section.find_parent('section') is None:
                top_level_sections.append(section)
    elif len(section_wrappers) != 1:
        article = soup.find('article', attrs={"role": "main"})
        try:
            main_title = article.find('h1').get_text()
        except AttributeError:
            main_title = soup.find("h1")
        print("Warning: " +
              f"The chapter with title '{main_title}' is malformed.")
        return None, None
    else:
        main = section_wrappers[0]

        for element in main.children:
            if (
                    element.name == "section" and
                    element.get('id') != "bibliography"
               ):
                top_level_sections.append(element)

    return top_level_sections

This way, we wouldn't need to know where the outermost section(s) is in the document, or at what level: we're just looking for sections that are top-most. I tested this locally and it seemed to work pretty well with this particular book, but I haven't tested with other books.

If you think something like this would be a helpful change, I'd be happy to open a PR. Looking forward to your thoughts!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions