Selecting top-level sections

Hi Danny! This is more of an idea than an issue. Currently, it looks like `get_top_level_sections` returns sections at all levels of the document in the case of the absence of a `article` wrapper with `role` of `main`:

```
    if len(section_wrappers) == 0:
        sections = soup.find_all('section')
```

In the case of a book I've been working on, this resulted in all chapters meeting the criteria to trigger the warning in `get_main_section`, even though all chapters but for two appeared to be processed correctly/in full, because they had a single outermost section. This sort of obscured the problem with the two chapters where content was actually dropped. 

I was wondering if in the case of the absence of an article meeting the wrapper criteria, we could look for sections that don't have a section parent:

```
def get_top_level_sections(soup):
    """
    Helper utility to grab top-level sections in main <article>. Returns
    all but bibliography sections
    """
    section_wrappers = soup.find_all("article", attrs={"role": "main"})
    top_level_sections = []

    # test case for partial files, not expected in production
    if len(section_wrappers) == 0:
        sections = soup.find_all('section')

        for section in sections:
            if section.find_parent('section') is None:
                top_level_sections.append(section)
    elif len(section_wrappers) != 1:
        article = soup.find('article', attrs={"role": "main"})
        try:
            main_title = article.find('h1').get_text()
        except AttributeError:
            main_title = soup.find("h1")
        print("Warning: " +
              f"The chapter with title '{main_title}' is malformed.")
        return None, None
    else:
        main = section_wrappers[0]

        for element in main.children:
            if (
                    element.name == "section" and
                    element.get('id') != "bibliography"
               ):
                top_level_sections.append(element)

    return top_level_sections
```

This way, we wouldn't need to know where the outermost section(s) is in the document, or at what level: we're just looking for sections that are top-most. I tested this locally and it seemed to work pretty well with this particular book, but I haven't tested with other books. 

If you think something like this would be a helpful change, I'd be happy to open a PR. Looking forward to your thoughts!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Selecting top-level sections #71

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Selecting top-level sections #71

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions