-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Hi Danny! This is more of an idea than an issue. Currently, it looks like get_top_level_sections returns sections at all levels of the document in the case of the absence of a article wrapper with role of main:
if len(section_wrappers) == 0:
sections = soup.find_all('section')
In the case of a book I've been working on, this resulted in all chapters meeting the criteria to trigger the warning in get_main_section, even though all chapters but for two appeared to be processed correctly/in full, because they had a single outermost section. This sort of obscured the problem with the two chapters where content was actually dropped.
I was wondering if in the case of the absence of an article meeting the wrapper criteria, we could look for sections that don't have a section parent:
def get_top_level_sections(soup):
"""
Helper utility to grab top-level sections in main <article>. Returns
all but bibliography sections
"""
section_wrappers = soup.find_all("article", attrs={"role": "main"})
top_level_sections = []
# test case for partial files, not expected in production
if len(section_wrappers) == 0:
sections = soup.find_all('section')
for section in sections:
if section.find_parent('section') is None:
top_level_sections.append(section)
elif len(section_wrappers) != 1:
article = soup.find('article', attrs={"role": "main"})
try:
main_title = article.find('h1').get_text()
except AttributeError:
main_title = soup.find("h1")
print("Warning: " +
f"The chapter with title '{main_title}' is malformed.")
return None, None
else:
main = section_wrappers[0]
for element in main.children:
if (
element.name == "section" and
element.get('id') != "bibliography"
):
top_level_sections.append(element)
return top_level_sections
This way, we wouldn't need to know where the outermost section(s) is in the document, or at what level: we're just looking for sections that are top-most. I tested this locally and it seemed to work pretty well with this particular book, but I haven't tested with other books.
If you think something like this would be a helpful change, I'd be happy to open a PR. Looking forward to your thoughts!