Skip to content

Conversation

@Antoinelfr
Copy link
Collaborator

@Antoinelfr Antoinelfr commented May 22, 2025

Description

This PR includes:

  • Title text cleaning from new lines and weird formatting.
  • Fix the author formatting in the reference.
  • Change manually \xa0 to space.
  • Fix a bug where the section title was empty due to the title tag being outside the section tag.
  • Fix the keywords section to exclude the title from the text, remove non-English keywords, take multiple lists as opposed to the first one before, and exclude the abbreviation list.
  • Remove tables and figures from the front and back tags.
  • Add IOA IDs for all passages and fixed the "document part" allocation.
  • Fix a bug where chunks of text were not identified because they were not in section tags.
  • Improve the filtering of text artefacts from unwanted sections.

Fixes #190

Type of change

  • Documentation (non-breaking change that adds or improves the documentation)
  • New feature (non-breaking change which adds functionality)
  • Optimization (non-breaking, back-end change that speeds up the code)
  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (whatever its nature)

Key checklist

  • All tests pass (eg. pytest)
  • The documentation builds and looks OK (eg. mkdocs)
  • Pre-commit hooks run successfully (eg. pre-commit run --all-files)

Further checks

  • Code is commented, particularly in hard-to-understand areas
  • Tests added or an issue has been opened to tackle that in the future. (Indicate issue here: # (issue))

@codecov
Copy link

codecov bot commented May 22, 2025

Codecov Report

Attention: Patch coverage is 33.33333% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
autocorpus/file_processing.py 33.33% 2 Missing ⚠️
Files with missing lines Coverage Δ
autocorpus/parse_xml.py 37.10% <ø> (ø)
autocorpus/file_processing.py 50.90% <33.33%> (ø)

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Antoinelfr
Copy link
Collaborator Author

Before merging, let me look at the Windows test failing. I will try to implement the IOA from the main AC as well to avoid redundancy. I also noticed a weird character encoding in the AC paper parsing as opposed to the HTML I will take a look at this one as well.

Copy link
Collaborator

@alexdewar alexdewar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice cleanup and having a test for the XML processing is definitely an improvement. Good work!

The reason the tests are failing on the Windows runners currently is because there are a few places where you're calling open() without setting encoding="utf-8" (the default text encoding used by Python on Windows isn't UTF8 for reasons that are obscure and bad).

I've taken this as an opportunity to give this code a bit of an audit. I've made some small suggestions in comments (in case you haven't done this before, there is a button to commit suggestions directly). Commit the ones you like and ignore the ones you don't, then re-request a review from me.

I've also got some comments about the general structure of the code, though I'm not saying that you should fix this now -- just something to think about as you go forward.
Firstly, I'd strongly suggest breaking apart the convert_xml_to_json function into smaller, more manageable chunks. It's currently doing way too much. There's also quite a lot of code duplicated between if and else branches that could be extracted into separate functions. Lots of copy-pasted code is a bit of a maintenance headache because a) it makes the code harder to read and reason about and b) it's easy to fix a bug in one part of the code and forget to update all the other copy-pasted versions of it.

year_xml = ""
else:
# If 'accepted' date is missing, assign an empty string
year_xml = ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd split all these little checks and bits for cleaning the text into separate functions for readability + testability. For example, this one could be something like this:

def get_year(soup: BeautifulSoup) -> str:
    # Check if the 'accepted' date is found within 'date', and if it contains a 'year' tag
    ### no check for unicode or hexacode or XML tags
    if date := soup.find("date", {"date-type": "accepted"}):
        if year := date.find("year"):
            # Extract the text content of the 'year' tag if found
            return year.text

    # If 'accepted' date or 'year' is missing, return empty string
    return ""

(I'm not saying you have to do that on this PR -- just food for thought)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are right, this will be part of the next change in the next PR

continue
# Find the title (if it exists)
title_tag = kwd.find("title")
if title_tag is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some duplicated code between the if and else branches here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I will look into that

Comment on lines +1780 to +1792
# Check for the presence of <etal> (et al.)
etal_tag = ref.find("etal")
if etal_tag is not None:
etal = "Et al." # Add "Et al." if the tag is present
else:
etal = ""

# If 'etal' is found, append it to the final authors list
### ERROR authors could be an empty list, need to figure out if the above tag is absent what to do
if etal != "":
final_authors = f"{', '.join(authors)} {etal}"
else:
final_authors = f"{', '.join(authors)}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

Suggested change
# Check for the presence of <etal> (et al.)
etal_tag = ref.find("etal")
if etal_tag is not None:
etal = "Et al." # Add "Et al." if the tag is present
else:
etal = ""
# If 'etal' is found, append it to the final authors list
### ERROR authors could be an empty list, need to figure out if the above tag is absent what to do
if etal != "":
final_authors = f"{', '.join(authors)} {etal}"
else:
final_authors = f"{', '.join(authors)}"
# If 'etal' is found, append it to the final authors list
### ERROR authors could be an empty list, need to figure out if the above tag is absent what to do
final_authors = ", ".join(authors)
if ref.find("etal"):
final_authors += " Et al."

If you want to handle the empty authors case you could do:

if final_authors and re.find("etal"):
    # ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need to look into it

.replace("&amp;", "&")
.replace("&apos;", "'")
.replace("&quot;", '"')
.replace("\xa0", " ")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this extra replace is present in the other places where you're doing the same processsing... Might you need to replace \xa0s in those places too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look into it. I think at the moment I focused on the text that is used by the NLP

Comment on lines 2261 to 2267
try:
dir_path = sys.argv[1]
# dir_output = sys.argv[2]
#### ANTOINE wants to take the output here are well and also transform the below as a function
#### Request an error if no input parameters
except IndexError:
dir_path = "./xml_hackathon"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should give an error if the user doesn't provide an input path here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

@alexdewar
Copy link
Collaborator

alexdewar commented May 23, 2025

Btw mypy is now happy with parse_xml.py, but we're still ignoring it in .pre-commit-config.yaml. Maybe we should enable mypy for it?

@AdrianDAlessandro

@AdrianDAlessandro AdrianDAlessandro added enhancement New feature or request and removed bug Something isn't working labels Jun 2, 2025
@AdrianDAlessandro
Copy link
Collaborator

@Antoinelfr I've brought this up to date with main and added one or two small changes. I'd still recommend addressing all of @alexdewar 's suggestions before merging.

@Thomas-Rowlands I'll leave this with you now to decide when it's ready to merge

Copy link
Collaborator

@Thomas-Rowlands Thomas-Rowlands left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, puts my documentation to shame!

@Antoinelfr
Copy link
Collaborator Author

I have made some modifications as suggested by @alexdewar. I also created a summary of the next steps here #294

@Antoinelfr Antoinelfr merged commit 399b80c into main Jun 10, 2025
16 checks passed
@Antoinelfr Antoinelfr deleted the xml_update branch June 10, 2025 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Import IAO code from AC into parse_xml

5 participants