Update XML script + add regression test for XML module #263

Antoinelfr · 2025-05-22T15:30:48Z

Description

This PR includes:

Title text cleaning from new lines and weird formatting.
Fix the author formatting in the reference.
Change manually \xa0 to space.
Fix a bug where the section title was empty due to the title tag being outside the section tag.
Fix the keywords section to exclude the title from the text, remove non-English keywords, take multiple lists as opposed to the first one before, and exclude the abbreviation list.
Remove tables and figures from the front and back tags.
Add IOA IDs for all passages and fixed the "document part" allocation.
Fix a bug where chunks of text were not identified because they were not in section tags.
Improve the filtering of text artefacts from unwanted sections.

Fixes #190

Type of change

Documentation (non-breaking change that adds or improves the documentation)
New feature (non-breaking change which adds functionality)
Optimization (non-breaking, back-end change that speeds up the code)
Bug fix (non-breaking change which fixes an issue)
Breaking change (whatever its nature)

Key checklist

All tests pass (eg. pytest)
The documentation builds and looks OK (eg. mkdocs)
Pre-commit hooks run successfully (eg. pre-commit run --all-files)

Further checks

Code is commented, particularly in hard-to-understand areas
Tests added or an issue has been opened to tackle that in the future. (Indicate issue here: # (issue))

codecov · 2025-05-22T15:36:29Z

Codecov Report

Attention: Patch coverage is 33.33333% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
autocorpus/file_processing.py	33.33%	2 Missing ⚠️

Files with missing lines	Coverage Δ
autocorpus/parse_xml.py	`37.10% <ø> (ø)`
autocorpus/file_processing.py	`50.90% <33.33%> (ø)`

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Antoinelfr · 2025-05-23T07:04:28Z

Before merging, let me look at the Windows test failing. I will try to implement the IOA from the main AC as well to avoid redundancy. I also noticed a weird character encoding in the AC paper parsing as opposed to the HTML I will take a look at this one as well.

alexdewar

This is a nice cleanup and having a test for the XML processing is definitely an improvement. Good work!

The reason the tests are failing on the Windows runners currently is because there are a few places where you're calling open() without setting encoding="utf-8" (the default text encoding used by Python on Windows isn't UTF8 for reasons that are obscure and bad).

I've taken this as an opportunity to give this code a bit of an audit. I've made some small suggestions in comments (in case you haven't done this before, there is a button to commit suggestions directly). Commit the ones you like and ignore the ones you don't, then re-request a review from me.

I've also got some comments about the general structure of the code, though I'm not saying that you should fix this now -- just something to think about as you go forward.
Firstly, I'd strongly suggest breaking apart the convert_xml_to_json function into smaller, more manageable chunks. It's currently doing way too much. There's also quite a lot of code duplicated between if and else branches that could be extracted into separate functions. Lots of copy-pasted code is a bit of a maintenance headache because a) it makes the code harder to read and reason about and b) it's easy to fix a bug in one part of the code and forget to update all the other copy-pasted versions of it.

autocorpus/parse_xml.py

alexdewar · 2025-05-23T13:12:51Z

autocorpus/parse_xml.py

+            year_xml = ""
+    else:
+        # If 'accepted' date is missing, assign an empty string
+        year_xml = ""


I'd split all these little checks and bits for cleaning the text into separate functions for readability + testability. For example, this one could be something like this:

def get_year(soup: BeautifulSoup) -> str: # Check if the 'accepted' date is found within 'date', and if it contains a 'year' tag ### no check for unicode or hexacode or XML tags if date := soup.find("date", {"date-type": "accepted"}): if year := date.find("year"): # Extract the text content of the 'year' tag if found return year.text # If 'accepted' date or 'year' is missing, return empty string return ""

(I'm not saying you have to do that on this PR -- just food for thought)

yes, you are right, this will be part of the next change in the next PR

alexdewar · 2025-05-23T13:17:14Z

autocorpus/parse_xml.py

+            continue
+        # Find the title (if it exists)
+        title_tag = kwd.find("title")
+        if title_tag is None:


There's some duplicated code between the if and else branches here

You are right, I will look into that

autocorpus/parse_xml.py

alexdewar · 2025-05-23T13:57:36Z

autocorpus/parse_xml.py

+            # Check for the presence of <etal> (et al.)
+            etal_tag = ref.find("etal")
+            if etal_tag is not None:
+                etal = "Et al."  # Add "Et al." if the tag is present
+            else:
+                etal = ""
+
+            # If 'etal' is found, append it to the final authors list
+            ### ERROR authors could be an empty list, need to figure out if the above tag is absent what to do
+            if etal != "":
+                final_authors = f"{', '.join(authors)} {etal}"
+            else:
+                final_authors = f"{', '.join(authors)}"


How about:

Suggested change

# Check for the presence of <etal> (et al.)

etal_tag = ref.find("etal")

if etal_tag is not None:

etal = "Et al." # Add "Et al." if the tag is present

else:

etal = ""

# If 'etal' is found, append it to the final authors list

### ERROR authors could be an empty list, need to figure out if the above tag is absent what to do

if etal != "":

final_authors = f"{', '.join(authors)} {etal}"

else:

final_authors = f"{', '.join(authors)}"

# If 'etal' is found, append it to the final authors list

### ERROR authors could be an empty list, need to figure out if the above tag is absent what to do

final_authors = ", ".join(authors)

if ref.find("etal"):

final_authors += " Et al."

If you want to handle the empty authors case you could do:

if final_authors and re.find("etal"): # ...

will need to look into it

alexdewar · 2025-05-23T13:59:52Z

autocorpus/parse_xml.py

+                        .replace("&amp;", "&")
+                        .replace("&apos;", "'")
+                        .replace("&quot;", '"')
+                        .replace("\xa0", " ")


I don't think this extra replace is present in the other places where you're doing the same processsing... Might you need to replace \xa0s in those places too?

Will look into it. I think at the moment I focused on the text that is used by the NLP

autocorpus/parse_xml.py

alexdewar · 2025-05-23T14:02:32Z

autocorpus/parse_xml.py

+    try:
+        dir_path = sys.argv[1]
+        # dir_output = sys.argv[2]
+        #### ANTOINE wants to take the output here are well and also transform the below as a function
+        #### Request an error if no input parameters
+    except IndexError:
+        dir_path = "./xml_hackathon"


I think we should give an error if the user doesn't provide an input path here.

alexdewar · 2025-05-23T14:33:07Z

Btw mypy is now happy with parse_xml.py, but we're still ignoring it in .pre-commit-config.yaml. Maybe we should enable mypy for it?

@AdrianDAlessandro

…title, import function from main function

AdrianDAlessandro · 2025-06-03T18:08:04Z

@Antoinelfr I've brought this up to date with main and added one or two small changes. I'd still recommend addressing all of @alexdewar 's suggestions before merging.

@Thomas-Rowlands I'll leave this with you now to decide when it's ready to merge

Thomas-Rowlands

LGTM, puts my documentation to shame!

Antoinelfr · 2025-06-10T15:59:44Z

I have made some modifications as suggested by @alexdewar. I also created a summary of the next steps here #294

Antoinelfr added 2 commits May 22, 2025 14:57

fix bugs regarding the title, abstract, missing text, formating, unicode

618c913

add test, reformat XML main function, add expected output

39ebb40

Antoinelfr requested review from AdrianDAlessandro, Thomas-Rowlands and alexdewar May 22, 2025 15:30

Antoinelfr added the bug Something isn't working label May 22, 2025

Antoinelfr assigned Antoinelfr and AdrianDAlessandro May 22, 2025

alexdewar requested changes May 23, 2025

View reviewed changes

Antoinelfr and others added 3 commits May 27, 2025 17:06

Merge branch 'main' into xml_update

241b4b2

implement changes suggested, fix some characters, better indexing of …

07acfd1

…title, import function from main function

No longer ignore parse_xml with mypy

6cf15c3

AdrianDAlessandro mentioned this pull request May 29, 2025

Split convert_xml_to_json into smaller functions #274

Open

AdrianDAlessandro added enhancement New feature or request and removed bug Something isn't working labels Jun 2, 2025

Merge branch 'main' into xml_update

c60efd2

This was referenced Jun 3, 2025

Functions for checks and cleaning in parse_xml #285

Open

Reduce duplicated code between the if and else branches in parse_xml #286

Open

Create try_get_text helper function in parse_xml #287

Open

Use convert_xml_to_json in process_file

898493c

Merge branch 'main' into xml_update

0190fdf

Thomas-Rowlands approved these changes Jun 4, 2025

View reviewed changes

update IAO files to include title

4554e5a

Merge branch 'main' into xml_update

d7cf934

Antoinelfr merged commit 399b80c into main Jun 10, 2025
16 checks passed

Antoinelfr deleted the xml_update branch June 10, 2025 16:24

Update XML script + add regression test for XML module #263

Update XML script + add regression test for XML module #263

Uh oh!

Conversation

Antoinelfr commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Key checklist

Further checks

Uh oh!

codecov bot commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Antoinelfr commented May 23, 2025

Uh oh!

alexdewar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexdewar commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AdrianDAlessandro commented Jun 3, 2025

Uh oh!

Thomas-Rowlands left a comment

Choose a reason for hiding this comment

Uh oh!

Antoinelfr commented Jun 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Antoinelfr commented May 22, 2025 •

edited

Loading

codecov bot commented May 22, 2025 •

edited

Loading

alexdewar commented May 23, 2025 •

edited

Loading