-
Notifications
You must be signed in to change notification settings - Fork 11
Update XML script + add regression test for XML module #263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
... and 1 file with indirect coverage changes 🚀 New features to boost your workflow:
|
|
Before merging, let me look at the Windows test failing. I will try to implement the IOA from the main AC as well to avoid redundancy. I also noticed a weird character encoding in the AC paper parsing as opposed to the HTML I will take a look at this one as well. |
alexdewar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice cleanup and having a test for the XML processing is definitely an improvement. Good work!
The reason the tests are failing on the Windows runners currently is because there are a few places where you're calling open() without setting encoding="utf-8" (the default text encoding used by Python on Windows isn't UTF8 for reasons that are obscure and bad).
I've taken this as an opportunity to give this code a bit of an audit. I've made some small suggestions in comments (in case you haven't done this before, there is a button to commit suggestions directly). Commit the ones you like and ignore the ones you don't, then re-request a review from me.
I've also got some comments about the general structure of the code, though I'm not saying that you should fix this now -- just something to think about as you go forward.
Firstly, I'd strongly suggest breaking apart the convert_xml_to_json function into smaller, more manageable chunks. It's currently doing way too much. There's also quite a lot of code duplicated between if and else branches that could be extracted into separate functions. Lots of copy-pasted code is a bit of a maintenance headache because a) it makes the code harder to read and reason about and b) it's easy to fix a bug in one part of the code and forget to update all the other copy-pasted versions of it.
| year_xml = "" | ||
| else: | ||
| # If 'accepted' date is missing, assign an empty string | ||
| year_xml = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd split all these little checks and bits for cleaning the text into separate functions for readability + testability. For example, this one could be something like this:
def get_year(soup: BeautifulSoup) -> str:
# Check if the 'accepted' date is found within 'date', and if it contains a 'year' tag
### no check for unicode or hexacode or XML tags
if date := soup.find("date", {"date-type": "accepted"}):
if year := date.find("year"):
# Extract the text content of the 'year' tag if found
return year.text
# If 'accepted' date or 'year' is missing, return empty string
return ""(I'm not saying you have to do that on this PR -- just food for thought)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you are right, this will be part of the next change in the next PR
| continue | ||
| # Find the title (if it exists) | ||
| title_tag = kwd.find("title") | ||
| if title_tag is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some duplicated code between the if and else branches here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I will look into that
| # Check for the presence of <etal> (et al.) | ||
| etal_tag = ref.find("etal") | ||
| if etal_tag is not None: | ||
| etal = "Et al." # Add "Et al." if the tag is present | ||
| else: | ||
| etal = "" | ||
|
|
||
| # If 'etal' is found, append it to the final authors list | ||
| ### ERROR authors could be an empty list, need to figure out if the above tag is absent what to do | ||
| if etal != "": | ||
| final_authors = f"{', '.join(authors)} {etal}" | ||
| else: | ||
| final_authors = f"{', '.join(authors)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
| # Check for the presence of <etal> (et al.) | |
| etal_tag = ref.find("etal") | |
| if etal_tag is not None: | |
| etal = "Et al." # Add "Et al." if the tag is present | |
| else: | |
| etal = "" | |
| # If 'etal' is found, append it to the final authors list | |
| ### ERROR authors could be an empty list, need to figure out if the above tag is absent what to do | |
| if etal != "": | |
| final_authors = f"{', '.join(authors)} {etal}" | |
| else: | |
| final_authors = f"{', '.join(authors)}" | |
| # If 'etal' is found, append it to the final authors list | |
| ### ERROR authors could be an empty list, need to figure out if the above tag is absent what to do | |
| final_authors = ", ".join(authors) | |
| if ref.find("etal"): | |
| final_authors += " Et al." |
If you want to handle the empty authors case you could do:
if final_authors and re.find("etal"):
# ...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will need to look into it
| .replace("&", "&") | ||
| .replace("'", "'") | ||
| .replace(""", '"') | ||
| .replace("\xa0", " ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this extra replace is present in the other places where you're doing the same processsing... Might you need to replace \xa0s in those places too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will look into it. I think at the moment I focused on the text that is used by the NLP
autocorpus/parse_xml.py
Outdated
| try: | ||
| dir_path = sys.argv[1] | ||
| # dir_output = sys.argv[2] | ||
| #### ANTOINE wants to take the output here are well and also transform the below as a function | ||
| #### Request an error if no input parameters | ||
| except IndexError: | ||
| dir_path = "./xml_hackathon" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should give an error if the user doesn't provide an input path here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
|
Btw |
…title, import function from main function
|
@Antoinelfr I've brought this up to date with main and added one or two small changes. I'd still recommend addressing all of @alexdewar 's suggestions before merging. @Thomas-Rowlands I'll leave this with you now to decide when it's ready to merge |
Thomas-Rowlands
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, puts my documentation to shame!
|
I have made some modifications as suggested by @alexdewar. I also created a summary of the next steps here #294 |
Description
This PR includes:
Fixes #190
Type of change
Key checklist
pytest)mkdocs)pre-commit run --all-files)Further checks