-
Notifications
You must be signed in to change notification settings - Fork 0
Timx 283 oaidc field method refactor #177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timx 283 oaidc field method refactor #177
Conversation
tests/sources/xml/test_oai_dc.py
Outdated
| xml = next(transformer_instance.source_records) | ||
| assert transformer_instance.get_dates("test_source_record_id", xml) == [ | ||
| timdex.Date(kind="Unknown", note=None, range=None, value="2008-06-19T17:55:27") | ||
| assert OaiDc("cool-repo", source_records).get_content_type() == ["cool-repo"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the content_type is derived from the source instance variable, OaiDc must be instantiated for the test.
tests/sources/xml/test_oai_dc.py
Outdated
| def test_get_identifiers_transforms_correctly_if_fields_blank( | ||
| oai_dc_record_optional_fields_blank, | ||
| ): | ||
| oai_dc_record_optional_fields_blank.header.identifier.clear() | ||
| assert OaiDc.get_identifiers(oai_dc_record_optional_fields_blank) == [] | ||
|
|
||
|
|
||
| def test_get_identifiers_transforms_correctly_if_fields_missing( | ||
| oai_dc_record_optional_fields_missing, | ||
| ): | ||
| oai_dc_record_optional_fields_missing.header.identifier.decompose() | ||
| assert OaiDc.get_identifiers(oai_dc_record_optional_fields_missing) == [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious what you think of this setup. 🤔 The identifiers are derived from a child of the header element, so using the stub (as it is written) or the oai_dc_record_* fixtures directly was not possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, good call out. This is probably one of the inherent issues with the stubs.
Though I agree this works, another option could be modifying the function that creates the stub to one that allows both header overrides and actual content?
Maybe that's a strong argument for keeping these stubs in the test suite for that source itself vs the higher level abstraction proposed in this ticket.
Example:
def create_oaidc_source_record_stub(
header_identifier:str = "oai:libguides.com:guides/123456", #<------- new
xml_insert: str = ""
) -> BeautifulSoup:
xml_str = f"""
<records>
<record>
<header>
<identifier>{header_identifier}</identifier> #<---------- templated here
<datestamp>2023-05-31T19:49:21Z</datestamp>
<setSpec>guides</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
{xml_insert}
</oai_dc:dc>
</metadata>
</record>
<records
"""
return BeautifulSoup(xml_str, "xml")Just a thought exercise though, as I agree the method used above works well too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update from @ghukill : Whole header element should be editable via param to stub method.
cd5569a to
dd69c7d
Compare
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, some very minor suggestions
tests/sources/xml/test_oai_dc.py
Outdated
| def test_get_identifiers_transforms_correctly_if_fields_blank( | ||
| oai_dc_record_optional_fields_blank, | ||
| ): | ||
| oai_dc_record_optional_fields_blank.header.identifier.clear() | ||
| assert OaiDc.get_identifiers(oai_dc_record_optional_fields_blank) == [] | ||
|
|
||
|
|
||
| def test_get_identifiers_transforms_correctly_if_fields_missing( | ||
| oai_dc_record_optional_fields_missing, | ||
| ): | ||
| oai_dc_record_optional_fields_missing.header.identifier.decompose() | ||
| assert OaiDc.get_identifiers(oai_dc_record_optional_fields_missing) == [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that works
|
And let's resolve the |
a3ee3cf to
45a75bc
Compare
| return str(xml.find("dc:identifier").string) | ||
| if source_link := source_record.find("dc:identifier", string=True): | ||
| return str(source_link.string) | ||
| return "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, the method would throw an attribute error if dc:identifier did not appear in the XML:
AttributeError: 'NoneType' object has no attribute 'string'
Given that source_link is a required field, raising an error seems to be the right approach, but is there a cleaner way to handle this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your instinct to use SkippedRecordEvent here is correct, @ghukill your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. If we can't pull a dc:identifier value, then we don't have a source link to send people to, and it feels like it's not worth keeping the record (punting the awareness of this to later observability/metrics discussions).
So yes, raising a SkippedRecordEvent feels like the right move!
|
@ghukill @ehanson8 The PR has now includes required updates to the Note(s):
|
45a75bc to
a05fc00
Compare
tests/sources/xml/test_oai_dc.py
Outdated
|
|
||
|
|
||
| def create_oai_dc_source_record_stub(xml_insert: str = "") -> BeautifulSoup: | ||
| def create_oaidc_source_record_stub(xml_insert: str = "") -> BeautifulSoup: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the default, I'll update my PR with that. Actually, let's just use the default with the new global stub function that will be created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had agreed with the comment until the strikethrough, lolz!
See this comment about how maybe keeping the function closer to the tests is handy, given we can modify the signature to set values as needed in the stub. Maybe good fodder for our discussion today.
To me, feels like either approach is still kind of on the table.
| ) | ||
|
|
||
|
|
||
| def test_get_dates_transforms_correctly_and_logs_error_if_date_invalid( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect example of an edge case test!
| @@ -51,16 +52,16 @@ def get_dates(self, source_record_id: str, xml: Tag) -> list[timdex.Date]: | |||
| ) | |||
| return dates | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in the other PR, this return can be updated with or None
| return str(xml.find("dc:identifier").string) | ||
| if source_link := source_record.find("dc:identifier", string=True): | ||
| return str(source_link.string) | ||
| return "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your instinct to use SkippedRecordEvent here is correct, @ghukill your thoughts?
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Submitting some comments now for our discussion, with the intent of completing a further review later.
| return str(xml.find("dc:identifier").string) | ||
| if source_link := source_record.find("dc:identifier", string=True): | ||
| return str(source_link.string) | ||
| return "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. If we can't pull a dc:identifier value, then we don't have a source link to send people to, and it feels like it's not worth keeping the record (punting the awareness of this to later observability/metrics discussions).
So yes, raising a SkippedRecordEvent feels like the right move!
tests/sources/xml/test_oai_dc.py
Outdated
|
|
||
|
|
||
| def create_oai_dc_source_record_stub(xml_insert: str = "") -> BeautifulSoup: | ||
| def create_oaidc_source_record_stub(xml_insert: str = "") -> BeautifulSoup: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had agreed with the comment until the strikethrough, lolz!
See this comment about how maybe keeping the function closer to the tests is handy, given we can modify the signature to set values as needed in the stub. Maybe good fodder for our discussion today.
To me, feels like either approach is still kind of on the table.
transmogrifier/sources/xml/oaidc.py
Outdated
|
|
||
| # links | ||
| fields["links"] = self.get_links(source_record_id, xml) or None | ||
| fields["links"] = self.get_links(source_record, source_record_id) or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to pass source_record_id, if the source_record contains it?
And/or, could the method get_links() just call the other method get_source_record_id() again if not?
Feeling like maybe this is touched on in other comments. Full disclaimer: I wrote these transforms fairly early on, and recall they had requirements that deviated from previous Transmog sources (as I think you've commented on as well @jonavellecuerdo, like the identifier vs source link). So it feels possible that now is a good time to revisit some of those awkward edges. Silver lining: these are not yet live, so I think we're fairly free to modify them somewhat heavily if needed.
a05fc00 to
8b946ff
Compare
|
@ghukill @ehanson8 Please see the latest three (3) commits for updates based on our sync yesterday! See Summary of changes via Field Method Refactor |
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great, just a few more suggestions!
| """ | ||
| <dc:creator>Ye Li</dc:creator> | ||
| """ | ||
| metadata_insert=( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call on named args!
tests/sources/xml/test_oai_dc.py
Outdated
| ): | ||
| oai_dc_record_all_fields.header.identifier.clear() | ||
| assert OaiDc.get_source_record_id(oai_dc_record_all_fields) == "" | ||
| def test_get_source_record_id_transforms_properly_if_fields_blank(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to indicate it's raising the exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to test_get_source_record_id_raises_skipped_record_event_if_fields_blank().
tests/sources/xml/test_oai_dc.py
Outdated
| ): | ||
| oai_dc_record_all_fields.header.identifier.decompose() | ||
| assert OaiDc.get_source_record_id(oai_dc_record_all_fields) == "" | ||
| def test_get_source_record_id_transforms_properly_if_fields_missing(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to test_get_source_record_id_raises_skipped_record_event_if_fields_missing.
| SpringshareOaiDc.get_source_link("", "", source_record=source_record) | ||
|
|
||
|
|
||
| def test_get_source_link_transforms_correctly_if_required_fields_missing(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another rename to describe exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to test_get_source_link_raises_skipped_record_event_if_required_fields_blank and test_get_source_link_raises_skipped_record_event_if_required_fields_missing.
tests/sources/xml/test_oai_dc.py
Outdated
| """ | ||
| ), | ||
| ) | ||
| assert OaiDc.get_dates(source_record=source_record) == [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I think named args for a single param method might be overkill 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My rule of thumb -- like anything, open to exceptions -- is use positional when the function signature is positional, and use named when the function signature is named. Keeps it simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghukill I'll follow this convention moving forward. I updated the function calls in tests to use positional arguments to match function signatures.
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, thinking it's going well! Appreciating that OaiDc and Springshare are kind of finicky, given there is heavy inheritance implications.
Left a couple of comments / questions.
| message = ( | ||
| "Record skipped because 'source_record_id' could not be derived. " | ||
| "The 'identifier' was either missing from the header element or blank." | ||
| ) | ||
| raise SkippedRecordEvent(message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether or not we end up layering in more granular exceptions, I think this is a great exception message. Ignoring when or how this is logged, this does explain why this record would be skipped.
One thing we might want to consider is how to control these messages so we could group them, even across sources. I think that's where more granular Exception types could come in. Kind of a balance/art between the Exception type, and the message. Will be easy to group and log and store metrics by exception type, but the string itself becomes most helpful likely for deep debugging a single record.
| ) | ||
| ) | ||
|
|
||
| # [TODO]: Message is logged even if the condition is met. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonavellecuerdo - was this intended to be left in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes. I think this is one of the logging messages that this issue refers to. I'll add a reference to this line of code as part of the issue and remove the comment to avoid linting errors.
| return dates or None | ||
|
|
||
| def get_links(self, source_record_id: str, xml: Tag) -> list[timdex.Link] | None: | ||
| def get_links(self, source_record: Tag) -> list[timdex.Link] | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be an instance method because SpringshareOaiDc.get_links uses the instance attribute source_name in the derivation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonavellecuerdo - thanks for the explanation! It's altogether possible that I introduced that when I first worked on these. I feel like there is a bit of code smell there, and I'm probably to blame, but feels like maybe it's worth revisiting during the 2nd orchestration phase of this work.
| header_insert=( | ||
| """ | ||
| <identifier>oai:libguides.com:guides/175846</identifier> | ||
| """ | ||
| ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem harmful here, but I'm wondering if the header_insert is needed? Wouldn't that only be needed for tests that are looking to pull the identifier from it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is required by the logging that uses source_record_id; OaiDc.get_source_record_id() derives a value that pulls from the 'identifier' subelement of the 'header' element.
32ce150 to
dde85aa
Compare
dde85aa to
445448f
Compare
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks for the changes!
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, but remove references to source_record_id from the docstrings
| source_record_id: Source record id | ||
| xml: A BeautifulSoup Tag representing a single OAI DC XML record. | ||
| source_record: A BeautifulSoup Tag representing a single OAI DC record in XML. | ||
| source_record_id: Source record ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer needed in the docstring since the param was removed
| source_record_id: Source record id | ||
| xml: A BeautifulSoup Tag representing a single OAI DC XML record. | ||
| source_record: A BeautifulSoup Tag representing a single OAI DC record in XML. | ||
| source_record_id: Source record ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer needed in the docstring since the param was removed
transmogrifier/sources/xml/oaidc.py
Outdated
| source_record_id: Source record id | ||
| xml: A BeautifulSoup Tag representing a single OAI DC XML record. | ||
| source_record: A BeautifulSoup Tag representing a single OAI DC record in XML. | ||
| source_record_id: Source record ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer needed in the docstring since the param was removed
445448f to
3534a01
Compare
Why these changes are being introduced: * These updates are required to implement the architecture described in the following ADR: https://github.com/MITLibraries/transmogrifier/blob/main/docs/adrs/0005-field-methods.md How this addresses that need: * Refactor base transform class OaiDc to contain get_* methods for optional fields * Make 'source_record' the first argument in OaiDc field methods * Move 'or None' to the return statements of field methods * Remove 'source_record_id' as param, replace with call inside field method instead * Raise SkippedRecordEvent * OaiDc.get_source_record_id * SpringshareOaiDc.get_source_link * Update unit tests * Remove OaiDc and SpringshareOaiDc record_* fixtures Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-283
3534a01 to
537f6ec
Compare

Purpose and background context
Field method refactor for base transform class
OaiDc.Note: Linters are failing for
transmogrifiers.sources.xml.springshare, which will be addressed in a later PR!How can a reviewer manually see the effects of these changes?
make testand verify all tests are passing.Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)