-
Notifications
You must be signed in to change notification settings - Fork 69
[SHARE-832][feature]Add Symbiota harvester (Additional dev.) #704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SHARE-832][feature]Add Symbiota harvester (Additional dev.) #704
Conversation
* Update setup * Add harvester * Add transformer * Add source * Add tests for harvester/transformer
… into feature/SHARE-832 * Update transformer test
* Fix contact info (name) * Update source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple things that could be cleaner, otherwise 👍
| earliest_date: null | ||
| harvester: org.swbiodiversity | ||
| harvester_kwargs: | ||
| harvester_kwargs: &h_kw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this?
| rate_limit_period: 2 | ||
| transformer: org.swbiodiversity | ||
| transformer_kwargs: {} | ||
| transformer_kwargs: *h_kw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this?
| contact_dict = {} | ||
| contact = entry.replace('Contact:', '').strip() | ||
| contact_email = contact[contact.find("(") + 1:contact.find(")")] | ||
| if re.match(r"[^@]+@[^@]+\.[^@]+", contact_email): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this shouldn't capture whitespace. Maybe [^@\s] instead of [^@]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, single quotes, please
| 'Mycologist and Director', ','] | ||
| for item in remove_list: | ||
| if item.lower() in contact_name.lower(): | ||
| contact_name = contact_name.lower().replace(item.lower(), '').strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means contact_name will be lower case if anything is replaced. Could you use a case-insensitive regex instead?
| collection_statistics = list(map(self.extract_text, collection_statistics)) | ||
| data['collection-statistics'] = self.process_collection_stat(collection_statistics) | ||
|
|
||
| link = record.find_next('div', style="margin:3px;") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of all this (and instead of identifiers = ... above), you could yield the full URL as the identifier from the harvester (yield collection_page.url, str(record)). It will automatically be added as an identifier.
|
@aaxelb, thanks for your feedback. YAML has a feature called 'anchors', which let you easily duplicate content across your yaml document. Both of transformer_kwargs and harvester_kwargs will have the same url value. I used this to avoid duplication and to avoid updating url in two places if need to be updated. Since we are using the yielded identifier from the harvester, it is not needed now and removed. |
|
@sheriefvt Oh neat, I didn't know about yaml anchors. That's handy. |
… into develop * 'develop' of https://github.com/CenterForOpenScience/SHARE: (26 commits) [Fix] Add a script to correct datacite issues [Fix] Add try/catches in fetchers Drop fetcher chunk size [Fix] Handle deleted documents Trying to fix indexing again Switch the default fetcher to the old one [Fix] Disable dual indexing [Fix] Catch exceptions from _get_messages Better defaults and only count up Re-vamp search indexing to support multiple indexes [SHARE-832][feature]Add Symbiota harvester (Additional dev.) (CenterForOpenScience#704) [Fix] Index all disambiguated works (CenterForOpenScience#706) [Fix] Disable throughsubjects triggers [Fix] Handle subjects with /s in them [Fix] aliases -> synonyms [Fix] Subject URIs unique only within taxonomy [Fix] Typo [Fix] Handle a queue full of non-indexable works [Fix] Migration to correct subjects issue [Fix] Dont raise on error ...
This is a follow-up work for a previous merged PR #671