Skip to content

Conversation

@sheriefvt
Copy link
Contributor

This is a follow-up work for a previous merged PR #671

  • Fix contact info (name) for most of the cases.
  • Add external identifier (URL)
  • Update source .yaml

Copy link
Contributor

@aaxelb aaxelb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple things that could be cleaner, otherwise 👍

earliest_date: null
harvester: org.swbiodiversity
harvester_kwargs:
harvester_kwargs: &h_kw
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this?

rate_limit_period: 2
transformer: org.swbiodiversity
transformer_kwargs: {}
transformer_kwargs: *h_kw
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this?

contact_dict = {}
contact = entry.replace('Contact:', '').strip()
contact_email = contact[contact.find("(") + 1:contact.find(")")]
if re.match(r"[^@]+@[^@]+\.[^@]+", contact_email):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this shouldn't capture whitespace. Maybe [^@\s] instead of [^@]?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, single quotes, please

'Mycologist and Director', ',']
for item in remove_list:
if item.lower() in contact_name.lower():
contact_name = contact_name.lower().replace(item.lower(), '').strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means contact_name will be lower case if anything is replaced. Could you use a case-insensitive regex instead?

collection_statistics = list(map(self.extract_text, collection_statistics))
data['collection-statistics'] = self.process_collection_stat(collection_statistics)

link = record.find_next('div', style="margin:3px;")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of all this (and instead of identifiers = ... above), you could yield the full URL as the identifier from the harvester (yield collection_page.url, str(record)). It will automatically be added as an identifier.

@sheriefvt
Copy link
Contributor Author

@aaxelb, thanks for your feedback. YAML has a feature called 'anchors', which let you easily duplicate content across your yaml document. Both of transformer_kwargs and harvester_kwargs will have the same url value. I used this to avoid duplication and to avoid updating url in two places if need to be updated. Since we are using the yielded identifier from the harvester, it is not needed now and removed.

@aaxelb
Copy link
Contributor

aaxelb commented Jul 31, 2017

@sheriefvt Oh neat, I didn't know about yaml anchors. That's handy.

@aaxelb aaxelb merged commit 20f3f49 into CenterForOpenScience:develop Aug 2, 2017
cardene added a commit to cardene/SHARE that referenced this pull request Aug 8, 2017
… into develop

* 'develop' of https://github.com/CenterForOpenScience/SHARE: (26 commits)
  [Fix] Add a script to correct datacite issues
  [Fix] Add try/catches in fetchers
  Drop fetcher chunk size
  [Fix] Handle deleted documents
  Trying to fix indexing again
  Switch the default fetcher to the old one
  [Fix] Disable dual indexing
  [Fix] Catch exceptions from _get_messages
  Better defaults and only count up
  Re-vamp search indexing to support multiple indexes
  [SHARE-832][feature]Add Symbiota harvester (Additional dev.) (CenterForOpenScience#704)
  [Fix] Index all disambiguated works (CenterForOpenScience#706)
  [Fix] Disable throughsubjects triggers
  [Fix] Handle subjects with /s in them
  [Fix] aliases -> synonyms
  [Fix] Subject URIs unique only within taxonomy
  [Fix] Typo
  [Fix] Handle a queue full of non-indexable works
  [Fix] Migration to correct subjects issue
  [Fix] Dont raise on error
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants