-
Notifications
You must be signed in to change notification settings - Fork 69
[SHARE-832][feature]Add Symbiota harvester #671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SHARE-832][feature]Add Symbiota harvester #671
Conversation
* Update setup * Add harvester * Add transformer * Add source * Add tests for harvester/transformer
|
|
||
| logging.info('Found %d results from swbiodiversity', total) | ||
| count = 0 | ||
| while count < total: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a for loop and enumerate here.
for count, record in record_list:
...| identifier = record_list[count] | ||
|
|
||
| data['identifier'] = identifier | ||
| html = self.requests.get(self.kwargs['list_url'] + '?collid=' + identifier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the furl library to build URLs. You can look at any other harvester for examples 👍
| identifier = record_list[count] | ||
|
|
||
| data['identifier'] = identifier | ||
| html = self.requests.get(self.kwargs['list_url'] + '?collid=' + identifier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a more descriptive name here like resp or response. html is misleading.
| count += 1 | ||
| yield identifier, data | ||
|
|
||
| def process_text(self, text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is unnecessary, you can just call .text().
Don't parse HTML with regex
|
|
||
| def process_collection_stat(self, list): | ||
| stat = {} | ||
| for item in list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid using list as a variable name, it shadows the built in type list
Here's a list of builtins/names to avoid.
|
|
||
| # Peel out script tags and css things to minimize size of HTML | ||
| for el in itertools.chain( | ||
| soup('img'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this might be double indented?
| el.extract() | ||
|
|
||
| record = soup.find(id='innertext') | ||
| title = self.process_text(record.h1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most, if not all, of this processing should happen in the transformer. You'll want to preserve as much of the original data as possible. If a bug pops up in the parsing, we can just reprocess the data we already have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you still prefer to give this data to the transformer, you can massage the data before-hand by overriding transform
… into feature/SHARE-832 * Update transformer test
|
|
||
| from bs4 import BeautifulSoup, Comment | ||
| from furl import furl | ||
| import itertools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
itertools is part of the stdlib
| from share.transform.chain.parsers import Parser | ||
| from share.transform.chain.soup import SoupXMLTransformer | ||
| from bs4 import BeautifulSoup | ||
| import re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be ordered as
import re
from bs4 import BeautifulSoup
from share.transform.chain import ctx
from share.transform.chain import links as tools
from share.transform.chain.parsers import Parser
from share.transform.chain.soup import SoupXMLTransformer| @@ -0,0 +1,148 @@ | |||
| from datetime import timedelta | |||
|
|
|||
| import requests_mock | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you switch this to use httpretty? @laurenbarker and I thought the interface for it was a bit nicer. It would be good to standardize on a single library.
| end_date = end.date() | ||
| start_date = start.date() | ||
| logger.info('Harvesting swbiodiversity %s - %s', start_date, end_date) | ||
| return self.fetch_records() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to disregard date ranges. Did we discuss this at some point?
Ticket
https://openscience.atlassian.net/browse/SHARE-832
Problem
Adding a harvester for Symbiota, harvesting Collections.
Solution: