Skip to content

Conversation

@sheriefvt
Copy link
Contributor

@sheriefvt sheriefvt commented May 30, 2017

Ticket

https://openscience.atlassian.net/browse/SHARE-832

Problem

Adding a harvester for Symbiota, harvesting Collections.

Solution:

  • Update setup
  • Add the harvester (Scraping HTML from each record page)
  • Add the transformer
  • Add new source
  • Add tests for both harvester/transformer
  • Add requests-mock package to requirements

sheriefvt added 5 commits May 26, 2017 10:01
* Update setup
* Add harvester
* Add transformer
* Add source
* Add tests for harvester/transformer

logging.info('Found %d results from swbiodiversity', total)
count = 0
while count < total:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a for loop and enumerate here.

for count, record in record_list:
    ...

identifier = record_list[count]

data['identifier'] = identifier
html = self.requests.get(self.kwargs['list_url'] + '?collid=' + identifier)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the furl library to build URLs. You can look at any other harvester for examples 👍

identifier = record_list[count]

data['identifier'] = identifier
html = self.requests.get(self.kwargs['list_url'] + '?collid=' + identifier)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a more descriptive name here like resp or response. html is misleading.

count += 1
yield identifier, data

def process_text(self, text):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is unnecessary, you can just call .text().
Don't parse HTML with regex


def process_collection_stat(self, list):
stat = {}
for item in list:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using list as a variable name, it shadows the built in type list
Here's a list of builtins/names to avoid.


# Peel out script tags and css things to minimize size of HTML
for el in itertools.chain(
soup('img'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this might be double indented?

el.extract()

record = soup.find(id='innertext')
title = self.process_text(record.h1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most, if not all, of this processing should happen in the transformer. You'll want to preserve as much of the original data as possible. If a bug pops up in the parsing, we can just reprocess the data we already have.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you still prefer to give this data to the transformer, you can massage the data before-hand by overriding transform


from bs4 import BeautifulSoup, Comment
from furl import furl
import itertools
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

itertools is part of the stdlib

from share.transform.chain.parsers import Parser
from share.transform.chain.soup import SoupXMLTransformer
from bs4 import BeautifulSoup
import re
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be ordered as

import re

from bs4 import BeautifulSoup

from share.transform.chain import ctx
from share.transform.chain import links as tools
from share.transform.chain.parsers import Parser
from share.transform.chain.soup import SoupXMLTransformer

@@ -0,0 +1,148 @@
from datetime import timedelta

import requests_mock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you switch this to use httpretty? @laurenbarker and I thought the interface for it was a bit nicer. It would be good to standardize on a single library.

end_date = end.date()
start_date = start.date()
logger.info('Harvesting swbiodiversity %s - %s', start_date, end_date)
return self.fetch_records()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to disregard date ranges. Did we discuss this at some point?

@chrisseto chrisseto merged commit 5d5f904 into CenterForOpenScience:develop Jul 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants