-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add s3 to DESY spider instead of FTP #284
Add s3 to DESY spider instead of FTP #284
Conversation
hepcrawl/spiders/desy_spider.py
Outdated
self.logger.error("Cannot connect to s3.", exc) | ||
raise | ||
return connections | ||
def get_s3_url_for_key(self, key, bucket=None, expire=86400): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually it looks like this method is not used at all. Could you check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are right. It's leftover from processing additional files I did at first.
) | ||
def crawl_s3_bucket(self): | ||
input_bucket = self.s3_resource.Bucket(self.s3_input_bucket) | ||
existing_files = os.listdir(self.destination_folder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would forget about the local files handling, it's not needed if you move the source files to a different bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to leave local
part of spider intact, as it's not part of this task.
hepcrawl/spiders/desy_spider.py
Outdated
url=response.url | ||
) | ||
self.logger.info('Got %d hep records' % len(parsed_items)) | ||
|
||
if self.s3_enabled: | ||
s3_file = response.meta['s3_file'] | ||
self.logger.info("Moving {file} to {processed} bucket and deleting {file} from {incoming} bucket.".format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.logger.info("Moving {file} to {processed} bucket and deleting {file} from {incoming} bucket.".format( | |
self.logger.info("Moving {file} from {incoming} bucket to {processed} bucket.".format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
dc669ea
to
adaaed8
Compare
adaaed8
to
6c78cd2
Compare
Currently all additional files for record are intentionally ignored. (After discussion with Micha)
ref:inspirehep/inspirehep#1245