Add s3 to DESY spider instead of FTP #284

pazembrz · 2020-09-07T07:58:31Z

Currently all additional files for record are intentionally ignored. (After discussion with Micha)

michamos · 2020-09-07T15:02:41Z

hepcrawl/spiders/desy_spider.py

+            self.logger.error("Cannot connect to s3.", exc)
+            raise
+        return connections
+    def get_s3_url_for_key(self, key, bucket=None, expire=86400):


missing newline

actually it looks like this method is not used at all. Could you check?

Yeah, you are right. It's leftover from processing additional files I did at first.

michamos · 2020-09-07T15:05:28Z

hepcrawl/spiders/desy_spider.py

-        )
+    def crawl_s3_bucket(self):
+        input_bucket = self.s3_resource.Bucket(self.s3_input_bucket)
+        existing_files = os.listdir(self.destination_folder)


I would forget about the local files handling, it's not needed if you move the source files to a different bucket.

I would prefer to leave local part of spider intact, as it's not part of this task.

michamos · 2020-09-07T15:06:38Z

hepcrawl/spiders/desy_spider.py

            url=response.url
        )
        self.logger.info('Got %d hep records' % len(parsed_items))

+        if self.s3_enabled:
+            s3_file = response.meta['s3_file']
+            self.logger.info("Moving {file} to {processed} bucket and deleting {file} from {incoming} bucket.".format(


Suggested change

self.logger.info("Moving {file} to {processed} bucket and deleting {file} from {incoming} bucket.".format(

self.logger.info("Moving {file} from {incoming} bucket to {processed} bucket.".format(

ref:inspirehep/inspirehep#1245

michamos reviewed Sep 7, 2020

View reviewed changes

pazembrz force-pushed the #1243-add-s3-support--to-desy-spider branch from dc669ea to adaaed8 Compare September 8, 2020 09:20

Add s3 to DESY spider instead of FTP

6c78cd2

ref:inspirehep/inspirehep#1245

pazembrz force-pushed the #1243-add-s3-support--to-desy-spider branch from adaaed8 to 6c78cd2 Compare September 9, 2020 08:26

michamos approved these changes Sep 9, 2020

View reviewed changes

pazembrz merged commit 1d6c9c7 into inspirehep:master Sep 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add s3 to DESY spider instead of FTP #284

Add s3 to DESY spider instead of FTP #284

pazembrz commented Sep 7, 2020 •

edited

Loading

michamos Sep 7, 2020

pazembrz Sep 8, 2020

michamos Sep 8, 2020

pazembrz Sep 9, 2020

michamos Sep 7, 2020

pazembrz Sep 8, 2020

michamos Sep 7, 2020

pazembrz Sep 8, 2020

	self.logger.info("Moving {file} to {processed} bucket and deleting {file} from {incoming} bucket.".format(
	self.logger.info("Moving {file} from {incoming} bucket to {processed} bucket.".format(

Add s3 to DESY spider instead of FTP #284

Add s3 to DESY spider instead of FTP #284

Conversation

pazembrz commented Sep 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pazembrz commented Sep 7, 2020 •

edited

Loading