DataSink S3 support #1316

pintohutch · 2016-01-08T21:04:08Z

Added in S3 support to the standard DataSink class where the user can prepend 's3://' to the base directory of the DataSink, along with credentials, and optional server-side encryption for sending data directly to an S3 bucket. Additionally, this class includes an additional local_copy attribute, which can be used to write to a local directory, as well as S3, if desired. * This code incorporates boto3 over the boto package. The reasons for using boto3 over boto are outlined here: https://aws.amazon.com/blogs/aws/now-available-aws-sdk-for-python-3-boto3/

1a) Added unit tests for S3 datasink code. * Note - these were added to fit the nipype code form as it stands now. However, in the future, we would like to set up the unit tests for added code per the Python standard unittest TestCase format (e.g. https://github.com/FCP-INDI/C-PAC/blob/0.4.0_development/test/unit/nipype/datasink_test.py)

… handle uploading data to S3 by including "s3://bucket_name/.." in the base_directory, passes all unittests in https://github.com/FCP-INDI/C-PAC/blob/test_dev/test/unit/nipype/datasink_test.py

…cess S3

add ResourceMultiprocPlugi

Pull in multiproc

Added md5 checking before uploading to S3

Update io.py

Merge fcp-indi nipype to mine

Fix divide by 0 bug

…ed_memory

chrisgorgo · 2016-01-13T19:35:19Z

@shariqiqbal2810 and @jason-wg could you have a look at this PR?

I'm especially interested to know if this PR has the same functionality as S3DataSink. It would be nice to have only one way of storing files on S3 in Nipype. Maybe we can merge the two?

shariqiqbal2810 · 2016-01-13T21:25:51Z

@chrisfilo Just had a look. It seems to be essentially the same but with the benefit of using boto3 and being more fleshed out/better tested. One thing I would change is to add the ability to use aws credentials stored as environments variables instead of defaulting to an anonymous connection when a credentials file isn't specified. Also, for the sake of consistency, the S3DataGrabber should probably be merged into the regular DataGrabber and made to use boto3.

pintohutch · 2016-01-14T17:40:40Z

I'm trying to unit test the code but I am noticing I am now missing some new dependencies introduced to the master branch of nipype (e.g. futures, prov). Can you provide a list of new python packages I need to install in order for the latest nipype to work?

chrisgorgo · 2016-01-14T17:46:07Z

here you go: https://github.com/nipy/nipype/blob/master/nipype/info.py#L105

…patible

jason-wg · 2016-01-16T18:57:15Z

@chrisfilo
I just tried a few test runs using the updated io.py
The workflows I ran used the S3DataSink and S3DataGrabber from the updated io.py in this PR.
Regarding S3DataGrabber, there’s one problem that I ran into.
Here’s the code of interest

dg = io.S3DataGrabber()
dg.inputs.bucket='bucketName'
dg.inputs.sort_filelist=False
dg.inputs.local_directory='./testfolder/'
dg.inputs.bucket_path='bucketPath/'
dg.inputs.template='data.nii.gz'
result=dg.run()
myData=result.outputs.outfiles

When I use code from this PR, myData is set to:
bucketPath/data.nii.gz
However, I expect myData to be set to:
./testfolder/data.nii.gz
(running the same workflow, using the code from nipype master produces this expected result)

One workflow that only used S3DataSink worked well, so the S3DataSink functionality seems good.

chrisgorgo · 2016-01-16T19:08:25Z

Thanks for looking into this @jason-wg. This behavior indeed surprising, because in principle this PR does not change S3DataGrabber, but modifies DataGrabber (a parent class to S3DataGrabber) to add support of S3. Maybe something that got inherited is causing this behaviour.

In terms of moving forward I would propose the following:

Add support of aws credentials stored as environments variables to DataSink
Make sure DataSink passes all of the S3DataSink tests.
Remove S3DataSink (we never released a version with it so we don't need to deprecate it)

What do you think? (It would be good to hear @satra opinion as well) In terms of doing the analogous thing to DataSink I would leave it to another PR.

jason-wg · 2016-01-16T19:31:47Z

@chrisfilo
That sounds good to me.

I also think it would be helpful to provide documentation on how to use the new features after all this is updated. Sample code that runs a workflow/uses a feature is really useful.
Additionally, a guide to setting up AWS credentials (either using a .boto file, or using environment variables) would be beneficial to users who are new to AWS.

chrisgorgo · 2016-01-17T05:22:27Z

@jason-wg Of course! Now that we are more clear on feature set we add the much needed documentation. A dedicated page for using Nipype on AWS might be useful. S3 related notes might go there for now, but it can grow in the future.

pintohutch · 2016-01-19T19:34:20Z

We typically don't use the DataGrabber or S3DataGrabber in our workflows, so we just updated and merged the DataSink class to support S3. It looks like all the tests are passing now.

chrisgorgo · 2016-01-19T19:51:59Z

Agreed - let's leave the DataSink to a separate PR. Maybe @shariqiqbal2810 or @jason-wg could work on this - it should not be hard to integrate S3DataGrabber into DataGrabber.

We still need documentation for this PR so new users will be able to pick up the new functionality. Also did you copy the credential stored in env var from S3DataGrabber?

pintohutch · 2016-01-19T21:00:57Z

We have enabled using env vars for credentials here: https://github.com/FCP-INDI/nipype/blob/s3_datasink/nipype/interfaces/io.py#L479-L480

Documentation for AWS credentials and S3 compatibility in the new DataSink is a good idea. Where should this be written?

chrisgorgo · 2016-01-21T22:10:17Z

For the docs I would propose a new page: "Using Nipype with Amazon Web Services (AWS)". Maybe we can rapidly put some content together using Google Docs? https://docs.google.com/document/d/18ZzTCfhyuZxUGYrSfKaflrvuyTQSnhU19jV4Ycnlo3g/edit?usp=sharing

chrisgorgo · 2016-01-28T16:57:48Z

nipype/interfaces/io.py

@@ -233,9 +298,12 @@ class DataSink(IOBase):
        >>> ds.run()  # doctest: +SKIP


It would be good to add an S3 example here (even if it would be just with +SKIP).

chrisgorgo · 2016-01-28T17:05:46Z

I have added some text to the document, but since I have not used this feature it might be easier for you to fill it in.

pintohutch · 2016-01-28T23:34:06Z

Hey Chris, thanks for setting this up! I have added my description of the new DataSink as it pertains to S3. Let me know your thoughts.

chrisgorgo · 2016-01-28T23:45:41Z

Looks great! Thank you so much for adding this. Could you also briefly
describe how to use credentials stored as env variable?

On Thu, Jan 28, 2016 at 3:34 PM, Daniel Clark notifications@github.com
wrote:

Hey Chris, thanks for setting this up! I have added my description of the
new DataSink as it pertains to S3. Let me know your thoughts.

—
Reply to this email directly or view it on GitHub
#1316 (comment).

pintohutch · 2016-01-29T16:54:38Z

Oh forgot! Just added the env vars section and also gave examples of how the creds files should be formatted (i.e. as they are by default when the user first downloads them from AWS).

chrisgorgo · 2016-01-31T17:30:41Z

I made some small stylistic changes. Unless @jason-wg or @shariqiqbal2810 have anything to add you should turn it to .rst and include it in this PR. It should be ready to merge then. Thanks for putting this together!

pintohutch · 2016-02-02T16:23:42Z

Sure! Where should the rst go? Should it be inside another rst or its own file? If its own file, what should the filename be? I really have no preference with any of these, so I'd rather get your input so as not to throw off the documentation organization/hierarchy.

chrisgorgo · 2016-02-02T16:59:25Z

I would make it it's own thing. Call it aws.rst and link it from table of
contents. You can model it after the page describing MapNodes and
iterables. This is a great help!

On Tue, Feb 2, 2016 at 8:23 AM, Daniel Clark notifications@github.com
wrote:

Sure! Where should the rst go? Should it be inside another rst or its own
file? If its own file, what should the filename be? I really have no
preference with any of these, so I'd rather get your input so as not to
throw off the documentation organization/hierarchy.

—
Reply to this email directly or view it on GitHub
#1316 (comment).

…Sink class

…it a local variable; pickle is not able to pickle the Bucket object. Functionally, the DataSink is the same

chrisgorgo · 2016-02-03T23:53:12Z

Well done: https://circle-artifacts.com/gh/nipy/nipype/674/artifacts/0/home/ubuntu/nipype/doc/_build/html/users/aws.html Thanks once again.

DataSink S3 support

carolFrohlich and others added 30 commits September 29, 2015 14:15

add resource multiproc plugin

4b3f926

callback functions write log

6f4690b

fix multiproc tests. create lot 2 json converter

52da583

fix comments and logs

ffb4756

fix tests

0890e81

Modified the DataSink class and DataSinkInputSpec class to be able to…

b3c6afc

… handle uploading data to S3 by including "s3://bucket_name/.." in the base_directory, passes all unittests in https://github.com/FCP-INDI/C-PAC/blob/test_dev/test/unit/nipype/datasink_test.py

Removed redundant imports

4b02558

Quick cosmetic fix

42f0b1b

scheduler does not sleep

872e752

clean code

e465c28

draw gant chart, small fixes

e49965c

add memory and thread to gantt chart, callback handles errors

34acdf8

Added handling of DataSink to save to a local directory if it cant ac…

c9c92ef

…cess S3

add tests

cb07b5a

fix method name

827d2c2

Merge branch 'master' of https://github.com/carolFrohlich/nipype

70897b2

add ResourceMultiprocPlugi

Merge branch 'master' of https://github.com/dclark87/nipype

0856bca

fix typos

a8f8006

Update io.py

300d20c

Added md5 checking for s3

0503c23

Merge pull request #1 from FCP-INDI/master

e3ad668

Pull in multiproc

Added message about file already existsing

f6cfad7

Merge pull request #2 from dclark87/master

0529444

Added md5 checking before uploading to S3

Merge pull request #1 from dclark87/patch-1

f107efd

Update io.py

Merge pull request #2 from FCP-INDI/master

fdcab2a

Merge fcp-indi nipype to mine

Fixed dive by 0 bug

186d00a

Added upper/lower case support for S3 prefix

f77371b

Added support for both non-root and root AWS creds in DataSink

e2f51f6

Merge pull request #3 from dclark87/master

f34b6d6

Fix divide by 0 bug

add attribute real_memory to interface, change attr memory to estimat…

350fd4a

…ed_memory

pintohutch added 2 commits January 13, 2016 14:27

Fixed Python3 compatibility bug in exception raising

818da99

Made exceptions more explicit

49c14f8

Removed S3DataSink and changed dummy file writing to be Python2/3 com…

a9dd168

…patible

pintohutch changed the title ~~DataSink S3 support and ResourceMultiProc plugin~~ DataSink S3 support Jan 14, 2016

chrisgorgo reviewed Jan 28, 2016
View reviewed changes

pintohutch added 2 commits February 2, 2016 16:17

Added aws.rst file documenting use of new S3 capabilities in the Data…

c2eedc7

…Sink class

Removed bucket from being an attribute of the DataSink and just made …

c0d148a

…it a local variable; pickle is not able to pickle the Bucket object. Functionally, the DataSink is the same

chrisgorgo added a commit that referenced this pull request Feb 3, 2016

Merge pull request #1316 from FCP-INDI/s3_datasink

1f49391

DataSink S3 support

chrisgorgo merged commit 1f49391 into nipy:master Feb 3, 2016

jason-wg mentioned this pull request Apr 4, 2016

Create brainsuite.py #1305

Merged

		@@ -233,9 +298,12 @@ class DataSink(IOBase):
		>>> ds.run() # doctest: +SKIP

DataSink S3 support #1316

DataSink S3 support #1316

Uh oh!

Conversation

pintohutch commented Jan 8, 2016

Uh oh!

chrisgorgo commented Jan 13, 2016

Uh oh!

shariqiqbal2810 commented Jan 13, 2016

Uh oh!

pintohutch commented Jan 14, 2016

Uh oh!

chrisgorgo commented Jan 14, 2016

Uh oh!

jason-wg commented Jan 16, 2016

Uh oh!

chrisgorgo commented Jan 16, 2016

Uh oh!

jason-wg commented Jan 16, 2016

Uh oh!

chrisgorgo commented Jan 17, 2016

Uh oh!

pintohutch commented Jan 19, 2016

Uh oh!

chrisgorgo commented Jan 19, 2016

Uh oh!

pintohutch commented Jan 19, 2016

Uh oh!

chrisgorgo commented Jan 21, 2016

Uh oh!

chrisgorgo Jan 28, 2016

Choose a reason for hiding this comment

Uh oh!

chrisgorgo commented Jan 28, 2016

Uh oh!

pintohutch commented Jan 28, 2016

Uh oh!

chrisgorgo commented Jan 28, 2016

Uh oh!

pintohutch commented Jan 29, 2016

Uh oh!

chrisgorgo commented Jan 31, 2016

Uh oh!

pintohutch commented Feb 2, 2016

Uh oh!

chrisgorgo commented Feb 2, 2016

Uh oh!

chrisgorgo commented Feb 3, 2016

Uh oh!

Uh oh!