Skip to content

DataSink S3 support #1316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 55 commits into from
Feb 3, 2016
Merged

DataSink S3 support #1316

merged 55 commits into from
Feb 3, 2016

Conversation

pintohutch
Copy link
Contributor

  1. Added in S3 support to the standard DataSink class where the user can prepend 's3://' to the base directory of the DataSink, along with credentials, and optional server-side encryption for sending data directly to an S3 bucket. Additionally, this class includes an additional local_copy attribute, which can be used to write to a local directory, as well as S3, if desired. * This code incorporates boto3 over the boto package. The reasons for using boto3 over boto are outlined here: https://aws.amazon.com/blogs/aws/now-available-aws-sdk-for-python-3-boto3/

1a) Added unit tests for S3 datasink code. * Note - these were added to fit the nipype code form as it stands now. However, in the future, we would like to set up the unit tests for added code per the Python standard unittest TestCase format (e.g. https://github.com/FCP-INDI/C-PAC/blob/0.4.0_development/test/unit/nipype/datasink_test.py)

carolFrohlich and others added 30 commits September 29, 2015 14:15
… handle uploading data to S3 by including "s3://bucket_name/.." in the base_directory, passes all unittests in https://github.com/FCP-INDI/C-PAC/blob/test_dev/test/unit/nipype/datasink_test.py
Added md5 checking before uploading to S3
Merge fcp-indi nipype to mine
@chrisgorgo
Copy link
Member

@shariqiqbal2810 and @jason-wg could you have a look at this PR?

I'm especially interested to know if this PR has the same functionality as S3DataSink. It would be nice to have only one way of storing files on S3 in Nipype. Maybe we can merge the two?

@shariqiqbal2810
Copy link
Contributor

@chrisfilo Just had a look. It seems to be essentially the same but with the benefit of using boto3 and being more fleshed out/better tested. One thing I would change is to add the ability to use aws credentials stored as environments variables instead of defaulting to an anonymous connection when a credentials file isn't specified. Also, for the sake of consistency, the S3DataGrabber should probably be merged into the regular DataGrabber and made to use boto3.

@pintohutch
Copy link
Contributor Author

I'm trying to unit test the code but I am noticing I am now missing some new dependencies introduced to the master branch of nipype (e.g. futures, prov). Can you provide a list of new python packages I need to install in order for the latest nipype to work?

@chrisgorgo
Copy link
Member

@pintohutch pintohutch changed the title DataSink S3 support and ResourceMultiProc plugin DataSink S3 support Jan 14, 2016
@jason-wg
Copy link
Contributor

@chrisfilo
I just tried a few test runs using the updated io.py
The workflows I ran used the S3DataSink and S3DataGrabber from the updated io.py in this PR.
Regarding S3DataGrabber, there’s one problem that I ran into.
Here’s the code of interest

dg = io.S3DataGrabber()
dg.inputs.bucket='bucketName'
dg.inputs.sort_filelist=False
dg.inputs.local_directory='./testfolder/'
dg.inputs.bucket_path='bucketPath/'
dg.inputs.template='data.nii.gz'
result=dg.run()
myData=result.outputs.outfiles

When I use code from this PR, myData is set to:
bucketPath/data.nii.gz
However, I expect myData to be set to:
./testfolder/data.nii.gz
(running the same workflow, using the code from nipype master produces this expected result)

One workflow that only used S3DataSink worked well, so the S3DataSink functionality seems good.

@chrisgorgo
Copy link
Member

Thanks for looking into this @jason-wg. This behavior indeed surprising, because in principle this PR does not change S3DataGrabber, but modifies DataGrabber (a parent class to S3DataGrabber) to add support of S3. Maybe something that got inherited is causing this behaviour.

In terms of moving forward I would propose the following:

  • Add support of aws credentials stored as environments variables to DataSink
  • Make sure DataSink passes all of the S3DataSink tests.
  • Remove S3DataSink (we never released a version with it so we don't need to deprecate it)

What do you think? (It would be good to hear @satra opinion as well) In terms of doing the analogous thing to DataSink I would leave it to another PR.

@jason-wg
Copy link
Contributor

@chrisfilo
That sounds good to me.

I also think it would be helpful to provide documentation on how to use the new features after all this is updated. Sample code that runs a workflow/uses a feature is really useful.
Additionally, a guide to setting up AWS credentials (either using a .boto file, or using environment variables) would be beneficial to users who are new to AWS.

@chrisgorgo
Copy link
Member

@jason-wg Of course! Now that we are more clear on feature set we add the much needed documentation. A dedicated page for using Nipype on AWS might be useful. S3 related notes might go there for now, but it can grow in the future.

@pintohutch
Copy link
Contributor Author

We typically don't use the DataGrabber or S3DataGrabber in our workflows, so we just updated and merged the DataSink class to support S3. It looks like all the tests are passing now.

@chrisgorgo
Copy link
Member

Agreed - let's leave the DataSink to a separate PR. Maybe @shariqiqbal2810 or @jason-wg could work on this - it should not be hard to integrate S3DataGrabber into DataGrabber.

We still need documentation for this PR so new users will be able to pick up the new functionality. Also did you copy the credential stored in env var from S3DataGrabber?

@pintohutch
Copy link
Contributor Author

We have enabled using env vars for credentials here: https://github.com/FCP-INDI/nipype/blob/s3_datasink/nipype/interfaces/io.py#L479-L480

Documentation for AWS credentials and S3 compatibility in the new DataSink is a good idea. Where should this be written?

@chrisgorgo
Copy link
Member

For the docs I would propose a new page: "Using Nipype with Amazon Web Services (AWS)". Maybe we can rapidly put some content together using Google Docs? https://docs.google.com/document/d/18ZzTCfhyuZxUGYrSfKaflrvuyTQSnhU19jV4Ycnlo3g/edit?usp=sharing

@@ -233,9 +298,12 @@ class DataSink(IOBase):
>>> ds.run() # doctest: +SKIP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add an S3 example here (even if it would be just with +SKIP).

@chrisgorgo
Copy link
Member

I have added some text to the document, but since I have not used this feature it might be easier for you to fill it in.

@pintohutch
Copy link
Contributor Author

Hey Chris, thanks for setting this up! I have added my description of the new DataSink as it pertains to S3. Let me know your thoughts.

@chrisgorgo
Copy link
Member

Looks great! Thank you so much for adding this. Could you also briefly
describe how to use credentials stored as env variable?

On Thu, Jan 28, 2016 at 3:34 PM, Daniel Clark notifications@github.com
wrote:

Hey Chris, thanks for setting this up! I have added my description of the
new DataSink as it pertains to S3. Let me know your thoughts.


Reply to this email directly or view it on GitHub
#1316 (comment).

@pintohutch
Copy link
Contributor Author

Oh forgot! Just added the env vars section and also gave examples of how the creds files should be formatted (i.e. as they are by default when the user first downloads them from AWS).

@chrisgorgo
Copy link
Member

I made some small stylistic changes. Unless @jason-wg or @shariqiqbal2810 have anything to add you should turn it to .rst and include it in this PR. It should be ready to merge then. Thanks for putting this together!

@pintohutch
Copy link
Contributor Author

Sure! Where should the rst go? Should it be inside another rst or its own file? If its own file, what should the filename be? I really have no preference with any of these, so I'd rather get your input so as not to throw off the documentation organization/hierarchy.

@chrisgorgo
Copy link
Member

I would make it it's own thing. Call it aws.rst and link it from table of
contents. You can model it after the page describing MapNodes and
iterables. This is a great help!

On Tue, Feb 2, 2016 at 8:23 AM, Daniel Clark notifications@github.com
wrote:

Sure! Where should the rst go? Should it be inside another rst or its own
file? If its own file, what should the filename be? I really have no
preference with any of these, so I'd rather get your input so as not to
throw off the documentation organization/hierarchy.


Reply to this email directly or view it on GitHub
#1316 (comment).

…it a local variable; pickle is not able to pickle the Bucket object. Functionally, the DataSink is the same
@chrisgorgo
Copy link
Member

chrisgorgo added a commit that referenced this pull request Feb 3, 2016
@chrisgorgo chrisgorgo merged commit 1f49391 into nipy:master Feb 3, 2016
@jason-wg jason-wg mentioned this pull request Apr 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants