-
Notifications
You must be signed in to change notification settings - Fork 532
DataSink S3 support #1316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataSink S3 support #1316
Conversation
… handle uploading data to S3 by including "s3://bucket_name/.." in the base_directory, passes all unittests in https://github.com/FCP-INDI/C-PAC/blob/test_dev/test/unit/nipype/datasink_test.py
add ResourceMultiprocPlugi
Pull in multiproc
Added md5 checking before uploading to S3
Update io.py
Merge fcp-indi nipype to mine
Fix divide by 0 bug
@shariqiqbal2810 and @jason-wg could you have a look at this PR? I'm especially interested to know if this PR has the same functionality as S3DataSink. It would be nice to have only one way of storing files on S3 in Nipype. Maybe we can merge the two? |
@chrisfilo Just had a look. It seems to be essentially the same but with the benefit of using boto3 and being more fleshed out/better tested. One thing I would change is to add the ability to use aws credentials stored as environments variables instead of defaulting to an anonymous connection when a credentials file isn't specified. Also, for the sake of consistency, the S3DataGrabber should probably be merged into the regular DataGrabber and made to use boto3. |
I'm trying to unit test the code but I am noticing I am now missing some new dependencies introduced to the master branch of nipype (e.g. futures, prov). Can you provide a list of new python packages I need to install in order for the latest nipype to work? |
@chrisfilo dg = io.S3DataGrabber() When I use code from this PR, One workflow that only used S3DataSink worked well, so the S3DataSink functionality seems good. |
Thanks for looking into this @jason-wg. This behavior indeed surprising, because in principle this PR does not change S3DataGrabber, but modifies DataGrabber (a parent class to S3DataGrabber) to add support of S3. Maybe something that got inherited is causing this behaviour. In terms of moving forward I would propose the following:
What do you think? (It would be good to hear @satra opinion as well) In terms of doing the analogous thing to DataSink I would leave it to another PR. |
@chrisfilo I also think it would be helpful to provide documentation on how to use the new features after all this is updated. Sample code that runs a workflow/uses a feature is really useful. |
@jason-wg Of course! Now that we are more clear on feature set we add the much needed documentation. A dedicated page for using Nipype on AWS might be useful. S3 related notes might go there for now, but it can grow in the future. |
We typically don't use the DataGrabber or S3DataGrabber in our workflows, so we just updated and merged the DataSink class to support S3. It looks like all the tests are passing now. |
Agreed - let's leave the DataSink to a separate PR. Maybe @shariqiqbal2810 or @jason-wg could work on this - it should not be hard to integrate S3DataGrabber into DataGrabber. We still need documentation for this PR so new users will be able to pick up the new functionality. Also did you copy the credential stored in env var from S3DataGrabber? |
We have enabled using env vars for credentials here: https://github.com/FCP-INDI/nipype/blob/s3_datasink/nipype/interfaces/io.py#L479-L480 Documentation for AWS credentials and S3 compatibility in the new DataSink is a good idea. Where should this be written? |
For the docs I would propose a new page: "Using Nipype with Amazon Web Services (AWS)". Maybe we can rapidly put some content together using Google Docs? https://docs.google.com/document/d/18ZzTCfhyuZxUGYrSfKaflrvuyTQSnhU19jV4Ycnlo3g/edit?usp=sharing |
@@ -233,9 +298,12 @@ class DataSink(IOBase): | |||
>>> ds.run() # doctest: +SKIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add an S3 example here (even if it would be just with +SKIP).
I have added some text to the document, but since I have not used this feature it might be easier for you to fill it in. |
Hey Chris, thanks for setting this up! I have added my description of the new DataSink as it pertains to S3. Let me know your thoughts. |
Looks great! Thank you so much for adding this. Could you also briefly On Thu, Jan 28, 2016 at 3:34 PM, Daniel Clark notifications@github.com
|
Oh forgot! Just added the env vars section and also gave examples of how the creds files should be formatted (i.e. as they are by default when the user first downloads them from AWS). |
I made some small stylistic changes. Unless @jason-wg or @shariqiqbal2810 have anything to add you should turn it to .rst and include it in this PR. It should be ready to merge then. Thanks for putting this together! |
Sure! Where should the rst go? Should it be inside another rst or its own file? If its own file, what should the filename be? I really have no preference with any of these, so I'd rather get your input so as not to throw off the documentation organization/hierarchy. |
I would make it it's own thing. Call it aws.rst and link it from table of On Tue, Feb 2, 2016 at 8:23 AM, Daniel Clark notifications@github.com
|
…it a local variable; pickle is not able to pickle the Bucket object. Functionally, the DataSink is the same
Well done: https://circle-artifacts.com/gh/nipy/nipype/674/artifacts/0/home/ubuntu/nipype/doc/_build/html/users/aws.html Thanks once again. |
1a) Added unit tests for S3 datasink code. * Note - these were added to fit the nipype code form as it stands now. However, in the future, we would like to set up the unit tests for added code per the Python standard unittest TestCase format (e.g. https://github.com/FCP-INDI/C-PAC/blob/0.4.0_development/test/unit/nipype/datasink_test.py)