Skip to content

Commit 1f49391

Browse files
committed
Merge pull request #1316 from FCP-INDI/s3_datasink
DataSink S3 support
2 parents 356b028 + c0d148a commit 1f49391

File tree

5 files changed

+715
-274
lines changed

5 files changed

+715
-274
lines changed

doc/users/aws.rst

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
.. _aws:
2+
3+
============================================
4+
Using Nipype with Amazon Web Services (AWS)
5+
============================================
6+
Several groups have been successfully using Nipype on AWS. This procedure
7+
involves setting a temporary cluster using StarCluster and potentially
8+
transferring files to/from S3. The latter is supported by Nipype through
9+
DataSink and S3DataGrabber.
10+
11+
12+
Using DataSink with S3
13+
======================
14+
The DataSink class now supports sending output data directly to an AWS S3
15+
bucket. It does this through the introduction of several input attributes to the
16+
DataSink interface and by parsing the `base_directory` attribute. This class
17+
uses the `boto3 <https://boto3.readthedocs.org/en/latest/>`_ and
18+
`botocore <https://botocore.readthedocs.org/en/latest/>`_ Python packages to
19+
interact with AWS. To configure the DataSink to write data to S3, the user must
20+
set the ``base_directory`` property to an S3-style filepath. For example:
21+
22+
::
23+
24+
import nipype.interfaces.io as nio
25+
ds = nio.DataSink()
26+
ds.inputs.base_directory = 's3://mybucket/path/to/output/dir'
27+
28+
With the "s3://" prefix in the path, the DataSink knows that the output
29+
directory to send files is on S3 in the bucket "mybucket". "path/to/output/dir"
30+
is the relative directory path within the bucket "mybucket" where output data
31+
will be uploaded to (NOTE: if the relative path specified contains folders that
32+
don’t exist in the bucket, the DataSink will create them). The DataSink treats
33+
the S3 base directory exactly as it would a local directory, maintaining support
34+
for containers, substitutions, subfolders, "." notation, etc to route output
35+
data appropriately.
36+
37+
There are four new attributes introduced with S3-compatibility: ``creds_path``,
38+
``encrypt_bucket_keys``, ``local_copy``, and ``bucket``.
39+
40+
::
41+
42+
ds.inputs.creds_path = '/home/user/aws_creds/credentials.csv'
43+
ds.inputs.encrypt_bucket_keys = True
44+
ds.local_copy = '/home/user/workflow_outputs/local_backup'
45+
46+
``creds_path`` is a file path where the user's AWS credentials file (typically
47+
a csv) is stored. This credentials file should contain the AWS access key id and
48+
secret access key and should be formatted as one of the following (these formats
49+
are how Amazon provides the credentials file by default when first downloaded).
50+
51+
Root-account user:
52+
53+
::
54+
55+
AWSAccessKeyID=ABCDEFGHIJKLMNOP
56+
AWSSecretKey=zyx123wvu456/ABC890+gHiJk
57+
58+
IAM-user:
59+
60+
::
61+
62+
User Name,Access Key Id,Secret Access Key
63+
"username",ABCDEFGHIJKLMNOP,zyx123wvu456/ABC890+gHiJk
64+
65+
The ``creds_path`` is necessary when writing files to a bucket that has
66+
restricted access (almost no buckets are publicly writable). If ``creds_path``
67+
is not specified, the DataSink will check the ``AWS_ACCESS_KEY_ID`` and
68+
``AWS_SECRET_ACCESS_KEY`` environment variables and use those values for bucket
69+
access.
70+
71+
``encrypt_bucket_keys`` is a boolean flag that indicates whether to encrypt the
72+
output data on S3, using server-side AES-256 encryption. This is useful if the
73+
data being output is sensitive and one desires an extra layer of security on the
74+
data. By default, this is turned off.
75+
76+
``local_copy`` is a string of the filepath where local copies of the output data
77+
are stored in addition to those sent to S3. This is useful if one wants to keep
78+
a backup version of the data stored on their local computer. By default, this is
79+
turned off.
80+
81+
``bucket`` is a boto3 Bucket object that the user can use to overwrite the
82+
bucket specified in their ``base_directory``. This can be useful if one has to
83+
manually create a bucket instance on their own using special credentials (or
84+
using a mock server like `fakes3 <https://github.com/jubos/fake-s3>`_). This is
85+
typically used for developers unit-testing the DataSink class. Most users do not
86+
need to use this attribute for actual workflows. This is an optional argument.
87+
88+
Finally, the user needs only to specify the input attributes for any incoming
89+
data to the node, and the outputs will be written to their S3 bucket.
90+
91+
::
92+
93+
workflow.connect(inputnode, 'subject_id', ds, 'container')
94+
workflow.connect(realigner, 'realigned_files', ds, 'motion')
95+
96+
So, for example, outputs for sub001’s realigned_file1.nii.gz will be in:
97+
s3://mybucket/path/to/output/dir/sub001/motion/realigned_file1.nii.gz
98+
99+
100+
Using S3DataGrabber
101+
======================
102+
Coming soon...

doc/users/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@
3838
spmmcr
3939
mipav
4040
nipypecmd
41+
aws
4142

4243

4344

0 commit comments

Comments
 (0)