[SPARK-787] Add S3 configuration parameters to the EC2 deploy scripts #1120

danosipov · 2014-06-18T21:40:04Z

When deploying to AWS, there is additional configuration that is required to read S3 files. EMR creates it automatically, there is no reason that the Spark EC2 script shouldn't.

This PR requires a corresponding PR to the mesos/spark-ec2 to be merged, as it gets cloned in the process of setting up machines: mesos/spark-ec2#58

AmplabJenkins · 2014-06-18T21:44:53Z

Can one of the admins verify this patch?

pwendell · 2014-07-30T23:50:10Z

This seems like a good idea, sorry this hasn't been looked at in a while. We can try to review it after the 1.1 code freeze.

danosipov · 2014-08-18T23:02:27Z

BTW, this resolves SPARK-787

nchammas · 2014-09-03T03:10:11Z

@danosipov Could you edit the title of this PR to include "[SPARK-787]" at the beginning? That will bring it into alignment with the style we are using for other PRs.

This is a neat idea. Do we need to put those AWS configs in the appropriate core-site.xml file for Spark to pick them up?

I gather the goal of this PR is to have EC2 clusters setup so that you can run sc.textFile("s3n://...") without providing AWS credentials in-line, right?

danosipov · 2014-09-03T03:19:51Z

@nchammas Sure thing, I'll repush.

That's correct, this gives native support out of the box for s3n:// URLs on EC2 clusters (started with the spark_ec2) script, making for one less thing one has to setup manually.

nchammas · 2014-09-03T03:26:28Z

OK that's sweet!

I'll repush.

Oh, I think you can edit the PR title in-place here, no?

danosipov · 2014-09-03T03:38:54Z

Oh, I thought you meant the commit message… I can repush that one for better history.

nchammas · 2014-09-03T03:55:49Z

I think @asfgit squashes all the commits in a PR when it gets merged and the PR text is what goes in the main commit message, so there's probably no need to repush unless you really want to. Here's an example.

nchammas · 2014-09-03T04:53:07Z

Hmm, so I applied this patch and launched a new EC2 cluster with my patched spark_ec2.py. My AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID environment variables are set on my local machine.

I logged in to the launched EC2 cluster and opened up the PySpark shell. I tried a simple sc.textFile("s3n://...").count() but got an error about needing to provide my AWS credentials.

Did I misunderstand what this patch is supposed to do? How can I confirm it is working as expected?

nchammas · 2014-09-03T14:31:41Z

This PR requires a corresponding PR to the mesos/spark-ec2 to be merged, as it gets cloned in the process of setting up machines: mesos/spark-ec2#58

Ah, maybe it's related to this comment? Though I didn't know we had a dependency on some stuff in the Mesos project...

danosipov · 2014-09-03T16:17:03Z

Ah, maybe it's related to this comment

Yes - for a quick test you can change this line to pull from https://github.com/danosipov/spark-ec2

JoshRosen · 2014-09-03T18:34:44Z

@nchammas I think the name of the mesos/spark-ec2 repo is a holdover from when Spark was hosted in the mesos/spark repo.

nchammas · 2014-09-04T02:26:00Z

Yes - for a quick test you can change this line to pull from https://github.com/danosipov/spark-ec2

OK, tried this out and it works! I created an EC2 cluster and was able to access files on S3 without any additional setup.

PR looks good to me.

SparkQA · 2014-09-05T23:46:11Z

Can one of the admins verify this patch?

nchammas · 2014-09-06T00:21:21Z

@JoshRosen or @pwendell - Could one of you ask Jenkins to test this now that he's back?

JoshRosen · 2014-09-06T00:29:32Z

Jenkins, this is ok to test.

nchammas · 2014-09-06T03:42:25Z

Jenkins, retest this please.

nchammas · 2014-09-06T16:32:41Z

Jenkins, retest this please.

SparkQA · 2014-09-06T16:44:08Z

QA tests have started for PR 1120 at commit 41fd938.

This patch merges cleanly.

SparkQA · 2014-09-06T17:36:27Z

QA tests have finished for PR 1120 at commit 41fd938.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2014-09-06T18:53:15Z

Jenkins, retest this please.

SparkQA · 2014-09-06T19:44:19Z

QA tests have started for PR 1120 at commit 41fd938.

This patch merges cleanly.

SparkQA · 2014-09-06T20:45:45Z

QA tests have finished for PR 1120 at commit 41fd938.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

nchammas · 2014-09-07T17:45:45Z

Since this has been tested (both by hand and by Jenkins), I think the only thing that remains is perhaps to add a note in ec2-scripts.md about this feature, as well as an additional note about how to override it with different credentials if so desired. After that, I think this can be merged in.

@JoshRosen Does that sound good to you?

JoshRosen · 2014-09-08T16:39:15Z

Sounds good to me. Were you thinking of adding these instructions under the Accessing Data in S3 section?

nchammas · 2014-09-08T16:49:39Z

That's what I was thinking at first, but then I noticed under "Before You Start" there already appear to be instructions about this:

Whenever you want to use the spark-ec2 script, set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to your Amazon EC2 access key ID and secret access key. These can be obtained from the AWS homepage by clicking Account > Security Credentials > Access Credentials.

Perhaps we just need to clarify here how these environment variables will be used. I'm guessing the original intention was to have them work as implemented in this PR.

JoshRosen · 2014-09-08T17:02:40Z

I think that boto will read its credentials from those environment variables or through configuration files, so those environment variables aren't necessarily the only way to set these keys. What will happen if I use those? Do we want to allow users to use separate launch vs. S3 credentials, or should we always copy the AWS credentials used to launch the cluster, irrespective of whether they were supplied via environment variables, command-line flags, or boto configs?

Also, just to be paranoid, is there any security concern with implicitly copying the launch credentials to the cluster? Users who read from S3 still end up embedding credentials into their programs / environment variables / launch commands on the server, so I don't whether this is a big reduction in security (except now users who don't read from S3 will end up with these credentials on the server).

danosipov · 2014-09-08T17:22:29Z

@JoshRosen Yes, the credentials could be supplied in boto specific config files, but if they're not there the spark_ec2 script will look in the env variables, and error out with a message - see https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L174
So this PR will not work for those who have the credentials in boto configs, but at least it doesn't break anything.

As far as security, EMR uses essentially the same method for transferring credentials to the cluster nodes, so I don't think there are any additional concerns.

danosipov · 2014-09-08T17:27:30Z

Actually it looks it can work for all cases by getting the credentials out of boto:

conn.aws_access_key_id
conn.aws_secret_access_key

Let me make the change.

danosipov · 2014-09-08T17:38:16Z

Looks like I screwed up a rebase. Should I open a new PR?

SparkQA · 2014-09-08T17:48:31Z

QA tests have started for PR 1120 at commit 07c599c.

This patch merges cleanly.

JoshRosen · 2014-09-08T17:48:56Z

You could open a new PR, but this one is probably salvageable through some Git magic. One thing you could try is to undo your rebase using git-reflog, make a new branch off of this one, switch back to this branch and hard-reset it to master, then cherry-pick the PRs from your old copy of the branch into this one.

danosipov · 2014-09-08T17:54:46Z

Thanks! Salvaged...

nchammas · 2014-09-08T18:34:24Z

There is probably a use case for wanting separate keys to access S3 than the ones you use to launch EC2 instances, but perhaps it's fine for starters to have them be the same and open a separate PR in the future to make it possible for them to be different.

Also, just to be paranoid, is there any security concern with implicitly copying the launch credentials to the cluster?

Probably not directly since, as Dan pointed out, I think EMR does the same. But we do have an existing security problem reported in SPARK-2528, which is that spark-ec2 opens access to SSH from 0.0.0.0/0 by default.

SparkQA · 2014-09-08T18:46:08Z

QA tests have finished for PR 1120 at commit 07c599c.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- logDebug("isMulticlass = " + metadata.isMulticlass)
- logDebug("isMulticlass = " + metadata.isMulticlass)

SparkQA · 2014-09-08T18:48:42Z

QA tests have started for PR 1120 at commit 7e0da26.

This patch merges cleanly.

SparkQA · 2014-09-08T19:42:16Z

QA tests have finished for PR 1120 at commit 7e0da26.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

danosipov · 2014-09-08T20:01:10Z

Failed tests appear unrelated.

pwendell · 2014-09-15T04:35:19Z

From a security perspective I think once concern is the following. Say a company is in a workflow where user A launches clusters (e.g. an ops person) and user B, C, D use the Spark clusters with supplied login credentials. With this patch, now we'll copy user A's AWS credentials into a file on the cluster that user B, C, D, etc can read. And this change will basically be silently introduced. Am I correct in this understanding?

That's IMO too big of a change to introduce silently. But maybe we can guard this with a flag. I do agree this is a super useful feature to have.

Jenkins, test this please.

danosipov · 2014-09-15T16:31:06Z

@pwendell I've never worked in an organization that had such a setup for AWS operations - every user had their credentials with IAM authorizations. But I do see your point. This functionality can be enabled by a spark_ec2 parameter. Let me know if that's how you'd like to proceed.

pwendell · 2014-09-15T18:37:44Z

Hey @danosipov - after speaking with @JoshRosen a bit offline we felt that credential copying was too big of a change to introduce silently. Could you add a --copy-aws-credentials flag to the script to enable this? Also it would be good to make sure it's handled gracefully if the flag is disabled.

…ment

danosipov · 2014-09-15T21:02:24Z

@pwendell Done, take a look. This should probably be documented on this page https://github.com/apache/spark/blob/master/docs/ec2-scripts.md#accessing-data-in-s3

SparkQA · 2014-09-15T21:04:14Z

QA tests have started for PR 1120 at commit 71fab14.

This patch merges cleanly.

SparkQA · 2014-09-15T22:10:39Z

QA tests have finished for PR 1120 at commit 71fab14.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

pwendell · 2014-09-16T16:51:48Z

@danosipov Yeah, do you want to just add the docs in this patch? Our docs are versioned in the spark repo under docs/. I'd just add one sentence that says you can pass the flag to automatically transfer credentials.

LGTM pending docs.

danosipov · 2014-09-16T17:02:36Z

Added

SparkQA · 2014-09-16T17:09:31Z

QA tests have started for PR 1120 at commit 758da8b.

This patch merges cleanly.

SparkQA · 2014-09-16T18:19:09Z

QA tests have finished for PR 1120 at commit 758da8b.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ArrayConstructor extends net.razorvine.pickle.objects.ArrayConstructor
- class NonASCIICharacterChecker extends ScalariformChecker
- class SCCallSiteSync(object):

pwendell · 2014-09-16T20:39:54Z

Okay thanks, I'm merging this.

danosipov · 2014-09-16T20:47:59Z

@pwendell Can you also merge mesos/spark-ec2#58, otherwise this feature will not take effect.

…spark.sql.adaptive.advisoryPartitionSizeInBytes` (#1120) * Update HandleOuterJoinBuildSideSkew.scala * Update HandleOuterJoinBuildSideSkewSuite.scala

…apRDB predicate (apache#1120)

danosipov changed the title ~~Add S3 configuration parameters to the EC2 deploy scripts~~ [SPARK-787] Add S3 configuration parameters to the EC2 deploy scripts Sep 3, 2014

danosipov force-pushed the s3_credentials branch from 07c599c to 7e0da26 Compare September 8, 2014 17:53

Use a parameter --copy-aws-credentials to enable S3 credential deploy…

71fab14

…ment

Modify documentation to include the new parameter

758da8b

asfgit closed this in b201712 Sep 16, 2014

udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024

MapR [SPARK-1196] MaprDB spark connector - Cannot convert true to a M…

c74104e

…apRDB predicate (apache#1120)

[SPARK-787] Add S3 configuration parameters to the EC2 deploy scripts #1120

[SPARK-787] Add S3 configuration parameters to the EC2 deploy scripts #1120

Uh oh!

Conversation

danosipov commented Jun 18, 2014

Uh oh!

AmplabJenkins commented Jun 18, 2014

Uh oh!

pwendell commented Jul 30, 2014

Uh oh!

danosipov commented Aug 18, 2014

Uh oh!

nchammas commented Sep 3, 2014

Uh oh!

danosipov commented Sep 3, 2014

Uh oh!

nchammas commented Sep 3, 2014

Uh oh!

danosipov commented Sep 3, 2014

Uh oh!

nchammas commented Sep 3, 2014

Uh oh!

nchammas commented Sep 3, 2014

Uh oh!

nchammas commented Sep 3, 2014

Uh oh!

danosipov commented Sep 3, 2014

Uh oh!

JoshRosen commented Sep 3, 2014

Uh oh!

nchammas commented Sep 4, 2014

Uh oh!

SparkQA commented Sep 5, 2014

Uh oh!

nchammas commented Sep 6, 2014

Uh oh!

JoshRosen commented Sep 6, 2014

Uh oh!

nchammas commented Sep 6, 2014

Uh oh!

nchammas commented Sep 6, 2014

Uh oh!

SparkQA commented Sep 6, 2014

Uh oh!

SparkQA commented Sep 6, 2014

Uh oh!

JoshRosen commented Sep 6, 2014

Uh oh!

SparkQA commented Sep 6, 2014

Uh oh!

SparkQA commented Sep 6, 2014

Uh oh!

nchammas commented Sep 7, 2014

Uh oh!

JoshRosen commented Sep 8, 2014

Uh oh!

nchammas commented Sep 8, 2014

Uh oh!

JoshRosen commented Sep 8, 2014

Uh oh!

danosipov commented Sep 8, 2014

Uh oh!

danosipov commented Sep 8, 2014

Uh oh!

danosipov commented Sep 8, 2014

Uh oh!

SparkQA commented Sep 8, 2014

Uh oh!

JoshRosen commented Sep 8, 2014

Uh oh!

danosipov commented Sep 8, 2014

Uh oh!

nchammas commented Sep 8, 2014

Uh oh!

SparkQA commented Sep 8, 2014

Uh oh!

SparkQA commented Sep 8, 2014

Uh oh!

SparkQA commented Sep 8, 2014

Uh oh!