Skip to content

[SPARK-787] Add S3 configuration parameters to the EC2 deploy scripts #1120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

danosipov
Copy link
Contributor

When deploying to AWS, there is additional configuration that is required to read S3 files. EMR creates it automatically, there is no reason that the Spark EC2 script shouldn't.

This PR requires a corresponding PR to the mesos/spark-ec2 to be merged, as it gets cloned in the process of setting up machines: mesos/spark-ec2#58

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@pwendell
Copy link
Contributor

This seems like a good idea, sorry this hasn't been looked at in a while. We can try to review it after the 1.1 code freeze.

@danosipov
Copy link
Contributor Author

BTW, this resolves SPARK-787

@nchammas
Copy link
Contributor

nchammas commented Sep 3, 2014

@danosipov Could you edit the title of this PR to include "[SPARK-787]" at the beginning? That will bring it into alignment with the style we are using for other PRs.

This is a neat idea. Do we need to put those AWS configs in the appropriate core-site.xml file for Spark to pick them up?

I gather the goal of this PR is to have EC2 clusters setup so that you can run sc.textFile("s3n://...") without providing AWS credentials in-line, right?

@danosipov
Copy link
Contributor Author

@nchammas Sure thing, I'll repush.

That's correct, this gives native support out of the box for s3n:// URLs on EC2 clusters (started with the spark_ec2) script, making for one less thing one has to setup manually.

@nchammas
Copy link
Contributor

nchammas commented Sep 3, 2014

OK that's sweet!

I'll repush.

Oh, I think you can edit the PR title in-place here, no?

@danosipov danosipov changed the title Add S3 configuration parameters to the EC2 deploy scripts [SPARK-787] Add S3 configuration parameters to the EC2 deploy scripts Sep 3, 2014
@danosipov
Copy link
Contributor Author

Oh, I thought you meant the commit message… I can repush that one for better history.

@nchammas
Copy link
Contributor

nchammas commented Sep 3, 2014

I think @asfgit squashes all the commits in a PR when it gets merged and the PR text is what goes in the main commit message, so there's probably no need to repush unless you really want to. Here's an example.

@nchammas
Copy link
Contributor

nchammas commented Sep 3, 2014

Hmm, so I applied this patch and launched a new EC2 cluster with my patched spark_ec2.py. My AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID environment variables are set on my local machine.

I logged in to the launched EC2 cluster and opened up the PySpark shell. I tried a simple sc.textFile("s3n://...").count() but got an error about needing to provide my AWS credentials.

Did I misunderstand what this patch is supposed to do? How can I confirm it is working as expected?

@nchammas
Copy link
Contributor

nchammas commented Sep 3, 2014

This PR requires a corresponding PR to the mesos/spark-ec2 to be merged, as it gets cloned in the process of setting up machines: mesos/spark-ec2#58

Ah, maybe it's related to this comment? Though I didn't know we had a dependency on some stuff in the Mesos project...

@danosipov
Copy link
Contributor Author

Ah, maybe it's related to this comment

Yes - for a quick test you can change this line to pull from https://github.com/danosipov/spark-ec2

@JoshRosen
Copy link
Contributor

@nchammas I think the name of the mesos/spark-ec2 repo is a holdover from when Spark was hosted in the mesos/spark repo.

@nchammas
Copy link
Contributor

nchammas commented Sep 4, 2014

Yes - for a quick test you can change this line to pull from https://github.com/danosipov/spark-ec2

OK, tried this out and it works! I created an EC2 cluster and was able to access files on S3 without any additional setup.

PR looks good to me.

@SparkQA
Copy link

SparkQA commented Sep 5, 2014

Can one of the admins verify this patch?

@nchammas
Copy link
Contributor

nchammas commented Sep 6, 2014

@JoshRosen or @pwendell - Could one of you ask Jenkins to test this now that he's back?

@JoshRosen
Copy link
Contributor

Jenkins, this is ok to test.

@nchammas
Copy link
Contributor

nchammas commented Sep 6, 2014

Jenkins, retest this please.

1 similar comment
@nchammas
Copy link
Contributor

nchammas commented Sep 6, 2014

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Sep 6, 2014

QA tests have started for PR 1120 at commit 41fd938.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 6, 2014

QA tests have finished for PR 1120 at commit 41fd938.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Sep 6, 2014

QA tests have started for PR 1120 at commit 41fd938.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 6, 2014

QA tests have finished for PR 1120 at commit 41fd938.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nchammas
Copy link
Contributor

nchammas commented Sep 7, 2014

Since this has been tested (both by hand and by Jenkins), I think the only thing that remains is perhaps to add a note in ec2-scripts.md about this feature, as well as an additional note about how to override it with different credentials if so desired. After that, I think this can be merged in.

@JoshRosen Does that sound good to you?

@JoshRosen
Copy link
Contributor

Sounds good to me. Were you thinking of adding these instructions under the Accessing Data in S3 section?

@nchammas
Copy link
Contributor

nchammas commented Sep 8, 2014

That's what I was thinking at first, but then I noticed under "Before You Start" there already appear to be instructions about this:

Whenever you want to use the spark-ec2 script, set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to your Amazon EC2 access key ID and secret access key. These can be obtained from the AWS homepage by clicking Account > Security Credentials > Access Credentials.

Perhaps we just need to clarify here how these environment variables will be used. I'm guessing the original intention was to have them work as implemented in this PR.

@JoshRosen
Copy link
Contributor

I think that boto will read its credentials from those environment variables or through configuration files, so those environment variables aren't necessarily the only way to set these keys. What will happen if I use those? Do we want to allow users to use separate launch vs. S3 credentials, or should we always copy the AWS credentials used to launch the cluster, irrespective of whether they were supplied via environment variables, command-line flags, or boto configs?

Also, just to be paranoid, is there any security concern with implicitly copying the launch credentials to the cluster? Users who read from S3 still end up embedding credentials into their programs / environment variables / launch commands on the server, so I don't whether this is a big reduction in security (except now users who don't read from S3 will end up with these credentials on the server).

@danosipov
Copy link
Contributor Author

@JoshRosen Yes, the credentials could be supplied in boto specific config files, but if they're not there the spark_ec2 script will look in the env variables, and error out with a message - see https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L174
So this PR will not work for those who have the credentials in boto configs, but at least it doesn't break anything.

As far as security, EMR uses essentially the same method for transferring credentials to the cluster nodes, so I don't think there are any additional concerns.

@danosipov
Copy link
Contributor Author

Actually it looks it can work for all cases by getting the credentials out of boto:

conn.aws_access_key_id
conn.aws_secret_access_key

Let me make the change.

@danosipov
Copy link
Contributor Author

Looks like I screwed up a rebase. Should I open a new PR?

@SparkQA
Copy link

SparkQA commented Sep 8, 2014

QA tests have started for PR 1120 at commit 07c599c.

  • This patch merges cleanly.

@JoshRosen
Copy link
Contributor

You could open a new PR, but this one is probably salvageable through some Git magic. One thing you could try is to undo your rebase using git-reflog, make a new branch off of this one, switch back to this branch and hard-reset it to master, then cherry-pick the PRs from your old copy of the branch into this one.

@danosipov
Copy link
Contributor Author

Thanks! Salvaged...

@nchammas
Copy link
Contributor

nchammas commented Sep 8, 2014

There is probably a use case for wanting separate keys to access S3 than the ones you use to launch EC2 instances, but perhaps it's fine for starters to have them be the same and open a separate PR in the future to make it possible for them to be different.

Also, just to be paranoid, is there any security concern with implicitly copying the launch credentials to the cluster?

Probably not directly since, as Dan pointed out, I think EMR does the same. But we do have an existing security problem reported in SPARK-2528, which is that spark-ec2 opens access to SSH from 0.0.0.0/0 by default.

@SparkQA
Copy link

SparkQA commented Sep 8, 2014

QA tests have finished for PR 1120 at commit 07c599c.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • logDebug("isMulticlass = " + metadata.isMulticlass)
    • logDebug("isMulticlass = " + metadata.isMulticlass)

@SparkQA
Copy link

SparkQA commented Sep 8, 2014

QA tests have started for PR 1120 at commit 7e0da26.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 8, 2014

QA tests have finished for PR 1120 at commit 7e0da26.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@danosipov
Copy link
Contributor Author

Failed tests appear unrelated.

@pwendell
Copy link
Contributor

From a security perspective I think once concern is the following. Say a company is in a workflow where user A launches clusters (e.g. an ops person) and user B, C, D use the Spark clusters with supplied login credentials. With this patch, now we'll copy user A's AWS credentials into a file on the cluster that user B, C, D, etc can read. And this change will basically be silently introduced. Am I correct in this understanding?

That's IMO too big of a change to introduce silently. But maybe we can guard this with a flag. I do agree this is a super useful feature to have.

Jenkins, test this please.

@danosipov
Copy link
Contributor Author

@pwendell I've never worked in an organization that had such a setup for AWS operations - every user had their credentials with IAM authorizations. But I do see your point. This functionality can be enabled by a spark_ec2 parameter. Let me know if that's how you'd like to proceed.

@pwendell
Copy link
Contributor

Hey @danosipov - after speaking with @JoshRosen a bit offline we felt that credential copying was too big of a change to introduce silently. Could you add a --copy-aws-credentials flag to the script to enable this? Also it would be good to make sure it's handled gracefully if the flag is disabled.

@danosipov
Copy link
Contributor Author

@pwendell Done, take a look. This should probably be documented on this page https://github.com/apache/spark/blob/master/docs/ec2-scripts.md#accessing-data-in-s3

@SparkQA
Copy link

SparkQA commented Sep 15, 2014

QA tests have started for PR 1120 at commit 71fab14.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 15, 2014

QA tests have finished for PR 1120 at commit 71fab14.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@pwendell
Copy link
Contributor

@danosipov Yeah, do you want to just add the docs in this patch? Our docs are versioned in the spark repo under docs/. I'd just add one sentence that says you can pass the flag to automatically transfer credentials.

LGTM pending docs.

@danosipov
Copy link
Contributor Author

Added

@SparkQA
Copy link

SparkQA commented Sep 16, 2014

QA tests have started for PR 1120 at commit 758da8b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 16, 2014

QA tests have finished for PR 1120 at commit 758da8b.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ArrayConstructor extends net.razorvine.pickle.objects.ArrayConstructor
    • class NonASCIICharacterChecker extends ScalariformChecker
    • class SCCallSiteSync(object):

@pwendell
Copy link
Contributor

Okay thanks, I'm merging this.

@asfgit asfgit closed this in b201712 Sep 16, 2014
@danosipov
Copy link
Contributor Author

@pwendell Can you also merge mesos/spark-ec2#58, otherwise this feature will not take effect.

wangyum added a commit that referenced this pull request May 26, 2023
…spark.sql.adaptive.advisoryPartitionSizeInBytes` (#1120)

* Update HandleOuterJoinBuildSideSkew.scala

* Update HandleOuterJoinBuildSideSkewSuite.scala
udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants