-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-787] Add S3 configuration parameters to the EC2 deploy scripts #1120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can one of the admins verify this patch? |
This seems like a good idea, sorry this hasn't been looked at in a while. We can try to review it after the 1.1 code freeze. |
BTW, this resolves SPARK-787 |
@danosipov Could you edit the title of this PR to include "[SPARK-787]" at the beginning? That will bring it into alignment with the style we are using for other PRs. This is a neat idea. Do we need to put those AWS configs in the appropriate I gather the goal of this PR is to have EC2 clusters setup so that you can run |
@nchammas Sure thing, I'll repush. That's correct, this gives native support out of the box for s3n:// URLs on EC2 clusters (started with the spark_ec2) script, making for one less thing one has to setup manually. |
OK that's sweet!
Oh, I think you can edit the PR title in-place here, no? |
Oh, I thought you meant the commit message… I can repush that one for better history. |
I think @asfgit squashes all the commits in a PR when it gets merged and the PR text is what goes in the main commit message, so there's probably no need to repush unless you really want to. Here's an example. |
Hmm, so I applied this patch and launched a new EC2 cluster with my patched I logged in to the launched EC2 cluster and opened up the PySpark shell. I tried a simple Did I misunderstand what this patch is supposed to do? How can I confirm it is working as expected? |
Ah, maybe it's related to this comment? Though I didn't know we had a dependency on some stuff in the Mesos project... |
Yes - for a quick test you can change this line to pull from https://github.com/danosipov/spark-ec2 |
@nchammas I think the name of the |
OK, tried this out and it works! I created an EC2 cluster and was able to access files on S3 without any additional setup. PR looks good to me. |
Can one of the admins verify this patch? |
@JoshRosen or @pwendell - Could one of you ask Jenkins to test this now that he's back? |
Jenkins, this is ok to test. |
Jenkins, retest this please. |
1 similar comment
Jenkins, retest this please. |
QA tests have started for PR 1120 at commit
|
QA tests have finished for PR 1120 at commit
|
Jenkins, retest this please. |
QA tests have started for PR 1120 at commit
|
QA tests have finished for PR 1120 at commit
|
Since this has been tested (both by hand and by Jenkins), I think the only thing that remains is perhaps to add a note in ec2-scripts.md about this feature, as well as an additional note about how to override it with different credentials if so desired. After that, I think this can be merged in. @JoshRosen Does that sound good to you? |
Sounds good to me. Were you thinking of adding these instructions under the Accessing Data in S3 section? |
That's what I was thinking at first, but then I noticed under "Before You Start" there already appear to be instructions about this:
Perhaps we just need to clarify here how these environment variables will be used. I'm guessing the original intention was to have them work as implemented in this PR. |
I think that Also, just to be paranoid, is there any security concern with implicitly copying the launch credentials to the cluster? Users who read from S3 still end up embedding credentials into their programs / environment variables / launch commands on the server, so I don't whether this is a big reduction in security (except now users who don't read from S3 will end up with these credentials on the server). |
@JoshRosen Yes, the credentials could be supplied in boto specific config files, but if they're not there the spark_ec2 script will look in the env variables, and error out with a message - see https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L174 As far as security, EMR uses essentially the same method for transferring credentials to the cluster nodes, so I don't think there are any additional concerns. |
Actually it looks it can work for all cases by getting the credentials out of boto:
Let me make the change. |
Looks like I screwed up a rebase. Should I open a new PR? |
QA tests have started for PR 1120 at commit
|
You could open a new PR, but this one is probably salvageable through some Git magic. One thing you could try is to undo your rebase using git-reflog, make a new branch off of this one, switch back to this branch and hard-reset it to master, then cherry-pick the PRs from your old copy of the branch into this one. |
07c599c
to
7e0da26
Compare
Thanks! Salvaged... |
There is probably a use case for wanting separate keys to access S3 than the ones you use to launch EC2 instances, but perhaps it's fine for starters to have them be the same and open a separate PR in the future to make it possible for them to be different.
Probably not directly since, as Dan pointed out, I think EMR does the same. But we do have an existing security problem reported in SPARK-2528, which is that |
QA tests have finished for PR 1120 at commit
|
QA tests have started for PR 1120 at commit
|
QA tests have finished for PR 1120 at commit
|
Failed tests appear unrelated. |
From a security perspective I think once concern is the following. Say a company is in a workflow where user A launches clusters (e.g. an ops person) and user B, C, D use the Spark clusters with supplied login credentials. With this patch, now we'll copy user A's AWS credentials into a file on the cluster that user B, C, D, etc can read. And this change will basically be silently introduced. Am I correct in this understanding? That's IMO too big of a change to introduce silently. But maybe we can guard this with a flag. I do agree this is a super useful feature to have. Jenkins, test this please. |
@pwendell I've never worked in an organization that had such a setup for AWS operations - every user had their credentials with IAM authorizations. But I do see your point. This functionality can be enabled by a spark_ec2 parameter. Let me know if that's how you'd like to proceed. |
Hey @danosipov - after speaking with @JoshRosen a bit offline we felt that credential copying was too big of a change to introduce silently. Could you add a |
@pwendell Done, take a look. This should probably be documented on this page https://github.com/apache/spark/blob/master/docs/ec2-scripts.md#accessing-data-in-s3 |
QA tests have started for PR 1120 at commit
|
QA tests have finished for PR 1120 at commit
|
@danosipov Yeah, do you want to just add the docs in this patch? Our docs are versioned in the spark repo under LGTM pending docs. |
Added |
QA tests have started for PR 1120 at commit
|
QA tests have finished for PR 1120 at commit
|
Okay thanks, I'm merging this. |
@pwendell Can you also merge mesos/spark-ec2#58, otherwise this feature will not take effect. |
…spark.sql.adaptive.advisoryPartitionSizeInBytes` (#1120) * Update HandleOuterJoinBuildSideSkew.scala * Update HandleOuterJoinBuildSideSkewSuite.scala
When deploying to AWS, there is additional configuration that is required to read S3 files. EMR creates it automatically, there is no reason that the Spark EC2 script shouldn't.
This PR requires a corresponding PR to the mesos/spark-ec2 to be merged, as it gets cloned in the process of setting up machines: mesos/spark-ec2#58