Add S3 configuration #58

danosipov · 2014-06-18T21:36:44Z

When deploying to AWS, its helpful to add the S3 configuration, so that Spark can access S3 transparently. If the access keys are not available in the environment, the configuration will be empty.

When deploying to AWS, there is additional configuration that is required to read S3 files. EMR creates it automatically, there is no reason that the Spark EC2 script shouldn't. This PR requires a corresponding PR to the mesos/spark-ec2 to be merged, as it gets cloned in the process of setting up machines: mesos/spark-ec2#58 Author: Dan Osipov <daniil.osipov@shazam.com> Closes #1120 from danosipov/s3_credentials and squashes the following commits: 758da8b [Dan Osipov] Modify documentation to include the new parameter 71fab14 [Dan Osipov] Use a parameter --copy-aws-credentials to enable S3 credential deployment 7e0da26 [Dan Osipov] Get AWS credentials out of boto connection instance 39bdf30 [Dan Osipov] Add S3 configuration parameters to the EC2 deploy scripts

danosipov · 2014-09-22T18:00:56Z

@pwendell @JoshRosen @nchammas Reminder that this needs to be merged for SPARK-787

shivaram · 2014-09-22T18:01:45Z

Do you want to add this for persistent-hdfs as well ?

danosipov · 2014-09-22T18:04:38Z

@shivaram What would be the advantage? Adding it to ephemeral-hdfs seemed to enable the functionality I was looking for (reading S3 data into RDDs).

nchammas · 2014-09-22T19:03:38Z

Yeah, I never figured out whether configs like these belonged in ephemeral-hdfs/conf, spark/conf, or other. In the past, I've also added S3 keys to ephemeral-hdfs/confg and got the desired results for short-lived Spark EC2 clusters.

shivaram · 2014-09-22T19:36:26Z

So we have 2 HDFS / MapReduce setups on our EC2 clusters. One which stores data on ephemeral disks (which is faster), but doesn't retain data across cluster stop/starts and the other which stores data on EBS volumes.

For most use cases the ephemeral-hdfs is preferred and sufficient. Its just that we usually keep the configs synchronized between them, so I was wondering if we should update persistent-hdfs as well.

JoshRosen · 2014-10-06T23:46:52Z

/bump @pwendell @danosipov so we don't forget to update / merge this

shivaram · 2014-10-07T16:49:35Z

LGTM. Merging this. We can worry about persistent-hdfs later

Add S3 configuration

plaanv · 2014-10-08T13:44:06Z

Hi everyone,
I think this pull request is related to some problems seen on the spark user ML today. People are reporting problems launching ephemeral HDFS, as well as other connectivity issues that may (or may not) come from this.:

http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-HDFS-doesn-t-start-on-AWS-EC2-cluster-tc15921.html

mattinbits · 2014-10-08T14:26:50Z

Yes, this commit appears to have broken my spark-ec2 process, can't identify exactly why though.

nchammas · 2014-10-08T14:28:09Z

~~@planvin What leads you to believe that? Is it the timing of the commit, or something in the errors reported on the mailing list?~~ Looks like the issue was identified in #74.

@shivaram Are released versions of spark-ec2 (e.g. 1.1.0) exposed to updates to the v3 branch here? If so, should we be pointing them instead to release-specific versions (e.g. 1.1.0)?

mattinbits · 2014-10-08T14:33:43Z

@nchammas See https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L586

JoshRosen · 2014-10-08T14:44:37Z

I'm just going to revert this commit for now, since that should be guaranteed to fix things. After that, I'll try to put together a postmortem to figure out why this broke things and how we can guard against this sort of problem in the future.

nchammas · 2014-10-08T14:44:38Z

Yeah, that seems like a dangerous thing to do.

@shivaram @JoshRosen Does it make sense to change that line to point to a release-specific tag?

danosipov · 2014-10-08T14:49:56Z

Agree that a release specific branch would make sense. Let me know how you'd like me to repush.

mattinbits · 2014-10-08T14:54:42Z

Looks like it needs this repo to define a release tag (I don't see any currently), and then a PR to the Spark project to change that python file?

shivaram · 2014-10-08T15:00:10Z

Before we get to adding release tags I'd like to see why this can't be
backwards compatible. Let's wait for Josh's email.

Shivaram
On Oct 8, 2014 8:54 AM, "mattinbits" notifications@github.com wrote:

Looks like it needs this repo to define a release tag (I don't see any
currently), and then a PR to the Spark project to change that python file?

—
Reply to this email directly or view it on GitHub
#58 (comment).

mattinbits · 2014-10-08T15:04:03Z

OK, but even if this particular change is made backward compatible, having a piece of software pull from a dynamically changing repo (or branch of a repo) seems like a bad idea.

I was seriously confused how my use of the spark-ec2 launch script had broken when I hadn't updated anything myself.

nchammas · 2014-10-08T15:05:34Z

Yeah, a post-mortem would be good. In this case it might be a simple matter of handling empty template variables correctly.

JoshRosen · 2014-10-08T15:30:58Z

OK, but even if this particular change is made backward compatible, having a piece of software pull from a dynamically changing repo (or branch of a repo) seems like a bad idea.

Agreed. This behavior surprised me, too.

I think that the current design might have been motivated by a "chicken-and-egg" problem when preparing Spark releases, since the setup scripts won't be able to download the Spark binaries until they're released and the Spark release can't contain the latest version of the EC2. Hence, the scripts fetch a branch instead of a tag.

One way to break this cyclic dependency would be to move the client-side spark_ec2.py scripts outside of the main Spark repository so that their release cycle / versioning isn't coupled to Spark's. I'm in favor of this, since I think it would make the EC2 scripts much easier to test / develop. I don't think that it necessarily makes sense for users who only want to run Spark on EC2 to have to download the entire Spark source / binaries to their machine.

In this case it might be a simple matter of handling empty template variables correctly.

True, although I think we'd only need to do that when these server-side scripts are invoked by an earlier version of the client-side scripts that doesn't support setting the AWS credentials: the most recent spark_ec2.py script in apache/spark actually sets the environment variables to either the keys or empty strings, so this empty-variable handling isn't necessary there.

This a great example of how supporting multiple versions of the client-side script from the same branch can make compatibility difficult to reason about.

nchammas · 2014-10-08T15:41:02Z

One way to break this cyclic dependency would be to move the client-side spark_ec2.py scripts outside of the main Spark repository so that their release cycle / versioning isn't coupled to Spark's. I'm in favor of this, since I think it would make the EC2 scripts much easier to test / develop.

Sounds good to me too. It always seemed strange that the spark-ec2 script in the Spark repo depended on things here in a different repo.

Would you want to move the whole EC2 machinery to live here in mesos/spark-ec2? Or perhaps even a more permanent databricks/spark-ec2 or apache/spark-ec2?

danosipov · 2014-10-08T15:46:18Z

@nchammas Would this make sense to look at in the context of https://issues.apache.org/jira/browse/SPARK-3821?

I'd say a short term solution in this case would be to handle empty template variables, but more generically pulling a release tag from this repository in spark_ec2 would help to develop breaking changes.

nchammas · 2014-10-08T16:00:56Z

@nchammas Would this make sense to look at in the context of https://issues.apache.org/jira/browse/SPARK-3821?

Hmm, yeah I'm still figuring out how the two are related. At the very least, we'll want to hold off on doing anything major for SPARK-3821 until the EC2 scripts have gone through the initial reorganization we're talking about here.

shivaram · 2014-10-08T17:24:19Z

Thanks for the summary Josh. Some replies inline, we can also move this to
the dev mailing list

On Wed, Oct 8, 2014 at 9:30 AM, Josh Rosen notifications@github.com wrote:

OK, but even if this particular change is made backward compatible, having
a piece of software pull from a dynamically changing repo (or branch of a
repo) seems like a bad idea.

Agreed. This behavior surprised me, too.

I think that the current design might have been motivated by a
"chicken-and-egg" problem when preparing Spark releases, since the setup
scripts won't be able to download the Spark binaries until they're released
and the Spark release can't contain the latest version of the EC2. Hence,
the scripts fetch a branch instead of a tag.

One way to break this cyclic dependency would be to move the client-side
spark_ec2.py scripts outside of the main Spark repository so that their
release cycle / versioning isn't coupled to Spark's. I'm in favor of this,
since I think it would make the EC2 scripts much easier to test / develop.
I don't think that it necessarily makes sense for users who only want to
run Spark on EC2 to have to download the entire Spark source / binaries to
their machine.

Actually this isn't directly related to the cyclic release dependency
problem. At a high-level the spark-ec2 repository is intentionally
maintained separately so that we can make changes to EC2 setups without
requiring a new Spark release. EC2 script changes to say install ganglia or
some system tool like dstat or HDFS configs are not directly related to
Spark and hence we don't actually want to "freeze" the ec2 scripts with a
release.

However there are times when the Spark configuration changes significantly
where we can't keep things backwards compatible. In such cases we have
created new spark-ec2 branches (master, v2, v3) to get backwards
compatibility. FWIW such changes have been very rare and I don't see us
requiring one per-release.

In this case it might be a simple matter of handling empty template
variables correctly.

True, although I think we'd only need to do that when these server-side
scripts are invoked by an earlier version of the client-side scripts that
doesn't support setting the AWS credentials: the most recent spark_ec2.py
script in apache/spark actually sets the environment variables to either
the keys or empty strings, so this empty-variable handling isn't necessary
there.

This a great example of how supporting multiple versions of the
client-side script from the same branch can make compatibility difficult to
reason about.

Agree that there is some mental overhead in reasoning about whether a
change is backwards compatible and we didn't catch it this time. However
maintaining separate branches also involves backporting changes and making
sure that changes can safely apply to older branches (which involves
similar reasoning).

—
Reply to this email directly or view it on GitHub
#58 (comment).

JoshRosen · 2014-10-09T05:27:28Z

I'd say a short term solution in this case would be to handle empty template variables, but more generically pulling a release tag from this repository in spark_ec2 would help to develop breaking changes.

This is more-or-less the fix that we're going with for now: create a new branch so that users running released versions of Spark aren't affected by unexpected changes, and merge @danosipov's PR, #74, to help guard against backwards-incompatibilities for users who run spark-ec2 off of Spark's master branch

I've created a new v4 branch in this repository and updated spark-ec2 to pull from it when running Spark's master branch. When cutting Spark releases, we plan to freeze the old development branches in this repository and continue development on new branches.

nchammas · 2014-10-09T15:16:48Z

So we're going to manage a Spark version to spark-ec2 version mapping? For example, 1.1.0 works with v3, 1.1.1 works with v4, etc.

Why wouldn't we instead go with matching version tags on spark-ec2? So that Spark 1.1.0 goes with spark-ec2 1.1.0.

JoshRosen · 2014-10-09T18:19:19Z

Good point; I was just trying to match the existing style here while putting together a quick hotfix.

danosipov added 2 commits June 18, 2014 11:48

Add awsAccessKeyId and awsSecretAccessKey to the ephemeral hdfs config

6c88224

Add AWS access keys into the deploy template

d88d097

danosipov mentioned this pull request Jun 18, 2014

[SPARK-787] Add S3 configuration parameters to the EC2 deploy scripts apache/spark#1120

Closed

Merge branch 'v3' of https://github.com/mesos/spark-ec2 into v3

36442f2

shivaram added a commit that referenced this pull request Oct 7, 2014

Merge pull request #58 from danosipov/v3

de9ed51

Add S3 configuration

shivaram merged commit de9ed51 into mesos:v3 Oct 7, 2014

douglaz mentioned this pull request Oct 7, 2014

SPARK-3840 - Fixed template substitution when value is None (not found on environment) #74

Closed

JoshRosen mentioned this pull request Oct 8, 2014

Revert "Add S3 configuration" #75

Merged

Add S3 configuration #58

Add S3 configuration #58

Uh oh!

Conversation

danosipov commented Jun 18, 2014

Uh oh!

danosipov commented Sep 22, 2014

Uh oh!

shivaram commented Sep 22, 2014

Uh oh!

danosipov commented Sep 22, 2014

Uh oh!

nchammas commented Sep 22, 2014

Uh oh!

shivaram commented Sep 22, 2014

Uh oh!

JoshRosen commented Oct 6, 2014

Uh oh!

shivaram commented Oct 7, 2014

Uh oh!

plaanv commented Oct 8, 2014

Uh oh!

mattinbits commented Oct 8, 2014

Uh oh!

nchammas commented Oct 8, 2014

Uh oh!

mattinbits commented Oct 8, 2014

Uh oh!

JoshRosen commented Oct 8, 2014

Uh oh!

nchammas commented Oct 8, 2014

Uh oh!

danosipov commented Oct 8, 2014

Uh oh!

mattinbits commented Oct 8, 2014

Uh oh!

shivaram commented Oct 8, 2014

Uh oh!

mattinbits commented Oct 8, 2014

Uh oh!

nchammas commented Oct 8, 2014

Uh oh!

JoshRosen commented Oct 8, 2014

Uh oh!

nchammas commented Oct 8, 2014

Uh oh!

danosipov commented Oct 8, 2014

Uh oh!

nchammas commented Oct 8, 2014

Uh oh!

shivaram commented Oct 8, 2014

Uh oh!

JoshRosen commented Oct 9, 2014

Uh oh!

nchammas commented Oct 9, 2014

Uh oh!

JoshRosen commented Oct 9, 2014

Uh oh!

Uh oh!