Skip to content

Add S3 configuration #58

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 7, 2014
Merged

Add S3 configuration #58

merged 3 commits into from
Oct 7, 2014

Conversation

danosipov
Copy link

When deploying to AWS, its helpful to add the S3 configuration, so that Spark can access S3 transparently. If the access keys are not available in the environment, the configuration will be empty.

asfgit pushed a commit to apache/spark that referenced this pull request Sep 16, 2014
When deploying to AWS, there is additional configuration that is required to read S3 files. EMR creates it automatically, there is no reason that the Spark EC2 script shouldn't.

This PR requires a corresponding PR to the mesos/spark-ec2 to be merged, as it gets cloned in the process of setting up machines: mesos/spark-ec2#58

Author: Dan Osipov <daniil.osipov@shazam.com>

Closes #1120 from danosipov/s3_credentials and squashes the following commits:

758da8b [Dan Osipov] Modify documentation to include the new parameter
71fab14 [Dan Osipov] Use a parameter --copy-aws-credentials to enable S3 credential deployment
7e0da26 [Dan Osipov] Get AWS credentials out of boto connection instance
39bdf30 [Dan Osipov] Add S3 configuration parameters to the EC2 deploy scripts
@danosipov
Copy link
Author

@pwendell @JoshRosen @nchammas Reminder that this needs to be merged for SPARK-787

@shivaram
Copy link

Do you want to add this for persistent-hdfs as well ?

@danosipov
Copy link
Author

@shivaram What would be the advantage? Adding it to ephemeral-hdfs seemed to enable the functionality I was looking for (reading S3 data into RDDs).

@nchammas
Copy link

Yeah, I never figured out whether configs like these belonged in ephemeral-hdfs/conf, spark/conf, or other. In the past, I've also added S3 keys to ephemeral-hdfs/confg and got the desired results for short-lived Spark EC2 clusters.

@shivaram
Copy link

So we have 2 HDFS / MapReduce setups on our EC2 clusters. One which stores data on ephemeral disks (which is faster), but doesn't retain data across cluster stop/starts and the other which stores data on EBS volumes.

For most use cases the ephemeral-hdfs is preferred and sufficient. Its just that we usually keep the configs synchronized between them, so I was wondering if we should update persistent-hdfs as well.

@JoshRosen
Copy link
Member

/bump @pwendell @danosipov so we don't forget to update / merge this

@shivaram
Copy link

shivaram commented Oct 7, 2014

LGTM. Merging this. We can worry about persistent-hdfs later

shivaram added a commit that referenced this pull request Oct 7, 2014
@shivaram shivaram merged commit de9ed51 into mesos:v3 Oct 7, 2014
@plaanv
Copy link

plaanv commented Oct 8, 2014

Hi everyone,
I think this pull request is related to some problems seen on the spark user ML today. People are reporting problems launching ephemeral HDFS, as well as other connectivity issues that may (or may not) come from this.:

http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-HDFS-doesn-t-start-on-AWS-EC2-cluster-tc15921.html

@mattinbits
Copy link

Yes, this commit appears to have broken my spark-ec2 process, can't identify exactly why though.

@nchammas
Copy link

nchammas commented Oct 8, 2014

@planvin What leads you to believe that? Is it the timing of the commit, or something in the errors reported on the mailing list? Looks like the issue was identified in #74.

@shivaram Are released versions of spark-ec2 (e.g. 1.1.0) exposed to updates to the v3 branch here? If so, should we be pointing them instead to release-specific versions (e.g. 1.1.0)?

@mattinbits
Copy link

@JoshRosen
Copy link
Member

I'm just going to revert this commit for now, since that should be guaranteed to fix things. After that, I'll try to put together a postmortem to figure out why this broke things and how we can guard against this sort of problem in the future.

@nchammas
Copy link

nchammas commented Oct 8, 2014

Yeah, that seems like a dangerous thing to do.

@shivaram @JoshRosen Does it make sense to change that line to point to a release-specific tag?

@danosipov
Copy link
Author

Agree that a release specific branch would make sense. Let me know how you'd like me to repush.

@mattinbits
Copy link

Looks like it needs this repo to define a release tag (I don't see any currently), and then a PR to the Spark project to change that python file?

@shivaram
Copy link

shivaram commented Oct 8, 2014

Before we get to adding release tags I'd like to see why this can't be
backwards compatible. Let's wait for Josh's email.

Shivaram
On Oct 8, 2014 8:54 AM, "mattinbits" notifications@github.com wrote:

Looks like it needs this repo to define a release tag (I don't see any
currently), and then a PR to the Spark project to change that python file?


Reply to this email directly or view it on GitHub
#58 (comment).

@mattinbits
Copy link

OK, but even if this particular change is made backward compatible, having a piece of software pull from a dynamically changing repo (or branch of a repo) seems like a bad idea.

I was seriously confused how my use of the spark-ec2 launch script had broken when I hadn't updated anything myself.

@nchammas
Copy link

nchammas commented Oct 8, 2014

Yeah, a post-mortem would be good. In this case it might be a simple matter of handling empty template variables correctly.

@JoshRosen
Copy link
Member

OK, but even if this particular change is made backward compatible, having a piece of software pull from a dynamically changing repo (or branch of a repo) seems like a bad idea.

Agreed. This behavior surprised me, too.

I think that the current design might have been motivated by a "chicken-and-egg" problem when preparing Spark releases, since the setup scripts won't be able to download the Spark binaries until they're released and the Spark release can't contain the latest version of the EC2. Hence, the scripts fetch a branch instead of a tag.

One way to break this cyclic dependency would be to move the client-side spark_ec2.py scripts outside of the main Spark repository so that their release cycle / versioning isn't coupled to Spark's. I'm in favor of this, since I think it would make the EC2 scripts much easier to test / develop. I don't think that it necessarily makes sense for users who only want to run Spark on EC2 to have to download the entire Spark source / binaries to their machine.

In this case it might be a simple matter of handling empty template variables correctly.

True, although I think we'd only need to do that when these server-side scripts are invoked by an earlier version of the client-side scripts that doesn't support setting the AWS credentials: the most recent spark_ec2.py script in apache/spark actually sets the environment variables to either the keys or empty strings, so this empty-variable handling isn't necessary there.

This a great example of how supporting multiple versions of the client-side script from the same branch can make compatibility difficult to reason about.

@nchammas
Copy link

nchammas commented Oct 8, 2014

One way to break this cyclic dependency would be to move the client-side spark_ec2.py scripts outside of the main Spark repository so that their release cycle / versioning isn't coupled to Spark's. I'm in favor of this, since I think it would make the EC2 scripts much easier to test / develop.

Sounds good to me too. It always seemed strange that the spark-ec2 script in the Spark repo depended on things here in a different repo.

Would you want to move the whole EC2 machinery to live here in mesos/spark-ec2? Or perhaps even a more permanent databricks/spark-ec2 or apache/spark-ec2?

@danosipov
Copy link
Author

@nchammas Would this make sense to look at in the context of https://issues.apache.org/jira/browse/SPARK-3821?

I'd say a short term solution in this case would be to handle empty template variables, but more generically pulling a release tag from this repository in spark_ec2 would help to develop breaking changes.

@nchammas
Copy link

nchammas commented Oct 8, 2014

@nchammas Would this make sense to look at in the context of https://issues.apache.org/jira/browse/SPARK-3821?

Hmm, yeah I'm still figuring out how the two are related. At the very least, we'll want to hold off on doing anything major for SPARK-3821 until the EC2 scripts have gone through the initial reorganization we're talking about here.

@shivaram
Copy link

shivaram commented Oct 8, 2014

Thanks for the summary Josh. Some replies inline, we can also move this to
the dev mailing list

On Wed, Oct 8, 2014 at 9:30 AM, Josh Rosen notifications@github.com wrote:

OK, but even if this particular change is made backward compatible, having
a piece of software pull from a dynamically changing repo (or branch of a
repo) seems like a bad idea.

Agreed. This behavior surprised me, too.

I think that the current design might have been motivated by a
"chicken-and-egg" problem when preparing Spark releases, since the setup
scripts won't be able to download the Spark binaries until they're released
and the Spark release can't contain the latest version of the EC2. Hence,
the scripts fetch a branch instead of a tag.

One way to break this cyclic dependency would be to move the client-side
spark_ec2.py scripts outside of the main Spark repository so that their
release cycle / versioning isn't coupled to Spark's. I'm in favor of this,
since I think it would make the EC2 scripts much easier to test / develop.
I don't think that it necessarily makes sense for users who only want to
run Spark on EC2 to have to download the entire Spark source / binaries to
their machine.

Actually this isn't directly related to the cyclic release dependency
problem. At a high-level the spark-ec2 repository is intentionally
maintained separately so that we can make changes to EC2 setups without
requiring a new Spark release. EC2 script changes to say install ganglia or
some system tool like dstat or HDFS configs are not directly related to
Spark and hence we don't actually want to "freeze" the ec2 scripts with a
release.

However there are times when the Spark configuration changes significantly
where we can't keep things backwards compatible. In such cases we have
created new spark-ec2 branches (master, v2, v3) to get backwards
compatibility. FWIW such changes have been very rare and I don't see us
requiring one per-release.

In this case it might be a simple matter of handling empty template
variables correctly.

True, although I think we'd only need to do that when these server-side
scripts are invoked by an earlier version of the client-side scripts that
doesn't support setting the AWS credentials: the most recent spark_ec2.py
script in apache/spark actually sets the environment variables to either
the keys or empty strings, so this empty-variable handling isn't necessary
there.

This a great example of how supporting multiple versions of the
client-side script from the same branch can make compatibility difficult to
reason about.

Agree that there is some mental overhead in reasoning about whether a
change is backwards compatible and we didn't catch it this time. However
maintaining separate branches also involves backporting changes and making
sure that changes can safely apply to older branches (which involves
similar reasoning).


Reply to this email directly or view it on GitHub
#58 (comment).

@JoshRosen
Copy link
Member

I'd say a short term solution in this case would be to handle empty template variables, but more generically pulling a release tag from this repository in spark_ec2 would help to develop breaking changes.

This is more-or-less the fix that we're going with for now: create a new branch so that users running released versions of Spark aren't affected by unexpected changes, and merge @danosipov's PR, #74, to help guard against backwards-incompatibilities for users who run spark-ec2 off of Spark's master branch

I've created a new v4 branch in this repository and updated spark-ec2 to pull from it when running Spark's master branch. When cutting Spark releases, we plan to freeze the old development branches in this repository and continue development on new branches.

@nchammas
Copy link

nchammas commented Oct 9, 2014

So we're going to manage a Spark version to spark-ec2 version mapping? For example, 1.1.0 works with v3, 1.1.1 works with v4, etc.

Why wouldn't we instead go with matching version tags on spark-ec2? So that Spark 1.1.0 goes with spark-ec2 1.1.0.

@JoshRosen
Copy link
Member

Good point; I was just trying to match the existing style here while putting together a quick hotfix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants