-
Notifications
You must be signed in to change notification settings - Fork 114
Add S3 configuration #58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When deploying to AWS, there is additional configuration that is required to read S3 files. EMR creates it automatically, there is no reason that the Spark EC2 script shouldn't. This PR requires a corresponding PR to the mesos/spark-ec2 to be merged, as it gets cloned in the process of setting up machines: mesos/spark-ec2#58 Author: Dan Osipov <daniil.osipov@shazam.com> Closes #1120 from danosipov/s3_credentials and squashes the following commits: 758da8b [Dan Osipov] Modify documentation to include the new parameter 71fab14 [Dan Osipov] Use a parameter --copy-aws-credentials to enable S3 credential deployment 7e0da26 [Dan Osipov] Get AWS credentials out of boto connection instance 39bdf30 [Dan Osipov] Add S3 configuration parameters to the EC2 deploy scripts
@pwendell @JoshRosen @nchammas Reminder that this needs to be merged for SPARK-787 |
Do you want to add this for persistent-hdfs as well ? |
@shivaram What would be the advantage? Adding it to ephemeral-hdfs seemed to enable the functionality I was looking for (reading S3 data into RDDs). |
Yeah, I never figured out whether configs like these belonged in |
So we have 2 HDFS / MapReduce setups on our EC2 clusters. One which stores data on ephemeral disks (which is faster), but doesn't retain data across cluster stop/starts and the other which stores data on EBS volumes. For most use cases the ephemeral-hdfs is preferred and sufficient. Its just that we usually keep the configs synchronized between them, so I was wondering if we should update persistent-hdfs as well. |
/bump @pwendell @danosipov so we don't forget to update / merge this |
LGTM. Merging this. We can worry about persistent-hdfs later |
Hi everyone, |
Yes, this commit appears to have broken my spark-ec2 process, can't identify exactly why though. |
@shivaram Are released versions of |
I'm just going to revert this commit for now, since that should be guaranteed to fix things. After that, I'll try to put together a postmortem to figure out why this broke things and how we can guard against this sort of problem in the future. |
Yeah, that seems like a dangerous thing to do. @shivaram @JoshRosen Does it make sense to change that line to point to a release-specific tag? |
Agree that a release specific branch would make sense. Let me know how you'd like me to repush. |
Looks like it needs this repo to define a release tag (I don't see any currently), and then a PR to the Spark project to change that python file? |
Before we get to adding release tags I'd like to see why this can't be Shivaram
|
OK, but even if this particular change is made backward compatible, having a piece of software pull from a dynamically changing repo (or branch of a repo) seems like a bad idea. I was seriously confused how my use of the spark-ec2 launch script had broken when I hadn't updated anything myself. |
Yeah, a post-mortem would be good. In this case it might be a simple matter of handling empty template variables correctly. |
Agreed. This behavior surprised me, too. I think that the current design might have been motivated by a "chicken-and-egg" problem when preparing Spark releases, since the setup scripts won't be able to download the Spark binaries until they're released and the Spark release can't contain the latest version of the EC2. Hence, the scripts fetch a branch instead of a tag. One way to break this cyclic dependency would be to move the client-side
True, although I think we'd only need to do that when these server-side scripts are invoked by an earlier version of the client-side scripts that doesn't support setting the AWS credentials: the most recent This a great example of how supporting multiple versions of the client-side script from the same branch can make compatibility difficult to reason about. |
Sounds good to me too. It always seemed strange that the Would you want to move the whole EC2 machinery to live here in |
@nchammas Would this make sense to look at in the context of https://issues.apache.org/jira/browse/SPARK-3821? I'd say a short term solution in this case would be to handle empty template variables, but more generically pulling a release tag from this repository in spark_ec2 would help to develop breaking changes. |
Hmm, yeah I'm still figuring out how the two are related. At the very least, we'll want to hold off on doing anything major for SPARK-3821 until the EC2 scripts have gone through the initial reorganization we're talking about here. |
Thanks for the summary Josh. Some replies inline, we can also move this to On Wed, Oct 8, 2014 at 9:30 AM, Josh Rosen notifications@github.com wrote:
However there are times when the Spark configuration changes significantly
|
This is more-or-less the fix that we're going with for now: create a new branch so that users running released versions of Spark aren't affected by unexpected changes, and merge @danosipov's PR, #74, to help guard against backwards-incompatibilities for users who run I've created a new |
So we're going to manage a Spark version to Why wouldn't we instead go with matching version tags on |
Good point; I was just trying to match the existing style here while putting together a quick hotfix. |
When deploying to AWS, its helpful to add the S3 configuration, so that Spark can access S3 transparently. If the access keys are not available in the environment, the configuration will be empty.