Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Caching for FlowDefExecution #1581

Merged
merged 1 commit into from
Jul 21, 2016
Merged

Use Caching for FlowDefExecution #1581

merged 1 commit into from
Jul 21, 2016

Conversation

johnynek
Copy link
Collaborator

When @jnievelt added Trampolines to Execution, this one was left off.

@jnievelt
Copy link
Contributor

Looks like this isn't a trampolining issue but a caching issue, in that we could have multiple runs of a re-used FlowDefExecution?

@johnynek
Copy link
Collaborator Author

Right, bad title. We cache to everywhere else. Not sure why this one should not be cached.

@johnynek johnynek changed the title Use Trampoline in FlowDefExecution Use Caching for FlowDefExecution Jul 19, 2016
@johnynek
Copy link
Collaborator Author

The basic idea is that hadoop jobs are so expensive, we should definitely cache them.

protected def runStats(conf: Config, mode: Mode, cache: EvalCache)(implicit cec: ConcurrentExecutionContext) =
Trampoline(
protected def runStats(conf: Config, mode: Mode, cache: EvalCache)(implicit cec: ConcurrentExecutionContext) = {
lazy val future =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go outside the method? Or does this live across method calls?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it depends on conf and mode so I don't think it can go outside the method. Being lazy just means we don't actually fire this future unless the cache does not already have a run of this Execution with the same Config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, right. I missed the getOrElse part below.

@rubanm
Copy link
Contributor

rubanm commented Jul 21, 2016

👍 So without this fix, it looks like we can have undesirable results if the Execution is not idempotent?

@johnynek
Copy link
Collaborator Author

@rubanm exactly. There is also the chance that a job could get run twice (and if our experience in the past is any indication, that happens a lot with Execution without the cache). Note, the cache was added only after bug reports that using Execution resulted in many duplicated jobs.

The main case for duplication is the use or zip or sequence when both branches have a common ancestor. In that case, each branch recomputes without the cache, which is not what you want.

@johnynek johnynek merged commit 868abf5 into develop Jul 21, 2016
@johnynek johnynek deleted the oscar/minor-fixes branch July 21, 2016 17:12
@jnievelt
Copy link
Contributor

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants