Skip to content

Parallel task evaluation #480

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 88 commits into from
Apr 14, 2020
Merged

Parallel task evaluation #480

merged 88 commits into from
Apr 14, 2020

Conversation

lefou
Copy link
Member

@lefou lefou commented Nov 5, 2018

This branch experiments with parallel task execution in mill.

This PR is an implementation of #32. Please check out and give feedback! If you need a more recent mill version, just ping me (@lefou) and I'll re-sync the branch.

I try to update this description on the go, to reflect the current state. For details have a look at the commit messages. Everything is open for discussion, either here or on Gitter. Just ping me.

How to test

Checkout and build locally.

$ git clone -b par https://github.com/lefou/mill.git
Cloning into 'mill'...
$ cd mill
mill $ ./mill -i all __.publishLocal assembly
mill $ cp out/assembly/dest/mill ~/mill-par

The last line copies the mill executable to ~/mill-par, but you can freely choose a different location.

By default, mill uses as much build threads as available logical cpu cores. You can change it with --jobs or -j option. Use -j 1 to disable parallel execution at all.

Please note, that the all target enforces serial processing to ensure that calls like mill -i all clean __.compile will still work. In most cases (e.g. when you don't include a clean) you might want to use the new par target, which runs the following targets in parallel. For example: mill -i par __.compile __.testCached.

If you found issues, please report!
If it works for you, please report, too!

How to compare compile speeds

After a change of the mill version, various files need to be re-compiled (ammonite env, build.sc and possibly more). So, before you can compare build times with different mill versions you should run mill -i version, which ensures these things are done and wont foul your timings.

Also, if you want compare times side-by-side, you better clean your out directory with rm -f out.

For starters, if you just want to compare compile times, compare the outputs of:

rm -r out
time ~/mill-par -i __.compile

and

rm -r out
time ~/mill-jar -i -j 1 __.compile

Don't expect wonders! The more cores you have, the better the speedup. Many Laptops throttle their cpu frequencies when they are gooing too hot. Higher cpu usage means more heat, so after all the build times may even increase because you are running in parallel. Look at the ratio between real and user time.

Reporting bugs

If you think you found a bug, here are some infos that help me understanding and fixing it:

  • Output of mill-par --version
  • If possible a link to the public repository and steps to reproduce the error
  • The build output, esp. the error message, also out/evaluator.log is interesting (or run mill with -d)
  • Does the build succeed with mill-par -j 1 ?
  • Does the build succeed with mainline mill?

Pull Request Details

Task Evaluation

In this attempt, I use a Java ExecutionService for the job management, to avoid additional dependencies. It also allows some easy reasoning about job scheduling, as the control work and state handling is managed from the single self thread. Only Task execution happens in separate thread pool.

It uses a fixed thread pool with a configurable thread count. The thread pool is managed inside of Evaluator and created and shutdown for each evaluation phase (invokation of the evaluate def). That also means, that when running tasks which itself use an Evalutator instance, we end up with another thread pool. As an example mill all clean __.compile will run the all task in an outer Evaluator instance, which will block until an inner Evaluator instance completes the other tasks. Thus, I plan to redesign it to manage parallelism outside of Evaluator and increase the pool size for each new Evaluator instance, to gracefully adapt to the more or less blocking threads, that wait for the finish of the inner Evaluator instance.

Currently, we only parallelize whole terminal groups (which are most probably a collection of multiple tasks). All elements of a terminal group are by a high change depended on each other and also very fine grained, so that parallelization would not provide much gain; in contrary, it could produce more stress and overhead. But this is only an unverified assumption.

Logging and output

Except for ticker message, I try to prefix all output that goes through the Logger with the current task name.

Caveats

Tasks that get an Evaluatorinstance are potentially sensitive to parallel execution. For example the built-in clean task will most probably not work as expected when given as an argument to the built-in all task. This will be fixed once #502 is addressed.

@lihaoyi
Copy link
Member

lihaoyi commented Nov 5, 2018

Thanks! Will take a look.

Some initial comments:

Currently, I use a hardcoded threadpool with 4 threads, but it is planned to make it configurable.

Sounds good; just a flag you pass similar to -w/--watch seems fine

That also means, that when running tasks which itself use an Evalutator instance, we end up with another thread pool. As an example mill all clean __.compile will run the all task in an outer Evaluator instance, which will block until a inner Evaluator instance completes the other tasks.

This is fine for now. I think long term the way forward is to remove the outer Evaluator from the top-level entry points like all that currently take an Evaluator as a parameter, so for now having an extra thread pool lying around is probably fine.

Currently, we only parallelize whole terminal groups (which are most probably a collection of multiple tasks). All elements of a terminal group are by a high change depended on each other and also very fine grained, so that parallelization would not provide much gain; in contrary, it could produce more stress and overhead. But this is only an unverified assumption.

Probably fine for now. Finer grained parallelism may or may not be better than terminal-group-level paralellism, but let's leave that for future follow-up investigation.

terminal-group-level parallelism also has the benefit that it is simpler for a user to understand, since most of the time they are interacting with terminal NamedTasks at the head of a terminal group, and not with individual tasks themselves

@lefou
Copy link
Member Author

lefou commented Nov 8, 2018

I suspect tasks, which get an Evaluator instance, are tricky to evaluate in parallel. As they can impact the whole tree, e.g. the clean task can be destructive, the work-to-do must be re-evaluated after those tasks complete.

@lihaoyi
Copy link
Member

lihaoyi commented Nov 8, 2018

I'd like to eventually get to a state where tasks that currently take an evaluator instance are not tasks at all, but something else, and cannot be part of the task graph and take part caching, the dependency-based evaluation ordering, or any of that stuff.

Not sure how that applies to your PR but that's where I hope to get to.

lefou added 17 commits December 4, 2018 23:18
This initial proof of concept code uses a hardcoded threadpool with 4
threads. It is based on the Java ExecutionService and CompletionService
to gain the following features:
* control work and state from single thread
* easy waiting for completion
* easy abortion of all working jobs
* no dependency to external library

We only parallelize whole terminal groups (which are most probably a
collection of multiple tasks). All elements of a terminal group are by
a high change depended on each other and also very fine grained, so
that parallelization would not provide much gain; in contrary, it
could produce more stress and overhead. But this is only an unverified
assumption.

There is certainly a more elegant solution to calculate the task tree
(which is a terminal group tree), but I just wanted to concentrate on
the parallel processing for now.

See com-lihaoyi#32
This avoids concurent cursor movements resulting in ticker (over)drawing multiple lines
@lefou lefou changed the title WIP: parallel task evaluation (POC) Parallel task evaluation (POC) - Test / Feedback appreciated Dec 12, 2018
@lefou lefou mentioned this pull request Feb 18, 2019
@lihaoyi-databricks
Copy link
Contributor

@lefou thanks, will find time to take another look

@lefou
Copy link
Member Author

lefou commented Mar 18, 2020

I'm using the mill-par exclusively for a while now. No issues at all for me. Nice speedup, especially when running tests.

@lefou lefou mentioned this pull request Mar 20, 2020
@lefou
Copy link
Member Author

lefou commented Mar 27, 2020

Travis-CI is now green, jay!

@Baccata
Copy link
Contributor

Baccata commented Apr 11, 2020

So out of curiosity, what's the plan regarding this branch ?

@Ammonite-Bot
Copy link
Collaborator

Ammonite-Bot commented Apr 11, 2020 via email

@lefou
Copy link
Member Author

lefou commented Apr 11, 2020

@Baccata, glad you asked. The branch itself is pretty stable for me. I'm using it almost on a daily basis. I'm eager to merge it, but asked @lihaoyi for a second review a month ago.

He already demanded a change to Scala Futures (initially, I choose Java Futures because they are cancelable, but it isn't really required and less readable). But we agreed to defer that to after the merge in another PR, mostly because of lack of time and this PR is already pretty big.

What I ask myself: Should we make the parallel evaluation the default (as it is now in the branch), or should we default to --jobs/-j 1? If we do that (which means no change in mill behavior by default), than it is pretty safe to merge as-is, I think. We can mark the feature "experimental" a release round or two, and ask for feedback. There was not much feedback IMHO, and I can't tell if it's because nobody tried or because all who tried are happy.

@lefou
Copy link
Member Author

lefou commented Apr 11, 2020

Oh, looks like Haoyi already answered (concurrently) and we took different conclusions regarding the Futures change.

@lihaoyi
Copy link
Member

lihaoyi commented Apr 11, 2020 via email

@Ammonite-Bot
Copy link
Collaborator

Ammonite-Bot commented Apr 11, 2020 via email

@lefou
Copy link
Member Author

lefou commented Apr 14, 2020

The thing is, we won't get rid of the hand rolled scheduling just by changing from Java Futures to Scala Futures. As we plug-in the multithreading on a high level, we need to handle some concurrency constraints.

When creating the futures, we have no idea if a target is cached before we actually start the future. And we need to make sure that only one target with the same name runs at the same time. This is not only the case when the user requests the same target twice (e.g. __.test, which resolves to modulex.test.test and modulex.test with default command modulex.test.test) but also each compile task depends on different versions of ZincModule.worker, and we can't run these concurrently before we know it is cached properly. That's something that wasn't obvious to me before, and which I only learned by battle testing and adapting the scheduling.

I think to get rid of the hand roled scheduling would mean, we also need to refactor the evaluateGroup* methods, split it up and be more specific with inputs and output. Currently, these do not hand-over concrete results to each other, but simply fill and use some mutable state. Also we might want to invesitage why there are multiple instances of the (from my naive understanding) same workers. Of course we can improve those things, and I'm sure, mill would greatly improve from these changes, and we should do that sooner or later, but it's still additional non-trivial work.

And that's a bitter pill to me, especially as I've already investent a great amount of time to make the current branch ready to work properly with all projects I know.

So my request is to reconsider adding the current branch as an experimental feature right now. It's disabled by default and easy to remove or improve too. But adding it now allows me and many others to profit from the huge speedups when running compiles and tests right now. Also the additional merge overhead with each change in master was not fun in the past and will not be in the future. It's the reason why I was no longer able to rebase this branch, but needed to merge with master now and then.

As an alternative, we could prepare the Evaluator to have an injectable sheduling strategy and some mechanism to let the user to configure it.

@lihaoyi
Copy link
Member

lihaoyi commented Apr 14, 2020

What you say makes sense. Let's merge this as is then! If I want to try a different evaluation strategy, or if there remain issues, we can do that in a later PR.

Looks good to me!

@lefou
Copy link
Member Author

lefou commented Apr 14, 2020

Thank you! I will polish a bit more and merge. I think tagging mill afterwards as 0.7.0 is a good idea.

@lihaoyi
Copy link
Member

lihaoyi commented Apr 14, 2020

@lefou give me a few days before tagging 0.7.0, I also want to try and squeeze in a Scala 2.13 upgrade now that sbt/zinc#754 has landed :)

@lefou
Copy link
Member Author

lefou commented Apr 14, 2020

@lihaoyi Sure.

@lefou lefou changed the title Parallel task evaluation - Please test Parallel task evaluation Apr 14, 2020
@lefou lefou merged commit 8d6b236 into com-lihaoyi:master Apr 14, 2020
@lefou lefou added this to the after 0.6.1 milestone Apr 14, 2020
@lefou lefou linked an issue Apr 14, 2020 that may be closed by this pull request
@lefou lefou deleted the par branch April 16, 2020 08:25
@lefou lefou removed the feedback wanted Additional feedback or testing is apreciated label Apr 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallel Task Evaluation
7 participants