Skip to content

[SPARK-17455][MLlib] Improve PAVA implementation in IsotonicRegression #15018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from

Conversation

neggert
Copy link
Contributor

@neggert neggert commented Sep 8, 2016

What changes were proposed in this pull request?

New implementation of the Pool Adjacent Violators Algorithm (PAVA) in mllib.IsotonicRegression, which used under the hood by ml.regression.IsotonicRegression. The previous implementation could have factorial complexity in the worst case. This implementation, which closely follows those in scikit-learn and the R iso package, runs in quadratic time in the worst case.

How was this patch tested?

Existing unit tests in both mllib and ml passed before and after this patch. Scaling properties were tested by running the poolAdjacentViolators method in scala-benchmarking-template with the input generated by

val x = (1 to length).toArray.map(_.toDouble)
val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 else yi}
val w = Array.fill(length)(1d)

val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, x), w) => (y, x, w)}

Before this patch:

Input Length Time (us)
100 1.35
200 3.14
400 116.10
800 2134225.90

After this patch:

Input Length Time (us)
100 1.25
200 2.53
400 5.86
800 10.55

Benchmarking was also performed with randomly-generated y values, with similar results.

@SparkQA
Copy link

SparkQA commented Sep 8, 2016

Test build #65117 has finished for PR 15018 at commit 5608ae1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@neggert
Copy link
Contributor Author

neggert commented Oct 4, 2016

Could someone please take a look at this? @mengxr maybe? Looks like you merged the initial implementation.

@neggert
Copy link
Contributor Author

neggert commented Oct 20, 2016

Anyone? @srowen?

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More than anything it would be great to add a couple simple unit tests of this method to prove it still works in key cases.


// Find next monotonicity violating sequence, if any.
while (j < n && input(j)._1 >= input(j + 1)._1) {
j = j + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: j += 1 and likewise elsewhere.

var j = i

// Find next monotonicity violating sequence, if any.
while (j < n && input(j)._1 >= input(j + 1)._1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition changed to >= here on purpose right? you're now looking for a region that's non-increasing, not just strictly decreasing. Just checking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be correct. Lemma 1 from "A Fast Scaling Algorithm for Minimizing Separable Convex Functions Subject to Chain Constraints"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original one i.e., input(j)._1 > input(j + 1)._1 is correct. Here it is going to select out-of-order blocks.

Quoted from the paper:

We refer to two blocks [p, q] and [q + 1, r] as consecutive. We refer to two consecutive blocks [p, q] and [q +1, r] as in-order if theta_pq <= theta_q+1, r and out-of-order otherwise.

LEMMA 1 is pointing the how a merged block is also a single-valued block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't actually seen that paper. I just worked from the scikit-learn implementation (which, incidentally, has changed since I submitted this PR).

I believe that either way will get you correct results. Doing >= should be faster in some cases, because it's greedier about pooling, which should lead to fewer iterations of the outer loop.

// and check if pooling caused monotonicity violation in previously processed points.
while (i >= 0 && input(i)._1 > input(i + 1)._1) {
// Pool the violating sequence with the data point preceding it
if (input(i)._1 != input(j)._1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is "if there was any decrease, not just == from i to j" right? and pool makes everything in [i,j] non-violating.

Comments from the previous code note that this could introduce a violation -- is this still possible? it wasn't obvious why it can't happen now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the logic is:

  • If equality holds here, then either (a) i = j or (b) the block [i,j] has equal elements.
  • A violation could be introduced but will be caught on the next iteration of the outer loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Any violations that are introduced will be fixed in the next iteration.

@jkbradley
Copy link
Member

Thanks for pinging about this---it looks important!

One request: Do you know if there is a resource with a clearer statement of this precise algorithm for the L2 norm? That'd be nice to have. The best I found so far was "A Fast Scaling Algorithm for Minimizing Separable Convex Functions Subject to Chain Constraints"

// Otherwise pool the violating sequence
// and check if pooling caused monotonicity violation in previously processed points.
while (i >= 0 && input(i)._1 > input(i + 1)._1) {
// Pool the violating sequence with the data point preceding it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment correct? Looks like you pool the violating sequence [i, j] only. The preceding data points are checked for pooling in next outer loop.

@neggert neggert force-pushed the SPARK-17455-isoreg-algo branch from 5608ae1 to 8d16043 Compare December 14, 2016 15:49
@neggert
Copy link
Contributor Author

neggert commented Dec 14, 2016

Given that the scikit-learn implementation I based this on has changed (scikit-learn/scikit-learn#7444) since I submitted the PR, and @jkbradley has pointed out a reference I hadn't seen, I'd like to take a little bit of time to see if there's yet a better algorithm.

This one is an improvement over what we had, but as long as we're changing it, we may as well get it right.

@SparkQA
Copy link

SparkQA commented Dec 14, 2016

Test build #70134 has finished for PR 15018 at commit 8d16043.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 14, 2016

Test build #70136 has finished for PR 15018 at commit 9cfa5ed.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@neggert
Copy link
Contributor Author

neggert commented Dec 14, 2016

Looks like the new scikit-learn implementation suffers from a similar problem to the one that the original Spark implementation had. I left them a note pointing it out.

Right now I'm working on implementing what's in "Minimizing Separable Convex Functions Subject to Simple Chain Constraints", by Best, Chakravarti, and Ubhaya. Running down the referecence trail, it looks like the original algorithm is in "Projections onto Order Simplexes" by Grotzinger and Witzgall, and this particular formulation of it is originally in "Active set algorithms for isotonic regression; A unifying framework" by Best and Chakravarti.

@srowen
Copy link
Member

srowen commented Dec 15, 2016

If it's correct and faster, I tend to favor merging this. The code looks OK to my understanding and passes existing tests.

@neggert
Copy link
Contributor Author

neggert commented Dec 15, 2016

Found another input that triggers non-polynomial time with the code in this PR. I'm again borrowing from scikit-learn. I think this is the case they found that led them to re-write their implementation.

    val y = ((0 until length) ++ (-(length - 1) until length) ++ (-(length - 1) to 0)).toArray.map(_.toDouble)
    val x = (1 to y.length).toArray.map(_.toDouble)
Input Length Time (ns)
40 2059
80 4604
160 1974269
320 3246603433

I'm now working on implementing what's described in the Best papers. This should give O(n), even in the worst case.

Should I close this and open a new PR with the new algorithm, or just add it here and you can squash when you merge?

@srowen
Copy link
Member

srowen commented Dec 15, 2016

Just add commits here.

The isotonic regression problem doesn't even have a unique solution
when there are 0 weights.
This version closely follows the algorithm in referenced literature
and should guarantee O(n) complexity, even in the worst case.
@SparkQA
Copy link

SparkQA commented Dec 15, 2016

Test build #70214 has finished for PR 15018 at commit 61d383f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@neggert
Copy link
Contributor Author

neggert commented Dec 15, 2016

Alright, the PAV algorithm has been completely re-written to follow what's outlined in "Minimizing Separable Convex Functions Subject to Simple Chain Constraints". I've tested it with a bunch of different inputs that caused previous version of this algorithm to go non-polynomial. It stays linear for all of them. I will note that it's slightly slower on very small datasets (< 5000 points or so), but those still finish in less than a millisecond on my laptop, so I'm not too concerned.

The only caveat is that there was one test that I couldn't get passing, so I removed it. It involved passing input data with 0 weights. I'd argue that the isotonic regression problem doesn't even have a unique solution in that case, so we shouldn't support it. Still, it is a slight change in behavior.

Looks like Jenkins is pointing out some style issues, so I'll get to work on those.

@SparkQA
Copy link

SparkQA commented Dec 15, 2016

Test build #70216 has finished for PR 15018 at commit e420fb3.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modulo a few minor comments, the code at least looks clean. If the tests pass, seems OK.

gives the values for the block. Entries that are not at the start of a block
are meaningless.
*/
val weights: Array[(Double, Double)] = input.map(x => (x._3, x._3 * x._1)) // (weight, weight * y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you; might be clearer as input.map { case (y, _, weight) => (weight, weight * y) }

}
/*
Keeps track of the start and end indices of the blocks. blockBounds(start) gives the
index of the end of the block and blockBounds(end) gives the index of the start of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blockBounds(start) is the end of a block? this is reflected in your helper functions below but it seems reversed. Can you double check and elaborate a bit?

while (nextBlock(i) < input.length) {
if (average(i) >= average(nextBlock(i))) {
merge(i, nextBlock(i))
while((i > 0) && (average(prevBlock(i)) >= average(i))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nit: space before while here and below. You can probably nix the extra parens around each of the two terms, but no big deal

@SparkQA
Copy link

SparkQA commented Dec 19, 2016

Test build #70363 has finished for PR 15018 at commit 8762531.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@neggert neggert force-pushed the SPARK-17455-isoreg-algo branch from 8762531 to d19647b Compare December 19, 2016 16:17
@SparkQA
Copy link

SparkQA commented Dec 19, 2016

Test build #70364 has finished for PR 15018 at commit d19647b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@neggert
Copy link
Contributor Author

neggert commented Dec 19, 2016

@srowen Addressed your comments and fixed some style issues.

Some updated timings:

Alternating decreasing input

val x = (1 to length).toArray.map(_.toDouble)
val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 else yi}

Before:

Input Length Time (us)
100 1.4
200 3.0
400 101.8
800 2,010,511.7

After:

Input Length Time (us)
100 2.5
200 5.8
400 10.1
800 20.7

Pathological input from scikit-learn benchmarks:

val y= ((0 until length) ++ (-(length - 1) until length) ++ (-(length - 1) to 0)).toArray.map(_.toDouble)
val x = (1 to y.length).toArray.map(_.toDouble)

Before:

Input Length Time (us)
40 2.2
80 4.7
160 2140.0
320 3,076,508.1

After:

Input Length Time (us)
40 5.3
80 10.4
160 21.3
320 41.9

@viirya
Copy link
Member

viirya commented Dec 20, 2016

For the zero-weight values, can we do similar to scikit-learn to remove zero-weight values, like amueller/scikit-learn@2415100

@SparkQA
Copy link

SparkQA commented Jan 4, 2017

Test build #70886 has finished for PR 15018 at commit a119423.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// gives the values for the block. Entries that are not at the start of a block
// are meaningless.
val weights: Array[(Double, Double)] = input.map { case (y, _, weight) =>
require(weight != 0.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are negative weights OK? (you don't need a type on weight, but hardly matters)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a weird thing to do, but I don't think there's any reason we can't support it. The problem still has a unique solution.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, seems reasonable to me. Any other thoughts @viirya ?

@viirya
Copy link
Member

viirya commented Jan 6, 2017

SGTM. See if @jkbradley has more comment?

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pinging me. The updated implementation looks nice, but I do have comments about weights.

I'd like to disallow negative weights. It's quite safe to assume that weights are non-negative in ML, and I've never seen negative weights used in isotonic regression.

  • Q: I saw mention of negative weights above, but the Best and Chakravarti (1990) paper referenced explicitly requires strictly positive weights. Does one of the other papers talk about negative weights?
  • Btw, with negative weights, it would be easy to have undefined solutions. Imagine the first element in the input having the smallest value and a negative weight. A "correct" output would be infinite for that value and 0 elsewhere, giving a minimum objective value of NegativeInfinity.

I'd also prefer to allow zero weights to avoid breaking the API and since the model is fairly well-defined even with zero weights, except for edge cases at extreme feature values. Removing the zero-weight rows should provide reasonable solutions, though we'd need to assert that the whole dataset had non-zero weight.

How does this sound?

// Merge two adjacent blocks, updating blockBounds and weights to reflect the merge
// Return the start index of the merged block
def merge(block1: Int, block2: Int): Int = {
assert(blockEnd(block1) + 1 == block2, "attempting to merge non-consecutive blocks")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the error message more informative:

  • Make it clear that this indicates an internal bug within IsotonicRegression
  • State the invalid values

// gives the values for the block. Entries that are not at the start of a block
// are meaningless.
val weights: Array[(Double, Double)] = input.map { case (y, _, weight) =>
require(weight != 0.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an informative error message.

@neggert
Copy link
Contributor Author

neggert commented Jan 9, 2017

Your changes sound fine. I don't have strong feelings one way or another, although I think we should at least throw a warning if we're discarding points. I do want to point out that disallowing negative weights also breaks the API, since they were previously supported (there's even a test for them).

@SparkQA
Copy link

SparkQA commented Jan 9, 2017

Test build #71087 has finished for PR 15018 at commit e33a308.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@neggert neggert force-pushed the SPARK-17455-isoreg-algo branch from e33a308 to 83480e7 Compare January 9, 2017 21:22
@SparkQA
Copy link

SparkQA commented Jan 9, 2017

Test build #71092 has finished for PR 15018 at commit 83480e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*
* [3] Best, Michael J., Nilotpal Chakravarti, and Vasant A. Ubhaya. "Minimizing separable convex
* functions subject to simple chain constraints." SIAM Journal on Optimization 10.3 (2000):
* 658-672.
*
* @param input Input data of tuples (label, feature, weight).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the param doc to indicate that now negative weights are prohibited?

@viirya
Copy link
Member

viirya commented Jan 12, 2017

LGTM

@SparkQA
Copy link

SparkQA commented Jan 12, 2017

Test build #71271 has finished for PR 15018 at commit 132d6a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

@neggert I'll ask @mengxr about the negative weights since he oversaw the original work here.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments, other than checking with @mengxr about negative weights.

val cleanInput = input.flatMap{ case (y, x, weight) =>
require(weight >= 0.0, "weights must be non-negative")
if (weight == 0.0) {
logWarning(s"Dropping zero-weight point ($y, $x, $weight)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree about logging for zero-weight points. If an instance has zero weight, then it should be ignored, and that's not a problem.

If we really want to log this, then let's do it once per dataset, not once per row, which could make logs explode. I'd also say we should lower the logging level to logInfo.

if (input.isEmpty) {
return Array.empty
val cleanInput = input.flatMap{ case (y, x, weight) =>
require(weight >= 0.0, "weights must be non-negative")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

State invalid y,x,weight.

* @return Result tuples (label, feature, weight) where labels were updated
* to form a monotone sequence as per isotonic regression definition.
*/
private def poolAdjacentViolators(
input: Array[(Double, Double, Double)]): Array[(Double, Double, Double)] = {

if (input.isEmpty) {
return Array.empty
val cleanInput = input.flatMap{ case (y, x, weight) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filter would be simpler than flatMap

assert(
blockEnd(block1) + 1 == block2,
s"Attempting to merge non-consecutive blocks [${block1}, ${blockEnd(block1)}]" +
s" and [${block2}, ${blockEnd(block2)}. This is likely a bug in the isotonic regression" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing "]"

@jkbradley
Copy link
Member

@zapletal-martin Pinging since you wrote the original PR: There's discussion here about whether IsotonicRegression should support negative weights. Is there a good reason to? I haven't seen papers which allow negative weights, though I've only looked at a few. Thanks!

@neggert
Copy link
Contributor Author

neggert commented Jan 13, 2017

My thought was that there's nothing in the computation that prevents it (unlike 0 weights, which cause NaNs). Why not err on the side of maximum flexibility to the end user? Just because there's no use case for it now doesn't mean someone won't find one in the future.

Just my 2 cents. I don't feel strongly about it and will defer to commiters.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71329 has started for PR 15018 at commit 0aeca09.

@neggert
Copy link
Contributor Author

neggert commented Jan 16, 2017

Looks like the build completed successfully. Somehow Jenkins didn't post the message.

@jkbradley
Copy link
Member

Thanks for the updates! This LGTM, except for deciding about negative weights.

Responding to your comment above, negative weights are just as problematic as 0 weights. See my comment above "Btw, with negative weights, ..."

I spoke with @mengxr offline, and he does not recall why the original work supported negative weights. I'll wait a few days before merging this to see if @zapletal-martin can check out the issue.

@srowen
Copy link
Member

srowen commented Jan 23, 2017

@jkbradley I'll merge if there are no other thoughts here. Seems OK to me too

@jkbradley
Copy link
Member

Sounds good. I'll run fresh tests before merging to be safe though.

@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #3547 has finished for PR 15018 at commit 0aeca09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Merging with master
Thanks!

@asfgit asfgit closed this in c8aea74 Jan 23, 2017
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

New implementation of the Pool Adjacent Violators Algorithm (PAVA) in mllib.IsotonicRegression, which used under the hood by ml.regression.IsotonicRegression. The previous implementation could have factorial complexity in the worst case. This implementation, which closely follows those in scikit-learn and the R `iso` package, runs in quadratic time in the worst case.
## How was this patch tested?

Existing unit tests in both `mllib` and `ml` passed before and after this patch. Scaling properties were tested by running the `poolAdjacentViolators` method in [scala-benchmarking-template](https://github.com/sirthias/scala-benchmarking-template) with the input generated by

``` scala
val x = (1 to length).toArray.map(_.toDouble)
val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 else yi}
val w = Array.fill(length)(1d)

val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, x), w) => (y, x, w)}
```

Before this patch:

| Input Length | Time (us) |
| --: | --: |
| 100 | 1.35 |
| 200 | 3.14 |
| 400 | 116.10 |
| 800 | 2134225.90 |

After this patch:

| Input Length | Time (us) |
| --: | --: |
| 100 | 1.25 |
| 200 | 2.53 |
| 400 | 5.86 |
| 800 | 10.55 |

Benchmarking was also performed with randomly-generated y values, with similar results.

Author: z001qdp <Nicholas.Eggert@target.com>

Closes apache#15018 from neggert/SPARK-17455-isoreg-algo.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

New implementation of the Pool Adjacent Violators Algorithm (PAVA) in mllib.IsotonicRegression, which used under the hood by ml.regression.IsotonicRegression. The previous implementation could have factorial complexity in the worst case. This implementation, which closely follows those in scikit-learn and the R `iso` package, runs in quadratic time in the worst case.
## How was this patch tested?

Existing unit tests in both `mllib` and `ml` passed before and after this patch. Scaling properties were tested by running the `poolAdjacentViolators` method in [scala-benchmarking-template](https://github.com/sirthias/scala-benchmarking-template) with the input generated by

``` scala
val x = (1 to length).toArray.map(_.toDouble)
val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 else yi}
val w = Array.fill(length)(1d)

val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, x), w) => (y, x, w)}
```

Before this patch:

| Input Length | Time (us) |
| --: | --: |
| 100 | 1.35 |
| 200 | 3.14 |
| 400 | 116.10 |
| 800 | 2134225.90 |

After this patch:

| Input Length | Time (us) |
| --: | --: |
| 100 | 1.25 |
| 200 | 2.53 |
| 400 | 5.86 |
| 800 | 10.55 |

Benchmarking was also performed with randomly-generated y values, with similar results.

Author: z001qdp <Nicholas.Eggert@target.com>

Closes apache#15018 from neggert/SPARK-17455-isoreg-algo.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants