Skip to content

[SPARK-18060][ML] Avoid unnecessary computation for MLOR #15593

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

sethah
Copy link
Contributor

@sethah sethah commented Oct 21, 2016

What changes were proposed in this pull request?

Before this patch, the gradient updates for multinomial logistic regression were computed by an outer loop over the number of classes and an inner loop over the number of features. Inside the inner loop, we standardized the feature value (value / featuresStd(index)), which means we performed the computation numFeatures * numClasses times. We only need to perform that computation numFeatures times, however. If we re-order the inner and outer loop, we can avoid this, but then we lose sequential memory access. In this patch, we instead lay out the coefficients in column major order while we train, so that we can avoid the extra computation and retain sequential memory access. We convert back to row-major order when we create the model.

How was this patch tested?

This is an implementation detail only, so the original behavior should be maintained. All tests pass. I ran some performance tests to verify speedups. The results are below, and show significant speedups.

Performance Tests

Setup

3 node bare-metal cluster
120 cores total
384 gb RAM total

Results

NOTE: The currentMasterTime and thisPatchTime are times in seconds for a single iteration of L-BFGS or OWL-QN.

numPoints numFeatures numClasses regParam elasticNetParam currentMasterTime (sec) thisPatchTime (sec) pctSpeedup
0 1e+07 100 500 0.5 0 90 18 80
1 1e+08 100 50 0.5 0 90 19 78
2 1e+08 100 50 0.05 1 72 19 73
3 1e+06 100 5000 0.5 0 93 53 43
4 1e+07 100 5000 0.5 0 900 390 56
5 1e+08 100 500 0.5 0 840 174 79
6 1e+08 100 200 0.5 0 360 72 80
7 1e+08 1000 5 0.5 0 9 3 66

@sethah
Copy link
Contributor Author

sethah commented Oct 21, 2016

cc @dbtsai

This may improve the problems you mentioned you were having on SPARK-17134 :)

@SparkQA
Copy link

SparkQA commented Oct 22, 2016

Test build #67362 has finished for PR 15593 at commit 07fd150.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor Author

sethah commented Nov 1, 2016

ping @MLnick @jkbradley This should be a nice performance boost for MLOR in ML, hopefully we can get it in for 2.1. If you get some time to review or run tests I'd really appreciate it.

@dbtsai
Copy link
Member

dbtsai commented Nov 1, 2016

@sethah I'm recently busy on company work. Will start to work on open source code review soon this week.

val sum = {
var temp = 0.0
if (maxMargin > 0) {
for (i <- 0 until numClasses) {
var i = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this not be:

val sum = {
  var temp = 0.0
  var i = 0
  while (i < numClasses) { 
    if (maxMargin > 0) margins(i) -= maxMargin
    val exp = math.exp(margins(i))
    temp += exp
    multipliers(i) = exp
    i += 1
  }
  temp
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, thanks!

@@ -489,13 +485,14 @@ class LogisticRegression @Since("1.2.0") (
val initialCoefWithInterceptArray = initialCoefficientsWithIntercept.toArray
val providedCoef = optInitialModel.get.coefficientMatrix
providedCoef.foreachActive { (row, col, value) =>
val flatIndex = row * numFeaturesPlusIntercept + col
// convert matrix to column major for training
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this code path ever tested? Since the current initialModel is just a sort of dummy placeholder, always None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are unit tests for it. Also, this is used in the old mllib code, which exposed an initial model but wraps ML now.

margin += localCoefficients(i * numFeaturesPlusIntercept + index) *
value / localFeaturesStd(index)
}
val margins = new Array[Double](numClasses)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be good to just add a general comment (perhaps in LogisticAggregator) about the training being done using col major order, and that this is converted to row major once training is done, etc? And perhaps a little detail on why it is (was) necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I added a comment inside of LogisticRegression.train above. Do you think it needs to be moved or just duplicated inside LogisticAggregator?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a slightly more detailed comment would be good for the aggregator, something like the following (please feel free to make it more clear):

In order to avoid unnecessary computation during calculation of the gradient updates,
we lay out the coefficients in column major order during training. This allows us to 
perform feature standardization once, while still retaining sequential memory access 
for speed. We convert back to row-major order when we create the model, 
since this form is optimal for the matrix operations used for prediction.

Here we can amend slightly to say:

Additionally, since the coefficients were laid out in column major order during training to
avoid extra computation, we convert them back to row major before passing them to the model.

@SparkQA
Copy link

SparkQA commented Nov 3, 2016

Test build #68084 has finished for PR 15593 at commit 210e1d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor Author

sethah commented Nov 8, 2016

@MLnick I updated it with your suggested wording for the comments.

@SparkQA
Copy link

SparkQA commented Nov 8, 2016

Test build #68308 has finished for PR 15593 at commit e205670.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -489,13 +485,14 @@ class LogisticRegression @Since("1.2.0") (
val initialCoefWithInterceptArray = initialCoefficientsWithIntercept.toArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you document on line 465 the layout of the data, to help future developers (all the coefficients, in column major order, maybe followed by a last column that contain the intercept)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment, not sure if it's exactly what you had in mind. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's harder to think about we are using column major in MLOR, and not consistent with the rest of linear models. Do you think we can still use row major as much as possible so we don't need to touch code here, and only do the column major thing in LogisticAggregator internally, and when gradient of LogisticAggregator is called, we convert it back from column major to row major?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other linear models that use a matrix of coefficients? I agree that thinking about indexing when flattening a matrix into an array is a pain, but I don't really see how column major is more difficult than row major. The only place this would make much difference is that we wouldn't have to modify the L2 reg update logic and the initial coefficients, but we still need to swap between layouts in every iteration. I guess I don't see that this is obviously simpler.

And for the review of this PR, I think we can be confident of the correctness due to the robustness of the LogisticRegression test suite - which has extensive correctness tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we are looping the coefficients array in LiR or BLOR in a flatten array, people think about it by looping through in the direction of features. In this PR, the logic is changed, and can lead to some maintenance issues when developers work on those code later. For example, I have difficulty to understand the code at the first glance. I think what we can do alternatively when we do L1/L2 update logic and initial coefficients things, we can put them into col major matrix, and use foreachActive with explicit col and row index so reader can easily understand the code logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, for LiR or binary LoR, row- or col- major will both loop "in the direction of features" since whether the coefficients are a "row vector" or a "column vector" doesn't matter, no?

@SparkQA
Copy link

SparkQA commented Nov 9, 2016

Test build #68420 has finished for PR 15593 at commit c003ee9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

var i = 0
while (i < numClasses) {
localGradientArray(numFeatures * numClasses + i) += weight * multipliers(i)
i += 1
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make def gradient: Vector returning Matrix, and for MLOR, the implementation can be col major matrix so when we use foreachActive we don't need to worry about the underlying implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I fully get where you intend to use foreachActive over the gradient matrix? Maybe it's the location of this comment that is confusing me ...

... but here in multinomialUpdateInPlace, we are iterating over features using foreachActive, then for each feature iterating over numClasses. If we iterate over the gradient using foreachActive how will that work? Won't it be super inefficient? Perhaps I am missing something about what you intend, could you clarify with an example?

Copy link
Contributor

@MLnick MLnick Nov 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My (perhaps incorrect) understanding of what you are saying:

"The aggregator should internally convert to col-major during training & in-place update and from gradient will return a Matrix that can be used during L2 update (cycle through using foreachActive) in L1678-L1716."

But to me that would require swapping coeffs.foreachActive for gradient.foreachActive in L1684. Since coeff is itself a vector (which we can't really change due to BreezeDiffFunction, and we laid it out col-major for training) we would need to index back into that using the same col-major indexing anyway, no? So I don't see that we gain anything for L2 reg. Also, if we change back and forth every iteration that seems like it would be inefficient.

For the L1 reg function, that is purely index-based so val isIntercept = $(fitIntercept) && ((index + 1) % numFeaturesPlusIntercept == 0) doesn't seem any more or less complex than val isIntercept = $(fitIntercept) && index >= numFeatures * numCoefficientSets. So I don't think we gain anything during L1 reg function either.

Finally, during initial coefficients setup, we already loop through them if they are provided. If not, the intercept adjustments are the only things that occur (the rest are zeros) so that is just indexing into the intercepts, which whether row- or col-major doesn't seem to me to be a major issue.

I tend to agree that the col-major indexing is no "more complex" than row-major. It's also commented fairly clearly why and where we lay it out col-major (with example of the form), and where we convert back to row-major. We need to convert back to row-major outside of the aggregator because the coefficients are returned from optimizer.states.x which is Breeze-specific, not LogisticAggregator or even LogisticCostFun.

@MLnick
Copy link
Contributor

MLnick commented Nov 11, 2016

@dbtsai I understand wanting to make things clear & easy to understand for current and future developers. Of course I would like the same.

However, it's currently not clear to me how (or if) your proposal above would work to achieve that goal. Concrete example code would help a lot here (I am probably missing something in my interpretation).

To be honest, even the current code is difficult to understand at first (or even third!) glance for a developer unfamiliar with the implementation (perhaps we can work to improve that, perhaps it can't be improved that much given we have to work with arrays, low-level performance and the Breeze optimizers). I don't think the col-major layout during training changes this situation significantly over row-major. As long as it is clearly and correctly documented I think it is ok (if the doc comments added here need improvement we can make changes).

(By the way, the row- or col-major layout shouldn't make a difference for BLoR, and indeed the BLoR code is not really touched. The indexing may look different but is in practice the same).

I've worked through the impl in the PR and to me the changes look good. The tests are pretty comprehensive and are a good check on things so we can be comfortable. This is an important performance fix that I think must be in 2.1.

If you can show a way to achieve the same but have more clarity & simplicity, then I'm all for that, but I think that can be done in a separate JIRA (even after 2.1 if need be).

Let me know if you disagree.

@dbtsai
Copy link
Member

dbtsai commented Nov 11, 2016

@MLnick I'm hoping that we could abstract out the the implementation of using column major format as much as possible; as a result, in the future, new developers can understand the code without thinking hard. I agree that we have good comments and documentation in the code now, but having said that, with some effort, we are able to hide the detail using column major matrix.

For example, let's say we have coeffs and gradient in column major matrices, we will be able to write L2 part of gradient in the following code without computing isIntercept and featureIndex using complicated logics.

      val coeffMatrix = new DenseMatrix(numClasses, numFeatures, Array(), false)
      val totalGradientMatrix = new DenseMatrix(numClasses, numFeatures, Array(), false)

      coeffMatrix.foreachActive { case (classIndex, featureIndex, value) =>
        // We do not apply regularization to the intercepts
        val isIntercept = fitIntercept && featureIndex == numFeatures
        if (!isIntercept) {
          // The following code will compute the loss of the regularization; also
          // the gradient of the regularization, and add back to totalGradientArray.
          sum += {
            if (standardization) {
              totalGradientMatrix.update(classIndex, featureIndex,
                totalGradientMatrix(classIndex, featureIndex) + regParamL2 * value)
              value * value
            } else {
              if (featuresStd(featureIndex) != 0.0) {
                // If `standardization` is false, we still standardize the data
                // to improve the rate of convergence; as a result, we have to
                // perform this reverse standardization by penalizing each component
                // differently to get effectively the same objective function when
                // the training dataset is not standardized.
                val temp = value / (featuresStd(featureIndex) * featuresStd(featureIndex))
                totalGradientMatrix.update(classIndex, featureIndex,
                  totalGradientMatrix(classIndex, featureIndex) + regParamL2 * temp)
                value * temp
              } else {
                0.0
              }
            }
          }
        }
      }
      0.5 * regParamL2 * sum
    }

For L1 part, I'm thinking to do the following,

          new BreezeOWLQN[(Int, Int), BDM[Double]]($(maxIter), 10, regParamL1Fun, $(tol))

then we can implement L1 function as

          def regParamL1Fun = (classIndex: Int, featureIndex: Int) => {
            // Remove the L1 penalization on the intercept
           val isIntercept = fitIntercept && featureIndex == numFeatures
            if (isIntercept) {
              0.0
            } else {
              if (standardizationParam) {
                regParamL1
              } else {
                if (featuresStd(featureIndex) != 0.0) {
                  regParamL1 / featuresStd(featureIndex)
                } else {
                  0.0
                }
              }
            }
          }

By having coeffs and gradientSum in matrix, we also open up a opportunity to having both of them in sparse representation. With high dimensional problems, often times, both of them can be very sparse.

I'm also aware that we're cutting 2.1 release now, and this is an important performance improvement, so I'm open to deal with those cleaning up work in a separate PR so at least we're able to merge this.

Thanks.

@sethah
Copy link
Contributor Author

sethah commented Nov 11, 2016

Thanks for the detailed explanation @dbtsai. +1 for doing this in a separate PR, since I'd imagine we want to run all the performance tests again as well.

@asfgit asfgit closed this in 46b2550 Nov 12, 2016
@dbtsai
Copy link
Member

dbtsai commented Nov 12, 2016

Thanks all for working on this PR. I merged this into master and 2.1 branch, and I'll create a followup task and PR to handle the abstraction together with handling the smoothing in the initialization of coefficients.

asfgit pushed a commit that referenced this pull request Nov 12, 2016
## What changes were proposed in this pull request?

Before this patch, the gradient updates for multinomial logistic regression were computed by an outer loop over the number of classes and an inner loop over the number of features. Inside the inner loop, we standardized the feature value (`value / featuresStd(index)`), which means we performed the computation `numFeatures * numClasses` times. We only need to perform that computation `numFeatures` times, however. If we re-order the inner and outer loop, we can avoid this, but then we lose sequential memory access. In this patch, we instead lay out the coefficients in column major order while we train, so that we can avoid the extra computation and retain sequential memory access. We convert back to row-major order when we create the model.

## How was this patch tested?

This is an implementation detail only, so the original behavior should be maintained. All tests pass. I ran some performance tests to verify speedups. The results are below, and show significant speedups.
## Performance Tests

**Setup**

3 node bare-metal cluster
120 cores total
384 gb RAM total

**Results**

NOTE: The `currentMasterTime` and `thisPatchTime` are times in seconds for a single iteration of L-BFGS or OWL-QN.

|    |   numPoints |   numFeatures |   numClasses |   regParam |   elasticNetParam |   currentMasterTime (sec) |   thisPatchTime (sec) |   pctSpeedup |
|----|-------------|---------------|--------------|------------|-------------------|---------------------------|-----------------------|--------------|
|  0 |       1e+07 |           100 |          500 |       0.5  |                 0 |                        90 |                    18 |           80 |
|  1 |       1e+08 |           100 |           50 |       0.5  |                 0 |                        90 |                    19 |           78 |
|  2 |       1e+08 |           100 |           50 |       0.05 |                 1 |                        72 |                    19 |           73 |
|  3 |       1e+06 |           100 |         5000 |       0.5  |                 0 |                        93 |                    53 |           43 |
|  4 |       1e+07 |           100 |         5000 |       0.5  |                 0 |                       900 |                   390 |           56 |
|  5 |       1e+08 |           100 |          500 |       0.5  |                 0 |                       840 |                   174 |           79 |
|  6 |       1e+08 |           100 |          200 |       0.5  |                 0 |                       360 |                    72 |           80 |
|  7 |       1e+08 |          1000 |            5 |       0.5  |                 0 |                         9 |                     3 |           66 |

Author: sethah <seth.hendrickson16@gmail.com>

Closes #15593 from sethah/MLOR_PERF_COL_MAJOR_COEF.

(cherry picked from commit 46b2550)
Signed-off-by: DB Tsai <dbtsai@dbtsai.com>
@sethah
Copy link
Contributor Author

sethah commented Nov 12, 2016

Thanks @dbtsai!

@MLnick
Copy link
Contributor

MLnick commented Nov 12, 2016

Thanks @dbtsai for your reply - yes that makes things clear. I didn't realise we can use a Breeze matrix in the optimizers. That will definitely help. I agree we should try to do that and simplify & clarify the code. Hopefully you or @sethah has time to do that for 2.1 but if not it'll be fine for 2.1.1

asfgit pushed a commit that referenced this pull request Nov 20, 2016
…n LogisticRegression training

## What changes were proposed in this pull request?

This is a follow up to some of the discussion [here](#15593). During LogisticRegression training, we store the coefficients combined with intercepts as a flat vector, but a more natural abstraction is a matrix. Here, we refactor the code to use matrix where possible, which makes the code more readable and greatly simplifies the indexing.

Note: We do not use a Breeze matrix for the cost function as was mentioned in the linked PR. This is because LBFGS/OWLQN require an implicit `MutableInnerProductModule[DenseMatrix[Double], Double]` which is not natively defined in Breeze. We would need to extend Breeze in Spark to define it ourselves. Also, we do not modify the `regParamL1Fun` because OWLQN in Breeze requires a `MutableEnumeratedCoordinateField[(Int, Int), DenseVector[Double]]` (since we still use a dense vector for coefficients). Here again we would have to extend Breeze inside Spark.

## How was this patch tested?

This is internal code refactoring - the current unit tests passing show us that the change did not break anything. No added functionality in this patch.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #15893 from sethah/logreg_refactor.

(cherry picked from commit 856e004)
Signed-off-by: DB Tsai <dbtsai@dbtsai.com>
ghost pushed a commit to dbtsai/spark that referenced this pull request Nov 20, 2016
…n LogisticRegression training

## What changes were proposed in this pull request?

This is a follow up to some of the discussion [here](apache#15593). During LogisticRegression training, we store the coefficients combined with intercepts as a flat vector, but a more natural abstraction is a matrix. Here, we refactor the code to use matrix where possible, which makes the code more readable and greatly simplifies the indexing.

Note: We do not use a Breeze matrix for the cost function as was mentioned in the linked PR. This is because LBFGS/OWLQN require an implicit `MutableInnerProductModule[DenseMatrix[Double], Double]` which is not natively defined in Breeze. We would need to extend Breeze in Spark to define it ourselves. Also, we do not modify the `regParamL1Fun` because OWLQN in Breeze requires a `MutableEnumeratedCoordinateField[(Int, Int), DenseVector[Double]]` (since we still use a dense vector for coefficients). Here again we would have to extend Breeze inside Spark.

## How was this patch tested?

This is internal code refactoring - the current unit tests passing show us that the change did not break anything. No added functionality in this patch.

Author: sethah <seth.hendrickson16@gmail.com>

Closes apache#15893 from sethah/logreg_refactor.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Before this patch, the gradient updates for multinomial logistic regression were computed by an outer loop over the number of classes and an inner loop over the number of features. Inside the inner loop, we standardized the feature value (`value / featuresStd(index)`), which means we performed the computation `numFeatures * numClasses` times. We only need to perform that computation `numFeatures` times, however. If we re-order the inner and outer loop, we can avoid this, but then we lose sequential memory access. In this patch, we instead lay out the coefficients in column major order while we train, so that we can avoid the extra computation and retain sequential memory access. We convert back to row-major order when we create the model.

## How was this patch tested?

This is an implementation detail only, so the original behavior should be maintained. All tests pass. I ran some performance tests to verify speedups. The results are below, and show significant speedups.
## Performance Tests

**Setup**

3 node bare-metal cluster
120 cores total
384 gb RAM total

**Results**

NOTE: The `currentMasterTime` and `thisPatchTime` are times in seconds for a single iteration of L-BFGS or OWL-QN.

|    |   numPoints |   numFeatures |   numClasses |   regParam |   elasticNetParam |   currentMasterTime (sec) |   thisPatchTime (sec) |   pctSpeedup |
|----|-------------|---------------|--------------|------------|-------------------|---------------------------|-----------------------|--------------|
|  0 |       1e+07 |           100 |          500 |       0.5  |                 0 |                        90 |                    18 |           80 |
|  1 |       1e+08 |           100 |           50 |       0.5  |                 0 |                        90 |                    19 |           78 |
|  2 |       1e+08 |           100 |           50 |       0.05 |                 1 |                        72 |                    19 |           73 |
|  3 |       1e+06 |           100 |         5000 |       0.5  |                 0 |                        93 |                    53 |           43 |
|  4 |       1e+07 |           100 |         5000 |       0.5  |                 0 |                       900 |                   390 |           56 |
|  5 |       1e+08 |           100 |          500 |       0.5  |                 0 |                       840 |                   174 |           79 |
|  6 |       1e+08 |           100 |          200 |       0.5  |                 0 |                       360 |                    72 |           80 |
|  7 |       1e+08 |          1000 |            5 |       0.5  |                 0 |                         9 |                     3 |           66 |

Author: sethah <seth.hendrickson16@gmail.com>

Closes apache#15593 from sethah/MLOR_PERF_COL_MAJOR_COEF.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…n LogisticRegression training

## What changes were proposed in this pull request?

This is a follow up to some of the discussion [here](apache#15593). During LogisticRegression training, we store the coefficients combined with intercepts as a flat vector, but a more natural abstraction is a matrix. Here, we refactor the code to use matrix where possible, which makes the code more readable and greatly simplifies the indexing.

Note: We do not use a Breeze matrix for the cost function as was mentioned in the linked PR. This is because LBFGS/OWLQN require an implicit `MutableInnerProductModule[DenseMatrix[Double], Double]` which is not natively defined in Breeze. We would need to extend Breeze in Spark to define it ourselves. Also, we do not modify the `regParamL1Fun` because OWLQN in Breeze requires a `MutableEnumeratedCoordinateField[(Int, Int), DenseVector[Double]]` (since we still use a dense vector for coefficients). Here again we would have to extend Breeze inside Spark.

## How was this patch tested?

This is internal code refactoring - the current unit tests passing show us that the change did not break anything. No added functionality in this patch.

Author: sethah <seth.hendrickson16@gmail.com>

Closes apache#15893 from sethah/logreg_refactor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants