Skip to content

[SPARK-16831] [Python] Fixed bug in CrossValidator.avgMetrics #14456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

pkch
Copy link
Contributor

@pkch pkch commented Aug 2, 2016

What changes were proposed in this pull request?

avgMetrics was summed, not averaged, across folds

@@ -234,7 +234,7 @@ def _fit(self, dataset):
model = est.fit(train, epm[j])
# TODO: duplicate evaluator to take extra params from input
metric = eva.evaluate(model.transform(validation, epm[j]))
metrics[j] += metric
metrics[j] += metric/nFolds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may fail style checks but we'll see

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK looks fine actually. Would it be possible to add a little bit to the test above this to demonstrate that the result is correct now? just testing the value of the first element of metrics for example.

Copy link
Contributor Author

@pkch pkch Aug 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (2 commits because I made a typo.)

@srowen
Copy link
Member

srowen commented Aug 2, 2016

Jenkins add to whitelist

@srowen
Copy link
Member

srowen commented Aug 2, 2016

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Aug 2, 2016

Test build #63122 has finished for PR 14456 at commit e253be5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2016

Test build #63157 has finished for PR 14456 at commit 0a4b0a9.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2016

Test build #63158 has finished for PR 14456 at commit c426a98.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -166,6 +166,8 @@ class CrossValidator(Estimator, ValidatorParams):
>>> evaluator = BinaryClassificationEvaluator()
>>> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
>>> cvModel = cv.fit(dataset)
>>> cvModel.avgMetrics[0]
0.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I assume this would have printed 1.5 before

asfgit pushed a commit that referenced this pull request Aug 3, 2016
## What changes were proposed in this pull request?

avgMetrics was summed, not averaged, across folds

Author: =^_^= <maxmoroz@gmail.com>

Closes #14456 from pkch/pkch-patch-1.

(cherry picked from commit 639df04)
Signed-off-by: Sean Owen <sowen@cloudera.com>
@srowen
Copy link
Member

srowen commented Aug 3, 2016

Merge to master/2.0/1.6

asfgit pushed a commit that referenced this pull request Aug 3, 2016
avgMetrics was summed, not averaged, across folds

Author: =^_^= <maxmoroz@gmail.com>

Closes #14456 from pkch/pkch-patch-1.

(cherry picked from commit 639df04)
Signed-off-by: Sean Owen <sowen@cloudera.com>
@asfgit asfgit closed this in 639df04 Aug 3, 2016
zzcclp pushed a commit to zzcclp/spark that referenced this pull request Aug 4, 2016
avgMetrics was summed, not averaged, across folds

Author: =^_^= <maxmoroz@gmail.com>

Closes apache#14456 from pkch/pkch-patch-1.

(cherry picked from commit 639df04)
Signed-off-by: Sean Owen <sowen@cloudera.com>
(cherry picked from commit 92ee6fb)
@yhuai
Copy link
Contributor

yhuai commented Aug 11, 2016

Sorry. I think this pr breaks 1.6 build.

**********************************************************************
File "/home/jenkins/workspace/NewSparkPullRequestBuilder/python/pyspark/ml/tuning.py", line 111, in __main__.CrossValidator
Failed example:
    cvModel.avgMetrics[0]
Exception raised:
    Traceback (most recent call last):
      File "/usr/lib64/python2.6/doctest.py", line 1253, in __run
        compileflags, 1) in test.globs
      File "<doctest __main__.CrossValidator[9]>", line 1, in <module>
        cvModel.avgMetrics[0]
    AttributeError: 'CrossValidatorModel' object has no attribute 'avgMetrics'
**********************************************************************

@srowen
Copy link
Member

srowen commented Aug 11, 2016

Oh dear. I give up on back-porting things to 1.6. It just breaks too much at this stage! I will revert.

@yhuai
Copy link
Contributor

yhuai commented Aug 11, 2016

Thanks! Seems #12464 introduced avgMetrics to CrossValidator model.

@gtolomei
Copy link

Hi everyone,

I know this thread is closed and the bug on how avgMetrics was previously computed has been fixed. Still, I'm experiencing an odd issue with something related to this, which I will try to explain in the following.

Basically, I have setup a CrossValidator object in combination with a linear regression pipeline and a grid of hyperparameters to select from. More specifically, I run 5-fold cross validation on 9 different settings resulting from the combinations of two hyperparameters (each one taking on 3 values), and I keep track of all the 45 resulting models by setting the collectSubModels flag to True:

...

lr = LinearRegression(featuresCol="features", labelCol="label")

pipeline = Pipeline(stages=indexers + [encoder] + [assembler] + [lr])

param_grid = ParamGridBuilder()\
        .addGrid(lr.regParam, [0.0, 0.05, 0.1]) \
        .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
        .build()

cross_val = CrossValidator(estimator=pipeline, 
                           estimatorParamMaps=param_grid,
                           evaluator=RegressionEvaluator(metricName="rmse"),
                           numFolds=5,
                           collectSubModels=True
                           )

# Run cross-validation, and choose the best set of parameters
cv_model = cross_val.fit(train)

return cv_model

Everything seems to run smoothly, except for the fact that when I'm printing out the performance (i.e., RMSE) of each model (i.e., 9 models for each fold) and I try to "manually" compute the average from each fold, the resulting 9 average values do not match at all with the values I get when I use the internal avgMetrics property of the CrossValidator.
Just to give you an example, the following are the 5 RMSE values I obtained using the first combination of the two hyperparameters (i.e., both set to 0):

*************** Fold #1 ***************
--- Model #1 out of 9 ---
	Parameters: lambda=[0.000]; alpha=[0.000] 
	RMSE: 149354.656

*************** Fold #2 ***************
--- Model #1 out of 9 ---
	Parameters: lambda=[0.000]; alpha=[0.000] 
	RMSE: 146038.521

*************** Fold #3 ***************
--- Model #1 out of 9 ---
	Parameters: lambda=[0.000]; alpha=[0.000] 
	RMSE: 148739.919

*************** Fold #4 ***************
--- Model #1 out of 9 ---
	Parameters: lambda=[0.000]; alpha=[0.000] 
	RMSE: 146816.473

*************** Fold #5 ***************
--- Model #1 out of 9 ---
	Parameters: lambda=[0.000]; alpha=[0.000] 
	RMSE: 149868.621

As you can see, all the values of RMSE are below 150,000.
My expectation was that if I had taken the average of those values above, I would have got the first element of the avgMetrics list (which, indeed, supposedly contains the cross-validation average of each hyperparameter combination computed across the folds).
Instead, if I'm running cv_model.avgMetrics this is what I get:

[150091.7372030353, 150091.7372030353, 150091.7372030353, 150091.7345116686, 150093.66131828527, 150090.52769066638, 150091.7338301999, 150090.52716106002, 150091.59829053417]

There are 9 elements as expected but none of them looks correct! In fact, all of them are above 150,000 even though none of my 45 models (not only the 5 I listed above) reaches those figures.

It looks like the way in which avgMetrics is populated is wrong.

I have also tried to inspect the current implementation of the _fit method of the CrossValidator object and - although I haven't spent too much time on this - apparently everything looks fine:

for i in range(nFolds):
    validateLB = i * h
    validateUB = (i + 1) * h
    condition = (df[randCol] >= validateLB) & (df[randCol] < validateUB)
    validation = df.filter(condition).cache()
    train = df.filter(~condition).cache()

    tasks = _parallelFitTasks(est, train, eva, validation, epm, collectSubModelsParam)
    for j, metric, subModel in pool.imap_unordered(lambda f: f(), tasks):
        metrics[j] += (metric / nFolds)
        if collectSubModelsParam:
            subModels[i][j] = subModel

Interestingly enough, the same happens even if I run k-fold cross validation on LogisticRegression using areaUnderROC as the evaluation metric on each fold (for a different task).

Has anyone else experienced the same issue?

Many thanks, any help will be much appreciated!
G.

NOTE: I have blindly assumed the problem (if any) is on the avgMetrics property; however, it might be that those averages are actually correct, whilst the individual metrics which I have printed out above by calling the .summary.rootMeanSquaredError on each submodel are computed wrongly. Either way, there is a clear inconsistency between the two.

@srowen
Copy link
Member

srowen commented Apr 23, 2020

That's weird, I also don't see how it can happen. How did you print the metrics above, just by adding logging statements to spark? just wondering if it's at all possible they're from something else.

I tried this locally with a simple setup like yours (different data) and the avgMetrics reports the average of the model across folds exactly.

@gtolomei
Copy link

Thanks for your reply Sean!

My setting is as follows: I'm running PySpark 2.4.5 remotely over Google Colab.
As per the way in which I print out the result, I'm using the following function:

def summarize_all_models(cv_models):
    for k, models in enumerate(cv_models):
        print("*************** Fold #{:d} ***************\n".format(k+1))
        for i, m in enumerate(models):
            print("--- Model #{:d} out of {:d} ---".format(i+1, len(models)))
            print("\tParameters: lambda=[{:.3f}]; alpha=[{:.3f}] ".format(m.stages[-1]._java_obj.getRegParam(), m.stages[-1]._java_obj.getElasticNetParam()))
            print("\tRMSE: {:.3f}\n".format(m.stages[-1].summary.rootMeanSquaredError))
        print("***************************************\n")

I call the above function as follows:

summarize_all_models(cv_model.subModels)

where cv_model is the result returning from the k-fold cross validation.

Thanks again for your help!

@srowen
Copy link
Member

srowen commented Apr 23, 2020

Oh, you're printing from the training summary. That's RMSE on the training set. The eval metric is the (average of) RMSE on the held-out folds. The result makes sense.

@gtolomei
Copy link

Ohhh... I got it, thanks!
Do you happen to know how to get access to each individual RMSE computed on each held-out portion of every cross validation run?

Thanks a lot!

@srowen
Copy link
Member

srowen commented Apr 23, 2020

I don't see that is recorded anywhere. I verified the avg was right just by hacking the code to print them.

@gtolomei
Copy link

Gotcha!
It's quite bad there is no such a possibilities, although I understand the only thing that ultimately matters is the average computed on the k folds.

Thanks again for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants