Skip to content

Multivariate r2 #1310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 20, 2020
Merged

Conversation

harahu
Copy link
Contributor

@harahu harahu commented Mar 14, 2020

I noticed that the RSquared metric does not behave in the same way as the scikit-learn implementation of the same metric for multivariate regression (in scikit-learn known as "multioutput"). First I thought I should mend it to align with the scikit-learn implementation that calculates separate scores for each label element and aggregates the scores. Then I started asking myself whether there might be other ways of interpreting R2 in higher dimensions. Looking at the definition presented in https://en.wikipedia.org/wiki/Coefficient_of_determination, it does look like it might make sense to extend it to higher rank data types, but this can be done in different ways, and I am unsure what would make sense mathematically. The scikit-learn approach at least makes some sense to me as it just treats the model as a model for n separate dependent variables at the same time.

I would really like a discussion on how R2 should be understood in higher dimensions, but if we are to go with the scikit-learn interpretation, this PR is one way to go about it. It is slightly annoying that you have to specify the shape of y when initializing the metric object. This could be solved by dynamically setting the shape like it is done in the MeanTensor metric. Although more user friendly, I don't think the code is as clean, so there is a trade-off.

I modified a test to show how this acts for higher dimensional data, currently matching the multioutput="raw_values" flavour of the scikit-learn implementation.

@bot-of-gabrieldemarmiesse

@SSaishruthi

You are owner of some files modified in this pull request.
Would you kindly review the changes whenever you have the time to?
Thank you very much.

@harahu
Copy link
Contributor Author

harahu commented Mar 14, 2020

This PR is incomplete, though. It is really just an idea proposal, so please don't merge before we work out the details.

@gabrieldemarmiesse
Copy link
Member

Thanks a lot @harahu for this pull request. Mutiple things come to mind:

  • I'll add WIP to the title of your pull request to prevent merging
  • I'm no expert in RSquared, but if we decide to go against what was decided for scikit-learn, we need a very strong reason. Often consistency if better than a small design improvement (priciple of least astonishment), especially concerning numerical libraries, where behavior changes go unoticed.
  • Concerning specifying the shape of y, I am faily certain that Metric, like Layer, will support a build method at some point. There is no technical reason why it wouldn't be possible, it's a better UX, and it's been shown to be needed. The question is what do we do in the meantime. We have two possibilities as you stated, either, a better UX, but using a private TF API, either a worse UX, but we don't use a private API. I'm in favor of your current implementation. It's a worse UX for a small number of users, but in the end it's temporary. On our side, it makes our life a lot easier. Using private APIs means that the behavior can change silently, which is really bad and can break the CI or worse, user code without us knowing about it. In the end, we can always deprecate the y_shape when the build method is officially supported, with a warning and everything, so changing the API is not an issue.

@gabrieldemarmiesse gabrieldemarmiesse changed the title Proposal for multivariate r2 [WIP] Proposal for multivariate r2 Mar 14, 2020
@harahu
Copy link
Contributor Author

harahu commented Mar 14, 2020

Thank you for your thoughts on this @gabrieldemarmiesse! I definitely agree that there should be a high threshold for breaking with the 'scikit-learn' behaviour.

I did ask a question regarding this on Cross Validated, hoping to get some feedback. I really am no expert myself.

@harahu
Copy link
Contributor Author

harahu commented Mar 18, 2020

Did not get much response on Cross Validated, but I did discuss this with my colleague @ttvand . The only generalization of R2 we felt comfortable claiming made sense was where you calculate individual scores per label element. This corresponds with the multioutput="raw_values" flavour in in scikit-learn. Although I would like that as the default setting in a utopia, I did implement the other supported options from scikit-learn, and used the same default "uniform_average" to maintain as similar behaviour as possible. Also tested that behaviour remains in line with scikit-learn for these different options.

Any further feedback on this, @gabrieldemarmiesse ?

@harahu harahu changed the title [WIP] Proposal for multivariate r2 Multivariate r2 Mar 18, 2020
@gabrieldemarmiesse
Copy link
Member

I believe you made the right call here. We don't need to think about it too much if we just copy what scikit learn does and make tests to ensure the results are the same for the three values. Users will pick whatever they want from what is available. I think that's perfect and it also has the nice property that we (I) don't need to check the math to merge since scikit-learn does it for us (the review is going to be much faster). Give me some time to review on the technical side of things.

Thanks again for all your work!

Copy link
Member

@gabrieldemarmiesse gabrieldemarmiesse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything is good. Just some style remarks to adress to try to improve readability and we're good to merge this!

Copy link
Member

@gabrieldemarmiesse gabrieldemarmiesse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thanks a lot!

@gabrieldemarmiesse gabrieldemarmiesse merged commit 3e562ad into tensorflow:master Mar 20, 2020
jrruijli pushed a commit to jrruijli/addons that referenced this pull request Dec 23, 2020
* Proposal for multivariate r2

* More flavours of multioutput

* Improve an unfinished exception message

* Simplifications to increase readability

* Pass some arguments that should be passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants