-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scoring: variants and multi-globals #256
Comments
I agree this isn't great, and it also causes a bit of inflation in general for multi-global tests on wpt.fyi. There I've often thought that we should use the manifest and group these tests under the filename somehow, perhaps going so far as to call the filename "the test" and treat all of the variants as subtests. But that'd be a lot of work. For the problem at hand, we could just not label the worker variants and reduce the size of the problem, but that isn't a very reusable approach... |
I guess the challenge is our current implementation of the scoring is JS, versus the rest of the WPT infra (including all the manifest stuff) being Python… hmm. must… not… rewrite… this… while… on… holiday… |
I'm not aware of any hurdles we'll run into fetching and using the manifest from JS. The main issue is that it would be very slow. Storing all manifests in a tree-deduplicating setup more like https://github.com/web-platform-tests/results-analysis-cache would make it faster. |
This came up again in #281. We have cases (URL) where we want to include some variants but not others, so that rules out a "clean" approach of labeling file names and using the manifest to figure out which test names to include, while treating it as 1 test, scored as 0-1. The more complex solution then is: Label test names, but use the manifest to figure out which tests are defined in the same file. Treat those as a group and score them 0-1. |
So, the previous situation was that each variant is its own top-level test for the purposes of scoring, and the proposal is that we define things based on the file rather than on the test id? FWIW I don't feel especially strongly either way; I think "the score just doesn't quite match reality" is an inevitable feature of the setup, and it's also possible to have a case where one file containing many subtests exercises a lot of the feature whereas a few tests that were moved to seperate top-level files only cover edge cases, but end up dominating the scoring. But if people feel that defacto today it's a better tradeoff to treat variants as a single test, I think it's reasonable to change. |
Summary from the notes:
So yes, it would be based on the file, but importantly we need to handle the case where we've only labeled some of the variants or multi-global tests. To be robust we need to use the manifest, so this isn't trivial to implement. I also think it would be very good if we could do the same grouping on wpt.fyi, otherwise we can't make the interop score view match. |
We've discussed this in a meeting. We have a pretty good idea of what we'd change to address this, but nobody assigned to do the work. |
One suggestion for webcodecs was to use all the tests which match the search
video
: https://wpt.fyi/results/webcodecs?label=master&label=experimental&aligned&view=subtest&q=videoHowever, due to extensive use of multi-global tests and variants this ends up not working particularly well. Most obviously, webcodecs/videoDecoder-codec-specific.https.any.js ends up contributing 20% to the overall score, as the one file contributing the following ten tests (out of a total of 48 tests):
This isn't the first time we've had problems like this with our scoring, but I think this is a much more extreme case than we've had otherwise.
The text was updated successfully, but these errors were encountered: