Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify / clean-up Interop 2021 labeling #46

Closed
jensimmons opened this issue Jan 18, 2022 · 4 comments
Closed

Clarify / clean-up Interop 2021 labeling #46

jensimmons opened this issue Jan 18, 2022 · 4 comments
Assignees
Labels
meta Process and/or repo issues

Comments

@jensimmons
Copy link
Contributor

Interop 2021 now has labels for the tests:

@foolip pulled data (published here: https://gist.github.com/foolip/25c9ed482a0dd802f9bf2eea4544ccac )

feature chrome firefox safari
interop-2021-aspect-ratio 993 970 964
interop-2021-flexbox 978 989 942
interop-2021-grid 978 912 963
interop-2021-position-sticky 1000 892 1000
interop-2021-transforms 974 961 847

(Where 993 = 99.3% pass rate, based on weighted calculations.)

Meanwhile, the Compat 2021 dashboard shows the following scores:

feature chrome firefox safari
aspect-ratio 19 19 19
flexbox 19 19 18
grid 19 18 19
position-sticky 20 17 20
transforms 19 19 16
total 96 92 92

What I don't understand is how the labeled tests, given the pass rates quoted above, translate into the points given. There seems to be another layer of computation that's happening.

For example, let's look at Safari's transform score. We are passing 847 of 1000 tests, which is 85%. Yet, we have a score of 16, which is the equivalent of 80%. Applying this across all the tests, and it seems both Safari and Firefox are being underscored.

Philip you mentioned in the meeting perhaps this is happening because the wrong tests are labeled?

@foolip
Copy link
Member

foolip commented Jan 19, 2022

@jensimmons thanks for filing the issue. I have taken a look at the number of tests used in each scoring script, and it turns out my suspicion was wrong. The exact same number of tests are going into it for both scripts:

feature tests
aspect-ratio 159
css-flexbox 1051
css-grid 902
css-transforms 755
position-sticky 42

I didn't check that the test names are the same, but I now doubt a test mismatch is the right explanation here. I'll self-assign this and look into if the scores end up being different even for the exact same runs, which I haven't confirmed yet.

@foolip foolip self-assigned this Jan 19, 2022
@foolip
Copy link
Member

foolip commented Feb 3, 2022

I have taken a close look at this. The 4 data files that get loaded by https://wpt.fyi/compat2021 are:

The summary files have the same data as the last entry in the unified scores files.

Here's the data from summary-experimental.csv:

feature chrome firefox safari
aspect-ratio 0.9935572940635067 0.9702306079664571 0.9650135108027422
css-flexbox 0.9784480375042816 0.990451223387353 0.9485491920429956
css-grid 0.9805697246403723 0.9127763987836662 0.9631824371311336
css-transforms 0.9811258278145696 0.9614199965275193 0.8872781623473189
position-sticky 1 0.8928571428571429 1

That data was from 2022-01-31 (sha 29ce70d915) so that's what I've compared.

Here are the summary numbers I get from the new script, from the same commit:

feature chrome firefox safari
interop-2021-aspect-ratio 993 970 965
interop-2021-flexbox 978 990 948
interop-2021-grid 980 912 963
interop-2021-position-sticky 1000 892 1000
interop-2021-transforms 981 961 887
interop-2022-cascade 965 837 777
interop-2022-color 467 521 912
interop-2022-contain 953 842 885
interop-2022-dialog 984 892 902
interop-2022-forms 767 732 547
interop-2022-scrolling 920 708 790
interop-2022-subgrid 100 953 100
interop-2022-text 677 965 783
interop-2022-viewport 166 166 1000
interop-2022-webcompat 260 957 495

All the numbers seem to match.

Next I'll look at the total score and explain how it was computed for Compat 2021 and check if it'll match for Interop 2022.

@foolip
Copy link
Member

foolip commented Feb 3, 2022

Here's how the summary score was computed for Compat 2021:

https://github.com/web-platform-tests/wpt.fyi/blob/39242385b97882de45af97f7713bc8e67eff7564/webapp/components/interop-2022.js#L121-L133

I can get the same scores as are shown on the Compat 2021 dashboard like this:

let chrome =  [0.9935572941, 0.9784480375, 0.9805697246, 0.9811258278, 1];
let firefox = [0.970230608,  0.9904512234, 0.9127763988, 0.9614199965, 0.8928571429];
let safari =  [0.9650135108, 0.948549192,  0.9631824371, 0.8872781623, 1];
function sum(list) { return list.reduce((acc, x) => acc + x, 0); }
sum(chrome.map(score => Math.floor(score * 20))); // 96%
sum(firefox.map(score => Math.floor(score * 20))); // 92%
sum(safari.map(score => Math.floor(score * 20))); // 93%

So the way this worked, because each area was first scored 0-20, the only way to increase the score was to pass a 5% threshold in an individual area. This is something I'm proposing to change.

If we were to score just these 5 areas using the method I'm suggesting, it would be:

let chrome =  [993, 978, 980, 981, 1000];
let firefox = [970, 990, 912, 961, 892];
let safari =  [965, 948, 963, 887, 1000];
function sum(list) { return list.reduce((acc, x) => acc + x, 0); }
Math.floor(sum(chrome) / 5); // 986 i.e. 98.6%
Math.floor(sum(firefox) / 5); // 945 i.e. 94.5%
Math.floor(sum(safari) / 5); // 952 i.e. 95.2%

I believe this is better since smaller improvements in the individual area scores get reflected in the overall score. Note that this isn't just due to adding a decimal point (which we can debate) but also because we get rid of the truncation to a 0-20 score for each area.

@foolip
Copy link
Member

foolip commented Feb 3, 2022

One thing remains to be explained here:

For example, let's look at Safari's transform score. We are passing 847 of 1000 tests, which is 85%. Yet, we have a score of 16, which is the equivalent of 80%. Applying this across all the tests, and it seems both Safari and Firefox are being underscored.

Taking this Safari run as the example since it's what I used in the previous comments, there are 755 tests. The Safari score in our metrics are:

  • 0.8872781623 as computed for Compat 2021
  • 17 for the overall Compat 2021 metric, from Math.floor(0.8872781623 * 20), which is 85%
  • 88.7% in the way I'm suggested we compute metrics for Interop 2022

So this is mainly explained by that truncation to a 0-20 score per area.

Unfortunately, it's hard to read that 88.7% from wpt.fyi, because wpt.fyi can't show normalized scores, where each test counts the same regardless of number of subtests. This is a feature request in web-platform-tests/wpt.fyi#2290 and I think we might get to it this year, but not before the launch of Interop 2022. It is possible to verify the score with careful counting of tests and subtests, however.

This was referenced Feb 3, 2022
@foolip foolip closed this as completed Feb 10, 2022
@foolip foolip removed the proposal label Apr 1, 2022
@gsnedders gsnedders added the meta Process and/or repo issues label Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta Process and/or repo issues
Projects
None yet
Development

No branches or pull requests

3 participants