Add @EXPECTED_RESULTS@ tag. #4

RagnarGrootKoerkamp · 2020-11-25T17:30:23Z

As discussed.

Suggestions on wording are welcome. I think the directory name should be one of:

mixed_results
mixed_verdicts
multiple_results (although only one would still be OK)
multiple_verdicts

simonlindholm · 2020-11-27T23:11:25Z

Could be worth coming up with a syntax that naturally extends to multiple test groups. We have an unofficial tool https://github.com/nordicolympiad/testdata_tools/pull/13/files that parses @EXPECTED_GRADES@ AC AC TLE TLE as "expect groups 1 and 2 to pass, 3 and 4 to TLE"; the similar syntax with different semantics is a bit confusing. (Though it probably won't lead to errors in practice since they are used in different contexts.)

@ALLOWED_VERDICTS@ is another possible bikeshed color, not sure if a good one.

eldering · 2020-11-27T23:21:28Z

@ALLOWED_VERDICTS@ is another possible bikeshed color, not sure if a good one.

Unless there's a good reason to change, I'd prefer to stick to this name as it is already in use in DOMjudge and also by some problem setters.

RagnarGrootKoerkamp · 2020-12-07T20:05:03Z

We should use short hand AC/WA/RTE/TLE

RagnarGrootKoerkamp · 2020-12-07T20:05:09Z

drop mixed result

eldering · 2021-11-21T13:20:41Z

spec/problem_format.md

+This tag implies that the submission may get any of the listed verdicts as
+final verdict.
+
+If `@EXPECTED_RESULTS@: ` is found in a submission in any of the other


This doesn't match the @EXPECTED_VERDICTS@ above.

jsannemo · 2022-09-07T17:44:20Z

@RagnarGrootKoerkamp I think we agreed on calling the mixed-bag-directory rejected, but I would still like to rescue the EXPECTED_RESULTS for those submissions!

Mind bumping this PR (and fixing @eldering 's comment) and perhaps @niemela can take a look?

RagnarGrootKoerkamp · 2022-09-07T17:48:11Z

So the semantics would be:

@EXPECTED_RESULTS@ is only allowed in rejected/. (Or anywhere, for backwards compatibility?)
rejected submissions optionally include @EXPECTED_RESULTS@. If not, any non-AC verdict is ok.

What about submissions that are WA or AC based on randomness? Should they be in accepted or rejected?

niemela · 2022-09-07T17:50:07Z

What about submissions that are WA or AC based on randomness?

When are they useful?

Should they be in accepted or rejected?

Feels like that should be is_submission or ignore or something?

jsannemo · 2022-09-07T18:00:29Z

I would just dump such a submission in submissions/. I don't immediately see that it provides a statement about the problem that a problem verifier needs to check? If it's intended as a non-deterministic AC solution that you want to compute time limits with, fix a seed and make it accepted? :)

jsannemo · 2022-09-07T18:02:37Z

Replying to your original question, I agree that those are the semantics we want. I'd say to only allow it in rejected/? Problems allowing multiple results that are in one of the other subdirectories (except TLE that also allows WA) already break the spec, so it's not really breaking backwards compatibility to break them, except if they were added as a tautology (and fixing them is easy since the verifier will say what's wrong).

RagnarGrootKoerkamp · 2022-09-07T19:39:07Z

Regarding having ACCEPTED as part of the EXPECTED_RESULTS: I have 23 submissions in the last 3 years of BAPC that do this, spread over 12 problems.

Reasons for doing this:

We have problems with guaranteed random input. Sometimes you have a solution that happens to pass all 100 testcases with say 50% probability (and WAs on the other half). This is unfortunate, but does not prevent us from wanting to write and test such a solution.
We have submissions that RTE/TLE/AC depending on whether pypy or cpython is used, because of recursion limit issues.
We have non-intented 'bruteforce' submissions that are only included to test correctness of the answers, that may sometimes happen to pass the tests anyway. These should not be used to set the timelimit (e.g. using 2* slowest submission), so are moved to another directory.

It would be nice to officially support this. That's as easy as allowing ACCEPTED as a possible verdict in the rejected directory.

thorehusfeldt · 2023-05-06T10:55:16Z

I am late to this conversation; and much less experienced than many of you.

(I have, however, given this topic some thought, and used my own script at analyzetestgroups.py quite a bit. It focusses on @EXPECTED_GRADES@ for problems with test groups, and I’ve found it incredibly useful in particular in co-developing with less experienced setters. Still, there are many annoying issues with this.)

Here’s an idea for a very different approach. It

avoids cluttering source code with what is ultimately problem development information
allows rich specification of expected behaviour
is easily accessed and parsed by tools
allows changes to test group structure and expected submission behaviour during development (such as test group 2 being removed)
is decoupled from other specifications

The idea is that the submissions directory can contain expected_verdicts.yaml that enumerates (some) submissions and specifies behaviour consistent with the layout of data/*, much like generators.yaml.

I want to be able to specify the allowed verdicts per submission and per named test group. (An allowed verdict can be a list, such as TLE, RTE.) The tree structure of data allows inheriting verdicts downwards, so I can quickly specify “everything should get AC”, but I could also specify “should get AC except for the huge instances in data/huge, where it should TLE or RTE, which depends on the recursion limit.

If one right, this allows arbitrarily fine-grained specification (say, specifying that this submission is guaranteed to fail on sample 3, but gets AC for the rest of the sample group.)

This would work (and be useful), independently of “we use test groups internally to organise test cases logically during development for pass/fail problems” or “we use test groups for graded problems with scoring and expose them to the solver.”

Here is an example, using fictional (and not complete thought-through) syntax:

submissions:
  accepted/th.py:
    sample: AC # redundant bc AC is default
    secret: AC # redundant
  partially_accepted/quadratic_time.py:
    secret:
        huge: [TLE, RTE] # times out on huge testcases in data/secret/huge
        *: AC # may be redundant as well
  partially_accepted/simple_graph.py:
    sample:
      '3': WA # whereas '1' and '2' are implicitly AC
    secret:
        non-simple: WA
        with-loops: WA
        *: AC # redundant
  partially_accepted/greedy.py # typical submission in a 4-testgroup scoring/grading problem
    secret:
      group1: AC
      group2: AC
      group3: WA
      group4:
        - TLE
        - RTE

For scoring problems with grade AC <integer>, one could specify the range of scores expected, also on a per-testgroup basis.

niemela · 2023-07-22T15:44:30Z

Will close this. @thorehusfeldt is working on an updated suggestion.

Add @EXPECTED_RESULTS@ tag.

fb087d0

Change to verdict abbreviations; rename tag to EXPECTED_VERDICTS

841b369

eldering reviewed Nov 21, 2021

View reviewed changes

vmcj mentioned this pull request Jan 26, 2023

Ignore multiple @EXPECTED_RESULTS@ strings DOMjudge/domjudge#1387

Open

eldering mentioned this pull request Jan 28, 2023

Document that @EXPECTED_RESULTS@: tag is ignored for standard submission directories DOMjudge/domjudge#1861

Closed

mpsijm mentioned this pull request Jan 29, 2023

Mention EXPECTED_RESULTS in the docs DOMjudge/domjudge#1862

Merged

niemela closed this Jul 22, 2023

jsannemo deleted the expected_results branch July 22, 2023 17:32

thorehusfeldt mentioned this pull request Sep 17, 2023

Implement Feat/expectations RagnarGrootKoerkamp/BAPCtools#307

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add @EXPECTED_RESULTS@ tag. #4

Add @EXPECTED_RESULTS@ tag. #4

Uh oh!

RagnarGrootKoerkamp commented Nov 25, 2020

Uh oh!

simonlindholm commented Nov 27, 2020

Uh oh!

eldering commented Nov 27, 2020

Uh oh!

RagnarGrootKoerkamp commented Dec 7, 2020

Uh oh!

RagnarGrootKoerkamp commented Dec 7, 2020

Uh oh!

eldering Nov 21, 2021

Uh oh!

jsannemo commented Sep 7, 2022

Uh oh!

RagnarGrootKoerkamp commented Sep 7, 2022

Uh oh!

niemela commented Sep 7, 2022

Uh oh!

jsannemo commented Sep 7, 2022

Uh oh!

jsannemo commented Sep 7, 2022 •

edited

Loading

Uh oh!

RagnarGrootKoerkamp commented Sep 7, 2022 •

edited

Loading

Uh oh!

thorehusfeldt commented May 6, 2023 •

edited

Loading

Uh oh!

niemela commented Jul 22, 2023

Uh oh!

Uh oh!

Add @EXPECTED_RESULTS@ tag. #4

Add @EXPECTED_RESULTS@ tag. #4

Uh oh!

Conversation

RagnarGrootKoerkamp commented Nov 25, 2020

Uh oh!

simonlindholm commented Nov 27, 2020

Uh oh!

eldering commented Nov 27, 2020

Uh oh!

RagnarGrootKoerkamp commented Dec 7, 2020

Uh oh!

RagnarGrootKoerkamp commented Dec 7, 2020

Uh oh!

eldering Nov 21, 2021

Choose a reason for hiding this comment

Uh oh!

jsannemo commented Sep 7, 2022

Uh oh!

RagnarGrootKoerkamp commented Sep 7, 2022

Uh oh!

niemela commented Sep 7, 2022

Uh oh!

jsannemo commented Sep 7, 2022

Uh oh!

jsannemo commented Sep 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RagnarGrootKoerkamp commented Sep 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thorehusfeldt commented May 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

niemela commented Jul 22, 2023

Uh oh!

Uh oh!

jsannemo commented Sep 7, 2022 •

edited

Loading

RagnarGrootKoerkamp commented Sep 7, 2022 •

edited

Loading

thorehusfeldt commented May 6, 2023 •

edited

Loading