trec_eval options #57

isoboroff · 2024-09-26T20:46:30Z

Consider adding some options from trec_eval. The one that drove me to post this issue was -M, so we could specify the maximum depth of a run. (Without this, runs can return >1000 docs and get better recall.)

A case might be made for -c and -J, but I think those are better implemented in specific measures.

No patch yet. The plan is to have read_trec_run take the full args object, and then we can count docs in the generator and know when to stop. With this implementation we are agnostic to providers.

cmacdonald · 2024-09-26T20:54:05Z

Three thoughts:

an optional kwarg of depth : Optional[int] = None to read_trec_run() rather than passing an args object around - its cleaner to have a public API free of args objects that no client code an possible understand.
Classical trec_eval performed cutoff after sorting, right?
But should Recall always be defined by an @ cutoff?

isoboroff · 2024-09-26T21:13:45Z

You're right, trec_eval only applies -M when making res_rels. And yes, if people always give a cutoff for recall then the problem there vanishes. But they won't and I don't think trec_eval assumes a cutoff aside from the standard measures.

But what I'm trying to get rid of is specifying the cutoff separately for every measure.
ir_measures -q
${QRELS} ${RUNS}/$runtag/$runtag
'nDCG(cutoff=1000)@20'
'P(cutoff=1000)@5'
'AP(cutoff=1000)'
'P(rel=3,cutoff=1000)@5'
'AP(rel=3,cutoff=1000)'

isoboroff · 2024-09-26T21:15:01Z

btw the docs don't show how to send multiple options to a measure, and the above doesn't work. About to code dive.

cmacdonald · 2024-09-26T21:21:19Z

btw the docs don't show how to send multiple options to a measure, and the above doesn't work. About to code dive.

its clear measures do support multiple options:
https://github.com/terrierteam/ir_measures/blob/main/ir_measures/measures/accuracy.py#L14-L16

I think they are indeed just kwargs to functions, so it should work.

But what I'm trying to get rid of is specifying the cutoff separately for every measure.

Agreed, I support an -M option, but this is mainly @seanmacavaney's shindig.

then we can count docs in the generator

We'd need to sort the lists before applying the cutoff.

There might be a neat impl using https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.consecutive_groups or
https://more-itertools.readthedocs.io/en/stable/api.html#more_itertools.map_reduce

isoboroff · 2024-09-26T21:22:06Z

Yup I was wrong about multiple measure args sorry.

isoboroff · 2024-09-26T21:22:27Z

Isn't it like quitting time in Scotland or something?

seanmacavaney · 2024-09-27T09:30:20Z

But what I'm trying to get rid of is specifying the cutoff separately for every measure.

I feel your pain here @isoboroff! It's indeed tedious to specify the same cutoff setting across all measures.

One of the project's goals is to ensure that all settings that affect a measure's calculation are defined by the measure specification and the underlying provider. (Barring stuff like floating point stuff that can differ across CPU architectures, etc.) This goal ensures that giving only the measure specification and the provider will give the same results for the same input. It also gets away from needing to provide instructions like: "run the evaluation tool twice, once with these settings, one with these other ones" in some cases.

OTOH, I see how it's pretty annoying to repeat this same default cutoff across a variety of measures. I'd like to think this through a bit more, but I'm potentially open to an -M-like option that sets a default cutoff setting for all measures where it's supported and not already provided. Note that this won't function exactly the same as trec_eval's -M, since it won't cut off at the data input level, won't affect measures without cutoff settings (like the set measures), etc. When presenting the measures, it would unambiguously show the full specification (satisfying the project goal), while reducing the burden when calling the command. E.g.,:

$ ir_measures -q ${QRELS} ${RUNS}/$runtag/$runtag -M 1000 'nDCG' 'nDCG@20' 'P@5' 'AP' 'P(rel=3)' 'AP(rel=3)'
nDCG@1000 #.####
nDCG@20 #.####
P@5 #.####
AP@1000 #.####
P(rel=3)@1000 #.####
AP(rel=3)@1000 #.####

Would this be helpful?

A case might be made for -c

We've made -c the default behavior across implementations, our reasoning here: https://ir-measur.es/en/latest/getting-started.html#empty-set-behaviour

and -J

Already handled with judged_only=True measure setting :) #44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trec_eval options #57

trec_eval options #57

isoboroff commented Sep 26, 2024

cmacdonald commented Sep 26, 2024

isoboroff commented Sep 26, 2024

isoboroff commented Sep 26, 2024

cmacdonald commented Sep 26, 2024

isoboroff commented Sep 26, 2024

isoboroff commented Sep 26, 2024

seanmacavaney commented Sep 27, 2024

trec_eval options #57

trec_eval options #57

Comments

isoboroff commented Sep 26, 2024

cmacdonald commented Sep 26, 2024

isoboroff commented Sep 26, 2024

isoboroff commented Sep 26, 2024

cmacdonald commented Sep 26, 2024

isoboroff commented Sep 26, 2024

isoboroff commented Sep 26, 2024

seanmacavaney commented Sep 27, 2024