Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unitxt evaluator #156

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Conversation

Roni-Friedman
Copy link

@Roni-Friedman Roni-Friedman commented Oct 22, 2024

Adding unitxt evaluator.

To be complemented by adding unitxt as a benchmark in instructlab repo

@mergify mergify bot added the testing Relates to testing label Oct 22, 2024
Copy link
Contributor

mergify bot commented Oct 22, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @Roni-Friedman please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@nathan-weinberg
Copy link
Member

@Mergifyio refresh

Copy link
Contributor

mergify bot commented Oct 22, 2024

refresh

✅ Pull request refreshed

@mergify mergify bot removed the ci-failure label Oct 27, 2024
@Roni-Friedman Roni-Friedman force-pushed the unitxt_eval branch 2 times, most recently from a6d43e7 to c22397b Compare October 27, 2024 09:44
@mergify mergify bot added ci-failure and removed needs-rebase labels Oct 27, 2024
@mergify mergify bot added ci-failure and removed ci-failure labels Oct 28, 2024
@mergify mergify bot removed the ci-failure label Oct 28, 2024
@Roni-Friedman Roni-Friedman marked this pull request as ready for review October 28, 2024 11:36
@Roni-Friedman Roni-Friedman changed the title Unitxt eval Unitxt evaluator Oct 28, 2024
tests/test_unitxt.py Outdated Show resolved Hide resolved
Copy link
Member

@danmcp danmcp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

tests/test_unitxt.py Outdated Show resolved Hide resolved
print("===> Executing 'test_unitxt'...")
try:
model_path = "instructlab/granite-7b-lab"
unitxt_recipe = "card=cards.wnli,template=templates.classification.multi_class.relation.default,max_train_instances=5,loader_limit=20,num_demos=3,demos_pool_size=10"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like too much detail to ask for from the cli in 1 long string. Are all the values things we want users to specify or could some of them be implementation details? For the ones that are, I think we need to break them down into individual params.

Copy link
Author

@Roni-Friedman Roni-Friedman Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users should have the flexibility to specify all details. I can suggest the following, tell me which you prefer:
1.a - have the recipe written down in a file and then in the cli just provide a path
1.b - have the recipe written down in a file, but in a json format (e.g. {card: ..., template: ...}, which is more friendly
2 - add some prefixed parameters such as card and template, but also a freetext parameter, as there are many customization a unitxt user may want to make.
3 - keep it as is :)

src/instructlab/eval/unitxt.py Outdated Show resolved Hide resolved
src/instructlab/eval/unitxt.py Outdated Show resolved Hide resolved
src/instructlab/eval/unitxt.py Show resolved Hide resolved
src/instructlab/eval/unitxt.py Outdated Show resolved Hide resolved
src/instructlab/eval/unitxt.py Outdated Show resolved Hide resolved
src/instructlab/eval/unitxt.py Outdated Show resolved Hide resolved
unitxt_recipe: str,
):
unitxt_task = self.assign_task_name()
tasks_dir = self.assign_tasks_dir(unitxt_task)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this using a local directory? If so, it needs to be built off a param like output_dir with mt_bench.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a temporary directory, deleted at the end of the evaluation process. Would you prefer the user specified an output dir? It does not contain anything of use for the user, just the files required for lm eval to run unitxt.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to use the output_dir so it doesn't confuse the user in a local directory. Also, since you do want to remove at the end, the create/remove logic should probably be in a try/finally block.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so a user would specify an output dir but will find it doesn't exist at the end of the run?
If the user specifies it, I guess I will not delete it, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so a user would specify an output dir but will find it doesn't exist at the end of the run?

The current output dir is a working dir for mt_bench.

If the user specifies it, I guess I will not delete it, right?

I was expecting you would create your directory inside the output_dir and then delete it when you are done. Or you could delete the directory before you start if there is some value in leaving it around.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to put all this into a memory filesystem? In general, it is best to avoid unnecessary disk writes, especially for something that's likely to run in a cloud service where it may or may not have write permissions on some sort of disk.

Copy link
Author

@Roni-Friedman Roni-Friedman Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwm4 memory filesystem does not seem to work well, as directory is later accessed also by lm-eval inside the mmlu class and I don't want to start passing this filesystem around (unless owners support such an overall change)

Copy link
Author

@Roni-Friedman Roni-Friedman Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to use the output_dir so it doesn't confuse the user in a local directory.

@danmcp So I'm not entirely sure what you mean here. I see mt_bench has output_dir: str = "eval_output",, but this is created only if one calls for mt_bench and does not enter a different output dir. Doing the following, although not sure it makes a lot of sense:

def assign_tasks_dir(self, task_name):
return os.path.join( "eval_output" ,f"{TEMP_DIR_PREFIX}_{task_name}")

Copy link
Member

@danmcp danmcp Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if my request wasn't clear, my suggestion was like mt_bench:

  • Accept the root dir to use for output as a var
  • Default it to the same root dir as mt_bench

@alimaredia
Copy link
Contributor

@Roni-Friedman Could you explain in the description what benefit the Unitxt evaluator would have? Why would a user run the unitxt evaluator over just using the MMLUBranchEvaluator?

tests/test_unitxt.py Outdated Show resolved Hide resolved
@mergify mergify bot added the ci-failure label Nov 7, 2024
Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>
Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>
Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>
Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>
Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>
Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>
Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>
Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>
Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>
Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>
Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>
Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>
Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>
Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>
@Roni-Friedman
Copy link
Author

@Roni-Friedman Could you explain in the description what benefit the Unitxt evaluator would have? Why would a user run the unitxt evaluator over just using the MMLUBranchEvaluator?

Let's discuss this in our meeting as well. I initially wrote a generic evaluator that can use all of unitxt features, but now my understanding is that there are two clear use cases and PR should be adjusted accordingly:
1 - run evaluation on user data
2 - run bluebench

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Evaluation of user data using Unitxt
6 participants