Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SODA Dataset for Training #35

Merged
merged 17 commits into from
Jul 8, 2024
Merged

SODA Dataset for Training #35

merged 17 commits into from
Jul 8, 2024

Conversation

farzadab
Copy link
Contributor

No description provided.

ultravox/tools/infer_tool.py Outdated Show resolved Hide resolved
ultravox/tools/infer_api.py Outdated Show resolved Hide resolved
ultravox/evaluation/eval_types.py Outdated Show resolved Hide resolved
ultravox/evaluation/gpt_eval.py Outdated Show resolved Hide resolved
@farzadab farzadab force-pushed the farzad-soda-train branch 2 times, most recently from 6248f64 to 5ba015c Compare June 25, 2024 21:23
@farzadab farzadab marked this pull request as ready for review June 25, 2024 21:25
@farzadab
Copy link
Contributor Author

The PR is ready.

ultravox/evaluation/gpt_eval_boolq.py Outdated Show resolved Hide resolved
roles = ["user", "assistant"] if len(turns) % 2 == 0 else ["assistant", "user"]

num_prompts = min(self._args.num_prompts, len(self.SYS_PROMPTS))
sys_prompt = self.SYS_PROMPTS[idx % num_prompts]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we end up using a RNG for this sort of thing rather than the index?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I forget where but we discussed adding a private RNG to datasets to allow them to simply pull a value from the RNG rather than using the index counter and various moduli)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the idea was that we do that in the next PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well just do it now I guess since I have the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +265 to +268
for column_name in self.BASE_AUDIO_COLUMNS:
dataset = dataset.cast_column(
column_name, datasets.Audio(sampling_rate=SAMPLE_RATE)
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bugfix for datasets that have audio column that is not named audio.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not an issue for SODA since it was constructed with 16K Hz, but it was sloppy of me.

@farzadab farzadab merged commit e607220 into main Jul 8, 2024
1 check passed
@farzadab farzadab deleted the farzad-soda-train branch July 8, 2024 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants