-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Diar-az #39
Add Diar-az #39
Conversation
I think this should fall into "Other software" instead of "Diarization dataset". This is not a new dataset. It's just a format conversion tool, is it correct? |
Its a tool specifically for the ruv-di dataset |
If so, we should add ruv as a dataset, and this repo as "Other Software". |
The dataset was never published, only the resulting models. Also, yes that
dataset should be added but it was also lost in a cyber security attack in
January 2024 on Reykjavik University’s servers. If you want, you could put
a placeholder text for the RÚV-DI dataset here in this repo and we could
try to recreate the dataset. We have a license that lists all the shows and
episodes contained within the dataset. So we could recreate it from that.
Other software works in my opinion.
…On Tuesday, August 20, 2024, Quan Wang ***@***.***> wrote:
If so, we should add ruv as a dataset, and this repo as "Other Software".
—
Reply to this email directly, view it on GitHub
<#39 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABUMNYEZUD2QOJQQ7AE2X5TZSNI2HAVCNFSM6AAAAABMY6MAQCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJZGAZTMOJRGA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Yes, I think Other software works and maybe a better fit, as it's not really a dataset, rather it was a tool to support the ruv-di dataset. To correct this, should this pull request be just updated or a new one created?
|
I'm OK either way. |
Fixed, added to other software. |
@afk0901 i believe you also need to put the placeholder text for the dataset for this pr to be properly closed. In terms of recreating the dataset i believe it's actually best if @wq2012 recreates the dataset with daan and pet of google. And @afk0901 finish our writeup of this dataset creation. When we are both done we compare notes on arxiv and write the dataset paper together for interspeech, icassp, or sand2025, or wand in october. |
For continuity and clarity I believe it's best if my second paragraph is dealt with separately, not in this pr. Thus i have created a new issue for it within this repo. |
I didn't see the change. |
f32caa9
to
8d1a453
Compare
README.md
Outdated
@@ -295,6 +296,7 @@ Team in the Inaugural DIHARD Challenge](https://www.isca-speech.org/archive/pdfs | |||
| [VoxConverse](https://github.com/joonson/voxconverse) | TBD | TBD | Free | VoxConverse is an audio-visual diarisation dataset consisting of over 50 hours of multispeaker clips of human speech, extracted from YouTube videos | | |||
| [MiniVox Benchmark](https://github.com/doerlbh/MiniVox) | [MiniVox Benchmark](https://github.com/doerlbh/MiniVox) | en | Free | MiniVox is an automatic framework to transform any speaker-labelled dataset into continuous speech datastream with episodically revealed label feedbacks. | | |||
| [The AliMeeting Corpus](https://github.com/yufan-aslp/AliMeeting) | Together with audios | zh | Free | | | |||
| RÚV-DI dataset | TBD | is | TBD | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
Add Diar-az
Diar-az creates files for a (diarization) corpus from Gecko and provides organization, cleaning and correction of data for Kaldi to Gecko to Kaldi/corpus and back.