Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate example in train and test set #6

Open
BramVanroy opened this issue Jul 2, 2022 · 0 comments
Open

Duplicate example in train and test set #6

BramVanroy opened this issue Jul 2, 2022 · 0 comments

Comments

@BramVanroy
Copy link

BramVanroy commented Jul 2, 2022

Hi

I was doing some sanity checking and found a duplicate item in the train and test set:

  • DBRD/train/neg/2074_2.txt
  • DBRD/test/neg/20602_2.txt

Content-wise they are identical, with the only difference being that the file in the train set has more newlines. But we filter out these new lines anyway during the training of our models (or at least I do and replace them with single spaces).

This seems important enough to have a revised version 3.1 where the duplicate is removed, as it impacts model training. Together with language filtering (#2), this might even be warranting a v4. Alternatively, I can make a fork and rework the whole thing - of course with acknowledgments to this repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant