-
Notifications
You must be signed in to change notification settings - Fork 700
Add SpeechCommands train/valid/test split #966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f2087d9
to
00dc9f9
Compare
Merging this before release would simplify the tutorial. |
68206f2
to
eb03a1c
Compare
@@ -48,13 +48,24 @@ class SPEECHCOMMANDS(Dataset): | |||
The top-level directory of the dataset. (default: ``"SpeechCommands"``) | |||
download (bool, optional): | |||
Whether to download the dataset if it is not found at root path. (default: ``False``). | |||
subset (Optional[str]): | |||
Select a subset of the dataset [None, "training", "validation", "testing"]. | |||
(default: ``None``) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the description on what kinds of files you would get for each option? I am looking at the dataset and I see testing_list.txt
and validation_list.txt
, but I do not see training_list.txt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you point out, training_list.txt
doesn't exist, it's defined to be what's not in validation or testing. The other two are defined by the dataset. Is that what you meant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's defined to be what's not in validation or testing. The other two are defined by the dataset. Is that what you meant?
Yes. And also for None
too.
71ab85e
to
6efedf5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely better than the previous code. Made some comments to push it to the next level.
@@ -48,13 +48,24 @@ class SPEECHCOMMANDS(Dataset): | |||
The top-level directory of the dataset. (default: ``"SpeechCommands"``) | |||
download (bool, optional): | |||
Whether to download the dataset if it is not found at root path. (default: ``False``). | |||
subset (Optional[str]): | |||
Select a subset of the dataset [None, "training", "validation", "testing"]. | |||
(default: ``None``) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's defined to be what's not in validation or testing. The other two are defined by the dataset. Is that what you meant?
Yes. And also for None
too.
@@ -20,6 +20,8 @@ | |||
"https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.02.tar.gz": | |||
"6b74f3901214cb2c2934e98196829835", | |||
} | |||
VALIDATION_LIST = "validation_list.txt" | |||
TESTING_LIST = "testing_list.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not use global variables for those only used in one function (constructor). This looks like they are public attributes. There is no good reason these should be global.
elif subset == "training": | ||
excludes = load_list(VALIDATION_LIST) + load_list(TESTING_LIST) | ||
walker = walk_files(self._path, suffix=".wav", prefix=True) | ||
self._walker = [w for w in walker if HASH_DIVIDER in w and EXCEPT_FOLDER not in w and w not in excludes] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking way better than the previous. One more push.
Rather than parsing the all the files present in the file system first then filter only the interested, it's better to only parse the needed one.
Also, repeating the combination of walk_files
and filtering HASH_DIVIDER
in two clauses looks overwhelming. You can combine them into a much shorter code like the bellow.
elif subset == "testing":
...
else:
paths = [str(p) for p in Path(self._path).glob('*/*_no_hash_*.wav')]
# no need to check EXCEPT_FOLDER because
# none of the wav file in the directory match with `_no_hash_`
if subset == "training":
excludes = set(_load_lists("validation_list.txt", "testing_list.txt"))
paths = [p for p in paths if p not in excludes]
self._walker = paths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I encourage you to take this opportunity ro get rid of walk_files
because, if you try to stick with walk_files
, I do not see any pro at a glance. You will have to more works to settle with sub-optimal solution, which is not as compact yet inefficient.
Also #950 is about the support of Path
object, so I do not see the logical connection between it and walk_files
(which returns list of string).
Regarding, the order it returns, we have a unit test which can verify the proper ordering of files returned, so you do not need to worry about aligning the order with walk_files
function. Instead, you can just focus on writing the correct code. However, this is a valid point and probably you will need to insert sort
function in the snippet above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vincentqb Now that #950 is closed, can you followup replacing walk_files
with glob
with something like the following?
I agree that parsing and then filtering is not optimal. Changing
os_walk
is out of scope for this PR though, and could change the order of the files from #814. I'm ok with replacingos_walk
withglob
in a next PR due to #950.
elif subset == "testing": ... else: paths = [str(p) for p in Path(self._path).glob('*/*_no_hash_*.wav')] # no need to check EXCEPT_FOLDER because # none of the wav file in the directory match with `_no_hash_` if subset == "training": excludes = set(_load_lists("validation_list.txt", "testing_list.txt")) paths = [p for p in paths if p not in excludes] self._walker = paths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea: i've opened an issue to request help from the community to remove walk_files
: #1051.
def load_list(filename): | ||
filepath = os.path.join(self._path, filename) | ||
with open(filepath) as fileobj: | ||
return [os.path.join(self._path, line.strip()) for line in fileobj] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please prefer a plain function defined on module level especially when there is no local variable needs to be captured.
Also, since you are defining a helper function for your own sake, you can adopt a signature like
def _load_lists(*filename)
and it will make excludes = load_lists(VALIDATION_LIST) + load_list(TESTING_LIST)
into as simple as excludes = _load_lists("validation_list.txt", "testing_list.txt")
.
valid.write(f'{label_filename}\n') | ||
cls.valid_samples.append(sample) | ||
elif j < 6: | ||
label_filename = os.path.join(label, filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contents of "testing_list.txt"
and "validation_list.txt"
are in Linux format.
$ head speech_commands_v0.01/testing_list.txt
bed/0c40e715_nohash_0.wav
bed/0ea0e2f4_nohash_0.wav
bed/0ea0e2f4_nohash_1.wav
Here, using os.path.join
make it to the valid path on Windows system (something like bed\0c40e715_nohash_0.wav
), so this is not correct. Something like the following is the correct way to mimic the data.
label_filename = os.path.join(label, filename) | |
label_filename = f'{label}/{filename}' |
Same goes to the case where j < 4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more straightforward to use List[pathlib.Path]
objects for self._walker
, rather than converting them to List[str]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that will be more convenient to do that in the future
for filename in filenames: | ||
filepath = os.path.join(self._path, filename) | ||
with open(filepath) as fileobj: | ||
output += [os.path.join(self._path, line.strip()) for line in fileobj] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised our Windows CI job does not raise FileNotFound error with this implementation. If my understanding is correct, as previously noted, os.path.join(self._path, line.strip())
would produce strings that look like speech_commands\speech_commands_v0.01/testing_list.txt
, which is an invalid path on Windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's what I see too. It seems like os.path.normpath
would be useful here.
52a8ce8
to
c3b1911
Compare
self._walker = [ | ||
w for w in walker | ||
if HASH_DIVIDER in w | ||
and EXCEPT_FOLDER not in w |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if HASH_DIVIDER in w and EXCEPT_FOLDER not in w
is redundant because files in background noises do not have _nohash_
in filenames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll avoid changing the logic here to avoid breaking any codes, say if a user moved a file in background_noise folder/etc. This can be looked at in a later PR.
self._walker = _load_list("validation_list.txt") | ||
elif subset == "testing": | ||
self._walker = _load_list("testing_list.txt") | ||
elif subset == "training": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there is a behavior inconsistency between "training"/None
and "validation"/"testing"
.
If certain valid files are removed from the dataset, then this dataset implementation will keep working for "training"/None
case with less files, but for "validation"/"testing"
it will raise an exception.
What's your take on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"testing", "validation", "training" are defined by the dataset by the training/validation files. It is technically undefined by the dataset outside of that. Given how the three are defined by the dataset in those files, as a user, I'd expect changes to those file to propagate. I'd say it'd be fair to add a quick note in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is technically undefined by the dataset outside of that.
I agree with that, however IIRC, one of the goal of torchaudio's dataset implementation is to make it easy to modify the dataset. Something along the line of point 3&4 of #852 (comment) . With this rule, we have to take the extra step to think through what kind of modification is valid/invalid and what is the expected behavior, and put it in the implementation.
I am trying to raise the awareness of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And of course, if we are just getting rid of the easy modification of the dataset, we do no need think about the any modification to dataset, and we leave it as UB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, SpeechCommands does explain how those two files are generated, and how to generalize their approach, see README.
for filename in filenames: | ||
filepath = os.path.join(root, filename) | ||
with open(filepath) as fileobj: | ||
output += [os.path.normpath(os.path.join(root, line.strip())) for line in fileobj] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment on why normapth
is necessary here?
Even though normpath
's official document says On Windows, it converts forward slashes to backward slashes. To normalize case, use normcase().
. For many of us, who are linux users, It is not well known. (I learned this only today.)
Merged based on offline discussion |
Updates for 1.5 release.
Add SpeechCommands train/valid/test split. This follows the convention established in gtzan.