Skip to content

Add SpeechCommands train/valid/test split #966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Oct 27, 2020

Conversation

vincentqb
Copy link
Contributor

@vincentqb vincentqb commented Oct 16, 2020

Add SpeechCommands train/valid/test split. This follows the convention established in gtzan.

@vincentqb vincentqb marked this pull request as ready for review October 23, 2020 21:32
@vincentqb
Copy link
Contributor Author

vincentqb commented Oct 23, 2020

Merging this before release would simplify the tutorial.

@@ -48,13 +48,24 @@ class SPEECHCOMMANDS(Dataset):
The top-level directory of the dataset. (default: ``"SpeechCommands"``)
download (bool, optional):
Whether to download the dataset if it is not found at root path. (default: ``False``).
subset (Optional[str]):
Select a subset of the dataset [None, "training", "validation", "testing"].
(default: ``None``)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the description on what kinds of files you would get for each option? I am looking at the dataset and I see testing_list.txt and validation_list.txt, but I do not see training_list.txt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you point out, training_list.txt doesn't exist, it's defined to be what's not in validation or testing. The other two are defined by the dataset. Is that what you meant?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's defined to be what's not in validation or testing. The other two are defined by the dataset. Is that what you meant?

Yes. And also for None too.

Copy link
Contributor

@mthrok mthrok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely better than the previous code. Made some comments to push it to the next level.

@@ -48,13 +48,24 @@ class SPEECHCOMMANDS(Dataset):
The top-level directory of the dataset. (default: ``"SpeechCommands"``)
download (bool, optional):
Whether to download the dataset if it is not found at root path. (default: ``False``).
subset (Optional[str]):
Select a subset of the dataset [None, "training", "validation", "testing"].
(default: ``None``)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's defined to be what's not in validation or testing. The other two are defined by the dataset. Is that what you meant?

Yes. And also for None too.

@@ -20,6 +20,8 @@
"https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.02.tar.gz":
"6b74f3901214cb2c2934e98196829835",
}
VALIDATION_LIST = "validation_list.txt"
TESTING_LIST = "testing_list.txt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not use global variables for those only used in one function (constructor). This looks like they are public attributes. There is no good reason these should be global.

elif subset == "training":
excludes = load_list(VALIDATION_LIST) + load_list(TESTING_LIST)
walker = walk_files(self._path, suffix=".wav", prefix=True)
self._walker = [w for w in walker if HASH_DIVIDER in w and EXCEPT_FOLDER not in w and w not in excludes]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking way better than the previous. One more push.

Rather than parsing the all the files present in the file system first then filter only the interested, it's better to only parse the needed one.

Also, repeating the combination of walk_files and filtering HASH_DIVIDER in two clauses looks overwhelming. You can combine them into a much shorter code like the bellow.

elif subset == "testing":
    ...
else:
    paths = [str(p) for p in Path(self._path).glob('*/*_no_hash_*.wav')]
        # no need to check EXCEPT_FOLDER because
        # none of the wav file in the directory match with `_no_hash_`
    if subset == "training":
        excludes = set(_load_lists("validation_list.txt", "testing_list.txt"))
        paths = [p for p in paths if p not in excludes]
    self._walker = paths

Copy link
Contributor Author

@vincentqb vincentqb Oct 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that parsing and then filtering is not optimal. Changing os_walk is out of scope for this PR though, and could change the order of the files from #814. I'm ok with some replacing os_walk with glob in a next PR due to #950

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I encourage you to take this opportunity ro get rid of walk_files because, if you try to stick with walk_files, I do not see any pro at a glance. You will have to more works to settle with sub-optimal solution, which is not as compact yet inefficient.

Also #950 is about the support of Path object, so I do not see the logical connection between it and walk_files (which returns list of string).

Regarding, the order it returns, we have a unit test which can verify the proper ordering of files returned, so you do not need to worry about aligning the order with walk_files function. Instead, you can just focus on writing the correct code. However, this is a valid point and probably you will need to insert sort function in the snippet above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincentqb Now that #950 is closed, can you followup replacing walk_files with glob with something like the following?

I agree that parsing and then filtering is not optimal. Changing os_walk is out of scope for this PR though, and could change the order of the files from #814. I'm ok with replacing os_walk with glob in a next PR due to #950.

elif subset == "testing":
    ...
else:
    paths = [str(p) for p in Path(self._path).glob('*/*_no_hash_*.wav')]
        # no need to check EXCEPT_FOLDER because
        # none of the wav file in the directory match with `_no_hash_`
    if subset == "training":
        excludes = set(_load_lists("validation_list.txt", "testing_list.txt"))
        paths = [p for p in paths if p not in excludes]
    self._walker = paths

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(about #940: #950 discussed using the Path library which arrived in 3.4. now that we use it, i'm ok with using other functions from it.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea: i've opened an issue to request help from the community to remove walk_files: #1051.

def load_list(filename):
filepath = os.path.join(self._path, filename)
with open(filepath) as fileobj:
return [os.path.join(self._path, line.strip()) for line in fileobj]
Copy link
Contributor

@mthrok mthrok Oct 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please prefer a plain function defined on module level especially when there is no local variable needs to be captured.

Also, since you are defining a helper function for your own sake, you can adopt a signature like

def _load_lists(*filename)

and it will make excludes = load_lists(VALIDATION_LIST) + load_list(TESTING_LIST) into as simple as excludes = _load_lists("validation_list.txt", "testing_list.txt").

valid.write(f'{label_filename}\n')
cls.valid_samples.append(sample)
elif j < 6:
label_filename = os.path.join(label, filename)
Copy link
Contributor

@mthrok mthrok Oct 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contents of "testing_list.txt" and "validation_list.txt" are in Linux format.

$ head speech_commands_v0.01/testing_list.txt
bed/0c40e715_nohash_0.wav
bed/0ea0e2f4_nohash_0.wav
bed/0ea0e2f4_nohash_1.wav

Here, using os.path.join make it to the valid path on Windows system (something like bed\0c40e715_nohash_0.wav), so this is not correct. Something like the following is the correct way to mimic the data.

Suggested change
label_filename = os.path.join(label, filename)
label_filename = f'{label}/{filename}'

Same goes to the case where j < 4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more straightforward to use List[pathlib.Path] objects for self._walker, rather than converting them to List[str].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that will be more convenient to do that in the future

for filename in filenames:
filepath = os.path.join(self._path, filename)
with open(filepath) as fileobj:
output += [os.path.join(self._path, line.strip()) for line in fileobj]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised our Windows CI job does not raise FileNotFound error with this implementation. If my understanding is correct, as previously noted, os.path.join(self._path, line.strip()) would produce strings that look like speech_commands\speech_commands_v0.01/testing_list.txt, which is an invalid path on Windows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's what I see too. It seems like os.path.normpath would be useful here.

self._walker = [
w for w in walker
if HASH_DIVIDER in w
and EXCEPT_FOLDER not in w
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if HASH_DIVIDER in w and EXCEPT_FOLDER not in w is redundant because files in background noises do not have _nohash_ in filenames.

Copy link
Contributor Author

@vincentqb vincentqb Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll avoid changing the logic here to avoid breaking any codes, say if a user moved a file in background_noise folder/etc. This can be looked at in a later PR.

self._walker = _load_list("validation_list.txt")
elif subset == "testing":
self._walker = _load_list("testing_list.txt")
elif subset == "training":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there is a behavior inconsistency between "training"/None and "validation"/"testing".

If certain valid files are removed from the dataset, then this dataset implementation will keep working for "training"/None case with less files, but for "validation"/"testing" it will raise an exception.

What's your take on that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"testing", "validation", "training" are defined by the dataset by the training/validation files. It is technically undefined by the dataset outside of that. Given how the three are defined by the dataset in those files, as a user, I'd expect changes to those file to propagate. I'd say it'd be fair to add a quick note in the docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is technically undefined by the dataset outside of that.

I agree with that, however IIRC, one of the goal of torchaudio's dataset implementation is to make it easy to modify the dataset. Something along the line of point 3&4 of #852 (comment) . With this rule, we have to take the extra step to think through what kind of modification is valid/invalid and what is the expected behavior, and put it in the implementation.

I am trying to raise the awareness of it.

Copy link
Contributor

@mthrok mthrok Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And of course, if we are just getting rid of the easy modification of the dataset, we do no need think about the any modification to dataset, and we leave it as UB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, SpeechCommands does explain how those two files are generated, and how to generalize their approach, see README.

for filename in filenames:
filepath = os.path.join(root, filename)
with open(filepath) as fileobj:
output += [os.path.normpath(os.path.join(root, line.strip())) for line in fileobj]
Copy link
Contributor

@mthrok mthrok Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment on why normapth is necessary here?
Even though normpath's official document says On Windows, it converts forward slashes to backward slashes. To normalize case, use normcase().. For many of us, who are linux users, It is not well known. (I learned this only today.)

@vincentqb vincentqb merged commit b34bc7d into pytorch:master Oct 27, 2020
@vincentqb
Copy link
Contributor Author

vincentqb commented Oct 27, 2020

Merged based on offline discussion

mthrok pushed a commit to mthrok/audio that referenced this pull request Feb 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants