Skip to content

Change to parallel reader #635

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Feb 6, 2018
Merged

Change to parallel reader #635

merged 14 commits into from
Feb 6, 2018

Conversation

pkuyym
Copy link
Contributor

@pkuyym pkuyym commented Feb 5, 2018

Resolves #630

  1. Change to parallel reader
  2. Adapt the model based on the new interface of data reader.

@pkuyym pkuyym requested review from kuke and zhxfl February 5, 2018 03:49
Copy link
Collaborator

@kuke kuke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost LGTM. @zhxfl please help to verify the correctness of whole process.

""" struct for one block :
contain label, label desc, feature, feature_desc
class SampleInfo(object):
"""SampleInfo holds the necessary information to load an example from disk.
Copy link
Collaborator

@kuke kuke Feb 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an example -> a sample

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.


class DataReader(object):
"""DataReader provides basic audio sample preprocessing pipeline including
I/O and augmentation transforming.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I/O and augmentation transforming -> data loading and data augmentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


Args:
feature_file_list (str): File containing feature data related files.
label_file_list (str): File containing label data related files.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> feature_file_list (str): File that lists the paths of all the feature data and their descriptions.
-> label_file_list (str): File that lists the paths of all the label data and their descriptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

label_file_list (str): File containing label data related files.
frame_dim (int): The final feature dimension of one frame after all
augmentation applied.
drop_frame_len (int): Lower threshold bound to filter samples having
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> drop_frame_len (int): The sequence length threshold above which the samples will be dropped

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

batch_samples = []
lod = [0]

if len(batch_samples) >= minimum_batch_size:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this if statement be merged with the one in above for loop? The resemble each other very much.

parser.add_argument(
'--minimum_batch_size',
type=int,
default=32,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the proper default value for this argument?


class SampleInfoBucket(object):
"""SampleInfoBucket contains paths of several description files. Feature
description file contains necessary information to access samples' feature
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a simple explanation of necessary information, you can use ( ) or such as, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.

"""SampleInfoBucket contains paths of several description files. Feature
description file contains necessary information to access samples' feature
data and label description file contains necessary information to
access samples' label data. SampleInfoBucket is the minimum unit to do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and label description file contains necessary information to access samples' label data-->, the same with the label description file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.

self._batch_buffer_size = batch_buffer_size
self._process_num = process_num

def generate_bucket_list(self, is_shuffle):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private function use function name "_generate_bucket_list" to discriminate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If generate_bucket_list is exposed, we can shuffle the block for each epoch. Otherwise, the shuffling can only be done once.

@kuke
Copy link
Collaborator

kuke commented Feb 6, 2018

We can merge this pr first.

@pkuyym pkuyym merged commit b1c3796 into PaddlePaddle:develop Feb 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants