-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove non-base datasets. #1275
Conversation
looks ok but @bpopeters can you please review. |
if hasattr(ex, "tgt"): | ||
return len(ex.src[0]), len(ex.tgt[0]) | ||
return len(ex.src[0]) | ||
def text_sort_key(ex): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The style of defining sort_key
as a static method of a dataset class was to match torchtext, where I think the general idea is people would define a different dataset class for each dataset (classes with names like IWSLT
and WMT14
, for example). But for the way we're doing it, with just the single Dataset
class, I agree that this makes sense.
Is this waiting on anything? |
I was about to merge it but Travis is weird. |
@flauted this breaks pre-existing pre-processed files, not a big deal but in case this could be fixed. |
You bumped the major version so I'll let it be. But for what it's worth fixing it would I think require leaving stub |
yeah that's what I figured out. |
* Remove non-base datasets. * Update Dataset documentation. * Move helper methods out of Dataset and document them. * Remove , replace w explicit constructor calls.
The various [Datatype]Dataset's are just shells that define a
sort_key
static method. Sincetorchtext.data.Dataset
only uses thesort_key
attribute attached to the instanceself
(despite defining it at the class level) , we can just make thesort_key
an instance attribute of theDatasetBase
. ConsequentlyDatasetBase
is now a generalDataset
. After this PR, the file dataset_base.py should be renamed to dataset.py.