-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
alternative (easier?) way to define datasets #11
base: master
Are you sure you want to change the base?
Conversation
return records | ||
class ADK_Equilibrium(Dataset): | ||
NAME = "adk_equilibrium" | ||
DESCRIPTION = "adk_equilibrium.rst" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One downside I can see is we've lost the short description that the function to generate this had
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't want to loose the description and don't want to loose the docs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So currently it looks like:
>>> print(ds.__doc__)
AdK 1us equilibrium trajectory (without water)
Attributes
----------
topology : filename
Filename of the topology file
trajectory : filename
Filename of the trajectory file
DESCR : string
Description of the trajectory.
So mostly there still
super().__init__(**contents) | ||
|
||
|
||
def fetch(dataset, data_home=None, download_if_missing=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows MDADATA.fetch('adk_equilibrium')
, rather than a separate function for each dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a reason why explicit functions: tab completion and introspection. (sklearn does it and it works really well – much better than having to know the name of the dataset)
I'd like to keep explicit functions – both for ease of use and for same "look and feel" as sklearn.datasets (as well as getting docs!)
We can have a generic mechanism and generate the fetch_*
functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's nice for this case. But did you look at some of the other accessors like fetch_adk_transitions_DIMS
where we get a tar file and unpack? We might be able to reduce our requirements to these two types of cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could put the compression/other info into the RemoteFileMetaData
object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok yeah the namespace is nice, we could implement the static functions as
def fetch_adk():
return base.fetch('adk')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RemoteFileMetadata
is verbatim from sklearn. Might be useful to keep it that way and really keep it simple.
If anything, we should build data structures that contain RemoteFileMetadata
instances that map remote and local. Have a look at the transitions dataset to see what else we have.
Finally, have a look at sklearn.datasets (and the outstanding docs) to see the variance. I think one reason for copy&paste code is that ultimately each dataset in the wild might have slightly different requirements. Still, that's not to say that we can't try to get a bit of order in ;-).
|
||
Returns | ||
------- | ||
dataset : dict-like object with the following attributes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and another thing lost is the clear description of what the returned Bunch will have....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to loose this...
# cannot check its checksum. Instead we download individual files. | ||
# separately. The keys of this dict are also going to be the keys in the | ||
# Bunch that is returned. | ||
ARCHIVE = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we normalize all of this then we might be able to just put all these data into JSON or YAML files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this is essentially py-son at this point
It's a good idea to have a template for the most common types of accessing
I still want to keep explicit accessor functions because of ease of use, docs, and keeping similarity to sklearn.datasets. We still need to be able to allow other code that does not fit into the general templates. |
Might be worthwhile reviving this PR.... if someone wants to look into it. |
@orbeckst what do you think about this? I was looking at how to write tests for this package, and with every single dataset implementing how it should be downloaded, there was a lot of room for potential errors. Instead if we define a template for how a dataset should look, we can have the logic of how this is retrieved in a centralised location.
It takes ideas from how we auto-register Readers in MDAnalysis, so Datasets are added to a list of available datasets once defined.