Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alternative (easier?) way to define datasets #11

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

richardjgowers
Copy link
Member

@orbeckst what do you think about this? I was looking at how to write tests for this package, and with every single dataset implementing how it should be downloaded, there was a lot of room for potential errors. Instead if we define a template for how a dataset should look, we can have the logic of how this is retrieved in a centralised location.

It takes ideas from how we auto-register Readers in MDAnalysis, so Datasets are added to a list of available datasets once defined.

return records
class ADK_Equilibrium(Dataset):
NAME = "adk_equilibrium"
DESCRIPTION = "adk_equilibrium.rst"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One downside I can see is we've lost the short description that the function to generate this had

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't want to loose the description and don't want to loose the docs...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So currently it looks like:

>>> print(ds.__doc__)
AdK 1us equilibrium trajectory (without water)

    Attributes
    ----------
    topology : filename
         Filename of the topology file
    trajectory : filename
         Filename of the trajectory file
    DESCR : string
         Description of the trajectory.

So mostly there still

super().__init__(**contents)


def fetch(dataset, data_home=None, download_if_missing=True):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows MDADATA.fetch('adk_equilibrium'), rather than a separate function for each dataset

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a reason why explicit functions: tab completion and introspection. (sklearn does it and it works really well – much better than having to know the name of the dataset)

I'd like to keep explicit functions – both for ease of use and for same "look and feel" as sklearn.datasets (as well as getting docs!)

We can have a generic mechanism and generate the fetch_* functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's nice for this case. But did you look at some of the other accessors like fetch_adk_transitions_DIMS where we get a tar file and unpack? We might be able to reduce our requirements to these two types of cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could put the compression/other info into the RemoteFileMetaData object

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok yeah the namespace is nice, we could implement the static functions as

def fetch_adk():
    return base.fetch('adk')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RemoteFileMetadata is verbatim from sklearn. Might be useful to keep it that way and really keep it simple.

If anything, we should build data structures that contain RemoteFileMetadata instances that map remote and local. Have a look at the transitions dataset to see what else we have.

Finally, have a look at sklearn.datasets (and the outstanding docs) to see the variance. I think one reason for copy&paste code is that ultimately each dataset in the wild might have slightly different requirements. Still, that's not to say that we can't try to get a bit of order in ;-).


Returns
-------
dataset : dict-like object with the following attributes:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and another thing lost is the clear description of what the returned Bunch will have....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to loose this...

# cannot check its checksum. Instead we download individual files.
# separately. The keys of this dict are also going to be the keys in the
# Bunch that is returned.
ARCHIVE = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we normalize all of this then we might be able to just put all these data into JSON or YAML files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this is essentially py-son at this point

@orbeckst
Copy link
Member

It's a good idea to have a template for the most common types of accessing

  • separate files
  • archives that need to be unpacked

I still want to keep explicit accessor functions because of ease of use, docs, and keeping similarity to sklearn.datasets.

We still need to be able to allow other code that does not fit into the general templates.

@orbeckst
Copy link
Member

Might be worthwhile reviving this PR.... if someone wants to look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants