Skip to content

stop caching data in a relative directory #196

Open
@davidegraff

Description

@davidegraff

Describe the problem
TDC caches downloaded data to disk for future uses, but by default, it caches this data to a relative local directory ./data. If I then use TDC from a different directory on the same machine without specifying the previous location, it downloads the data again, unnecessarily polluting disk space.

Describe the solution you'd like
Use a "global" cache directory that is absolute for a user. It's standard practice for most applications to cache downloaded data to a hidden directory like $HOME/.cache/PACKAGE (c.f., wandb, pip, huggingface, black, etc.) by default. At runtime, a user can change this if desired and configure this default location using an environment variable (see: huggingface)

I currently have this manually implemented in my TDC client code like so:

import os
from pathlib import Path
from tdc.single_pred import ADME

TDC_CACHE = os.getenv("TDC_DATASETS_CACHE", Path.home() / ".cache" / "TDC")
data = ADME(name = 'Caco2_Wang', path=TDC_CACHE)

but this is cumbersome to do everywhere. It would be nice for TDC to do this by default.

You can do this by changing the path parameter type from str to Optional[str] with a default value of None. A value of None indicates to use TDC_DATASETS_CACHE from the environment, allowing a user to (1) globally configure the default location of TDC downloads from the environment, and (2) avoid redownloading datasets every time they change directories.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions