Description
Describe the problem
TDC caches downloaded data to disk for future uses, but by default, it caches this data to a relative local directory ./data
. If I then use TDC from a different directory on the same machine without specifying the previous location, it downloads the data again, unnecessarily polluting disk space.
Describe the solution you'd like
Use a "global" cache directory that is absolute for a user. It's standard practice for most applications to cache downloaded data to a hidden directory like $HOME/.cache/PACKAGE
(c.f., wandb
, pip
, huggingface
, black
, etc.) by default. At runtime, a user can change this if desired and configure this default location using an environment variable (see: huggingface)
I currently have this manually implemented in my TDC client code like so:
import os
from pathlib import Path
from tdc.single_pred import ADME
TDC_CACHE = os.getenv("TDC_DATASETS_CACHE", Path.home() / ".cache" / "TDC")
data = ADME(name = 'Caco2_Wang', path=TDC_CACHE)
but this is cumbersome to do everywhere. It would be nice for TDC to do this by default.
You can do this by changing the path
parameter type from str
to Optional[str]
with a default value of None
. A value of None
indicates to use TDC_DATASETS_CACHE
from the environment, allowing a user to (1) globally configure the default location of TDC downloads from the environment, and (2) avoid redownloading datasets every time they change directories.
Activity