Simple Pachyderm Pipeline template to download a huggingface dataset or model to a repo.
name
- Pipeline namesecretName
- Kubernetes secret with hugggingface tokentype
- The type of downloadmodel
ordataset
hf_name
- Name of the hugggingface dataset or model to downloadrevision
- (Optional) a specific revision of the dataset you want to downloaddisable_progress
- (Optional) set totrue
to disable progress loggingallow_patterns
- (Optional) allow patterns for the download comma separated"data/*,*.json"
ignore_patterns
- (Optional) ignore patterns for the download comma separated"data/*,*.json"
- Create a secret in your pachyderm kubernetes namespace with your huggingface token that's read-only:
kubectl create secret generic hugging-face-token --from-literal HF_HOME=<token>
- Create the pipeline using the template in this repo using
pachctl
:pachctl create pipeline \ --jsonnet https://raw.githubusercontent.com/tybritten/hf-dataset-downloader/main/dataset-downloader.jsonnet \ --arg name="hf-downloader-CodeAlpaca_20k" --arg hf_name="HuggingFaceH4/CodeAlpaca_20K" --arg secretName=hugging-face-token
- This creates a cron pipeline with a spec of never. So to run it you'll run:
pachctl cron run hf-downloader-CodeAlpaca_20k