Best practices for testing changes to data sources

There is no question that caching data is the right pattern in CI in order to improve efficiency and prevent data having to be re-acquired for every run.

However, another best-practice (I'd argue) is that the data access be done programmatically in the tutorial itself. The combination of these two patterns leads to cases where changes to the *code* that accesses/acquires data may not be tested due to data caching in CI.

I think this case should be addressed in the "how-to"/"faq" section of this site. Off the top of my head, the pattern that makes the most sense is to have a scheduled CI job that *doesn't* incorporate data caching. This job should be triggered only by `cron` and have an option for manual triggering as well, for cases when reviewers identify that data access has changed[^1].

xref numpy/numpy-tutorials#255

[^1]: This could of course be extended to be made automatic, e.g. with notebook metadata, but IMO that's too involved for a high-level recommendation, at least at this stage!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Best practices for testing changes to data sources #67

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Best practices for testing changes to data sources #67

Description

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions