Skip to content

Best practices for testing changes to data sources #67

Open
@rossbar

Description

@rossbar

There is no question that caching data is the right pattern in CI in order to improve efficiency and prevent data having to be re-acquired for every run.

However, another best-practice (I'd argue) is that the data access be done programmatically in the tutorial itself. The combination of these two patterns leads to cases where changes to the code that accesses/acquires data may not be tested due to data caching in CI.

I think this case should be addressed in the "how-to"/"faq" section of this site. Off the top of my head, the pattern that makes the most sense is to have a scheduled CI job that doesn't incorporate data caching. This job should be triggered only by cron and have an option for manual triggering as well, for cases when reviewers identify that data access has changed1.

xref numpy/numpy-tutorials#255

Footnotes

  1. This could of course be extended to be made automatic, e.g. with notebook metadata, but IMO that's too involved for a high-level recommendation, at least at this stage!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions