Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document usage of Kedro + DVC #2691

Open
astrojuanlu opened this issue Jun 15, 2023 · 17 comments
Open

Document usage of Kedro + DVC #2691

astrojuanlu opened this issue Jun 15, 2023 · 17 comments
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation

Comments

@astrojuanlu
Copy link
Member

Description

It would be nice if we had a page on our docs that described how DVC and Kedro can be used together.

Context

Kedro users have been asking for DVC for some time. For example:

The scope of this issue is to document how such thing can be done, but there's a chance that the best way of documenting it is to contribute a plugin.

@astrojuanlu astrojuanlu added Component: Documentation 📄 Issue/PR for markdown and API documentation Issue: Feature Request New feature or improvement to existing feature labels Jun 15, 2023
@stichbury
Copy link
Contributor

I asked Chat GPT; what's the quality like for this response? Just wondering if we can use it start off a page...

Kedro and DVC are both tools that can be used for data science and machine learning projects. While they serve different purposes, they can be used together to enhance the reproducibility and version control aspects of your projects. Here's how you can integrate Kedro and DVC:

Initialize a Kedro project: Start by creating a new Kedro project using the kedro new command. This will set up the basic project structure and configuration files.

Set up DVC: Install DVC and initialize it within your Kedro project directory using the dvc init command. This will create the necessary DVC files and directories.

Define your Kedro pipelines: Use Kedro to define your data processing pipelines. This involves creating modular pipeline nodes that represent individual steps in your data science workflow.

Version your data with DVC: Use DVC to version your data files and track their changes over time. By using the dvc add command, you can tell DVC to track specific data files in your Kedro project. This ensures that your data remains under version control and can be easily reproduced.

Integrate DVC with Kedro pipelines: Modify your Kedro pipeline definitions to include DVC commands. For example, you can use the dvc run command within a Kedro pipeline node to run a data processing step and have DVC track the inputs and outputs of that step.

Use DVC for data versioning and sharing: DVC provides features for managing large datasets and collaborating with others. You can use DVC's remote storage options (such as AWS S3 or Google Cloud Storage) to store your data and share it with your team or collaborators.

By combining Kedro and DVC, you can leverage Kedro's pipeline management capabilities with DVC's data versioning and reproducibility features. This allows you to track and reproduce your data science experiments effectively, ensuring the integrity and consistency of your results.

@noklam
Copy link
Contributor

noklam commented Jun 16, 2023

@stichbury brilliant idea!

@JaynouOliver
Copy link

Sure, please assign me, I want to contribute and learn on the go

@astrojuanlu
Copy link
Member Author

Hi @JaynouOliver, go ahead! No need to assign the issue, start working on a new documentation page and open a pull request when it's ready for a first review.

@JaynouOliver
Copy link

Sure!

@astrojuanlu
Copy link
Member Author

Interesting perspective from a DVC user: https://fosstodon.org/@blakeNaccarato/111256190959866234

I appreciate the separation of concerns that working with DVC facilitates. Stages as shell commands make non-Python stages trivial. It's good for general processing outside research pipelines too, e.g. document processing.

Stage caching is enabled by hash comparison of deps/outs on disk and avoids costly recompute.

But this design forces disk access between each stage and lots of intermediate files. An abstraction enabling all-in-memory stages could help at the expense of caching.

@astrojuanlu
Copy link
Member Author

Today @datajoely mentioned this in our Slack, didn't realize that our dataset versioning sort of overlaps https://linen-slack.kedro.org/t/16014653/hello-very-much-new-to-the-ml-world-i-m-trying-to-setup-a-fr#e111a9d2-188c-4cb3-8a64-37f938ad21ff

DVC and Kedro don’t gell super nicely together, it can be done but our support for native DataSet versioning and Delta (spark) (non-spark) also work in this space

@stichbury
Copy link
Contributor

Hi @JaynouOliver -- how are you? Today is the last day of October so please do slip any PRs into our queue if you have them for Hacktoberfest.

@JaynouOliver
Copy link

Hi. I was not doing it for hacktoberfest. Mind if I submit it by tomorrow?

@stichbury
Copy link
Contributor

Then that's grand, yes please, that would work for us. Thank you.

@astrojuanlu
Copy link
Member Author

For the record, yesterday two users asked me how to combine Kedro and DVC.

@stichbury
Copy link
Contributor

For the record, yesterday two users asked me how to combine Kedro and DVC.

Did you tell them? Did you write it down? If not, is the above generated content any use? Shall we publish?

I have many questions.

@astrojuanlu
Copy link
Member Author

It was an in-person chat after my talk. I told them to try https://github.com/FactFiber/kedro-dvc/ but also warned them that Kedro versioning is not easily configurable so it might be hard #2355 I think this has to be an engineering spike before a documentation issue.

@astrojuanlu astrojuanlu changed the title Document integration with DVC [spike] Investigate possible integration with DVC Jan 25, 2024
@astrojuanlu astrojuanlu removed the Component: Documentation 📄 Issue/PR for markdown and API documentation label Jan 25, 2024
@stichbury
Copy link
Contributor

Perfect, thanks for the background and also for the change in the ticket, makes sense to me.

@merelcht
Copy link
Member

We're looking at this in the context of broader versioning and dataset research. If you have thoughts on this please comment on #3997.

@astrojuanlu
Copy link
Member Author

A useful resource I found on DVC https://www.python4data.science/en/latest/productive/dvc/index.html

@astrojuanlu
Copy link
Member Author

astrojuanlu commented Sep 20, 2024

Did a bit of umprompted investigation into DVC. I don't think it's actually that hard to use DVC and Kedro together.

  • Level 0: Data files are tracked by DVC, Kedro catalog contains pointers to local filepaths. kedro run assumes the files have been pulled (dvc checkout). If they haven't, Kedro will fail with "no such file". Basically Kedro doesn't know anything about DVC and viceversa.
  • Level 1: Data files are tracked by DVC, Kedro catalog contains pointers to dvc:// filepaths thanks to the fsspec-compatible DVCFileSystem1. Kedro would then never fail to read the files, DVC would be in charge of do an automatic checkout on read. Not much help from Kedro for outputs, those would still be written to a local filepath (if tracked by DVC) or to some other remote storage.
  • Level 2: Cooperative data tracking and versioning. Doesn't exist, unclear how that might look like.
  • Level 3: Cooperative data tracking and pipeline definition. Probably difficult or not possible, too much overlap.

I think Level 0 and 1 are possible today without any changes in Kedro. The only problem is that it would probably work badly with versioned: true datasets.

I'm all for at least documenting what's possible today.

Footnotes

  1. It has existed in its current form for about 2 years.

@astrojuanlu astrojuanlu added Component: Documentation 📄 Issue/PR for markdown and API documentation and removed Issue: Feature Request New feature or improvement to existing feature labels Sep 20, 2024
@astrojuanlu astrojuanlu changed the title [spike] Investigate possible integration with DVC Document usage of Kedro + DVC Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation
Projects
Status: No status
Development

No branches or pull requests

5 participants