Skip to content

Conversation

@sadhana-r
Copy link
Contributor

No description provided.

@sadhana-r
Copy link
Contributor Author

@ebrahimebrahim before checking out this branch, you need to initialize dvc in the main branch. So you can run:
dvc init
dvc checkout dvc_setup -f

Then you can try run dvc pull to check that you are able to download the database.

@sadhana-r sadhana-r marked this pull request as ready for review September 27, 2024 16:11
Copy link
Collaborator

@ebrahimebrahim ebrahimebrahim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super cool, tysm for putting it together!

A small thing:

  • Can you add "dvc[gdrive]" to the dev requirements list here in pyproject.toml? This way when someone does pip install of openlifu with the [dev] option it includes dvc as a developer's tool, with the needed google drive support.

Bigger thing, possibly my issue: I couldn't get it to work,after dvc pull I get a google drive login in a browser and then when I select my google acct (my work acct which should have access to our shared data folder) I get this in the browser:

image

Did you run into this by any chance or do you know what we might be doing differently? My dvc version is 3.55.2

@ebrahimebrahim ebrahimebrahim linked an issue Sep 27, 2024 that may be closed by this pull request
@sadhana-r
Copy link
Contributor Author

Bigger thing, possibly my issue: I couldn't get it to work,after dvc pull I get a google drive login in a browser and then when I select my google acct (my work acct which should have access to our shared data folder) I get this in the browser:

image

Did you run into this by any chance or do you know what we might be doing differently? My dvc version is 3.55.2

Interesting, seems like this is a recent issue due to Google blocking the dvc app till further verifications are complete. See here

I followed the solution in that link and created a custom app and authorization credentials following the instructions here. Could you test it out again and see if I have set it up correctly? I will share the credentials file with you.

sadhana-r added a commit that referenced this pull request Sep 29, 2024
pyproject.toml Outdated
dev = [
"pytest >=6",
"pytest-cov >=3",
dvc[gdrive]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this just needs to be in quotes. Also good to have a trailing comma.

If you want to test that the pyproject.toml works, you can create a fresh python environment, and then pip install from the directory:

pip install /path/to/openlifu-python-repo

Adding [dev] will install with these options:

pip install /path/to/openlifu-python-repo[dev]

@ebrahimebrahim
Copy link
Collaborator

ebrahimebrahim commented Sep 30, 2024

One thing that will help you to avoid going back and forth to fix linting errors is to install the pre-commit hooks. They will auto-reformat to fix most of trivial linting issues every time you commit. To install pre-commit hooks:

pip install pre-commit # installs the pre-commit program to your python environment
pre-commit install # uses pre-commit program to install the local pre-commit configuration in this repo as a hook that will be run whenever you commit something

@ebrahimebrahim
Copy link
Collaborator

ebrahimebrahim commented Sep 30, 2024

Hmm it does seem that google has broken this way of using DVC with google drive 😐

Okay so in order to get this to work I had to do this after dvc init:

dvc remote modify shared_gdrive gdrive_client_id "<the client ID from the json file that you shared with me>"
dvc remote modify shared_grdive gdrive_client_secret "<the client secret from the json file that you shared with me>"

Then dvc pull worked.


A few points and unanswered questions:

  • I see you have a commit that added the client ID and secret and then another commit that removes them. I imagine the client ID is okay to include, but what is the nature of the client secret? If we expose that then what does that allow someone to do? Is that secret just linked to that one google cloud project? So it would just enable someone to download the data files as many times as they want? If that's the case then the worst that can happen is that they run out the limit set by google cloud. That might be okay -- if we ever hit that limit we can change backends or something. Basically I want to answer the question: what is the appropriate way to share out the access keys so that developers on this repo can do dvc pull?
    • (If our intention is to ultimately not publicize the gdrive client secret then we should squash those two commits, and perhaps generate a new secret and retire the old one since it is exposed here in the git history)
  • Once we have an agreed upon way of doing things, let's include instructions in the README or somewhere for how to use DVC in this repo.
  • When I dvc pulled, the data I saw didn't look like it is exactly the contents of db-extended-example-v04.zip

@ebrahimebrahim
Copy link
Collaborator

Discussed this with @jcfr and here are a few points to consider moving forward:

  • Don't commit something that calls itself "secret". Even if it makes sense in terms of our access requirements as is the case here, it looks bad and attracts exploitation by bots that seek out this kind of thing.
  • Think about how external contribution would work, if we want the data to be openly accessible and if we want to track data and and code changes in parallel. If an external contributor forks the repo to make a PR intended for merging upstream, then what happens when they try to dvc push? Where does the data go and what gives them access to be able to push there?
  • Consider the other dvc remote storage options besides google drive. One suggestion was to create a small flask app that handles all the authentication in whatever custom way we want, and use dvc with an http storage backend. (I am not sure I have the skill/time to do this now but it's an appealing idea)
  • This might also be a bit much for the current project right now, but it's really neat: In order to address the problem of managin Slicer test data, JC and Slicer devs have cobbled together a custom content-addressable storage framework using the github release functionality and some scripts. Check it out here: https://github.com/Slicer/SlicerTestingData
  • For now if we just want a quick solution, I think I am okay with giving in and making the data private on google drive, not sharing the key. It's okay to commit the client ID, but don't commit the secret and instead just share it among developers. And of course generate a new secret and retire the one that was compromised in this PR

@sadhana-r
Copy link
Contributor Author

@ebrahimebrahim I can't figure out why the README isn't rendering correcly. I haven't used RST before and don't have the correct tooling setup for proper local rendering. Could you please check that during your review and let me know where it could be going wrong. I will share the updated gdrive_client_secret with you.

@ebrahimebrahim ebrahimebrahim self-requested a review October 8, 2024 14:30
Copy link
Collaborator

@ebrahimebrahim ebrahimebrahim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't get it to work. Here is exactly what I did, with a fresh clone of the repo and a fresh python environment just to be sure:

python3.10 -m venv .venv
. .venv/bin/activate
git clone git@github.com:OpenwaterHealth/OpenLIFU-python.git
cd OpenLIFU-python/
git checkout dvc_setup
pip install .[dev]
dvc remote modify shared_gdrive gdrive_client_secret <the_secret_that_you_sent_me>
dvc pull

It opened a google login in the browser and had me authenticate with google 3 or 4 times, then it stopped and said in the terminal:

ERROR: failed to connect to gdrive (1rGAzqdikeTUzUQADjdP7-fTPdfF15eaO/files/md5) - Failed to authenticate GDrive: OAuth2 code exchange failed: invalid_clientUnauthorized
Fetching
ERROR: failed to pull data from the cloud - 1 files failed to download

Other comments:

  • I'm not a fan of RST either, but I kept it initially because there was already some auto-documentation thing set up to use it. We should definitely switch to markdown at some point. This online tool I just found is how I debugged your readme
  • Some of the commits reference #121 (the issue) while others reference #122 (this PR). Make them all reference #121. (One example of how to do this is: update your local main branch, check out this dvc_setup branch, do git rebase -i main, then in the editor you can mark "reword" on the commits for which you want to reword the message)
  • There is still a secret key in the commit 938f2f0 -- we should not have this for the reasons discussed above, even if the secret is no longer valid. This can be eliminated from the history also with an interactive rebase as mentioned above, squashing the commit that adds the key and the commit that removes it.

README.rst Outdated
for user access authentication.

DVC usage
^^^^^^^^^
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apparently the hierarchy of heading markers is inferred from the document itself, and it doesn't like when you skip a heading level
https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#sections

Suggested change
^^^^^^^^^
~~~~~~~~~

README.rst Outdated
pip install -e .[dev]
Version control of database using DVC (Data Version Control)
------------
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the underline heading marker has to go all the way to the end of the text.

(This heading is kind of long though.)

Suggested change
------------
------------------------------------------------------------

Comment on lines 74 to 76
git pull
dvc pull # Requires access to remote storage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there also be a command to add the client secret? Something like

dvc remote modify shared_gdrive gdrive_client_secret <client_secret_here>

(and we tell the reader that they need to contact the developers to get the client secret).

Really the readme should be up front about data access currently being restricted and requiring an access key to be shared by the developers

README.rst Outdated
------------

Data Version Control (DVC) is a data management tool that is meant to be run alongside Git.
In this project DVC is used to link changes in the code to specific versions of the database. DVC can be
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to the reader what is "the database".
Maybe we can call it "database folder of example data" or something like that.

or be more generic and just call it "project data such as an example database" so that it holds true no matter what kind of project-associated data we end up trackign with dvc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And thanks for sharing the RST tool!

@sadhana-r
Copy link
Contributor Author

I couldn't get it to work. Here is exactly what I did, with a fresh clone of the repo and a fresh python environment just to be sure:

I had previously just update the client_secret but seems like I should have created a whole new id/secret pair. I have now created a new client_id as well and updated the config. I tested it out with a clean install and it seems to work. Hopefully it works for you now as well! Will share the new secret.

I have also now cleaned up the commit history.

README.rst Outdated
.. code:: sh
git pull
dvc remote modify shared_gdrive gdrive_client_secret <client_secret_here> # Contact developers for grive_client_secret
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking for a way to prevent the secret from affecting a file that is tracked by git, because it seemed to easy for a developer to accidentally commit it in the future (the secret would end up stored inside .dvc/config which would show up as a change to potentially be committed). I found that if you add --local like this the key will instead go into .dvc/config.local which is already in .dvc/.gitignore so the danger goes away

Suggested change
dvc remote modify shared_gdrive gdrive_client_secret <client_secret_here> # Contact developers for grive_client_secret
dvc remote modify --local shared_gdrive gdrive_client_secret <client_secret_here> # Contact developers for grive_client_secret

Copy link
Collaborator

@ebrahimebrahim ebrahimebrahim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything is working for me, yay!

Thanks so much for this it's such an improvement

@ebrahimebrahim ebrahimebrahim merged commit 03d5ca3 into main Oct 9, 2024
ebrahimebrahim added a commit that referenced this pull request Feb 28, 2025
I made a mistake while merging where I overwrote some previous DVC
changes.

Issue #122
ebrahimebrahim added a commit that referenced this pull request Feb 28, 2025
I made a mistake while merging where I overwrote some previous DVC
changes.

Issue #122
ebrahimebrahim added a commit that referenced this pull request Feb 28, 2025
I made a mistake while merging where I overwrote some previous DVC
changes.

Issue #122
@ebrahimebrahim ebrahimebrahim deleted the dvc_setup branch March 12, 2025 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Setup dvc for database versioning

3 participants