Skip to content

ajithvcoder/dvc-gdrive-workflow-setup

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Using Service Account Method with DVC in GitHub CI/CD Pipeline

Content

  1. Setup Service account and Google drive folder
  2. Local Setup
  3. Github Repo setup
  4. Github Actions

Setup Service account and Google drive folder

Go to Google cloud console -> Click APIs & Services -> Click Enable APIs and Services

services

Enable Drive Labels API, Google Drive API, Google Drive Activity API

service_api

Setup Service account and get json key

In this method, we can store data in a Google Drive and fetch it using service account authentication.

To create a service account, navigate to IAM & Admin in the left sidebar, and select Service Accounts.

service_account_icon

Click + CREATE SERVICE ACCOUNT, enter a service account name (e.g., "My DVC Project"). If you are new and don't know what permissions to choose, it's better to give owner permissions.

owner-permission

Add all user accounts for which you need to grant access.

email-access

Then click CREATE AND CONTINUE. Click DONE, and you will be returned to the overview page.

Now you can see your service account; click on it and go to the Keys tab.

serivce-mail-id

Under Add Key, select Create New Key, choose JSON, and click CREATE.

key-creation

Download the generated projectname-xxxxxx.json key file to a safe location.

Important: Store the API key in a local folder as credentials.json, but do not commit it to GitHub. If you do so, GitHub will raise a warning, and Google will be notified, revoking the credentials.

Google drive folder

Create a folder in your google drive. I have created a folder with name "dvc-storage-test"

folder

Important: Give permission to the folder anyone with the link with editor access. Also share with your service account for example this is my service account mail id "ajithvcodernew@devcmanager.iam.gserviceaccount.com" and give editor access. The folder in should be shared to specific users (or groups) so they can use it with DVC. "Anyone with a link" is not guaranteed to work.

permission

Now get the id of the folder. For example this my folder url https://drive.google.com/drive/folders/1b4v577_NGcEuUZK3WP6vZe5G9dBTGcsG. So this is my gdrive folder id - 1b4v577_NGcEuUZK3WP6vZe5G9dBTGcsG

This is the configuration url i need to add to dvc config later gdrive://1b4v577_NGcEuUZK3WP6vZe5G9dBTGcsG i.e gdrive://<your_gdrive_folder_id> Reference: here

Local Setup

Hereafter, in your local setup, you need to handle two things: the JSON file (dvcmanager-38xxxxxxx.json or any JSON with projectname-xxx.json which you downloaded as a key) and the Google Drive URL (gdrive://<your_gdrive_folder_id>).

Tree example

|- data

|----dataset

|--------Beagle

|--------Boxer

|-------- etc folders

|-------- etc folders

  • Install dvc and dvc-gdrive

pip install dvc dvc-gdrive

  • Run git init (if you are not in a git folder already)

  • Run dvc init

  • Now run dvc remote add -d myremote gdrive://<your_gdrive_folder_id> command. Reference here

eg: ```dvc remote add -d myremote gdrive://1b4v577_NGcEuUZK3WP6vZe5G9dBTGcsG````

You will see that "myremote" has been added in the .dvc file.

  • Run dvc remote modify myremote gdrive_use_service_account true

  • Run dvc remote modify myremote gdrive_acknowledge_abuse true

  • Run dvc remote modify myremote --local gdrive_service_account_json_file_path path/to/file.json. i.e For example: dvc remote modify --local myremote gdrive_service_account_json_file_path devcmanager-385390fe7f4f.json

You can see similar config in your .dvc/config file

dvc config

  • Run dvc add data i.e `dvc add <data_folder_name>

  • Run dvc config core.autostage true (optional)

  • Run dvc push -r myremote -v

  • Wait for about 10 minutes if it's around 800 MB of data for pushing; if it's in GitHub Actions, wait for 15 minutes.

Now you can check your Google Drive folder; you should see a folder named "files" like this:

config_file

Github Repo setup

  • Now push this to your GitHub repository. Note that in the .dvc folder, by default, you can only push the "config" and ".gitignore" files. Don't change this; let it remain as is.

  • Important: Never push the project-xxx.json file. If you do, Google will identify it and revoke the token; you'll need to set the key again.

  • Add only .dvc/config, .gitignore, and data.dvc files.

  • After pushing to repo, in github in your repo click on "Secrets and variables" -> "Actions" -> "Repository secret" in your GitHub repo and create a secret named "GDRIVE_CREDENTIALS_DATA" Copy the content of your project-xxx.json file (credentials.json file) into the content field.

secrets

Github Actions

  • Create a .github/workflows folder locally for setting up your GitHub Actions workflow.

  • You can refer to the dvc-pipeline.yml file for complete content.

Below is the code used to set up authentication and pull data inside GitHub CI/CD from Google Cloud Drive:

      # Note you can also directly use "GDRIVE_CREDENTIALS_DATA" as env variable and pull it
      - name: Create credentials.json
        env:
          GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}
        run: |
          echo $GDRIVE_CREDENTIALS_DATA > credentials_1.json

      - name: Modify DVC Remote
        run: |
          uv run dvc remote modify --local myremote gdrive_service_account_json_file_path credentials_1.json

      - name: DVC Pull Data
        run: |
          uv run dvc pull -v
  • Now you can trigger workflow by clicking "Run workflow" in github actions

Note: I have used uv package in github workflow to set a virtual environment as dvc-gdrive is causing some issues with github server instance. So you can also run it without uv run before dvc commands.

workflow-trigger

  • You might see a error like this but its not a problem wait for sometime it is internally downloading files

default_error

  • After 5 minutes(Depending on the data size) you can see successfull run

run_success

Reference

About

tutorial to connect dvc and gdrive and run github actions

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published