Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve running kedro as part of an automated workflow (CI/CD) #1640

Closed
2 of 3 tasks
merelcht opened this issue Jun 21, 2022 · 18 comments
Closed
2 of 3 tasks

Improve running kedro as part of an automated workflow (CI/CD) #1640

merelcht opened this issue Jun 21, 2022 · 18 comments
Labels
Component: CLI Issue/PR that addresses the CLI for Kedro Component: Documentation 📄 Issue/PR for markdown and API documentation

Comments

@merelcht
Copy link
Member

merelcht commented Jun 21, 2022

Description

When any kedro command is executed for the first time or in a clean environment (which is often the case in CI/CD) the telemetry prompt gets run. The user then has to answer Yes or No to running telemetry. If no user is involved, e.g. in an automated CI/CD workflow a hack needs to be put in place to programatically add a .telemetry file.

Summary of action (Edited on 2023-07-31)

Possible Implementation

The above hack works, but isn't well known. An alternative way of solving this issue is by running the command in a form like: yes | kedro new. We need to document how people can accept/deny the telemetry tracking automatically.

@deepyaman
Copy link
Member

An alternative way of solving this issue is by running the command in a form like: yes | kedro new.

There are probably interesting repercussions to recommending a workflow like this. Anybody who followed such a workflow would contribute to the telemetry stats, adding a lot of automated runs (no different to adding a .telemetry file turning it on programmatically). If you don't want to include stats from CI runs, maybe should prefer yes no | kedro new. 😂

I think it would be good to take this opportunity to distinguish CI runs in telemetry. Maybe can read an environment variable; most CI systems (GitHub Actions, CircleCI, Travis CI, etc.) follow the convention of setting CI=true, so I would recommend detecting presence of the environment variable, solving the telemetry prompt automatically, and recording commands as belonging to a CI run--and ideally not requiring any additional configuration or hacks to avoid this telemetry prompt.

I think the one question this leaves is, what do we do if users want to opt out of telemetry in the CI. Is it OK to have this be documented, but not prompted by default, or does this introduce some risk?

@antonymilne
Copy link
Contributor

antonymilne commented Jun 22, 2022

Totally agree with everything @deepyaman said here. If we want to recommend in documentation a command that programmatically disables telemetry in CI then let's go for the clearer echo "consent: false" > .telemetry rather than yes no 😀

But using the CI variable is is a much better solution here. I don't think anyone on the team was aware of its existence before (we've just been kind of guessing which telemetry data come from CI jobs) but that is perfect 👍 👍 👍 As Deepyaman says, the only question then is whether CI=true should disable or enable telemetry by default. In the past I think we've been regarding CI runs as contaminating our telemetry data, so probably it's easiest and safest from the risk perspective to just disable telemetry in this case. But that does mean that in future if we do actually want to look at CI telemetry data we wouldn't have anything available on it... @yetudada?

@antonymilne antonymilne added the Component: CLI Issue/PR that addresses the CLI for Kedro label Jun 22, 2022
@deepyaman
Copy link
Member

In the past I think we've been regarding CI runs as contaminating our telemetry data, so probably it's easiest and safest from the risk perspective to just disable telemetry in this case. But that does mean that in future if we do actually want to look at CI telemetry data we wouldn't have anything available on it...

Another thought I had this morning was to enable telemetry by default in CI environments, but to be even more conservative in what's collected by kedro-telemetry in those cases. Not sure that's necessary, though; I feel like it's quite conservative by default (but I'm also not that familiar with it).

I would advocate as capturing the CI variable in telemetry regardless of whether the default is to have it on or off.

@yetudada
Copy link
Contributor

From a data perspective I would still like to see to understand how many Kedro projects are being run with CI, because it helps give us a sense of projects that have moved into production systems. It would be great to figure out how to identify a "user" that is CI but also still observe the telemetry consent that the user set.

@merelcht merelcht added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Aug 16, 2022
@merelcht
Copy link
Member Author

merelcht commented Aug 17, 2022

We discussed this issue in a Technical Design session and decided on the following:

  • We do want to capture telemetry for projects run on CI. This gives us insights into how many Kedro projects are run as part of an automated CI workflow, which in turn will help us understand how many Kedro projects are part of production systems.
  • Use the CI environment variable to indicate that data comes from a CI environment
  • The user still needs to be allowed to decide whether to run telemetry on their CI Kedro projects or not, so we will not implement a system to automatically track the data if it's part of a CI run.
  • We need to document how a user can automatically give/deny consent to telemetry, by describing how to programatically create a .telemetry file
  • We want to implement some logic in the kedro new command that detects if it's being run in CI and if so, whether consent has been given/not and if not then it will advice the user on how to do this.

To do:

  • Update kedro-telemetry to use the CI environment variable to indicate that data comes from a CI environment
  • Document how a user can automatically give/deny consent to telemetry, by describing how to programatically create a .telemetry file
  • Implement logic in the kedro new command that detects if it's being run in CI and if so, whether consent has been given and if that's not the case then it will advice the user on how to do this. Add an option for kedro new and kedro run to skip telemetry (e.g. for CI/CD environment) #2867

@merelcht merelcht removed the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Aug 17, 2022
@noklam noklam changed the title Improve running kedro as part of an automated workflow Improve running kedro as part of an automated workflow (CI/CD) Jun 14, 2023
@astrojuanlu
Copy link
Member

astrojuanlu commented Jun 22, 2023

Today there was a question about this and I just wanted to confirm that the "Do you opt into usage analytics? [y/N]:` prompt does not block non-interactive terminals. Example:

FROM python:3.10

RUN pip install kedro

WORKDIR /src

RUN echo "project_name: Spaceflights" > kedro.yml
RUN kedro new --config=kedro.yml --starter=spaceflights

WORKDIR /src/spaceflights

RUN pip install -r src/requirements.txt

RUN kedro run
$ docker build . --platform linux/amd64 -t kedro-test:dev
DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
            Install the buildx component to build images with BuildKit:
            https://docs.docker.com/go/buildx/

Sending build context to Docker daemon  796.3MB
Step 1/8 : FROM python:3.10
 ---> 700e6de1daef
Step 2/8 : RUN pip install kedro
...
Step 8/8 : RUN kedro run
 ---> Running in e6b62eb0a4cf
As an open-source project, we collect usage analytics. 
We cannot see nor store information contained in a Kedro project. 
You can find out more by reading our privacy notice: 
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice 
Do you opt into usage analytics?  [y/N]: [06/22/23 07:05:15] WARNING  Failed to confirm consent. No data    plugin.py:265
                             was sent to Heap. Exception:                       
[06/22/23 07:05:15] INFO     Kedro project spaceflights           session.py:359
As an open-source project, we collect usage analytics. 
We cannot see nor store information contained in a Kedro project. 
You can find out more by reading our privacy notice: 
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice 
Do you opt into usage analytics?  [y/N]:                     WARNING  Failed to confirm consent. No data    plugin.py:265
                             was sent to Heap. Exception:                       
                    INFO     Loading data from 'companies'   data_catalog.py:345
                             (CSVDataSet)...                                    
                    INFO     Running node:                           node.py:331
...

@juanchodpg2
Copy link

Today there was a question about this and I just wanted to confirm that the "Do you opt into usage analytics? [y/N]:` prompt does not block non-interactive terminals. Example:

FROM python:3.10

RUN pip install kedro

WORKDIR /src

RUN echo "project_name: Spaceflights" > kedro.yml
RUN kedro new --config=kedro.yml --starter=spaceflights

WORKDIR /src/spaceflights

RUN pip install -r src/requirements.txt

RUN kedro run
$ docker build . --platform linux/amd64 -t kedro-test:dev
DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
            Install the buildx component to build images with BuildKit:
            https://docs.docker.com/go/buildx/

Sending build context to Docker daemon  796.3MB
Step 1/8 : FROM python:3.10
 ---> 700e6de1daef
Step 2/8 : RUN pip install kedro
...
Step 8/8 : RUN kedro run
 ---> Running in e6b62eb0a4cf
As an open-source project, we collect usage analytics. 
We cannot see nor store information contained in a Kedro project. 
You can find out more by reading our privacy notice: 
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice 
Do you opt into usage analytics?  [y/N]: [06/22/23 07:05:15] WARNING  Failed to confirm consent. No data    plugin.py:265
                             was sent to Heap. Exception:                       
[06/22/23 07:05:15] INFO     Kedro project spaceflights           session.py:359
As an open-source project, we collect usage analytics. 
We cannot see nor store information contained in a Kedro project. 
You can find out more by reading our privacy notice: 
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice 
Do you opt into usage analytics?  [y/N]:                     WARNING  Failed to confirm consent. No data    plugin.py:265
                             was sent to Heap. Exception:                       
                    INFO     Loading data from 'companies'   data_catalog.py:345
                             (CSVDataSet)...                                    
                    INFO     Running node:                           node.py:331
...

For me it does keep hanging in a Databricks job, with Kedro==0.18.7 installed.

@astrojuanlu
Copy link
Member

Thanks @juanchodpg2, good to know. You can create a .telemetry file with the appropriate content to opt in or out of telemetry and avoid the blocking https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#how-do-i-consent-to-the-use-of-kedro-telemetry

@astrojuanlu
Copy link
Member

@juanchodpg2 reports that creating a .telemetry file is finicky and doesn't always work, probably because of having to place it at the right directory (still unclear) https://linen-slack.kedro.org/t/16140465/hi-all-i-am-having-issues-trying-to-run-a-databricks-job-usi#71319d4a-493b-46fc-9f02-6fdf0cad3898

We might need to bump the priority of this. Getting a solid consent process is a blocker for kedro-org/kedro-plugins#375

@astrojuanlu
Copy link
Member

Let's continue the conversation about kedro-telemetry being blocked on Databricks and other places in kedro-org/kedro-plugins#484

@datajoely
Copy link
Contributor

Also most CI platforms include the CI=True env var - we should detect that

@astrojuanlu
Copy link
Member

My original example on #1640 (comment) had a mistake, it didn't install kedro-telemetry. Updated version:

FROM python:3.10

RUN pip install kedro kedro-telemetry

WORKDIR /src

RUN kedro new --name=spaceflights --starter=spaceflights-pandas

WORKDIR /src/spaceflights

RUN pip install -r requirements.txt

RUN kedro run

and I confirm the docker build doesn't block:

$ docker build . --platform linux/amd64 -t kedro-test:dev
...
---> [Warning] The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
 ---> Running in ffb042e40704
<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
<jemalloc>: (This is the expected behaviour if you are running under QEMU)
As an open-source project, we collect usage analytics. 
We cannot see nor store information contained in a Kedro project. 
You can find out more by reading our privacy notice: 
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice 
Do you opt into usage analytics?  [y/N]: [02/19/24 13:33:36] WARNING  Failed to confirm consent. No data    plugin.py:276
                             was sent to Heap. Exception:                       
                    INFO     Kedro project spaceflights           session.py:321
As an open-source project, we collect usage analytics. 
We cannot see nor store information contained in a Kedro project. 
You can find out more by reading our privacy notice: 
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice 
Do you opt into usage analytics?  [y/N]: [02/19/24 13:33:37] WARNING  Failed to confirm consent. No data    plugin.py:276
                             was sent to Heap. Exception:                       
[02/19/24 13:33:39] INFO     Using synchronous mode for  sequential_runner.py:64
...

Each platform needs to be understood separately.

@m-gris
Copy link

m-gris commented May 8, 2024

Any update on this one ? :)

@astrojuanlu
Copy link
Member

Hi @m-gris! To clarify, are you running Kedro on an unattended setting and being blocked by this? How does your setup look like?

The best workaround for now is to remove kedro-telemetry from the requirements.

@m-gris
Copy link

m-gris commented May 9, 2024

Hi @astrojuanlu
It's just that I had the first node/task in an airflow pipeline that was stuck... 😅
I simply did a manual kedro run from within the container... (Could have created .telemetry manually as well)
But your suggestion definitely makes for a better workaround.
Thanks a lot,
M.

@astrojuanlu
Copy link
Member

To set expectations clear, #2867 is fine for a workaround but the default kedro run experience should be seamless. We are still working on making that a reality.

@astrojuanlu
Copy link
Member

astrojuanlu commented May 24, 2024

Getting back to this after discussing it a bit more. We decided to not go with #2867.

In the future we want that telemetry prompt to disappear. This means that, whenever the user runs a Kedro command, we will assume a default answer, and we will also add more mechanisms for users to override that answer on top of the current .telemetry file (for example https://consoledonottrack.com/).

However, we are still in the process of minimising the amount of data we collect to make this more palatable to users, so we are not quite there yet.

We will give more details in due time.

@astrojuanlu
Copy link
Member

Progress towards opt-out consent is being tracked in other issues. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: CLI Issue/PR that addresses the CLI for Kedro Component: Documentation 📄 Issue/PR for markdown and API documentation
Projects
Archived in project
Development

No branches or pull requests

8 participants