Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(telemetry): Single project identifier #701

Merged
merged 25 commits into from
Jun 4, 2024

Conversation

ElenaKhaustova
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova commented May 28, 2024

Description

FIX #507

Related docs update kedro-org/kedro#3897

Development notes

Manually tested the following scenario:

  1. Created a project with consent
  • project id does not exist - it is generated and saved in pyproject.toml
  • project id exists - it is loaded from pyproject.toml, joined with the package name, hashed and used
  • package name changes, project id exists - it is loaded from pyproject.toml, joined with the new package name, hashed and used
  1. Created a project without consent
  • project id is not created
  1. Packaging project
  • package kedro project with modified pyproject.toml - works without errors
  • run packaged project with consent=True and existing pyproject.toml - project id is generated and saved in pyproject.toml
  • run packaged project with consent=True without pyproject.toml - project id is None

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
@ElenaKhaustova ElenaKhaustova marked this pull request as ready for review May 28, 2024 17:34
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
@ElenaKhaustova ElenaKhaustova marked this pull request as draft May 29, 2024 13:48
@ElenaKhaustova ElenaKhaustova self-assigned this May 29, 2024
@ElenaKhaustova ElenaKhaustova marked this pull request as ready for review May 29, 2024 23:57
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do a full code review but I think this is mostly in line with what's expected for #507.

However, I'm still observing a weird behavior for packaged projects, and since this PR introduces a change that can modify pyproject.toml files unrelated to Kedro, I opened kedro-org/kedro#3902 in case it's something we want to look at more broadly. I'm fearing that this might lead to false positives (like: the user runs a packaged Kedro project, which does not have a project_uuid, but since there happens to be a pyproject.toml file there, we get a false UUID). Would love to get opinions on that one before merging this PR.

One last thought: I'm assuming that we will introduce a change in kedro new so that projects have this project_uuid on pyproject.toml right after creation. Any objections/thoughts?

@ElenaKhaustova
Copy link
Contributor Author

I didn't do a full code review but I think this is mostly in line with what's expected for #507.

However, I'm still observing a weird behavior for packaged projects, and since this PR introduces a change that can modify pyproject.toml files unrelated to Kedro, I opened kedro-org/kedro#3902 in case it's something we want to look at more broadly. I'm fearing that this might lead to false positives (like: the user runs a packaged Kedro project, which does not have a project_uuid, but since there happens to be a pyproject.toml file there, we get a false UUID). Would love to get opinions on that one before merging this PR.

One last thought: I'm assuming that we will introduce a change in kedro new so that projects have this project_uuid on pyproject.toml right after creation. Any objections/thoughts?

I understand your arguments on modifying pyproject.toml files unrelated to Kedro. But I don't find it weird to add the project_id since we're asking a consent once again for packaged projects, meaning we're going to collect the metadata. The alternative is to store it out of pyproject.toml but previously we agreed we don't want to generate extra files and our goal is to collect metadata at the development stage rather than production, so we do not worry about propagating project_id generated at the development stage to the production. However, I don't mind differentiating pyproject.toml files unrelated to Kedro and do not modify them, in other words having project_id=None for them.

Theoretically we don't need project_id before sending the telemetry but generating it with kedro new can simplify it a bit for us. For example, we would be able to add it to the [tool.kedro] without rewriting the file (breaking the formatting) and then do not write to the pyproject.toml when collecting telemetry (which is better, IMO). But we would need to think about cases when a project is created without kedro new though I don't think they're common.

@merelcht
Copy link
Member

I didn't immediately understand what @astrojuanlu meant with the pyproject.toml being changed with a packaged project but I managed to replicate that state and I agree with him that it's strange and probably not expected behaviour. However, one could argue that running a Kedro project inside a directory that contains another project and thus a pyproject.toml file is also unusual.

That being said, I'd prefer it if we can ensure that the pyproject.toml file we're editing has a [tool.kedro] section. In fact, I'd argue that the project_id should be added to [tool.kedro] and not in an entirely new section for kedro-telemetry.

@ElenaKhaustova
Copy link
Contributor Author

@merelcht, @astrojuanlu, thank you for sharing your thoughts. Based on them, I'll add logic to differentiate Kedro-related pyproject.toml and modify only it.

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a minor comment, but other than that this now looks good to me and works as I expected 🙂

return project_id
except KeyError:
logging.debug(
f"Failed to retrieve project id or save project id: {str(pyproject_path)} does not relate to Kedro"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make this a bit more specific by mentioning that the file doesn't contain a tool.kedro section?

Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with 1 comment ⭐

Comment on lines +96 to +102
project_id = pyproject_data["tool"]["kedro_telemetry"]["project_id"]
except KeyError:
project_id = uuid.uuid4().hex
toml_string = (
f'\n[tool.kedro_telemetry]\nproject_id = "{project_id}"\n'
)
file.write(toml_string)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @merelcht above

In fact, I'd argue that the project_id should be added to [tool.kedro] and not in an entirely new section for kedro-telemetry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I 100% agree, there are two reasons why I added it to the separate section for now.

The first is because toml.dump changes formatting like so:

[project.optional-dependencies]
docs = [
    "docutils<0.18.0",
    "sphinx~=3.4.3",
    "sphinx_rtd_theme==0.5.1",
    "nbsphinx==0.8.1",
    "sphinx-autodoc-typehints==1.11.1",
    "sphinx_copybutton==0.3.1",
    "ipykernel>=5.3, <7.0",
    "Jinja2<3.1.0",
    "myst-parser~=0.17.2",
]

->

[project.optional-dependencies]
docs = [ "docutils<0.18.0", "sphinx~=3.4.3", "sphinx_rtd_theme==0.5.1", "nbsphinx==0.8.1", "sphinx-autodoc-typehints==1.11.1", "sphinx_copybutton==0.3.1", "ipykernel>=5.3, <7.0", "Jinja2<3.1.0", "myst-parser~=0.17.2",]

Which we do not like but I can replace it with custom file parsing to keep it proper.

  1. The main one is because we validate expected keys in the [tool.kedro] section and raise an error in case of mismatch: https://github.com/kedro-org/kedro/blob/27f5405cefd6701ffac4c6243030486fb7d3c942/kedro/framework/startup.py#L107

So, to make it properly, we need changes on the framework side: possibly updating kedro new, as we discussed above. The question is whether we want to merge it like this for now or wait until we apply updates on the side of the framework.

@merelcht, @astrojuanlu

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh that "found unexpected keys" indeed would make any extra key in [tool.kedro] unfeasible for older Kedro versions. I don't have an opinion right now on whether it makes sense to allow more keys going forward, but it would not be a solution for our case due to lots of old Kedro projects out there in the wild, I think.

About style preservation, I'm wondering if tomlkit is a better choice here, but again this would introduce an extra dependency for very little gain.

Waiting for @merelcht's signoff but I'm okay moving forward with this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ElenaKhaustova thanks for explaining! That's unfortunate, but in that case I'd also be in favour to just merge it like this. It's more of a "nice to have" than a must anyway.

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
@ElenaKhaustova ElenaKhaustova merged commit f157e1a into main Jun 4, 2024
9 checks passed
@ElenaKhaustova ElenaKhaustova deleted the feature/507-project-id branch June 4, 2024 10:15
tgoelles pushed a commit to tgoelles/kedro-plugins that referenced this pull request Jun 6, 2024
* Moved pyproject config name to constant

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Implemented _get_or_create_project_uuid

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Refactored _get_project_properties

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed tests from hanging

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed _is_known_ci_env

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed test_before_command_run

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed TestKedroTelemetryCLIHooks

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed TestKedroTelemetryProjectHooks

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed writing to pyproject.toml

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed write mock

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Removed debug output

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Refactored _add_tool_properties

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Added debug message when pyproject_path does not exist

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated release notes

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Return None as project UUID in case of not generated

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed pre-commit errors

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Updated the way project UUID is stored

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Renamed project_uuid -> project_id

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Check if pyproject file relates to kedro

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Changed debug message as suggested

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Added OSError handling

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

* Fixed unit test

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>

---------

Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: tgoelles <thomas.goelles@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Have a single way to identify a project in our telemetry
4 participants