Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User identity for telemetry events is not unique #333

Closed
astrojuanlu opened this issue Sep 8, 2023 · 19 comments · Fixed by #596
Closed

User identity for telemetry events is not unique #333

astrojuanlu opened this issue Sep 8, 2023 · 19 comments · Fixed by #596
Assignees

Comments

@astrojuanlu
Copy link
Member

astrojuanlu commented Sep 8, 2023

Description

kedro-telemetry uses the hashed username for the identity of the Heap event:

hashed_username = _get_hashed_username()
project_properties = _get_project_properties(hashed_username)
cli_properties = _format_user_cli_data(
project_properties, masked_command_args
)
_send_heap_event(
event_name=f"Command run: {main_command}",
identity=hashed_username,
properties=cli_properties,
)

def _get_hashed_username():
try:
username = getpass.getuser()
return _hash(username)

However, this is not enough to disambiguate users. An extreme example is that most people launching Kedro in a container will have the username root, hence all their events might be collected under the same identity!

❯ docker run -it --rm --platform linux/amd64 python:3.11-slim bash
root@6ac694d81117:/# pip install kedro-telemetry
...
>>> from kedro_telemetry.plugin import _get_hashed_username
>>> _get_hashed_username()
'99adc231b045331e514a516b4b7680f588e3823213abe901738bc3ad67b2f6fcb3c64efb93d18002588d3ccc1a49efbae1ce20cb43df36b38651f11fa75678e8'
>>> 
root@6ac694d81117:/# whoami
root

Context

If we are interested in number of users, we might be incurring in a dramatic under-reporting. The drawbacks of this are described in https://developers.heap.io/docs/using-identify#using-a-non-unique-identifier

This was done in #58 after discussion in #50. Before that, kedro-telemetry relied on hostname, which again is not necessarily unique.

identity=hashed_computer_name.hexdigest(),

Possible solution

For example, Orchest used to generate a TELEMETRY_UUID and store it in a centralized location:

https://github.com/orchest/orchest/blob/560103a5c7dbdf615a2e0df435076f0ee78b6a1d/services/orchest-api/app/app/utils.py#L431-L432

@astrojuanlu
Copy link
Member Author

I really like Pants anonymous telemetry collection system https://www.pantsbuild.org/docs/anonymous-telemetry

@astrojuanlu
Copy link
Member Author

One possible way to tackle this:

Try to read a user-wide id from a predefined, user-writable location (for example under ~/.config/kedro/ and equivalent on Windows, possibly using https://pypi.org/project/appdirs/) and, if it the file does not exist, write a new UUID4 https://docs.python.org/3/library/uuid.html#uuid.uuid4. If the file does exist but it's empty, or if the id is not a valid UUID4, or if the file cannot be written, or if anything else fails, then the user id would be None (double check if this is possible with Heap).

The config could have the form

$ cat ~/.config/kedro/settings.toml
telemetry_uuid = "2b030a71-4139-4816-9df4-222588f9027b"

(TOML choice and settings.toml filename is entirely arbitrary)

Thoughts?

@astrojuanlu
Copy link
Member Author

From #510 (comment), we see that GE, DVC, Evidently, Reflex do track unique users, so we can draw some inspiration from it.

Final file naming still TBC. Since this would be a kedro-telemetry specific config (and not kedro), we could go for something like ~/.config/kedro-telemetry.toml or similar.

@astrojuanlu
Copy link
Member Author

Let's make this

~/.config/kedro/telemetry.conf

equivalent home path using https://pypi.org/project/appdirs/. Format using ConfigParser https://docs.python.org/3/library/configparser.html

Mechanics as follows:

Try to read a user-wide id from ~/.config/kedro/telemetry.conf and, if it the file does not exist, write a new UUID4 https://docs.python.org/3/library/uuid.html#uuid.uuid4. If the file does exist but it's empty, or if the id is not a valid UUID4, or if the file cannot be written, or if anything else fails, then the identity in _send_heap_event must not be set (None or empty string).

@astrojuanlu astrojuanlu assigned DimedS and lrcouto and unassigned astrojuanlu Mar 4, 2024
@DimedS
Copy link
Contributor

DimedS commented Mar 4, 2024

Thank you for the thorough investigation, @astrojuanlu! I have a couple of questions regarding the approach:

  1. Does the implementation imply that we'll replace all existing usernames with new ones, including those that haven't posed any issues? Perhaps we should consider a strategy to preserve historical data.
  2. Will every container execution generate a new username, is it ok for us?

@astrojuanlu
Copy link
Member Author

Good questions!

  1. Historical data will remain unchanged. The burden is on whoever is analyzing that data to reconcile that information.
  2. Indeed, every new container execution will generate a new username. I think that's better than having it attributed to root, at least we know it comes from potentially different containers.

@DimedS
Copy link
Contributor

DimedS commented Mar 4, 2024

  1. Historical data will remain unchanged. The burden is on whoever is analyzing that data to reconcile that information.

It appears virtually impossible to reconcile data if we simply replace old usernames with new ones. Introducing a new variable for the new username and preserving both could be effective, making data merging much simpler, what do you think?

  1. Indeed, every new container execution will generate a new username. I think that's better than having it attributed to root, at least we know it comes from potentially different containers.

I agree with that, additionally, we might consider modifications to the Dockerfile in kedro-docker that could enable us to consistently save the same identifier for every run.

@astrojuanlu
Copy link
Member Author

astrojuanlu commented Mar 4, 2024

  1. Notice that the kedro-telemetry version is attached. So one can select the appropriate approach depending on that.
  2. Something we could do for kedro-docker, let's treat that separately

@DimedS
Copy link
Contributor

DimedS commented Mar 4, 2024

  1. Notice that the kedro-telemetry version is attached. So one can select the appropriate approach depending on that.

I mean, do we foresee a need in the future to identify that username XYZ2 was previously known as XYZ1, especially when analysing the historical interactions of users?

  1. Something we could do for kedro-docker, let's treat that separately

Ok!

@astrojuanlu
Copy link
Member Author

Clarified asynchronously: we're not retroactively rewriting existing data, not aiming for continuity or trying to link UUIDs with current user "ids" 👍

@DimedS
Copy link
Contributor

DimedS commented Mar 5, 2024

Try to read a user-wide id from ~/.config/kedro/telemetry.conf and, if it the file does not exist, write a new UUID4 https://docs.python.org/3/library/uuid.html#uuid.uuid4. If the file does exist but it's empty, or if the id is not a valid UUID4, or if the file cannot be written, or if anything else fails, then the identity in _send_heap_event must not be set (None or empty string).

@astrojuanlu , are you certain that when the file exists but is empty, or when it contains an invalid UUID4, we should proceed by sending an empty value instead of attempting to rectify the situation with a new ID? Perhaps it would be more effective to attempt generating and saving a new ID once, and only resort to sending an empty value and maintaining the status quo if that attempt fails?

@astrojuanlu
Copy link
Member Author

For full disclosure (since @idanov asked about this), I landed on .conf rather than .toml because currently the Python standard library doesn't have a way to write TOML, so it could be a way of saving 1 dependency. But I really don't mind about that and we could settle on TOML or YAML for everything.

@idanov @merelcht @DimedS do you have any thoughts?

@DimedS
Copy link
Contributor

DimedS commented Mar 6, 2024

For full disclosure (since @idanov asked about this), I landed on .conf rather than .toml because currently the Python standard library doesn't have a way to write TOML, so it could be a way of saving 1 dependency. But I really don't mind about that and we could settle on TOML or YAML for everything.

@idanov @merelcht @DimedS do you have any thoughts?

Looks like that kedro already has a dependency on toml, so incorporating .toml for config file instead to .conf, would not introduce any additional dependencies. If that's true, I agree with @idanov that using .toml would enhance consistency across the project.

But as I understand it, tomllib is now included in the Python standard library, serving solely for reading TOML files. For writing, an additional library, tomli_w, is required. Given this evolution in TOML library support within Python, do we have a strategy to address these changes in kedro? Should we consider transitioning away from the toml library?

@astrojuanlu
Copy link
Member Author

Should we consider transitioning away from the toml library?

IMHO, yes :) kedro-org/kedro#2569 (comment) we never opened an issue about it

@merelcht
Copy link
Member

merelcht commented Mar 7, 2024

Should we consider transitioning away from the toml library?

IMHO, yes :) kedro-org/kedro#2569 (comment) we never opened an issue about it

This might be a stupid question, but if we completely transition away from toml, how do we handle the pyproject.toml file? From the link you mention @astrojuanlu it sounds more like we should use just one standard toml library instead of a random third party one?

@astrojuanlu
Copy link
Member Author

Sorry, to clarify - I'm not advocating for moving away from the TOML format, but from using the toml library instead of stdlib https://docs.python.org/3/library/tomllib.html (and its backports for Python < 3.11)

@astrojuanlu
Copy link
Member Author

TOML discussion kedro-org/kedro#3690

@DimedS
Copy link
Contributor

DimedS commented Mar 8, 2024

In yesterday's stand-up, we agreed on the following actions:

  1. For the current PR, we'll transition from using .conf files to .toml, using the old toml library as previously utilized in kedro.
  2. Subsequently, we should approve and implement a new approach for all TOML files in a kedro, including code for unique users from that PR.

@astrojuanlu, do you agree with this plan or you prefer to finish that discussion first?

@astrojuanlu
Copy link
Member Author

Totally agreed! Thanks for the update @DimedS 👍🏽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
4 participants