Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update eager: make sure client secret can be specified as env var #2720

Merged
merged 14 commits into from
Sep 5, 2024

Conversation

cosmicBboy
Copy link
Contributor

@cosmicBboy cosmicBboy commented Aug 29, 2024

Why are the changes needed?

Currently, the @eager workflows are too strict with its handling of the client secret values. Both secret group and key need to be specified for the secret request to be passed down to the underlying task. There is also an issue with the async event loop not being available for the downstream data persistence step in the task execution.

What changes were proposed in this pull request?

This PR loosens this requirement to be compatible with Flyte backends that do not require secret groups. This PR also:

  • Allows the user to bind the client secret to an environment variable in cases where the FlyteRemote object can authenticate with just the client secret
  • Makes sure that an async loop is available for downstream data persistence step in the task execution

How was this patch tested?

The following workflow was tested on flyte sandbox

from flytekit import task, ImageSpec
from flytekit.configuration import Config
from flytekit.experimental import eager
from flytekit.remote import FlyteRemote

flytekit = "git+https://github.com/flyteorg/flytekit@54c4471a56311ff8201ba76d9bda726bc5b4689b"

image = ImageSpec(
    name="polars",
    packages=["flytekit", flytekit],
    registry="localhost:30000",
    apt_packages=["git"],
)


@task
def add_one(x: int) -> int:
    return x + 1


@task
def double(x: int) -> int:
    return x * 2


@eager(
    container_image=image,
    remote=FlyteRemote(
        config=Config.for_sandbox(),
        default_project="flytesnacks",
        default_domain="development",
    ),
)
async def simple_eager_workflow(x: int) -> int:
    out = await add_one(x=x)
    if out < 0:
        return -1
    return await double(x=out)
  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
Copy link

codecov bot commented Aug 29, 2024

Codecov Report

Attention: Patch coverage is 30.76923% with 9 lines in your changes missing coverage. Please review.

Project coverage is 71.82%. Comparing base (65bf9e7) to head (f0b5d67).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
flytekit/experimental/eager_function.py 30.76% 9 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #2720       +/-   ##
===========================================
+ Coverage   45.47%   71.82%   +26.34%     
===========================================
  Files         193      193               
  Lines       19516    19560       +44     
  Branches     2818     3845     +1027     
===========================================
+ Hits         8875    14049     +5174     
+ Misses      10208     4832     -5376     
- Partials      433      679      +246     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -177,6 +177,13 @@ def _dispatch_execute(
for k, v in output_file_dict.items():
utils.write_proto_to_file(v.to_flyte_idl(), os.path.join(ctx.execution_state.engine_dir, k))

# make sure an event loop exists for data persistence step
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this issue comes up not just with eager. Frameworks like Langchain that have async functionality will cause the event loop to end before the flyte task can run the async data persistence operation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a separate event loop just for flytekit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds right, I think someone more well-versed in async can help with that

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for this!
can we have

  1. register the task to remote cluster (sandbox or dogfood gcp) to prove it works?
  2. add workflow in unit tests.

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
@cosmicBboy
Copy link
Contributor Author

cosmicBboy commented Aug 30, 2024

thank you for this! can we have

  1. register the task to remote cluster (sandbox or dogfood gcp) to prove it works?
  2. add workflow in unit tests.

Tests for this exist here and here. The changes made to the PR (adding the new client_secret_env_var argument) and loosening the strictness of secrets only really applies to Union serverless.

The code snippet in the PR description should work on flyte sandbox
image
image

Future-Outlier
Future-Outlier previously approved these changes Aug 30, 2024
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you

Comment on lines 634 to 635
remote_cls = type(remote)
return remote_cls()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that the other FlyteRemote object can be created without a Config and uses an environment variable. Is this okay for OSS?

Also, should this carry over the default_domain and default_project from the original remote obj? Or should the other remote object figure out the domain and project from the execution environment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooo yeah it should probably carry over those args.

admittedly this code path is basically: "tell me you only run on serverless without saying you run on serverless" 🙃

Copy link
Contributor Author

@cosmicBboy cosmicBboy Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically client_secret_env_var only really applies to serverless (this will fail with Flyte), unless at some point down the road Flyte provides an single API key UX.

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
Copy link
Contributor

@wild-endeavor wild-endeavor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some questions, but this event loop thing... can you explain why it's needed in entrypoint.py but there doesn't seem to be another one for local execution? Won't both need it?

secret_requests.append(Secret(group=client_secret_group, key=client_secret_key))
except ValueError:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if there's nothing in the client_secret_env_var environment variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client_secret_env_var doesn't figure into this try except block, or am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no yeah i know, i mean, what happens if all three (client_secret_env_var, client_secret_group and client_secret_key) are None?

@@ -396,6 +398,8 @@ def eager(
:param local_entrypoint: If True, the eager workflow will can be executed locally but use the provided
:py:func:`~flytekit.remote.FlyteRemote` object to create task/workflow executions. This is useful for local
testing against a remote Flyte cluster.
:param client_secret_env_var: if specified, binds the client secret to the specified environment variable for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay with this for now, but I think we should see if we can't move this to some higher level in the future. the secret here we're talking about is the secret for hitting admin itself right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be implemented in Secret but we haven't moved on that for a while: #1726


if client_secret_env_var is not None:
# this creates a remote client where the env var client secret is sufficient for authentication
os.environ[client_secret_env_var] = client_secret
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confused what's happening here. if we're removing the asserts, then that means the group/key might be None. If they're none, won't the secrets_manager.get fail? So then what is the client_secret? Or is there some default in the secrets manager that somehow gets returned?
I don't get why we're setting the environment variable here to the value of client secret.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to make it easy to authenticate with UnionRemote on serverless (via UNION_SERVERLESS_API_KEY env var). Secrets on serverless only needs a secret key (no group).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering, in the following scenario:

  • client_secret_key is None
  • client_secret_group is None
    (and this scenario is now valid because the asserts are now removed)
    after line 626 client_secret = secrets_manager.get(client_secret_group, client_secret_key)
    what is the value of client_secret?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going through the secretsmanager, it looks like it raises an IndexError if you call get(group=None, key=None), so I don't think it ever gets past line 626. Could we move the assertion that at least one of those has be be non-None into the get function?

Copy link
Contributor

@wild-endeavor wild-endeavor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a few more comments and went over this with @eapolinario. Could we

  • Add an assert to secretsmanager.get that at least one of group/key should be non-None
  • Could we move the async loop stuff to another PR? I think it'd be helpful to discuss it separately. Also do you remember how you triggered the issue? I wanna repro locally to understand more what langchain is doing. Ideally we handle in such a way that it's completely independent of any other library.

Signed-off-by: Niels Bantilan <niels.bantilan@gmail.com>
@cosmicBboy
Copy link
Contributor Author

Had a few more comments and went over this with @eapolinario. Could we

  • Add an assert to secretsmanager.get that at least one of group/key should be non-None

Updated PR with client secret key/group assertion.

  • Could we move the async loop stuff to another PR? I think it'd be helpful to discuss it separately. Also do you remember how you triggered the issue? I wanna repro locally to understand more what langchain is doing. Ideally we handle in such a way that it's completely independent of any other library.

Deleted this. For some reason I'm unable to repro the async issue I was seeing earlier on eager or with langchain async document loaders https://github.com/unionai-oss/union-rag/blob/main/union_rag/langchain.py#L138-L141

@cosmicBboy cosmicBboy merged commit 9f9f197 into master Sep 5, 2024
101 checks passed
@cosmicBboy cosmicBboy deleted the update-eager branch September 9, 2024 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants