[pipeline-ui] Retrieve pod logs from argo archive #2081

eterna2 · 2019-09-10T17:03:37Z

A possible workaround for #1803

Currently, I persist pod logs with argo's archiveLogs config to s3 bucket.

This PR allow the UI to fallback to retrieving GCed pods logs from the argo archive artifactory - this can be either minio store or a s3 bucket.

This feature can be enabled by setting the following env variables:

  /** Is Argo log archive enabled? */
  ARGO_ARCHIVE_LOGS = "true",
  /** Use minio or s3 client to retrieve archives. */
  ARGO_ARCHIVE_ARTIFACTORY = 'minio',
  /** Bucket to retrive logs from */
  ARGO_ARCHIVE_BUCKETNAME = 'mlpipeline',
  /** Prefix to logs. */
  ARGO_ARCHIVE_PREFIX = 'logs',

Updates 20 sep 2019:

Investigate the feasibility of getting pod logs archive location from argo workflow status:

conclusion is yes, but if the pipeline did not complete successfully, the status will not be updated with the output information (even if the logs are actually archived)

Changes

Updated @types/minio to match the minio package (useSSL instead of insecure <- old)
Created a few additional helpers modules:
- aws-helper provides utils to query and handles AWS instance profile session credentials (aka kube2iam or ec2 profiles can be used instead of providing access key and secret to minio client)
- workflow-helper provides utils to retrieve pod logs archive info from argo workflow status
- minio-helper provides utils to retrieve objects from minio or s3 backends, includes capability to use AWS ec2 credentials to access s3
Replace existing handlers for s3 and minio objects with utils from minio-helper
Update get pod logs handler to try getting the pod logs in the following orders:
- get logs from pod with k8s api
- get archive location from workflow status, and retrieve from archive
- if workflow status does not have required info, fallback to the values provided in the environment variables (by user) and try to retrieve the logs.
I have build the image and test on my own cluster in AWS. However, i only tested on the pod logs, have not really test on the ui-metadata yet.

TODO

probably shld use the schema from third-party/argo
some unit tests?

This change is

k8s-ci-robot · 2019-09-10T17:03:51Z

Hi @eterna2. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Ark-kun · 2019-09-11T22:04:39Z

frontend/server/server.ts

+/** helper function to retrieve pod logs from argo artifactory. */
+const getPodLogsFromArtifactory = _as_bool(ARGO_ARCHIVE_LOGS) ? k8sHelper.getPodLogsFromArtifactoryHelper(
+  ARGO_ARCHIVE_ARTIFACTORY==='minio' ? minioClient : s3Client,
+  ARGO_ARCHIVE_BUCKETNAME,


Can the actual log location URL be taken from the Workflow status object?

Good idea.
Just a question on which client to use.

The workflow status does provide the output as well as the artifactory configs.

Do u think it is better to try to infer the pre-created client to use, or retrieve the secrets and creates a new client each time?

If the latter case, should we do it for the artifacts retrieval also? i.e. skip the configuration for minio outright.

Just a question on which client to use.

AFAIK, in the Frontend the s3Client is actually and instance of minioClient. There is some weird logic where s3 reads data uncompressed while minio expects tar.gz.

Generally, I feel that the UI handling of artifacts can get some improvement. If you see some ways to streamline and improve that part, I hope @jingzhang36 and other people working on UX will be glad to see that.

@eterna2 Do you think you can take on this: #2172 later ?

eterna2 · 2019-09-12T13:12:32Z

My other issue is IAM role for minio-js. Minio-js does not support IAM role (unlike minio-go), I made a PR to include this capability to minio-js, but they are asking for a relatively big rewrite (which I don't think I have time for it) instead of the monkey patch which I provided.

Wondering if u are open to me adding a short routine to support this outside of minio-js (retrieving and updating AWS IAM tmp credentials) inside here.

Because of this limitation, our current deployment relies on minio-gateway as a proxy to our s3 buckets. The crux of the issue is that in some places in kf, we use s3:// while for UI specifically, we have to use minio://.

This is quite confusing for our data scientist.

Ark-kun · 2019-09-17T02:57:06Z

/ok-to-test

Ark-kun · 2019-09-17T03:23:07Z

/retest

Ark-kun · 2019-09-17T20:15:01Z

/retest

eterna2 · 2019-09-19T19:28:10Z

@Ark-kun
Made a big change to the code base.
Basically, I added a few new features:

pod logs

workflow status does provide the archive information only if the job completes without errors. So we might still need the user-provided env variables as an optional fallback.

So the logic now is:

query k8s for logs
if fails, query for workflow, and try extract log location, create required client, and get the logs
else, try to retrieve based on the info provided in the env
if all else fails, return error message

AWS instance profile

A wrapper over the minio-js to mimic similar behavior as minio-go, so that pipeline ui can retrieve objects from s3 w/o access key/secrets, and instead use the ec2 instance profile (or kube2iam credentials).

This means if the IAM role for the pod/ec2 running the pipeline-ui can be used to interact with s3 instead of access key/secret.

k8s role/service account

Updated manifest for ui role to be able to get secrets as well as workflow.

Ark-kun · 2019-09-19T20:23:24Z

frontend/server/server.ts

-      });
+    case 'minio':
+      try {
+        res.send(await getTarObjectAsString({bucket, key, client: minioClient}));


In the next PR, can you please make a function that auto-detects whether the data is gzipped and unpacks if needed?

I am thinking whether to change from node-tar to tar-stream + maybe-gzip.

But probably need more testing to ensure it works.

Ark-kun · 2019-09-19T20:25:05Z

I think this is a good PR that also improves the handling of data access by Frontend.
/lgtm
/cc @Bobgy @jingzhang36 - Can you please review the Frontend code.

Bobgy · 2019-09-20T02:55:00Z

@eterna2 Thanks for the contribution! This is a great PR.

I briefly went through the change, and there's no obvious problems. I think you can start on some testing.

Our integration test infra doesn't run tests on AWS, so you cannot rely on that. I recommend unit tests for the node server that covers end to end flow of your use case. Please keep in mind we can only rely on your tests to make sure we don't break your feature, so prefer tests that cover the whole flow, rather than implementation details.

I need some time to gather enough context about minio, argo workflow and the exact approach you took. I will add more comments when I have time for a more detailed review.

Ark-kun · 2019-09-24T00:45:26Z

@Bobgy Let's try to prioritize this PR a bit since there are some UX improvements that might depend on some of the refactorings introduced here.

Bobgy

I'm still concerned with 0 test coverage. @Ark-kun what's your opinion? Do you think follow up tests in separate PR or no test coverage for now is okay?

I'd prefer having at least tests that cover the happy path for new features.

frontend/server/aws-helper.ts

frontend/server/minio-helper.ts

frontend/server/server.ts

manifests/kustomize/base/pipeline/ml-pipeline-ui-role.yaml

eterna2 · 2019-09-24T04:06:24Z

@Bobgy

I need some time to write the tests. I dun see anything specific currently in this repo thou. I do see some tests far down the line, but those are more end-2-end tests.

How do u want me to include the tests? Would adding tests/ folder and update package json and update the build script works?

Standard jest tests that mocks some of the AWS and K8s requests to test the various fallbacks.

@Ark-kun
Might need some time for the ux artifacts fix. Cuz I am on vacation. Not allowed to touch my computer until I am back next mon.

…ead workflow status for argo archive location for pod logs.

neuromage

/lgtm
/approve

k8s-ci-robot · 2019-10-15T16:36:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Bobgy, neuromage

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [neuromage]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

IronPan · 2019-10-15T19:34:51Z

@eterna2 I got a general question - how does the argo pod log got persisted in the artifactory in the first place? where did you specify the env var such as ARGO_ARCHIVE_LOGS? Did you set it in workflow controller?

IronPan · 2019-10-15T19:41:00Z

Read closer it seems the log persistence is set in argo output artifact repository api
https://github.com/kubeflow/pipelines/pull/2081/files#diff-35aaf752af0c32d077dee5fffb0eadb4R156

does our DSL expose this for now? @Ark-kun

eterna2 · 2019-10-15T23:57:29Z

It is not done by kfp. It is currently set by the cluster admin (i.e. via the Argo manifest). Argo allows u to config a default artifactory, and there is an archive logs flag u can set.

https://github.com/argoproj/argo/blob/master/ARTIFACT_REPO.md#configure-the-default-artifact-repository

How my PR handles this when unable to retrieve from the pod directly, is to

look at the corresponding Argo workflow crd status. parse the workflow status and check for the archiveLog flag. This is set by the Argo artifactory config in the manifest or inside the workflow template itself (not supported by current kfp sdk).
but because the status will not be updated properly if the workflow throws an error, the API will fallback to the manual config set inside the front end server itself and makes some simple assumptions.

IronPan · 2019-10-17T18:07:37Z

Argo artifact configmap is implementation detail that we are not intend to expose, either to cluster user or admin.

@Ark-kun thoughts?

Ark-kun · 2019-10-17T19:30:11Z

I think it's important for cluster administrators to control the artifact storage location.
If we do not want to expose the implementation detail, we should probably provide an official way for configuring the storage.

P.S. I consider archiving the logs to be beneficial. We should enable it by default. Otherwise the logs are lost in non-GCP environments.

IronPan · 2019-11-07T23:19:57Z

/hold cancel

mattiasarro · 2019-12-09T15:29:20Z

Can anyone give a bit more details on how to enable this? From a master branch installation of Kubeflow, if I set the below env vars on ml-pipeline-ui:

  ARGO_ARCHIVE_LOGS = "true"
  ARGO_ARCHIVE_ARTIFACTORY = 'minio'
  ARGO_ARCHIVE_BUCKETNAME = 'mlpipeline'
  ARGO_ARCHIVE_PREFIX = 'artifacts' # also tried 'logs'

...and data.config.artifactRepository.archiveLogs: true for workflow-controller-configmap, I do get logs in Minio mlpipeline/artifacts/*/*/main.log, but ml-pipeline-ui does not fetch logs from Minio if a node has been removed.

Is the above setup supposed to work? And is thyhere a way of enabling this for e.g. KF 0.7?

eterna2 · 2019-12-09T23:35:19Z

@mattiasarro

Hmmm. It works on my cluster thou - but I am on AWS eks with my custom manifest rather than kf's manifest.

Can u check what version of pipeline is deployed? And what is the returned error msg?

U might need to update the pipeline-ui service account to have access to k8s secrets and Argo workflow crd.

Because the under the hood, it will first try with k8s API to get the pod logs, followed by getting it from the Argo workflow crd status (which will tell u where the artifacts are stored, it will load the secrets and try to retrieve the logs). And if that fails, it will just try to read the logs based on the env var u set.

I have set it to return diff error msg depending on where it failed.

eterna2 · 2019-12-09T23:52:42Z

I just checked. Kf 0.7 uses 0.1.31.

This PR are only merged in for versions > 0.1.34.

U can try using 0.1.34 or 0.1.35. u shld also update the pipeline-ui service account to have access to k8s secrets and Argo workflow crd to.

U can look at the manifest changes in this PR. This is not updated for the main kubeflow manifest repo.

mattiasarro · 2019-12-17T11:19:33Z

Thanks @eterna2! 0.1.34 doesn't have the change but 0.1.35 works. I used your kubeflow-aws as a basis and created a non-AWS-specific version (since I'm using Minio not S3): https://github.com/mattiasarro/kubeflow-0-7-argo-minio-logs

Ark-kun · 2020-04-24T07:33:33Z

@eterna2 Is this feature supposed to work for non-AWS installations (e.g. GCP) as well? There are some reports that it might not be the case.

eterna2 · 2020-04-24T07:50:55Z

Yes it should work with both minio and s3.

But there are a few configuration they need to set. By default it is off.

What it does under the hood is just getting the workflow status and retrieving the log artifacts provided in the status.

There are a few ways that it will fail. When workflow error out, sometimes the status will be incomplete - no artifact info. Then in this case, it will not work.

And if secret is required to retrieve the artifacts, the UI sa must have permission to retrieve these secrets.

I don't think I have updated the main kubeflow manifest for this feature.

eterna2 · 2020-04-24T07:53:25Z

Is there a good place for us to document some of the configuration guide?

k8s-ci-robot requested review from IronPan and neuromage September 10, 2019 17:03

k8s-ci-robot added the size/M label Sep 10, 2019

k8s-ci-robot added the needs-ok-to-test label Sep 10, 2019

eterna2 force-pushed the eterna2/get-pod-log-from-artifactory branch 2 times, most recently from 1b7fb57 to 01340b6 Compare September 10, 2019 17:09

eterna2 mentioned this pull request Sep 10, 2019

Adds eterna2 to kubeflow org kubeflow/internal-acls#153

Closed

gaoning777 assigned neuromage and IronPan Sep 10, 2019

Ark-kun reviewed Sep 11, 2019

View reviewed changes

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Sep 17, 2019

k8s-ci-robot added size/XL and removed size/M labels Sep 19, 2019

Ark-kun reviewed Sep 19, 2019

View reviewed changes

k8s-ci-robot assigned Ark-kun Sep 19, 2019

k8s-ci-robot added the lgtm label Sep 19, 2019

Bobgy suggested changes Sep 24, 2019

View reviewed changes

frontend/server/aws-helper.ts Outdated Show resolved Hide resolved

frontend/server/minio-helper.ts Outdated Show resolved Hide resolved

frontend/server/server.ts Outdated Show resolved Hide resolved

manifests/kustomize/base/pipeline/ml-pipeline-ui-role.yaml Show resolved Hide resolved

eterna2 added 2 commits September 24, 2019 12:21

Retrieve pod logs from argo archive

100f730

Added aws instance profile iam credential support for minio client. R…

0bfe86e

…ead workflow status for argo archive location for pod logs.

neuromage reviewed Oct 15, 2019

View reviewed changes

k8s-ci-robot added the approved label Oct 15, 2019

This was referenced Oct 22, 2019

[Frontend] Set up unit tests for frontend node server #2460

Closed

[Frontend] Standardize code style using prettier #2406

Closed

Bobgy approved these changes Oct 22, 2019

View reviewed changes

Bobgy mentioned this pull request Oct 23, 2019

[Frontend] Server Testing - test infra for pipeline ui node server #2215

Closed

4 tasks

eterna2 mentioned this pull request Oct 26, 2019

Web UI should supports get artifact from local path #1497

Closed

k8s-ci-robot removed the do-not-merge/hold label Nov 7, 2019

k8s-ci-robot merged commit aa2d2f4 into kubeflow:master Nov 8, 2019

IronPan mentioned this pull request Nov 29, 2019

Implement a persistence agent for logs, and Garbage Collection for Kubernetes Resources. #844

Closed

Bobgy mentioned this pull request Dec 17, 2019

[UI] Pass namespace to APIs #2676

Merged

1 task

Jeffwan mentioned this pull request Mar 31, 2020

S3 support in Kubeflow Pipelines #3405

Closed

Ark-kun mentioned this pull request May 21, 2020

fix(cache): Cache executions with no outputs. Fixes #3507 #3808

Merged

fenglixa mentioned this pull request Jun 12, 2020

Support log archive to S3 storage. kubeflow/kfp-tekton#161

Closed

LEDfan mentioned this pull request Jun 15, 2021

feat(frontend): replace variables in ARGO_ARCHIVE_PREFIX #5861

Closed

1 task

twolffpiggott mentioned this pull request Nov 11, 2021

[backend] Archived Main Log File is Empty #6894

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pipeline-ui] Retrieve pod logs from argo archive #2081

[pipeline-ui] Retrieve pod logs from argo archive #2081

eterna2 commented Sep 10, 2019 •

edited

Loading

k8s-ci-robot commented Sep 10, 2019

Ark-kun Sep 11, 2019

eterna2 Sep 12, 2019

Ark-kun Sep 17, 2019

Ark-kun Sep 17, 2019

Ark-kun Sep 19, 2019

eterna2 Sep 20, 2019

eterna2 commented Sep 12, 2019

Ark-kun commented Sep 17, 2019

Ark-kun commented Sep 17, 2019

Ark-kun commented Sep 17, 2019

eterna2 commented Sep 19, 2019

Ark-kun Sep 19, 2019

eterna2 Sep 20, 2019

Ark-kun commented Sep 19, 2019

Bobgy commented Sep 20, 2019

Ark-kun commented Sep 24, 2019

Bobgy left a comment

eterna2 commented Sep 24, 2019

neuromage left a comment

k8s-ci-robot commented Oct 15, 2019

IronPan commented Oct 15, 2019

IronPan commented Oct 15, 2019

eterna2 commented Oct 15, 2019

IronPan commented Oct 17, 2019

Ark-kun commented Oct 17, 2019

IronPan commented Nov 7, 2019

mattiasarro commented Dec 9, 2019 •

edited

Loading

eterna2 commented Dec 9, 2019

eterna2 commented Dec 9, 2019

mattiasarro commented Dec 17, 2019

Ark-kun commented Apr 24, 2020

eterna2 commented Apr 24, 2020

eterna2 commented Apr 24, 2020

[pipeline-ui] Retrieve pod logs from argo archive #2081

[pipeline-ui] Retrieve pod logs from argo archive #2081

Conversation

eterna2 commented Sep 10, 2019 • edited Loading

Updates 20 sep 2019:

Changes

TODO

k8s-ci-robot commented Sep 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eterna2 commented Sep 12, 2019

Ark-kun commented Sep 17, 2019

Ark-kun commented Sep 17, 2019

Ark-kun commented Sep 17, 2019

eterna2 commented Sep 19, 2019

pod logs

AWS instance profile

k8s role/service account

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ark-kun commented Sep 19, 2019

Bobgy commented Sep 20, 2019

Ark-kun commented Sep 24, 2019

Bobgy left a comment

Choose a reason for hiding this comment

eterna2 commented Sep 24, 2019

neuromage left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Oct 15, 2019

IronPan commented Oct 15, 2019

IronPan commented Oct 15, 2019

eterna2 commented Oct 15, 2019

IronPan commented Oct 17, 2019

Ark-kun commented Oct 17, 2019

IronPan commented Nov 7, 2019

mattiasarro commented Dec 9, 2019 • edited Loading

eterna2 commented Dec 9, 2019

eterna2 commented Dec 9, 2019

mattiasarro commented Dec 17, 2019

Ark-kun commented Apr 24, 2020

eterna2 commented Apr 24, 2020

eterna2 commented Apr 24, 2020

eterna2 commented Sep 10, 2019 •

edited

Loading

mattiasarro commented Dec 9, 2019 •

edited

Loading