Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable completion time-to-live to be set on all jobs #407

Merged
merged 4 commits into from
Jun 6, 2019
Merged

Enable completion time-to-live to be set on all jobs #407

merged 4 commits into from
Jun 6, 2019

Conversation

objectiser
Copy link
Contributor

@objectiser objectiser commented May 10, 2019

Fixes #406

@jkandasa Could you give this a test to make sure it fixes the issue. Note it requires kubernetes 1.12 or higher.

Signed-off-by: Gary Brown gary@brownuk.com

@objectiser
Copy link
Contributor Author

Codeclimate issues are related to the autogenerated code.

@codecov
Copy link

codecov bot commented May 10, 2019

Codecov Report

Merging #407 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #407      +/-   ##
==========================================
+ Coverage   91.59%   91.62%   +0.02%     
==========================================
  Files          64       64              
  Lines        3142     3153      +11     
==========================================
+ Hits         2878     2889      +11     
  Misses        184      184              
  Partials       80       80
Impacted Files Coverage Δ
pkg/apis/jaegertracing/v1/jaeger_types.go 100% <ø> (ø) ⬆️
pkg/strategy/controller.go 97.69% <100%> (+0.19%) ⬆️
pkg/cronjob/spark_dependencies.go 97.8% <100%> (+0.02%) ⬆️
pkg/cronjob/es_rollover.go 95.53% <100%> (-0.08%) ⬇️
pkg/storage/cassandra_dependencies.go 100% <100%> (ø) ⬆️
pkg/cronjob/es_index_cleaner.go 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2f67fd4...174230e. Read the comment docs.

@jkandasa
Copy link
Member

@objectiser I do not have 1.12 Kubernetes,

I have,

OpenShift Master: v3.11.98
Kubernetes Master: v1.11.0+d4cacc0
OpenShift Web Console: v3.11.98

Test status:

  • works well for es-index-cleaner job (job status success all the time)
  • spark-dependencies inconsistent (job status error all the time, spark-dependencies job is failing #405 )
    • set of old jobs(spark-dependencies) are deleted at random interval (could not predict)

CR:

  storage:
    type: elasticsearch
    esIndexCleaner:
      enabled: true
      schedule: "*/1 * * * *"
      completedTTL: 180
    dependencies:
      enabled: true
      schedule: "*/1 * * * *"
      completedTTL: 300
    elasticsearch:
      nodeCount: 3
      resources:

image

@objectiser
Copy link
Contributor Author

@jkandasa Ok thanks.

@jpkrohling @pavolloffay At present the 'completed time-to-live' for all of the jobs is set to a default of 10 minutes, but wondering whether each job type (e.g. rollover, index cleaner, etc) should have their own defaults which can more closely relate to the schedules (when used) - so if a schedule only runs a job once a day, then maybe the ttl should be two days so any failures are around for a while?

@pavolloffay
Copy link
Member

Maybe we could make it always double to what is the schedule?

@objectiser
Copy link
Contributor Author

@pavolloffay On further thought, the time the job should remain shouldn't really be related to schedule, as if a failure/retry occurs it can result in a large number of jobs being left lying around for a long time.

On the other hand we need to give time for someone to detect the failure and capture any relevant information. So have updated default to 1 hour.

If we want to try to come up with a more complex scheme, then we could do it in a separate PR?

@pavolloffay
Copy link
Member

On the other hand we need to give time for someone to detect the failure and capture any relevant information. So have updated default to 1 hour.

The jobs we have here run once per day at midnight. Not sure if one hour is a good default in this case.

@objectiser
Copy link
Contributor Author

If there is a logging framework, then any failures would be captured centrally - so having the job hanging around would not be as relevant.

@pavolloffay
Copy link
Member

The question is if those logs would get attention and if they would indicate a problem.

@objectiser
Copy link
Contributor Author

@jpkrohling Any thoughts on this?

@jpkrohling
Copy link
Contributor

If we run each job on a daily basis, it's safe to assume that they shouldn't take more than one day to complete.

@objectiser
Copy link
Contributor Author

@jpkrohling Could you take a look at this? codeclimate errors not relevant, and travis job has finished, but not updated here for some reason.

@jpkrohling
Copy link
Contributor

This is all green now.

@objectiser
Copy link
Contributor Author

@jpkrohling Are the changes ok to merge?

@jpkrohling
Copy link
Contributor

LGTM, but I would prefer TTLSecondsAfterFinished as the name for the new property, to match the one from the batch objects, which is where it's used after all.

@objectiser
Copy link
Contributor Author

@jpkrohling No problem, can change.

Signed-off-by: Gary Brown <gary@brownuk.com>
Signed-off-by: Gary Brown <gary@brownuk.com>
…and change default TTL to 1 day

Signed-off-by: Gary Brown <gary@brownuk.com>
Signed-off-by: Gary Brown <gary@brownuk.com>
@@ -43,6 +43,7 @@ github.com/bradfitz/go-smtpd v0.0.0-20170404230938-deb6d6237625/go.mod h1:HYsPBT
github.com/census-instrumentation/opencensus-proto v0.2.0 h1:LzQXZOgg4CQfE6bFvXGM30YZL1WW/M337pXml+GrcZ4=
github.com/census-instrumentation/opencensus-proto v0.2.0/go.mod h1:f6KPmirojxKA12rnyqOA5BBL4O983OfeGPqjHWSTneU=
github.com/client9/misspell v0.3.4/go.mod h1:qj6jICC3Q7zFZvVWo7KLAzC3yx5G7kyvSDkc90ppPyw=
github.com/codahale/hdrhistogram v0.0.0-20161010025455-3a0bb77429bd h1:qMd81Ts1T2OTKmB4acZcyKaMtRnY5Y44NuXGX2GFJ1w=
Copy link
Contributor Author

@objectiser objectiser Jun 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpkrohling Would you expect these changes to the go.sum file? They seem similar except next line as /go.mod

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, but the version seems strange. Perhaps it's pointing to master and they had a new commit recently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would have expected it to just update the existing line rather than add a new.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good point. I need to understand go modules better in order to answer that. I have seen duplicated entries in different sections in the go.mod, but not sure I would expect the same for go.sum...

@jpkrohling jpkrohling merged commit eaa4d52 into jaegertracing:master Jun 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a mechanism to clean completed/failed jobs triggered by the jaeger-operator
4 participants