Skip to content

Commit 03169ff

Browse files
committed
docx: complete tuning
Signed-off-by: Kevin Bimonte <kbimonte@gmail.com>
1 parent a8134ec commit 03169ff

File tree

1 file changed

+109
-1
lines changed

1 file changed

+109
-1
lines changed

docs/docs/operation/tuning.md

Lines changed: 109 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,40 +2,148 @@
22
title: Performance Tuning
33
---
44

5+
By default, Concourse is configured to feel very snappy. This is good for when you are first trying out Concourse or
6+
using it on a small team with a few dozen pipelines.
7+
8+
When you begin trying to scale Concourse is where fires can start breaking out. This section will go over some
9+
configuration values in Concourse that you can change to make scaling easier.
10+
511
## The Big Caveat
612

13+
Track [Metrics](metrics.md)! Everything you read next could be all for nothing if you don't have metrics to track where
14+
the bottlenecks are in your Concourse system. We highly suggest tracking metrics so you have a clear before and after
15+
picture for any changes you make and to clearly see if you're moving things in the right direction.
16+
717
## Build Logs
818

19+
Is the size of your database growing dramatically? Can't keep up with the storage costs? Then you should probably
20+
configure some default log retention settings.
21+
22+
By default, Concourse will not delete any of your logs from your pipelines. You have to opt in to having Concourse
23+
automatically delete build logs for you. You can set a time-based retention policy and/or a policy based on the number
24+
of logs a job generates.
25+
926
### `CONCOURSE_DEFAULT_BUILD_LOGS_TO_RETAIN`
1027

28+
Determines how many build logs to retain per job by default. If you set this to `10` then any jobs in your pipelines
29+
that have more than ten builds will have the extra logs for those builds deleted.
30+
31+
Users can override this value in their pipelines.
32+
1133
### `CONCOURSE_MAX_BUILD_LOGS_TO_RETAIN`
1234

35+
Determines how many build logs to retain per job. Users cannot override this setting.
36+
1337
### `CONCOURSE_DEFAULT_DAYS_TO_RETAIN_BUILD_LOGS`
1438

39+
Determines how old build logs have to be before they are deleted. Setting this to a value like `10` will result in any
40+
build logs older than 10 days to be deleted.
41+
42+
Users can override this value in their pipelines.
43+
1544
### `CONCOURSE_MAX_DAYS_TO_RETAIN_BUILD_LOGS`
1645

46+
Determines how old build logs have to be before they are deleted. Users cannot override this setting in their pipelines.
47+
1748
## Resource Checking
1849

50+
By default, Concourse checks any given resource every ~1min. This makes Concourse feel snappy when you first start using
51+
it. Once you start trying to scale though the amount of checks can begin to feel aggressive. The following settings can
52+
help you reduce the load caused by resource checking.
53+
1954
### `CONCOURSE_RESOURCE_CHECKING_INTERVAL`
2055

56+
This is where the default value for 1min checks comes from. Changing this value changes the default checking interval
57+
for all resources. Users can override this value when defining a resource with the [
58+
`resource.check_every`](../resources/index.md#resource-schema) field.
59+
2160
### `CONCOURSE_RESOURCE_WITH_WEBHOOK_CHECKING_INTERVAL`
2261

62+
Same as the previous var but only applies to resources with webhooks. Could use this to disable resource checking of
63+
resources that use webhooks by setting it to a large value like `99h`.
64+
2365
### `CONCOURSE_MAX_CHECKS_PER_SECOND`
2466

67+
Maximum number of checks that can be started per second. This will be calculated as
68+
`(# of resources)/(resource checking interval)`. If you're finding that too many resource checks are running at once and
69+
consuming a lot of resources on your workers then you can use this var to reduce the overall load.
70+
71+
A value of `-1` will remove this maximum limit of checks per second.
72+
2573
## Pipeline Management
2674

75+
Here are some flags you can set on the web node to help manage the amount of resources pipelines consume. These flags
76+
are mostly about ensuring pipelines don't run forever without good reason.
77+
2778
### `CONCOURSE_PAUSE_PIPELINES_AFTER`
2879

80+
This flag takes a number representing the number of days since a pipeline last ran before it's automatically paused. So
81+
specifying `90` means any pipelines that last ran 91 days ago will be automatically paused.
82+
83+
For large instances it can be common for users to set a pipeline and then forget about it. The pipeline may never run
84+
another job again and be forgotten forever. Even if the jobs in the pipeline never run Concourse will still be running
85+
resource checks for that pipeline, if any resources are defined. By setting this flag you can ensure that any pipelines
86+
that meet this criteria will be automatically paused and not consume resources long-term. For some large instances this
87+
can mean up to 50% of pipelines eventually being paused.
88+
2989
### `CONCOURSE_DEFAULT_TASK_{CPU/MEMORY}_LIMIT`
3090

91+
Global defaults for CPU and memory you can set. Only applies to tasks, not resource containers (`check/get/put` steps).
92+
You can read more about how to set these limits on the [`task` step `container_limits`](../steps/task.md) page.
93+
94+
Users can override these values in their pipelines.
95+
3196
### `CONCOURSE_DEFAULT_{GET/PUT/TASK}_TIMEOUT`
3297

98+
Global defaults for how long the mentioned step takes to execute. Useful if you're finding your users write pipelines
99+
with tasks that get stuck or never end. Ensures that every build eventually finishes.
100+
101+
Users can override these values in their pipelines.
102+
33103
## Container Placement
34104

105+
If you find that workers keep crashing due to high CPU and/or memory usage then you could try specifying a custom
106+
container placement strategy or strategy chain. The [Container Placement](container-placement.md) page has some examples
107+
of container placement strategy chains you can use.
108+
35109
## Garbage Collection
36110

111+
When jobs fail or error out in Concourse their resources are not immediately cleaned up. The container and storage space
112+
remain on a worker for some period of time before they get garbage collected. If you want to make the garbage collector
113+
more aggressive you can change the following settings on your web node:
114+
37115
### `CONCOURSE_GC_FAILED_GRACE_PERIOD`
38116

117+
This env var only applies to containers where the job failed and has the longest grace period among all the other GC
118+
grace periods. It has a default value of `120h` (five days).
119+
120+
The reason the default value is so long is so users don't feel rushed to investigate their failed job. A job can fail
121+
over a weekend and users can investigate the failed jobs containers when they come back on Monday.
122+
123+
Failed containers get GC as soon as a new build of the job is kicked off. So you don't have to worry about failed
124+
containers always hanging around for five days. They'll only hang around for that long if they're the most recent build
125+
of a job.
126+
127+
If you notice a lot of containers and volumes hanging around that are tied to failed jobs you can try reducing this
128+
setting to fewer days or even a few hours.
129+
39130
### Other GC Grace Periods
40131

41-
## Web To Worker Ratio
132+
Depending on what a container was used for and its exit condition, there are various flags you can adjust to make
133+
Concourse GC these resources faster or slower. The following env vars cover the cases where you probably don't need the
134+
container hanging around for very long. They have a default value of `5m`.
135+
136+
* `CONCOURSE_GC_ONE_OFF_GRACE_PERIOD` - Period after which one-off build containers will be garbage-collected
137+
* `CONCOURSE_GC_MISSING_GRACE_PERIOD` - Period after which containers and volumes that were created but went missing
138+
from the worker will be garbage-collected
139+
* `CONCOURSE_GC_HIJACK_GRACE_PERIOD`- Period after which hijacked containers will be garbage-collected
140+
141+
## Web To Worker Ratio
142+
143+
This is anecdotal, and you should adjust based on your metrics of your web nodes. A starting ratio of web to workers is
144+
1:6; one web instance for every six workers.
145+
146+
The core Concourse team runs two web nodes and 16 workers, a 1:8 ratio. We can get away with this lower web to worker
147+
ratio because we don't have that many users actively interacting with the web UI on a daily basis; less than 10 active
148+
users. Since we're only one team using the instance we have fewer pipelines than an instance supporting multiple teams
149+
would.

0 commit comments

Comments
 (0)