|
2 | 2 | title: Performance Tuning |
3 | 3 | --- |
4 | 4 |
|
| 5 | +By default, Concourse is configured to feel very snappy. This is good for when you are first trying out Concourse or |
| 6 | +using it on a small team with a few dozen pipelines. |
| 7 | + |
| 8 | +When you begin trying to scale Concourse is where fires can start breaking out. This section will go over some |
| 9 | +configuration values in Concourse that you can change to make scaling easier. |
| 10 | + |
5 | 11 | ## The Big Caveat |
6 | 12 |
|
| 13 | +Track [Metrics](metrics.md)! Everything you read next could be all for nothing if you don't have metrics to track where |
| 14 | +the bottlenecks are in your Concourse system. We highly suggest tracking metrics so you have a clear before and after |
| 15 | +picture for any changes you make and to clearly see if you're moving things in the right direction. |
| 16 | + |
7 | 17 | ## Build Logs |
8 | 18 |
|
| 19 | +Is the size of your database growing dramatically? Can't keep up with the storage costs? Then you should probably |
| 20 | +configure some default log retention settings. |
| 21 | + |
| 22 | +By default, Concourse will not delete any of your logs from your pipelines. You have to opt in to having Concourse |
| 23 | +automatically delete build logs for you. You can set a time-based retention policy and/or a policy based on the number |
| 24 | +of logs a job generates. |
| 25 | + |
9 | 26 | ### `CONCOURSE_DEFAULT_BUILD_LOGS_TO_RETAIN` |
10 | 27 |
|
| 28 | +Determines how many build logs to retain per job by default. If you set this to `10` then any jobs in your pipelines |
| 29 | +that have more than ten builds will have the extra logs for those builds deleted. |
| 30 | + |
| 31 | +Users can override this value in their pipelines. |
| 32 | + |
11 | 33 | ### `CONCOURSE_MAX_BUILD_LOGS_TO_RETAIN` |
12 | 34 |
|
| 35 | +Determines how many build logs to retain per job. Users cannot override this setting. |
| 36 | + |
13 | 37 | ### `CONCOURSE_DEFAULT_DAYS_TO_RETAIN_BUILD_LOGS` |
14 | 38 |
|
| 39 | +Determines how old build logs have to be before they are deleted. Setting this to a value like `10` will result in any |
| 40 | +build logs older than 10 days to be deleted. |
| 41 | + |
| 42 | +Users can override this value in their pipelines. |
| 43 | + |
15 | 44 | ### `CONCOURSE_MAX_DAYS_TO_RETAIN_BUILD_LOGS` |
16 | 45 |
|
| 46 | +Determines how old build logs have to be before they are deleted. Users cannot override this setting in their pipelines. |
| 47 | + |
17 | 48 | ## Resource Checking |
18 | 49 |
|
| 50 | +By default, Concourse checks any given resource every ~1min. This makes Concourse feel snappy when you first start using |
| 51 | +it. Once you start trying to scale though the amount of checks can begin to feel aggressive. The following settings can |
| 52 | +help you reduce the load caused by resource checking. |
| 53 | + |
19 | 54 | ### `CONCOURSE_RESOURCE_CHECKING_INTERVAL` |
20 | 55 |
|
| 56 | +This is where the default value for 1min checks comes from. Changing this value changes the default checking interval |
| 57 | +for all resources. Users can override this value when defining a resource with the [ |
| 58 | +`resource.check_every`](../resources/index.md#resource-schema) field. |
| 59 | + |
21 | 60 | ### `CONCOURSE_RESOURCE_WITH_WEBHOOK_CHECKING_INTERVAL` |
22 | 61 |
|
| 62 | +Same as the previous var but only applies to resources with webhooks. Could use this to disable resource checking of |
| 63 | +resources that use webhooks by setting it to a large value like `99h`. |
| 64 | + |
23 | 65 | ### `CONCOURSE_MAX_CHECKS_PER_SECOND` |
24 | 66 |
|
| 67 | +Maximum number of checks that can be started per second. This will be calculated as |
| 68 | +`(# of resources)/(resource checking interval)`. If you're finding that too many resource checks are running at once and |
| 69 | +consuming a lot of resources on your workers then you can use this var to reduce the overall load. |
| 70 | + |
| 71 | +A value of `-1` will remove this maximum limit of checks per second. |
| 72 | + |
25 | 73 | ## Pipeline Management |
26 | 74 |
|
| 75 | +Here are some flags you can set on the web node to help manage the amount of resources pipelines consume. These flags |
| 76 | +are mostly about ensuring pipelines don't run forever without good reason. |
| 77 | + |
27 | 78 | ### `CONCOURSE_PAUSE_PIPELINES_AFTER` |
28 | 79 |
|
| 80 | +This flag takes a number representing the number of days since a pipeline last ran before it's automatically paused. So |
| 81 | +specifying `90` means any pipelines that last ran 91 days ago will be automatically paused. |
| 82 | + |
| 83 | +For large instances it can be common for users to set a pipeline and then forget about it. The pipeline may never run |
| 84 | +another job again and be forgotten forever. Even if the jobs in the pipeline never run Concourse will still be running |
| 85 | +resource checks for that pipeline, if any resources are defined. By setting this flag you can ensure that any pipelines |
| 86 | +that meet this criteria will be automatically paused and not consume resources long-term. For some large instances this |
| 87 | +can mean up to 50% of pipelines eventually being paused. |
| 88 | + |
29 | 89 | ### `CONCOURSE_DEFAULT_TASK_{CPU/MEMORY}_LIMIT` |
30 | 90 |
|
| 91 | +Global defaults for CPU and memory you can set. Only applies to tasks, not resource containers (`check/get/put` steps). |
| 92 | +You can read more about how to set these limits on the [`task` step `container_limits`](../steps/task.md) page. |
| 93 | + |
| 94 | +Users can override these values in their pipelines. |
| 95 | + |
31 | 96 | ### `CONCOURSE_DEFAULT_{GET/PUT/TASK}_TIMEOUT` |
32 | 97 |
|
| 98 | +Global defaults for how long the mentioned step takes to execute. Useful if you're finding your users write pipelines |
| 99 | +with tasks that get stuck or never end. Ensures that every build eventually finishes. |
| 100 | + |
| 101 | +Users can override these values in their pipelines. |
| 102 | + |
33 | 103 | ## Container Placement |
34 | 104 |
|
| 105 | +If you find that workers keep crashing due to high CPU and/or memory usage then you could try specifying a custom |
| 106 | +container placement strategy or strategy chain. The [Container Placement](container-placement.md) page has some examples |
| 107 | +of container placement strategy chains you can use. |
| 108 | + |
35 | 109 | ## Garbage Collection |
36 | 110 |
|
| 111 | +When jobs fail or error out in Concourse their resources are not immediately cleaned up. The container and storage space |
| 112 | +remain on a worker for some period of time before they get garbage collected. If you want to make the garbage collector |
| 113 | +more aggressive you can change the following settings on your web node: |
| 114 | + |
37 | 115 | ### `CONCOURSE_GC_FAILED_GRACE_PERIOD` |
38 | 116 |
|
| 117 | +This env var only applies to containers where the job failed and has the longest grace period among all the other GC |
| 118 | +grace periods. It has a default value of `120h` (five days). |
| 119 | + |
| 120 | +The reason the default value is so long is so users don't feel rushed to investigate their failed job. A job can fail |
| 121 | +over a weekend and users can investigate the failed jobs containers when they come back on Monday. |
| 122 | + |
| 123 | +Failed containers get GC as soon as a new build of the job is kicked off. So you don't have to worry about failed |
| 124 | +containers always hanging around for five days. They'll only hang around for that long if they're the most recent build |
| 125 | +of a job. |
| 126 | + |
| 127 | +If you notice a lot of containers and volumes hanging around that are tied to failed jobs you can try reducing this |
| 128 | +setting to fewer days or even a few hours. |
| 129 | + |
39 | 130 | ### Other GC Grace Periods |
40 | 131 |
|
41 | | -## Web To Worker Ratio |
| 132 | +Depending on what a container was used for and its exit condition, there are various flags you can adjust to make |
| 133 | +Concourse GC these resources faster or slower. The following env vars cover the cases where you probably don't need the |
| 134 | +container hanging around for very long. They have a default value of `5m`. |
| 135 | + |
| 136 | +* `CONCOURSE_GC_ONE_OFF_GRACE_PERIOD` - Period after which one-off build containers will be garbage-collected |
| 137 | +* `CONCOURSE_GC_MISSING_GRACE_PERIOD` - Period after which containers and volumes that were created but went missing |
| 138 | + from the worker will be garbage-collected |
| 139 | +* `CONCOURSE_GC_HIJACK_GRACE_PERIOD`- Period after which hijacked containers will be garbage-collected |
| 140 | + |
| 141 | +## Web To Worker Ratio |
| 142 | + |
| 143 | +This is anecdotal, and you should adjust based on your metrics of your web nodes. A starting ratio of web to workers is |
| 144 | +1:6; one web instance for every six workers. |
| 145 | + |
| 146 | +The core Concourse team runs two web nodes and 16 workers, a 1:8 ratio. We can get away with this lower web to worker |
| 147 | +ratio because we don't have that many users actively interacting with the web UI on a daily basis; less than 10 active |
| 148 | +users. Since we're only one team using the instance we have fewer pipelines than an instance supporting multiple teams |
| 149 | +would. |
0 commit comments