Skip to content

Commit 89116fc

Browse files
cstyantomwilkie
authored andcommitted
Add documentation for HA tracker flags. (#1465)
* Add documentation for HA tracker args. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Fix misspelling of Prometheus in HA replica/cluster flags. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Rename accept-ha-samples to enable-ha-tracker Signed-off-by: Callum Styan <callumstyan@gmail.com> * Update args docs based on review and addition of etcd. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Add note in architecture.md about HA tracking. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Specify that certain flags can/should be prefixed with ring/ha-tracker. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Move -ha-tracker.* to -distributor.ha-tracker.* Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> * Make the flags part of ha-tracker and explicit Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
1 parent 900ca62 commit 89116fc

File tree

7 files changed

+80
-17
lines changed

7 files changed

+80
-17
lines changed

docs/architecture.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,8 @@ The **distributor** service is responsible for handling samples written by Prome
2424

2525
Distributors communicate with ingesters via [gRPC](https://grpc.io). They are stateless and can be scaled up and down as needed.
2626

27+
If the HA Tracker is enabled, the Distributor will deduplicate incoming samples that contain both a cluster and replica label. It talks to a KVStore to store state about which replica per cluster it's accepting samples from for a given user ID. Samples with one or neither of these labels will be accepted by default.
28+
2729
#### Hashing
2830

2931
Distributors use consistent hashing, in conjunction with the (configurable) replication factor, to determine *which* instances of the ingester service receive each sample.
@@ -147,4 +149,4 @@ The interface works somewhat differently across the supported databases:
147149

148150
A set of schemas are used to map the matchers and label sets used on reads and writes to the chunk store into appropriate operations on the index. Schemas have been added as Cortex has evolved, mainly in an attempt to better load balance writes and improve query performance.
149151

150-
> The current schema recommendation is the **v10 schema**.
152+
> The current schema recommendation is the **v10 schema**.

docs/arguments.md

Lines changed: 62 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,67 @@ The ingester query API was improved over time, but defaults to the old behaviour
9292
- `-distributor.extra-query-delay`
9393
This is used by a component with an embedded distributor (Querier and Ruler) to control how long to wait until sending more than the minimum amount of queries needed for a successful response.
9494

95+
- `distributor.ha-tracker.enable-for-all-users`
96+
Flag to enable, for all users, handling of samples with external labels identifying replicas in an HA Prometheus setup. This defaults to false, and is technically defined in the Distributor limits.
97+
98+
- `distributor.ha-tracker.enable`
99+
Enable the distributors HA tracker so that it can accept samples from Prometheus HA replicas gracefully (requires labels). Global (for distributors), this ensures that the necessary internal data structures for the HA handling are created. The option `enable-for-all-users` is still needed to enable ingestion of HA samples for all users.
100+
101+
### Ring/HA Tracker Store
102+
103+
The KVStore client is used by both the Ring and HA Tracker.
104+
- `{ring,distributor.ha-tracker}.prefix`
105+
The prefix for the keys in the store. Should end with a /. For example with a prefix of foo/, the key bar would be stored under foo/bar.
106+
- `{ring,distributor.ha-tracker}.store`
107+
Backend storage to use for the ring (consul, etcd, inmemory).
108+
109+
#### Consul
110+
111+
By default these flags are used to configure Consul used for the ring. To configure Consul for the HA tracker,
112+
prefix these flags with `distributor.ha-tracker.`
113+
114+
- `consul.hostname`
115+
Hostname and port of Consul.
116+
- `consul.acltoken`
117+
ACL token used to interact with Consul.
118+
- `consul.client-timeout`
119+
HTTP timeout when talking to Consul.
120+
- `consul.consistent-reads`
121+
Enable consistent reads to Consul.
122+
123+
#### etcd
124+
125+
By default these flags are used to configure etcd used for the ring. To configure etcd for the HA tracker,
126+
prefix these flags with `distributor.ha-tracker.`
127+
128+
- `etcd.endpoints`
129+
The etcd endpoints to connect to.
130+
- `etcd.dial-timeout`
131+
The timeout for the etcd connection.
132+
- `etcd.max-retries`
133+
The maximum number of retries to do for failed ops.
134+
135+
### HA Tracker
136+
137+
HA tracking has two of it's own flags:
138+
- `distributor.ha-tracker.cluster`
139+
Prometheus label to look for in samples to identify a Prometheus HA cluster. (default "cluster")
140+
- `distributor.ha-tracker.replica`
141+
Prometheus label to look for in samples to identify a Prometheus HA replica. (default "__replica__")
142+
143+
It's reasonable to assume people probably already have a `cluster` label, or something similar. If not, they should add one along with `__replica__`
144+
via external labels in their Prometheus config.
145+
146+
HA Tracking looks for the two labels (which can be overwritten per user)
147+
148+
It also talks to a KVStore and has it's own copies of the same flags used by the Distributor to connect to for the ring.
149+
- `distributor.ha-tracker.failover-timeout`
150+
If we don't receive any samples from the accepted replica for a cluster in this amount of time we will failover to the next replica we receive a sample from. This value must be greater than the update timeout (default 30s)
151+
- `distributor.ha-tracker.store`
152+
Backend storage to use for the ring (consul, etcd, inmemory). (default "consul")
153+
- `distributor.ha-tracker.update-timeout`
154+
Update the timestamp in the KV store for a given cluster/replica only after this amount of time has passed since the current stored timestamp. (default 15s)
155+
95156
## Ingester
96157

97158
- `-ingester.normalise-tokens`
@@ -185,4 +246,4 @@ Valid fields are (with their corresponding flags for default values):
185246

186247
- `s3.force-path-style`
187248

188-
Set this to `true` to force the request to use path-style addressing (`http://s3.amazonaws.com/BUCKET/KEY`). By default, the S3 client will use virtual hosted bucket addressing when possible (`http://BUCKET.s3.amazonaws.com/KEY`).
249+
Set this to `true` to force the request to use path-style addressing (`http://s3.amazonaws.com/BUCKET/KEY`). By default, the S3 client will use virtual hosted bucket addressing when possible (`http://BUCKET.s3.amazonaws.com/KEY`).

pkg/distributor/distributor.go

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -109,8 +109,8 @@ type Config struct {
109109
BillingConfig billing.Config `yaml:"billing,omitempty"`
110110
PoolConfig ingester_client.PoolConfig `yaml:"pool,omitempty"`
111111

112-
EnableHAReplicas bool `yaml:"enable_ha_pairs,omitempty"`
113-
HATrackerConfig HATrackerConfig `yaml:"ha_tracker,omitempty"`
112+
EnableHATracker bool `yaml:"enable_ha_tracker,omitempty"`
113+
HATrackerConfig HATrackerConfig `yaml:"ha_tracker,omitempty"`
114114

115115
RemoteTimeout time.Duration `yaml:"remote_timeout,omitempty"`
116116
ExtraQueryDelay time.Duration `yaml:"extra_queue_delay,omitempty"`
@@ -129,7 +129,7 @@ func (cfg *Config) RegisterFlags(f *flag.FlagSet) {
129129
cfg.HATrackerConfig.RegisterFlags(f)
130130

131131
f.BoolVar(&cfg.EnableBilling, "distributor.enable-billing", false, "Report number of ingested samples to billing system.")
132-
f.BoolVar(&cfg.EnableHAReplicas, "distributor.accept-ha-labels", false, "Accept samples from Prometheus HA replicas gracefully (requires labels).")
132+
f.BoolVar(&cfg.EnableHATracker, "distributor.ha-tracker.enable", false, "Enable the distributors HA tracker so that it can accept samples from Prometheus HA replicas gracefully (requires labels).")
133133
f.DurationVar(&cfg.RemoteTimeout, "distributor.remote-timeout", 2*time.Second, "Timeout for downstream ingesters.")
134134
f.DurationVar(&cfg.ExtraQueryDelay, "distributor.extra-query-delay", 0, "Time to wait before sending more than the minimum successful query requests.")
135135
f.DurationVar(&cfg.LimiterReloadPeriod, "distributor.limiter-reload-period", 5*time.Minute, "Period at which to reload user ingestion limits.")
@@ -166,7 +166,7 @@ func New(cfg Config, clientConfig ingester_client.Config, limits *validation.Ove
166166
quit: make(chan struct{}),
167167
}
168168

169-
if cfg.EnableHAReplicas {
169+
if cfg.EnableHATracker {
170170
replicas, err := newClusterTracker(cfg.HATrackerConfig)
171171
if err != nil {
172172
return nil, err
@@ -204,7 +204,7 @@ func (d *Distributor) loop() {
204204
func (d *Distributor) Stop() {
205205
close(d.quit)
206206
d.ingesterPool.Stop()
207-
if d.cfg.EnableHAReplicas {
207+
if d.cfg.EnableHATracker {
208208
d.replicas.stop()
209209
}
210210
}
@@ -290,7 +290,7 @@ func (d *Distributor) Push(ctx context.Context, req *client.WriteRequest) (*clie
290290
var lastPartialErr error
291291
removeReplica := false
292292

293-
if d.cfg.EnableHAReplicas && d.limits.AcceptHASamples(userID) && len(req.Timeseries) > 0 {
293+
if d.cfg.EnableHATracker && d.limits.AcceptHASamples(userID) && len(req.Timeseries) > 0 {
294294
removeReplica, err = d.checkSample(ctx, userID, req.Timeseries[0])
295295
if err != nil {
296296
return nil, err

pkg/distributor/distributor_test.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,7 @@ func TestDistributorPushHAInstances(t *testing.T) {
139139
for _, shardByAllLabels := range []bool{true, false} {
140140
t.Run(fmt.Sprintf("[%d](shardByAllLabels=%v)", i, shardByAllLabels), func(t *testing.T) {
141141
d := prepare(t, 1, 1, 0, shardByAllLabels)
142-
d.cfg.EnableHAReplicas = true
142+
d.cfg.EnableHATracker = true
143143
d.limits.Defaults.AcceptHASamples = true
144144
codec := codec.Proto{Factory: ProtoReplicaDescFactory}
145145
mock := kv.PrefixClient(consul.NewInMemoryClient(codec), "prefix")

pkg/distributor/ha_tracker.go

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -68,15 +68,15 @@ type HATrackerConfig struct {
6868
// RegisterFlags adds the flags required to config this to the given FlagSet
6969
func (cfg *HATrackerConfig) RegisterFlags(f *flag.FlagSet) {
7070
f.DurationVar(&cfg.UpdateTimeout,
71-
"ha-tracker.update-timeout",
71+
"distributor.ha-tracker.update-timeout",
7272
15*time.Second,
7373
"Update the timestamp in the KV store for a given cluster/replica only after this amount of time has passed since the current stored timestamp.")
7474
f.DurationVar(&cfg.FailoverTimeout,
75-
"ha-tracker.failover-timeout",
75+
"distributor.ha-tracker.failover-timeout",
7676
30*time.Second,
7777
"If we don't receive any samples from the accepted replica for a cluster in this amount of time we will failover to the next replica we receive a sample from. This value must be greater than the update timeout")
7878
// We want the ability to use different Consul instances for the ring and for HA cluster tracking.
79-
cfg.KVStore.RegisterFlagsWithPrefix("ha-tracker.", f)
79+
cfg.KVStore.RegisterFlagsWithPrefix("distributor.ha-tracker.", f)
8080
}
8181

8282
// NewClusterTracker returns a new HA cluster tracker using either Consul

pkg/ring/kv/consul/client.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,8 +61,8 @@ type Client struct {
6161
func (cfg *Config) RegisterFlags(f *flag.FlagSet, prefix string) {
6262
f.StringVar(&cfg.Host, prefix+"consul.hostname", "localhost:8500", "Hostname and port of Consul.")
6363
f.StringVar(&cfg.ACLToken, prefix+"consul.acltoken", "", "ACL Token used to interact with Consul.")
64-
f.DurationVar(&cfg.HTTPClientTimeout, prefix+"consul.client-timeout", 2*longPollDuration, "HTTP timeout when talking to consul")
65-
f.BoolVar(&cfg.ConsistentReads, prefix+"consul.consistent-reads", true, "Enable consistent reads to consul.")
64+
f.DurationVar(&cfg.HTTPClientTimeout, prefix+"consul.client-timeout", 2*longPollDuration, "HTTP timeout when talking to Consul")
65+
f.BoolVar(&cfg.ConsistentReads, prefix+"consul.consistent-reads", true, "Enable consistent reads to Consul.")
6666
}
6767

6868
// NewClient returns a new Client.

pkg/util/validation/limits.go

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,9 +43,9 @@ type Limits struct {
4343
func (l *Limits) RegisterFlags(f *flag.FlagSet) {
4444
f.Float64Var(&l.IngestionRate, "distributor.ingestion-rate-limit", 25000, "Per-user ingestion rate limit in samples per second.")
4545
f.IntVar(&l.IngestionBurstSize, "distributor.ingestion-burst-size", 50000, "Per-user allowed ingestion burst size (in number of samples). Warning, very high limits will be reset every -distributor.limiter-reload-period.")
46-
f.BoolVar(&l.AcceptHASamples, "distributor.accept-ha-samples", false, "Per-user flag to enable handling of samples with external labels for identifying replicas in an HA Prometheus setup.")
47-
f.StringVar(&l.HAReplicaLabel, "ha-tracker.replica", "__replica__", "Prometheus label to look for in samples to identify a Proemtheus HA replica.")
48-
f.StringVar(&l.HAClusterLabel, "ha-tracker.cluster", "cluster", "Prometheus label to look for in samples to identify a Poemtheus HA cluster.")
46+
f.BoolVar(&l.AcceptHASamples, "distributor.ha-tracker.enable-for-all-users", false, "Flag to enable, for all users, handling of samples with external labels identifying replicas in an HA Prometheus setup.")
47+
f.StringVar(&l.HAReplicaLabel, "distributor.ha-tracker.replica", "__replica__", "Prometheus label to look for in samples to identify a Prometheus HA replica.")
48+
f.StringVar(&l.HAClusterLabel, "distributor.ha-tracker.cluster", "cluster", "Prometheus label to look for in samples to identify a Prometheus HA cluster.")
4949
f.IntVar(&l.MaxLabelNameLength, "validation.max-length-label-name", 1024, "Maximum length accepted for label names")
5050
f.IntVar(&l.MaxLabelValueLength, "validation.max-length-label-value", 2048, "Maximum length accepted for label value. This setting also applies to the metric name")
5151
f.IntVar(&l.MaxLabelNamesPerSeries, "validation.max-label-names-per-series", 30, "Maximum number of label names per series.")

0 commit comments

Comments
 (0)