Skip to content

Enabling availability zone awareness in metric R/W with ingesters. #2317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 29, 2020

Conversation

khaines
Copy link
Contributor

@khaines khaines commented Mar 23, 2020

What this PR does: This PR adds support for defining an availability zone for ingesters so that metric writes & reads will be assigned across zone boundaries. This is added as a CLI flag or config file field, as this feature is agnostic to how that particular is set/discovered; as it varies greatly based on the given environment configuration.

Which issue(s) this PR fixes:
Fixes #612

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

khaines added 5 commits March 23, 2020 11:57
…ded distinct zone check in ring Get ingesters function

Signed-off-by: Ken Haines <khaines@microsoft.com>
Signed-off-by: Ken Haines <khaines@microsoft.com>
Signed-off-by: Ken Haines <khaines@microsoft.com>
Signed-off-by: Ken Haines <khaines@microsoft.com>
Signed-off-by: Ken Haines <khaines@microsoft.com>
khaines added 2 commits March 23, 2020 12:14
Signed-off-by: Ken Haines <khaines@microsoft.com>
Signed-off-by: Ken Haines <khaines@microsoft.com>
@khaines khaines force-pushed the khaines/zone-aware-replication branch from 7c7ae7f to a978fd4 Compare March 23, 2020 19:15
Signed-off-by: Ken Haines <khaines@microsoft.com>
@khaines
Copy link
Contributor Author

khaines commented Mar 23, 2020

The lint failure is because of the doc check. The default for the zone setting at the moment is the hostname, which makes the document verification expectations dynamic. I'm working on a solution to this.

…generated in a consistent/verifiable way but still have a decent default value

Signed-off-by: Ken Haines <khaines@microsoft.com>
Copy link
Contributor

@tomwilkie tomwilkie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one Ken! I suggest adding a little bit of documentation about this - something about the pitfalls of having less zones than RF, for instance.

@khaines
Copy link
Contributor Author

khaines commented Mar 28, 2020

A great suggestion @tomwilkie. I'll work on that right away and add it here.

Signed-off-by: Ken Haines <khaines@microsoft.com>
@khaines khaines merged commit b3112b3 into cortexproject:master Mar 29, 2020
@bboreham
Copy link
Contributor

Looks like this broke my staging system:

time="2020-03-29T22:43:33Z" level=warning msg="POST /api/prom/push (500) 2.254634ms Response: \"at least 2 live replicas required, could only find 1\\n\" ws: false; Content-Encoding: snappy; Content-Length: 16461; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.12.0; X-Prometheus-Remote-Write-Version: 0.1.0; " traceID=77042d0abcee28fd

Comment on lines +2 to +5
title: "Ingester Hand-over"
linkTitle: "Ingester Hand-over"
weight: 5
slug: ingester-handover
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khaines This comes from a bad copy-paste. May you submit a PR to fix it, please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it did... looked at it a few times and this didn't catch my eye.


## Configuration

Cortex can be configured to consider an availability zone value in its replication system. Doing so mitigates risks associated with losing multiple nodes with in the same availability zone. The availability zone for an ingester can be defined on the command line of the ingester using the `ingester.availability-zone` flag or using the yaml configuration:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khaines with in > within

@@ -103,6 +104,7 @@ func (cfg *LifecyclerConfig) RegisterFlagsWithPrefix(prefix string, f *flag.Flag
f.StringVar(&cfg.Addr, prefix+"lifecycler.addr", "", "IP address to advertise in consul.")
f.IntVar(&cfg.Port, prefix+"lifecycler.port", 0, "port to advertise in consul (defaults to server.grpc-listen-port).")
f.StringVar(&cfg.ID, prefix+"lifecycler.ID", hostname, "ID to register into consul.")
f.StringVar(&cfg.Zone, prefix+"availability-zone", "", "The availability zone of the host, this instance is running on. Default is the lifecycler ID.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khaines Default is the lifecycler ID. This looks more an internal implementation detail than something we should let the final user know. For the final user the default is an empty string, which means the zone-awareness is disabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made an edit to this with #2384

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make distributors/ingesters aware of availability zones
4 participants