cortexproject · pstibrany · Aug 11, 2020 · Jul 30, 2020 · Jul 30, 2020 · Aug 3, 2020
diff --git a/docs/blocks-storage/migrate-from-chunks-to-blocks.md b/docs/blocks-storage/migrate-from-chunks-to-blocks.md
@@ -0,0 +1,275 @@
+---
+title: "Migrate Cortex cluster from chunks to blocks"
+linkTitle: "Migrate Cortex cluster from chunks to blocks"
+weight: 5
+slug: migrate-cortex-cluster-from-chunks-to-blocks
+---
+
+This article describes how to migrate existing Cortex cluster from chunks storage to blocks storage,
+and highlight possible issues you may encounter in the process.
+
+_This document replaces the [Cortex proposal](https://cortexmetrics.io/docs/proposals/ingesters-migration/),
+which was written before support for migration was in place._
+
+## Introduction
+
+This article **assumes** that:
+
+- Cortex cluster is managed by Kubernetes
+- Cortex is using chunks storage
+- Ingesters are using WAL
+
+_If your ingesters are not using WAL, the documented procedure will still apply, but the presented migration script will not work properly without changes, as it assumes that ingesters are managed via StatefulSet._
+
+The migration procedure is composed by 3 steps:
+
+1. [Preparation](#step-1-preparation)
+1. [Ingesters migration](#step-2-ingesters-migration)
+1. [Cleanup](#step-3-cleanup)
+
+_In case of any issue during or after the migration, this document also outlines a [Rollback](#rollback) strategy._
+
+## Step 1: Preparation
+
+Before starting the migration of ingesters, we need to prepare other services.
+
+### Querier and Ruler
+
+_Everything discussed for querier applies to ruler as well, since it shares querier configuration – options prefix is `querier` even when used by ruler._
+
+Querier and ruler need to be reconfigured as follow:
+
+- `-querier.second-store-engine=blocks`
+- `-querier.query-store-after=0`
+- `-querier.ingester-streaming=false`
+
+#### `-querier.second-store-engine=blocks`
+
+Querier (and ruler) needs to be reconfigured to query both chunks storage and blocks storage at the same time.
+This is achieved by using `-querier.second-store-engine=blocks` option, and providing querier with full blocks configuration, but keeping "primary" store set to `-store.engine=chunks`.
+
+#### `-querier.query-store-after=0`
+
+Querier (and ruler) has an option `-querier.query-store-after` to query store only if query hits data older than some period of time.
+For example, if ingesters keep 12h of data in memory, there is no need to hit the store for queries that only need last 1h of data.
+During the migration, this flag needs to be set to 0, to make queriers always consult the store when handling queries.
+The reason is that after chunks ingesters shutdown, they can no longer return chunks from memory.
+
+#### `-querier.ingester-streaming=false`
+
+Querier (and ruler) has a [bug](https://github.com/cortexproject/cortex/issues/2935) and doesn't properly
+merge streamed results from chunks and blocks-based ingesters. Instead it only returns data from blocks- instesters.
+To avoid this problem, we need to temporarily disable this feature by setting `-querier.ingester-streaming=false`.
+After migration is complete (i.e. all ingesters are running blocks only), this can be turned back to true, which is the default value.
+
+### Query-frontend
+
+Query-frontend needs to be reconfigured as follow:
+
+- `-querier.parallelise-shardable-queries=false`
+
+#### `-querier.parallelise-shardable-queries=false`
+
+Query frontend has an option `-querier.parallelise-shardable-queries` to split some incoming queries into multiple queries based on sharding factor used in v11 schema of chunk storage.
+As the description implies, it only works when using chunks storage.
+During and after the migration to blocks (and also after possible rollback), this option needs to be disabled otherwise query-frontend will generate queries that cannot be satisfied by blocks.
+
+### Compactor and Store-gateway
+
+[Compactor](./compactor.md) and [store-gateway](./store-gateway.md) services should be deployed and successfully up and running before migrating ingesters.
+
+## Step 2: Ingesters migration
+
+We have developed a script available in Cortex [`tools/migrate-ingester-statefulsets.sh`](https://github.com/cortexproject/cortex/blob/master/tools/migrate-ingester-statefulsets.sh) to migrate ingesters between two StatefulSets, shutting down ingesters one by one.
+
+It can be used like this:
+
+```
+$ tools/migrate-ingester-statefulsets.sh <namespace> <ingester-old> <ingester-new> <num-instances>
+```
+
+Where parameters are:
+- `<namespace>`: Kubernetes namespace where the Cortex cluster is running
+- `<ingester-old>`: name of the ingesters StatefulSet to scale down (running chunks storage)
+- `<ingester-new>`: name of the ingesters StatefulSet to scale up (running blocks storage)
+- `<num-instances>`: number of instances to scale down (in `ingester` statefulset) and scale up (in `ingester-blocks`), or "all" – which will scale down all remaining instances in `ingester` statefulset
+
+After starting new pod in `ingester-new` statefulset, script then triggers `/shutdown` endpoint on the old ingester. When the flushing on the old ingester is complete, scale down of statefulset continues, and process repeats.
+
+_The script supports both migration from chunks to blocks, and viceversa (eg. rollback)._
+
+### Known issues
+
+There are few known issues with the script:
+
+- If expected messages don't appear in the log, but pod keeps on running, the script will never finish.
+- Script doesn't verify that flush finished without any error.
+
+## Step 3: Cleanup
+
+When the ingesters migration finishes, there are still two StatefulSets, with original StatefulSet (running the chunks storage) having 0 instances now.
+
+At this point, we can delete the old StatefulSet and its persistent volumes and recreate it with final blocks storage configuration (eg. changing PVs), and use the script again to move pods from `ingester-blocks` to `ingester`.
+
+Querier (and ruler) can be reconfigured to use `blocks` as "primary" store to search, and `chunks` as secondary:
+
+- `-store.engine=blocks`
+- `-querier.second-store-engine=chunks`
+- `-querier.use-second-store-before-time=<timestamp after ingesters migration has completed>`
+- `-querier.ingester-streaming=true`
+
+#### `-querier.use-second-store-before-time`
+
+The CLI flag `-querier.use-second-store-before-time` (or its respective YAML config option) is only available for secondary store.
+This flag can be set to a timestamp when migration has finished, and it avoids querying secondary store (chunks) for data when running queries that don't need data before given time.
+
+#### `-querier.ingester-streaming=true`
+
+Querier can be configured to make use of streamed responses from ingester at this point (`-querier.ingester-streaming=true`).
+
+## Rollback
+
+If rollback to chunks is needed for any reason, it is possible to use the same migration script with reversed arguments:
+
+- Scale down ingesters StatefulSet running blocks storage
+- Scale up ingesters StatefulSet running chunks storage
+
+_Blocks ingesters support the same `/shutdown` endpoint for flushing data._
+
+During the rollback, queriers and rulers need to use the same configuration changes as during migration. You should also make sure the following settings are applied:
+
+- `-store.engine=chunks`
+- `-querier.second-store-engine=blocks`
+- `-querier.use-second-store-before-time` should not be set
+- `-querier.ingester-streaming=false`
+
+Once the rollback is complete, some configuration changes need to stay in place, because some data has already been stored to blocks:
+
+- The query sharding in the query-frontend must be kept disabled, otherwise querying blocks will not work correctly
+- `store-gateway` needs to keep running, otherwise querying blocks will fail
+- `compactor` may be shutdown, after it has no more compaction work to do
+
+Kubernetes resources related to the ingesters running the blocks storage may be deleted.
+
+### Known issues
+
+After rollback, chunk-ingesters replayed their old Write-Ahead-Log, thus loading old chunks into memory.
+WAL doesn't remember whether these old chunks were already flushed or not, so they are flushed again.
+Until that flush happens, Cortex reports those chunks as unflushed, which may trigger some alerts (via `cortex_oldest_unflushed_chunk_timestamp_seconds` metric).
+
+## Appendix
+
+### Jsonnet config
+
+This section shows how to use [cortex-jsonnet](https://github.com/grafana/cortex-jsonnet) to configure additional services.
+
+We will assume that `main.jsonnet` is main configuration for the cluster, that also imports `temp.jsonnet` – with our temporary configuration for migration.
+
+In `main.jsonnet` we have something like this:
+
+```jsonnet
+local cortex = import 'cortex/cortex.libsonnet';
+local wal = import 'cortex/wal.libsonnet';
+local temp = import 'temp.jsonnet';
+
+// Note that 'tsdb' is not imported here.
+cortex + wal + temp {
+  _images+:: (import 'images.libsonnet'),
+
+  _config+:: {
+    cluster: 'k8s-cluster',
+    namespace: 'k8s-namespace',
+
+...
+```
+
+To configure querier to use secondary store for querying, we need to add:
+
+```
+    querier_second_storage_engine: 'blocks',
+    storage_tsdb_bucket_name: 'bucket-for-storing-blocks',
+```
+
+to the `_config` object in main.jsonnet.
+
+Let's generate blocks configuration now in `temp.jsonnet`.
+There are comments inside that should give you an idea about what's happening.
+Most important thing is generating resources with blocks configuration, and exposing some of them.
+
+
+```jsonnet
+{
+  local cortex = import 'cortex/cortex.libsonnet',
+  local tsdb = import 'cortex/tsdb.libsonnet',
+  local rootConfig = self._config,
+  local statefulSet = $.apps.v1beta1.statefulSet,
+
+  // Prepare TSDB resources, but hide them. Cherry-picked resources will be exposed later.
+  tsdb_config:: cortex + tsdb + {
+    _config+:: {
+      cluster: rootConfig.cluster,
+      namespace: rootConfig.namespace,
+      external_url: rootConfig.external_url,
+
+      // This Cortex cluster is using the blocks storage.
+      storage_tsdb_bucket_name: rootConfig.storage_tsdb_bucket_name,
+      cortex_store_gateway_data_disk_size: '100Gi',
+      cortex_compactor_data_disk_class: 'fast',
+    },
+
+    // We create another statefulset for ingesters here, with different name.
+    ingester_blocks_statefulset: self.newIngesterStatefulSet('ingester-blocks', self.ingester_container) +
+                                 statefulSet.mixin.spec.withReplicas(0),
+
+    ingester_blocks_pdb: self.newIngesterPdb('ingester-blocks-pdb', 'ingester-blocks'),
+    ingester_blocks_service: $.util.serviceFor(self.ingester_blocks_statefulset, self.ingester_service_ignored_labels),
+  },
+
+  _config+: {
+    queryFrontend+: {
+      // Disabled because querying blocks-data breaks if query is rewritten for sharding.
+      sharded_queries_enabled: false,
+    },
+  },
+
+  // Expose some services from TSDB configuration, needed for running Querier with Chunks as primary and TSDB as secondary store.
+  tsdb_store_gateway_pdb: self.tsdb_config.store_gateway_pdb,
+  tsdb_store_gateway_service: self.tsdb_config.store_gateway_service,
+  tsdb_store_gateway_statefulset: self.tsdb_config.store_gateway_statefulset,
+
+  tsdb_memcached_metadata: self.tsdb_config.memcached_metadata,
+
+  tsdb_ingester_statefulset: self.tsdb_config.ingester_blocks_statefulset,
+  tsdb_ingester_pdb: self.tsdb_config.ingester_blocks_pdb,
+  tsdb_ingester_service: self.tsdb_config.ingester_blocks_service,
+
+  tsdb_compactor_statefulset: self.tsdb_config.compactor_statefulset,
+
+  // Querier and ruler configuration used during migration, and after.
+  query_config_during_migration:: {
+    // Disable streaming, as it is broken when querying both chunks and blocks ingesters at the same time.
+    'querier.ingester-streaming': 'false',
+
+    // query-store-after is required during migration, since new ingesters running on blocks will not load any chunks from chunks-WAL.
+    // All such chunks are however flushed to the store.
+    'querier.query-store-after': '0',
+  },
+
+  query_config_after_migration:: {
+    'querier.ingester-streaming': 'true',
+    'querier.query-ingesters-within': '13h',  // TSDB ingesters have data for up to 4d.
+    'querier.query-store-after': '12h',  // Can be enabled once blocks ingesters are running for 12h.
+
+    // Switch TSDB and chunks. TSDB is "primary" now so that we can skip querying chunks for old queries.
+    // We can do this, because querier/ruler have both configurations.
+    'store.engine': 'blocks',
+    'querier.second-store-engine': 'chunks',
+
+    'querier.use-second-store-before-time': '2020-07-28T17:00:00Z',  // If migration from chunks finished around 18:10 CEST, no need to query chunk store for queries before this time.
+  },
+
+  querier_args+:: self.tsdb_config.blocks_metadata_caching_config + self.query_config_during_migration,  // + self.query_config_after_migration,
+  ruler_args+:: self.tsdb_config.blocks_metadata_caching_config + self.query_config_during_migration,  // + self.query_config_after_migration,
+}
+```
+
diff --git a/tools/migrate-ingester-statefulsets.sh b/tools/migrate-ingester-statefulsets.sh
@@ -0,0 +1,72 @@
+#!/bin/bash
+
+# Exit on any problems.
+set -e
+
+if [[ $# -lt 4 ]]; then
+    echo "Usage: $0 <namespace> <from_statefulset> <to_statefulset> <instances_to_downscale | 'all'>"
+    exit 1
+fi
+
+NAMESPACE=$1
+DOWNSCALE_STATEFULSET=$2
+UPSCALE_STATEFULSET=$3
+INSTANCES_TO_DOWNSCALE=$4
+
+DOWNSCALE_REPLICA_SIZE=$(kubectl get statefulset "$DOWNSCALE_STATEFULSET" -o 'jsonpath={.spec.replicas}' --namespace="$NAMESPACE")
+UPSCALE_REPLICA_SIZE=$(kubectl get statefulset "$UPSCALE_STATEFULSET" -o 'jsonpath={.spec.replicas}' --namespace="$NAMESPACE")
+
+if [[ "$INSTANCES_TO_DOWNSCALE" = "all" ]]; then
+  INSTANCES_TO_DOWNSCALE=$DOWNSCALE_REPLICA_SIZE
+fi
+
+echo "Going to downscale $NAMESPACE/$DOWNSCALE_STATEFULSET and upscale $NAMESPACE/$UPSCALE_STATEFULSET by $INSTANCES_TO_DOWNSCALE instances"
+
+while [[ $INSTANCES_TO_DOWNSCALE -gt 0 ]]; do
+  echo "----------------------------------------"
+  echo "$(date): Scaling UP $UPSCALE_STATEFULSET to $((UPSCALE_REPLICA_SIZE + 1))"
+  # Scale up
+  kubectl scale statefulset "$UPSCALE_STATEFULSET" --namespace="$NAMESPACE" --current-replicas="$UPSCALE_REPLICA_SIZE" --replicas=$((UPSCALE_REPLICA_SIZE + 1))
+  kubectl rollout status statefulset "$UPSCALE_STATEFULSET" --namespace="$NAMESPACE" --timeout=30m
+  UPSCALE_REPLICA_SIZE=$((UPSCALE_REPLICA_SIZE + 1))
+
+  # Call /shutdown on the pod manually, so that it has enough time to flush chunks. By doing standard termination, pod may not have enough time.
+  # Wget is special BusyBox version. -T allows it to wait for 30m for shutdown to complete.
+  POD_TO_SHUTDOWN=$DOWNSCALE_STATEFULSET-$((DOWNSCALE_REPLICA_SIZE - 1))
+
+  echo "$(date): Triggering flush on $POD_TO_SHUTDOWN"
+
+  # wget (BusyBox version) will fail, but we don't care ... important thing is that it has triggered shutdown.
+  # -T causes wget to wait only 5 seconds, otherwise /shutdown takes a long time.
+  # Preferably we would wait for /shutdown to return, but unfortunately that doesn't work (even with big timeout), wget complains with weird error.
+  kubectl exec "$POD_TO_SHUTDOWN" --namespace="$NAMESPACE" -- wget -T 5 http://localhost:80/shutdown >/dev/null 2>/dev/null || true
+
+  # While request to /shutdown completes only after flushing has finished, it unfortunately returns 204 status code,
+  # which confuses wget. That is the reason why instead of waiting for /shutdown to complete, this script waits for
+  # specific log messages to appear in the log file that signal start/end of data flushing.
+  if kubectl logs -f "$POD_TO_SHUTDOWN" --namespace="$NAMESPACE" | grep -E -q "starting to flush all the chunks|starting to flush and ship TSDB blocks"; then
+    echo "$(date): Flushing started"
+  else
+    echo "$(date): Flushing not started? Check logs for pod $POD_TO_SHUTDOWN"
+    exit 1
+  fi
+
+  if kubectl logs -f "$POD_TO_SHUTDOWN" --namespace="$NAMESPACE" | grep -E -q "flushing of chunks complete|finished flushing and shipping TSDB blocks"; then
+    echo "$(date): Flushing complete"
+  else
+    echo "$(date): Failed to flush? Check logs for pod $POD_TO_SHUTDOWN"
+    exit 1
+  fi
+
+  echo
+
+  echo "$(date): Scaling DOWN $DOWNSCALE_STATEFULSET to $((DOWNSCALE_REPLICA_SIZE - 1))"
+  kubectl scale statefulset "$DOWNSCALE_STATEFULSET" --namespace="$NAMESPACE" --current-replicas="$DOWNSCALE_REPLICA_SIZE" --replicas=$((DOWNSCALE_REPLICA_SIZE - 1))
+  kubectl rollout status statefulset "$DOWNSCALE_STATEFULSET" --namespace="$NAMESPACE" --timeout=30m
+  DOWNSCALE_REPLICA_SIZE=$((DOWNSCALE_REPLICA_SIZE - 1))
+
+  INSTANCES_TO_DOWNSCALE=$((INSTANCES_TO_DOWNSCALE - 1))
+
+  echo "----------------------------------------"
+  echo
+done