Closed
Description
We deployed a new schema on a periodic table config, but deployed it not on the periodic table rotation day. When the new table was created, it was immediately deleted leading to nowhere to write index entries and OOM death of all the ingesters in the cluster.
I briefly looked at what happened and based on the logs it definitely looks related to the grace period config for table creation. Our environment had a 3 hour grace period configured and here were the logs:
level=info ts=2019-12-10T20:39:20.402311965Z caller=table_manager.go:220 msg="synching tables" expected_tables=10
level=info ts=2019-12-10T20:49:20.402248128Z caller=table_manager.go:220 msg="synching tables" expected_tables=10
level=info ts=2019-12-10T20:59:20.402287276Z caller=table_manager.go:220 msg="synching tables" expected_tables=10
level=info ts=2019-12-10T21:09:20.402451685Z caller=table_manager.go:220 msg="synching tables" expected_tables=10
level=info ts=2019-12-10T21:10:42.619662501Z caller=table_manager.go:374 msg="table has exceeded the retention period" table=loki_prod_index_2605
level=info ts=2019-12-10T21:10:42.61972737Z caller=table_manager.go:379 msg="deleting table" table=loki_prod_index_2605
level=info ts=2019-12-10T21:10:42.862172893Z caller=table_manager.go:363 msg="creating table" table=loki_prod_index_2595
level=info ts=2019-12-10T21:19:20.402302368Z caller=table_manager.go:220 msg="synching tables" expected_tables=10
level=info ts=2019-12-10T21:29:20.402288783Z caller=table_manager.go:220 msg="synching tables" expected_tables=10
level=info ts=2019-12-10T21:39:20.402273464Z caller=table_manager.go:220 msg="synching tables" expected_tables=10
level=info ts=2019-12-10T21:49:20.402482189Z caller=table_manager.go:220 msg="synching tables" expected_tables=10
level=info ts=2019-12-10T21:50:53.380202176Z caller=signals.go:54 msg="=== received SIGINT/SIGTERM ===\n*** exiting"
level=info ts=2019-12-10T21:50:53.380301092Z caller=loki.go:175 msg="notifying module about stopping" module=table-manager
level=info ts=2019-12-10T21:50:53.380324053Z caller=loki.go:175 msg="notifying module about stopping" module=server
level=info ts=2019-12-10T21:50:53.380333581Z caller=loki.go:157 msg=stopping module=table-manager
level=info ts=2019-12-10T21:50:53.3803468Z caller=loki.go:157 msg=stopping module=server
level=info ts=2019-12-10T21:50:56.60766565Z caller=loki.go:125 msg=initialising module=server
level=info ts=2019-12-10T21:50:56.607991288Z caller=server.go:121 http=[::]:80 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2019-12-10T21:50:56.617433538Z caller=loki.go:125 msg=initialising module=table-manager
level=info ts=2019-12-10T21:50:56.7346636Z caller=main.go:70 msg="Starting Loki" version="(version=v1.2.0, branch=HEAD, revision=ccef3da2)"
level=info ts=2019-12-10T21:51:12.690268924Z caller=table_manager.go:220 msg="synching tables" expected_tables=10
level=info ts=2019-12-10T21:57:10.355836814Z caller=signals.go:54 msg="=== received SIGINT/SIGTERM ===\n*** exiting"
level=info ts=2019-12-10T21:57:10.355920319Z caller=loki.go:175 msg="notifying module about stopping" module=table-manager
level=info ts=2019-12-10T21:57:10.355938636Z caller=loki.go:175 msg="notifying module about stopping" module=server
level=info ts=2019-12-10T21:57:10.355944517Z caller=loki.go:157 msg=stopping module=table-manager
level=info ts=2019-12-10T21:57:10.355959739Z caller=loki.go:157 msg=stopping module=server
level=info ts=2019-12-10T21:57:11.846009055Z caller=loki.go:125 msg=initialising module=server
level=info ts=2019-12-10T21:57:11.846266472Z caller=server.go:121 http=[::]:80 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2019-12-10T21:57:11.846550814Z caller=loki.go:125 msg=initialising module=table-manager
level=info ts=2019-12-10T21:57:11.929963123Z caller=main.go:70 msg="Starting Loki" version="(version=v1.2.0, branch=HEAD, revision=ccef3da2)"
level=info ts=2019-12-10T22:05:46.019484239Z caller=table_manager.go:220 msg="synching tables" expected_tables=10
level=info ts=2019-12-10T22:07:08.451749363Z caller=table_manager.go:374 msg="table has exceeded the retention period" table=loki_prod_index_2605
level=info ts=2019-12-10T22:15:46.019608647Z caller=table_manager.go:220 msg="synching tables" expected_tables=10
level=info ts=2019-12-10T22:15:46.078944287Z caller=table_manager.go:374 msg="table has exceeded the retention period" table=loki_prod_index_2605
level=info ts=2019-12-10T22:25:46.019555476Z caller=table_manager.go:220 msg="synching tables" expected_tables=10
level=info ts=2019-12-10T22:25:46.100326433Z caller=table_manager.go:374 msg="table has exceeded the retention period" table=loki_prod_index_2605
level=info ts=2019-12-10T22:35:46.019693276Z caller=table_manager.go:220 msg="synching tables" expected_tables=10
level=info ts=2019-12-10T22:35:46.07260004Z caller=table_manager.go:374 msg="table has exceeded the retention period" table=loki_prod_index_2605
level=info ts=2019-12-10T22:45:46.019580828Z caller=table_manager.go:220 msg="synching tables" expected_tables=10
level=info ts=2019-12-10T22:47:08.438978579Z caller=table_manager.go:374 msg="table has exceeded the retention period" table=loki_prod_index_2605
level=info ts=2019-12-10T22:55:46.019665922Z caller=table_manager.go:220 msg="synching tables" expected_tables=10
level=info ts=2019-12-10T22:55:46.123571474Z caller=table_manager.go:374 msg="table has exceeded the retention period" table=loki_prod_index_2605
level=info ts=2019-12-10T23:05:46.019622201Z caller=table_manager.go:220 msg="synching tables" expected_tables=10
level=info ts=2019-12-10T23:05:46.068127351Z caller=table_manager.go:374 msg="table has exceeded the retention period" table=loki_prod_index_2605
level=info ts=2019-12-10T23:15:46.019566173Z caller=table_manager.go:220 msg="synching tables" expected_tables=10
level=info ts=2019-12-10T23:15:46.091349825Z caller=table_manager.go:374 msg="table has exceeded the retention period" table=loki_prod_index_2605
level=info ts=2019-12-10T23:25:46.019582005Z caller=table_manager.go:220 msg="synching tables" expected_tables=10
level=info ts=2019-12-10T23:27:08.442378888Z caller=table_manager.go:374 msg="table has exceeded the retention period" table=loki_prod_index_2605
level=info ts=2019-12-10T23:35:46.019752522Z caller=table_manager.go:220 msg="synching tables" expected_tables=10
level=info ts=2019-12-10T23:35:46.094657878Z caller=table_manager.go:374 msg="table has exceeded the retention period" table=loki_prod_index_2605
level=info ts=2019-12-10T23:45:46.019572476Z caller=table_manager.go:220 msg="synching tables" expected_tables=10
level=info ts=2019-12-10T23:45:46.094836283Z caller=table_manager.go:374 msg="table has exceeded the retention period" table=loki_prod_index_2605
level=info ts=2019-12-10T23:55:46.020098629Z caller=table_manager.go:220 msg="synching tables" expected_tables=10
level=info ts=2019-12-10T23:57:08.444209386Z caller=table_manager.go:374 msg="table has exceeded the retention period" table=loki_prod_index_2605
level=info ts=2019-12-11T00:05:46.019544203Z caller=table_manager.go:220 msg="synching tables" expected_tables=11
level=info ts=2019-12-11T00:15:46.019689019Z caller=table_manager.go:220 msg="synching tables" expected_tables=11
level=info ts=2019-12-11T00:25:46.019563912Z caller=table_manager.go:220 msg="synching tables" expected_tables=11
There were a couple restarts after the initial incident which were us diagnosing the problem and then disabling deletes on tables because we couldn't stop it from deleting the table
Metadata
Metadata
Assignees
Labels
No labels