Skip to content

Crashing with "panic: failed to migrate ds" #4588

Open
@shashank42

Description

@shashank42

Describe the bug

Deployed using https://github.com/rudderlabs/rudderstack-helm
rudderstack backend keeps crashing and restarting with the error

Screenshot 2024-04-18 at 12 19 59 AM
Defaulted container "rudderstack-backend" out of: rudderstack-backend, rudderstack-telegraf-sidecar
Executing docker entrypoint script
Computed db host to namaste-rudderstack-rudderstack-postgresql-0.namaste-rudderstack-rudderstack-postgresql-headless
2024/04/18 06:57:30 maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
INFO: No .env file found.
2024-04-18T06:57:30.671Z	INFO	rudder-server/main.go:27	Setting memory limit to	{"limit": 1073741824}
2024-04-18T06:57:30.673Z	INFO	stats	stats/statsd.go:44	StatsD client setup succeeded.
2024-04-18T06:57:30.673Z	INFO	stats	stats/statsd.go:89	Stats started successfully in mode "StatsD" with address "localhost:8125"
2024-04-18T06:57:30.675Z	INFO	app	app/app.go:55	Open source version of rudder-server
2024-04-18T06:57:30.681Z	INFO	backend-config	backend-config/single_workspace.go:60	Setup backend config complete	{"workspaceId": ""}
2024-04-18T06:57:33.131Z	INFO	runner	runner/runner.go:344	warehousemode embedded
2024-04-18T06:57:33.133Z	INFO	db.recovery	db/embeddedRecovery.go:42	Starting in normal mode
2024-04-18T06:57:33.479Z	INFO	backend-config	backend-config/backend-config.go:215	Workspace Config changed{"workspaces": 1, "sources": 6}
2024-04-18T06:57:36.430Z	INFO	runner.warehouse.notifier	notifier/notifier.go:165	Initializing Notifier...
2024-04-18T06:57:37.141Z	INFO	runner	runner/runner.go:344	warehousemode embedded
2024-04-18T06:57:37.142Z	INFO	admin	admin/admin.go:167	Serving on admin interface @ /data/rudderstack/rudder-server.sock
2024-04-18T06:57:37.142Z	INFO	apphandlers.EMBEDDED	apphandlers/embeddedAppHandler.go:87	Embedded mode: Starting Rudder Core
2024-04-18T06:57:37.143Z	INFO	apphandlers.EMBEDDED	apphandlers/embeddedAppHandler.go:95	Configured deployment type: "DEDICATED"
2024-04-18T06:57:37.143Z	INFO	apphandlers.EMBEDDED	apphandlers/embeddedAppHandler.go:105	Clearing DB false
2024-04-18T06:57:37.143Z	INFO	debugger.transformation	cache/cache.go:29	Using in-memory cache
2024-04-18T06:57:37.144Z	INFO	debugger.destination	cache/cache.go:29	Using in-memory cache
2024-04-18T06:57:37.145Z	INFO	debugger.source	cache/cache.go:29	Using in-memory cache
2024-04-18T06:57:37.146Z	INFO	runner.warehouse	warehouse/app.go:310	Starting Warehouse service...
2024-04-18T06:57:37.146Z	INFO	runner.warehouse	warehouse/app.go:375	Starting warehouse slave...
2024-04-18T06:57:37.146Z	INFO	runner.warehouse	warehouse/app.go:391	[WH]: Starting warehouse master...
2024-04-18T06:57:37.147Z	INFO	runner.warehouse.api	api/http.go:157	waiting for BackendConfig before starting on 8082
2024-04-18T06:57:37.147Z	INFO	runner.warehouse.api	api/http.go:142	Starting warehouse master service on8082
2024-04-18T06:57:37.148Z	INFO	runner.warehouse.notifier	notifier/notifier.go:312	Deleting all jobs for workspace: default::f597437212cb10a56115df53cd0f6881
2024-04-18T06:57:37.150Z	INFO	runner.warehouse	warehouse/app.go:475	Starting a new Warehouse Destination Router: RS
2024-04-18T06:57:37.150Z	INFO	runner.warehouse.router.RS	router/router.go:147	WH: Warehouse Router started: RS
2024-04-18T06:57:38.035Z	INFO	runner.warehouse.source	source/source.go:143	starting source jobs processing
2024-04-18T06:57:38.240Z	INFO	runner.warehouse.router.RS	router/router.go:254	Received updated workspace config
2024-04-18T06:57:38.635Z	INFO	transformer-features	transformer/features_impl.go:62	Fetching transformer features from http://namaste-rudderstack-transformer:9090
2024-04-18T06:57:39.030Z	INFO	jobsdb.gw	jobsdb/jobsdb.go:840	Connected to gw DB
2024-04-18T06:57:50.231Z	INFO	jobsdb.gw	jobsdb/jobsdb.go:840	Connected to gw DB
2024-04-18T06:57:56.737Z	INFO	jobsdb.rt	jobsdb/jobsdb.go:840	Connected to rt DB
2024-04-18T06:57:59.233Z	INFO	jobsdb.batch_rt	jobsdb/jobsdb.go:840	Connected to batch_rt DB
2024-04-18T06:58:54.035Z	INFO	jobsdb.proc_error	jobsdb/jobsdb.go:840	Connected to proc_error DB
2024-04-18T06:58:55.035Z	INFO	jobsdb.proc_error	jobsdb/jobsdb.go:840	Connected to proc_error DB
2024-04-18T06:58:57.340Z	INFO	jobsdb.esch	jobsdb/jobsdb.go:840	Connected to esch DB
2024-04-18T06:58:58.930Z	INFO	jobsdb.arc	jobsdb/jobsdb.go:840	Connected to arc DB
2024-04-18T06:59:00.036Z	INFO	apphandlers.EMBEDDED	apphandlers/setup.go:114	using Static Cluster Manager
2024-04-18T06:59:00.237Z	INFO	enterprise.suppress-user	suppress-user/factory.go:48	Suppress User feature is enterprise only
2024-04-18T06:59:00.238Z	WARN	rsources	rsources/handler.go:770	shared database is not configured, skipping logical replication monitoring
2024-04-18T06:59:00.239Z	INFO	gateway	gateway/handle_lifecycle.go:363	WebHandler waiting for BackendConfig before starting on 8080
2024-04-18T06:59:00.239Z	INFO	cluster	cluster/dynamic.go:100	Got trigger to change the mode, new mode: NORMAL, old mode: DEGRADED
2024-04-18T06:59:00.239Z	INFO	cluster	cluster/dynamic.go:218	Transiting the server from DegradedMode to NormalMode
2024-04-18T06:59:00.239Z	INFO	cluster	cluster/dynamic.go:123	Starting the server
2024-04-18T06:59:00.242Z	INFO	gateway	gateway/handle_lifecycle.go:195	BackendConfig initialised	{"sources": 6}
2024-04-18T06:59:00.242Z	INFO	gateway	gateway/handle_lifecycle.go:365	WebHandler Starting on 8080
2024-04-18T06:59:00.530Z	INFO	jobsdb.proc_error	jobsdb/backup.go:36	backupDSLoop is running
2024-04-18T06:59:01.737Z	INFO	jobsdb.gw	jobsdb/backup.go:36	backupDSLoop is running
2024-04-18T06:59:02.237Z	INFO	jobsdb.esch	jobsdb/backup.go:36	backupDSLoop is running
2024-04-18T06:59:03.536Z	INFO	jobsdb.arc	jobsdb/backup.go:36	backupDSLoop is running
2024-04-18T06:59:04.738Z	INFO	jobsdb.rt	jobsdb/backup.go:36	backupDSLoop is running
2024-04-18T06:59:13.138Z	INFO	jobsdb.batch_rt	jobsdb/backup.go:36	backupDSLoop is running
2024-04-18T07:08:49.633Z	INFO	archiver	archiver/archiver.go:99	Starting archiver
2024-04-18T07:08:49.635Z	INFO	processor	processor/processor.go:628	Starting processor in isolation mode: source
2024-04-18T07:08:49.637Z	INFO	processor	processor/processor.go:659	Starting pinger loop
2024-04-18T07:08:49.637Z	INFO	processor	processor/processor.go:661	Backend config received
2024-04-18T07:08:49.637Z	INFO	processor	processor/processor.go:664	Waiting for async init group
2024-04-18T07:08:49.637Z	INFO	processor	processor/processor.go:671	Async init group done
2024-04-18T07:08:49.637Z	INFO	processor	processor/processor.go:674	Waiting for transformer features
2024-04-18T07:08:49.637Z	INFO	processor	processor/processor.go:681	Transformer features received
2024-04-18T07:08:49.830Z	INFO	processor.stash	stash/stash.go:296	Processor errors stash loop started
2024-04-18T07:09:43.545Z	ERROR	utils.misc	misc/misc.go:926	Panic detected. Application will crash.{"stack": "goroutine 5537 [running]:\nruntime/debug.Stack()\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x5e\ngithub.com/rudderlabs/rudder-server/utils/misc.BugsnagNotify.func1.1()\n\t/rudder-server/utils/misc/misc.go:927 +0x37f\nsync.(*Once).doSlow(0x4778d2?, 0xc002607d78?)\n\t/usr/local/go/src/sync/once.go:74 +0xc2\nsync.(*Once).Do(...)\n\t/usr/local/go/src/sync/once.go:65\ngithub.com/rudderlabs/rudder-server/utils/misc.BugsnagNotify.func1()\n\t/rudder-server/utils/misc/misc.go:916 +0xbb\npanic({0x34155c0?, 0xc006a39540?})\n\t/usr/local/go/src/runtime/panic.go:770 +0x132\ngithub.com/rudderlabs/rudder-server/jobsdb.(*Handle).migrateDSLoop(0xc000cdd188, {0x4377970, 0xc0025f6ff0})\n\t/rudder-server/jobsdb/migration.go:54 +0x1ff\ngithub.com/rudderlabs/rudder-server/jobsdb.(*Handle).startMigrateDSLoop.func1()\n\t/rudder-server/jobsdb/migration.go:28 +0x1f\ngithub.com/rudderlabs/rudder-server/jobsdb.(*Handle).startMigrateDSLoop.WithBugsnag.func2()\n\t/rudder-server/utils/misc/misc.go:945 +0x5c\ngolang.org/x/sync/errgroup.(*Group).Go.func1()\n\t/go/pkg/mod/golang.org/x/sync@v0.6.0/errgroup/errgroup.go:78 +0x56\ncreated by golang.org/x/sync/errgroup.(*Group).Go in goroutine 2162\n\t/go/pkg/mod/golang.org/x/sync@v0.6.0/errgroup/errgroup.go:75 +0x96\n", "panic": "failed to migrate ds: pq: canceling statement due to user request; driver: bad connection", "team": "Core", "goRoutines": 685, "version": "1.23.0", "releaseStage": "development"}
2024-04-18T07:09:43.545Z	ERROR	utils.misc	misc/misc.go:926	goroutine 5537 [running]:
github.com/rudderlabs/rudder-go-kit/logger.(*logger).Fatalw(0xc00128f9e0, {0x3a85163?, 0x4c6?}, {0xc0011ae600?, 0x1?, 0x34155c0?})
	/go/pkg/mod/github.com/rudderlabs/rudder-go-kit@v0.23.3/logger/logger.go:332 +0xa5
github.com/rudderlabs/rudder-server/utils/misc.BugsnagNotify.func1.1()
	/rudder-server/utils/misc/misc.go:926 +0x5b3
sync.(*Once).doSlow(0x4778d2?, 0xc002607d78?)
	/usr/local/go/src/sync/once.go:74 +0xc2
sync.(*Once).Do(...)
	/usr/local/go/src/sync/once.go:65
github.com/rudderlabs/rudder-server/utils/misc.BugsnagNotify.func1()
	/rudder-server/utils/misc/misc.go:916 +0xbb
panic({0x34155c0?, 0xc006a39540?})
	/usr/local/go/src/runtime/panic.go:770 +0x132
github.com/rudderlabs/rudder-server/jobsdb.(*Handle).migrateDSLoop(0xc000cdd188, {0x4377970, 0xc0025f6ff0})
	/rudder-server/jobsdb/migration.go:54 +0x1ff
github.com/rudderlabs/rudder-server/jobsdb.(*Handle).startMigrateDSLoop.func1()
	/rudder-server/jobsdb/migration.go:28 +0x1f
github.com/rudderlabs/rudder-server/jobsdb.(*Handle).startMigrateDSLoop.WithBugsnag.func2()
	/rudder-server/utils/misc/misc.go:945 +0x5c
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/go/pkg/mod/golang.org/x/sync@v0.6.0/errgroup/errgroup.go:78 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 2162
	/go/pkg/mod/golang.org/x/sync@v0.6.0/errgroup/errgroup.go:75 +0x96

2024/04/18 07:09:43 notifying bugsnag: failed to migrate ds: pq: canceling statement due to user request; driver: bad connection
panic: failed to migrate ds: pq: canceling statement due to user request; driver: bad connection [recovered]
	panic: failed to migrate ds: pq: canceling statement due to user request; driver: bad connection [recovered]
	panic: failed to migrate ds: pq: canceling statement due to user request; driver: bad connection

goroutine 5537 [running]:
github.com/bugsnag/bugsnag-go/v2.AutoNotify({0xc00740dac0, 0x3, 0xc001969500?})
	/go/pkg/mod/github.com/bugsnag/bugsnag-go/v2@v2.3.1/bugsnag.go:114 +0x3ef
panic({0x34155c0?, 0xc006a39540?})
	/usr/local/go/src/runtime/panic.go:770 +0x132
github.com/rudderlabs/rudder-server/utils/misc.BugsnagNotify.func1.1()
	/rudder-server/utils/misc/misc.go:935 +0x5d4
sync.(*Once).doSlow(0x4778d2?, 0xc002607d78?)
	/usr/local/go/src/sync/once.go:74 +0xc2
sync.(*Once).Do(...)
	/usr/local/go/src/sync/once.go:65
github.com/rudderlabs/rudder-server/utils/misc.BugsnagNotify.func1()
	/rudder-server/utils/misc/misc.go:916 +0xbb
panic({0x34155c0?, 0xc006a39540?})
	/usr/local/go/src/runtime/panic.go:770 +0x132
github.com/rudderlabs/rudder-server/jobsdb.(*Handle).migrateDSLoop(0xc000cdd188, {0x4377970, 0xc0025f6ff0})
	/rudder-server/jobsdb/migration.go:54 +0x1ff
github.com/rudderlabs/rudder-server/jobsdb.(*Handle).startMigrateDSLoop.func1()
	/rudder-server/jobsdb/migration.go:28 +0x1f
github.com/rudderlabs/rudder-server/jobsdb.(*Handle).startMigrateDSLoop.WithBugsnag.func2()
	/rudder-server/utils/misc/misc.go:945 +0x5c
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/go/pkg/mod/golang.org/x/sync@v0.6.0/errgroup/errgroup.go:78 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 2162
	/go/pkg/mod/golang.org/x/sync@v0.6.0/errgroup/errgroup.go:75 +0x96

While at the same time postgress machine logs the following error

2024-04-18 07:57:09.839 UTC [16135] STATEMENT:  DELETE FROM "batch_rt_job_status_435_2"
							WHERE NOT id = ANY(
								SELECT DISTINCT ON (job_id) id from "batch_rt_job_status_435_2" ORDER BY job_id ASC, id DESC
							)
2024-04-18 07:57:10.730 UTC [16111] LOG:  unexpected EOF on client connection with an open transaction
2024-04-18 07:57:11.328 UTC [18493] LOG:  could not receive data from client: Connection reset by peer
2024-04-18 07:57:11.347 UTC [18494] LOG:  could not receive data from client: Connection reset by peer
2024-04-18 07:57:11.429 UTC [18490] LOG:  could not receive data from client: Connection reset by peer
2024-04-18 07:57:11.441 UTC [18488] LOG:  could not send data to client: Broken pipe
2024-04-18 07:57:11.453 UTC [18488] FATAL:  connection to client lost
2024-04-18 07:57:11.538 UTC [18491] LOG:  could not receive data from client: Connection reset by peer
2024-04-18 07:57:11.543 UTC [18435] LOG:  could not send data to client: Connection reset by peer
2024-04-18 07:57:11.543 UTC [18435] STATEMENT:  SELECT * FROM (SELECT
										jobs.job_id, jobs.uuid, jobs.user_id, jobs.parameters, jobs.custom_val, jobs.event_payload, jobs.event_count,
										jobs.created_at, jobs.expire_at, jobs.workspace_id,
										pg_column_size(jobs.event_payload) as payload_size,
										sum(jobs.event_count) over (order by jobs.job_id asc) as running_event_counts,
										sum(pg_column_size(jobs.event_payload)) over (order by jobs.job_id) as running_payload_size,
										job_latest_state.job_state, job_latest_state.attempt,
										job_latest_state.exec_time, job_latest_state.retry_time,
										job_latest_state.error_code, job_latest_state.error_response, job_latest_state.parameters
									FROM
										"gw_jobs_545" AS jobs
										LEFT JOIN "gw_job_status_545" job_latest_state ON jobs.job_id=job_latest_state.job_id
									    WHERE ((job_latest_state.job_id IS NULL)) AND ((jobs.custom_val='GW')) AND (jobs.parameters->>'source_id'='2dPZSbNyzCt11TNn88HyTi3hF7N')
										ORDER BY jobs.job_id  LIMIT 772 ) t WHERE running_event_counts - t.event_count <= $1 AND running_payload_size - t.payload_size <= $2
2024-04-18 07:57:11.543 UTC [18435] FATAL:  connection to client lost
2024-04-18 07:57:11.543 UTC [18435] STATEMENT:  SELECT * FROM (SELECT
										jobs.job_id, jobs.uuid, jobs.user_id, jobs.parameters, jobs.custom_val, jobs.event_payload, jobs.event_count,
										jobs.created_at, jobs.expire_at, jobs.workspace_id,
										pg_column_size(jobs.event_payload) as payload_size,
										sum(jobs.event_count) over (order by jobs.job_id asc) as running_event_counts,
										sum(pg_column_size(jobs.event_payload)) over (order by jobs.job_id) as running_payload_size,
										job_latest_state.job_state, job_latest_state.attempt,
										job_latest_state.exec_time, job_latest_state.retry_time,
										job_latest_state.error_code, job_latest_state.error_response, job_latest_state.parameters
									FROM
										"gw_jobs_545" AS jobs
										LEFT JOIN "gw_job_status_545" job_latest_state ON jobs.job_id=job_latest_state.job_id
									    WHERE ((job_latest_state.job_id IS NULL)) AND ((jobs.custom_val='GW')) AND (jobs.parameters->>'source_id'='2dPZSbNyzCt11TNn88HyTi3hF7N')
										ORDER BY jobs.job_id  LIMIT 772 ) t WHERE running_event_counts - t.event_count <= $1 AND running_payload_size - t.payload_size <= $2
2024-04-18 07:57:11.932 UTC [18470] LOG:  unexpected EOF on client connection with an open transaction
2024-04-18 07:57:12.031 UTC [18499] LOG:  incomplete startup packet
2024-04-18 07:57:17.140 UTC [18514] ERROR:  relation "rsources_failed_keys_v2" already exists
2024-04-18 07:57:17.140 UTC [18514] STATEMENT:  create table "rsources_failed_keys_v2" (
			id VARCHAR(27) COLLATE "C",
			db_name text not null default 'namaste-rudderstack-rudderstack-postgresql-0.namaste-rudderstack-rudderstack-postgresql-headless',
			job_run_id text not null,
			task_run_id text not null,
			source_id text not null,
			destination_id text not null,
			primary key (id),
			unique (job_run_id, task_run_id, source_id, destination_id, db_name)
		)
2024-04-18 07:57:17.143 UTC [18514] ERROR:  relation "rsources_failed_keys_v2_records" already exists
2024-04-18 07:57:17.143 UTC [18514] STATEMENT:  create table "rsources_failed_keys_v2_records" (
			id VARCHAR(27) COLLATE "C",
			record_id text not null,
	    	ts timestamp not null default NOW(),
			primary key (id, record_id)
		)
2024-04-18 07:57:17.338 UTC [18514] ERROR:  relation "rsources_stats" already exists
2024-04-18 07:57:17.338 UTC [18514] STATEMENT:  create table "rsources_stats" (
			db_name text not null default 'namaste-rudderstack-rudderstack-postgresql-0.namaste-rudderstack-rudderstack-postgresql-headless',
			job_run_id text not null,
			task_run_id text not null,
			source_id text not null,
			destination_id text not null,
			in_count integer not null default 0,
			out_count integer not null default 0,
			failed_count integer not null default 0,
			ts timestamp not null default NOW(),
			primary key (db_name, job_run_id, task_run_id, source_id, destination_id)
		)
2024-04-18 07:57:20.533 UTC [18425] ERROR:  canceling autovacuum task
2024-04-18 07:57:20.533 UTC [18425] CONTEXT:  automatic vacuum of table "jobsdb.public.gw_job_status_511_1"
2024-04-18 07:57:21.128 UTC [18148] LOG:  unexpected EOF on client connection with an open transaction

While there are no errors on transformer

➜  rudderstack-helm git:(master) ✗ kubectl logs --tail=10000 -f namaste-rudderstack-transformer-f45745c8f-8psss

> rudder-transformer@1.62.1 start
> cd dist;node ./src/index.js;cd ..

setting up statsd client
Stack Depth set: 64
Interval Bytes set: 524288
Transformer: Swagger route loading
Transformer: Swagger route loaded
{"extraData":{},"level":"info","message":"Using new routes","timestamp":"2024-04-18 06:48:30"}
Master (pid: 19) has started
Worker (pid: 26) is online
setting up statsd client
Stack Depth set: 64
Interval Bytes set: 524288
Transformer: Swagger route loading
Transformer: Swagger route loaded
{"extraData":{},"level":"info","message":"Using new routes","timestamp":"2024-04-18 06:48:33"}
Worker (pid: 26) has started
App started. Listening on port: 9090

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions