Worker fails to recover table with exactly_once guarantee #47

jkgenser · 2020-11-29T17:23:31Z

Checklist

[] I have included information about relevant versions
[] I have verified that the issue persists when using the master branch of Faust.

Steps to reproduce

Application is configured with processing_guarantee="exactly_once"
publish 5 messages to a topic, keyed by id
repartition the topic using group_by(new_id)
increment count on table with keys that are new_id
Initially, the worker is up, processes the messages and stores the correct data in the changelog topic.
Then I send SIGTERM to stop the worker
When restarting the worker, it gets stuck on recovering per the logs below.

Tell us what you did to cause something to happen.

Possibly some issue with the transaction producer and a transaction potentially getting aborted leads to worker not able to recover.

Expected behavior

Tell us what you expected to happen.

Worker recovers and is able to process events.

Actual behavior

Tell us what happened instead.

Worker hangs on recovery.

Full traceback

Log showing this behavior.

[2020-11-29 15:55:45,969] [114] [WARNING] [^---Recovery]: No event received for active tp TP(topic='meteor-submission-count-by-workflow-changelog', partition=0) in the last 30.0 seconds (last event received 1.04 minute ago) 
[2020-11-29 15:55:50,974] [114] [WARNING] [^---Recovery]: No event received for active tp TP(topic='meteor-submission-count-by-workflow-changelog', partition=0) in the last 30.0 seconds (last event received 1.13 minute ago) 
[2020-11-29 15:55:55,970] [114] [WARNING] [^---Recovery]: No event received for active tp TP(topic='meteor-submission-count-by-workflow-changelog', partition=0) in the last 30.0 seconds (last event received 1.21 minute ago) 
[2020-11-29 15:56:00,976] [114] [WARNING] [^---Recovery]: No event received for active tp TP(topic='meteor-submission-count-by-workflow-changelog', partition=0) in the last 30.0 seconds (last event received 1.29 minute ago) 
[2020-11-29 15:56:05,972] [114] [WARNING] [^---Recovery]: No event received for active tp TP(topic='meteor-submission-count-by-workflow-changelog', partition=0) in the last 30.0 seconds (last event received 1.38 minute ago) 
[2020-11-29 15:56:10,977] [114] [WARNING] [^---Recovery]: No event received for active tp TP(topic='meteor-submission-count-by-workflow-changelog', partition=0) in the last 30.0 seconds (last event received 1.46 minute ago) 
[2020-11-29 15:56:15,974] [114] [WARNING] [^---Recovery]: No event received for active tp TP(topic='meteor-submission-count-by-workflow-changelog', partition=0) in the last 30.0 seconds (last event received 1.54 minute ago) 
[2020-11-29 15:56:20,979] [114] [WARNING] [^---Recovery]: No event received for active tp TP(topic='meteor-submission-count-by-workflow-changelog', partition=0) in the last 30.0 seconds (last event received 1.63 minute ago) 
[2020-11-29 15:56:25,975] [114] [WARNING] [^---Recovery]: No event received for active tp TP(topic='meteor-submission-count-by-workflow-changelog', partition=0) in the last 30.0 seconds (last event received 1.71 minute ago) 
[2020-11-29 15:56:30,980] [114] [WARNING] [^---Recovery]: No event received for active tp TP(topic='meteor-submission-count-by-workflow-changelog', partition=0) in the last 30.0 seconds (last event received 1.79 minute ago) 
[2020-11-29 15:56:35,977] [114] [WARNING] [^---Recovery]: No event received for active tp TP(topic='meteor-submission-count-by-workflow-changelog', partition=0) in the last 30.0 seconds (last event received 1.88 minute ago) 
[2020-11-29 15:56:40,982] [114] [WARNING] [^---Recovery]: Recovery has not flushed buffers in the last 120.0 seconds (last flush was 2.00 minutes ago). Current total buffer size: 5

Versions

Python version: 3.7
Faust version: 0.3.0
Operating system: Ubuntu 18:04
Kafka version: 2.6.0
RocksDB version (if applicable)

The text was updated successfully, but these errors were encountered:

#49) * Fixing issues #47 and #48 * fix linting

patkivikram added a commit that referenced this issue Nov 30, 2020

Fixing issues #47 and #48

f125677

patkivikram mentioned this issue Nov 30, 2020

Fix recovery issue in transaction and reprocessing message in consumer #49

Merged

patkivikram added a commit that referenced this issue Nov 30, 2020

Fix recovery issue in transaction and reprocessing message in consumer (

be1e6db

#49) * Fixing issues #47 and #48 * fix linting

patkivikram closed this as completed Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker fails to recover table with exactly_once guarantee #47

Worker fails to recover table with exactly_once guarantee #47

jkgenser commented Nov 29, 2020 •

edited

Loading

Worker fails to recover table with exactly_once guarantee #47

Worker fails to recover table with exactly_once guarantee #47

Comments

jkgenser commented Nov 29, 2020 • edited Loading

Checklist

Steps to reproduce

Expected behavior

Actual behavior

Full traceback

Versions

jkgenser commented Nov 29, 2020 •

edited

Loading