Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker fails to recover table with exactly_once guarantee #47

Closed
jkgenser opened this issue Nov 29, 2020 · 0 comments
Closed

Worker fails to recover table with exactly_once guarantee #47

jkgenser opened this issue Nov 29, 2020 · 0 comments

Comments

@jkgenser
Copy link

jkgenser commented Nov 29, 2020

Checklist

  • [] I have included information about relevant versions
  • [] I have verified that the issue persists when using the master branch of Faust.

Steps to reproduce

  • Application is configured with processing_guarantee="exactly_once"

  • publish 5 messages to a topic, keyed by id

  • repartition the topic using group_by(new_id)

  • increment count on table with keys that are new_id

  • Initially, the worker is up, processes the messages and stores the correct data in the changelog topic.

  • Then I send SIGTERM to stop the worker

  • When restarting the worker, it gets stuck on recovering per the logs below.

Tell us what you did to cause something to happen.

Possibly some issue with the transaction producer and a transaction potentially getting aborted leads to worker not able to recover.

Expected behavior

Tell us what you expected to happen.

Worker recovers and is able to process events.

Actual behavior

Tell us what happened instead.

Worker hangs on recovery.

Full traceback

Log showing this behavior.

[2020-11-29 15:55:45,969] [114] [WARNING] [^---Recovery]: No event received for active tp TP(topic='meteor-submission-count-by-workflow-changelog', partition=0) in the last 30.0 seconds (last event received 1.04 minute ago) 
[2020-11-29 15:55:50,974] [114] [WARNING] [^---Recovery]: No event received for active tp TP(topic='meteor-submission-count-by-workflow-changelog', partition=0) in the last 30.0 seconds (last event received 1.13 minute ago) 
[2020-11-29 15:55:55,970] [114] [WARNING] [^---Recovery]: No event received for active tp TP(topic='meteor-submission-count-by-workflow-changelog', partition=0) in the last 30.0 seconds (last event received 1.21 minute ago) 
[2020-11-29 15:56:00,976] [114] [WARNING] [^---Recovery]: No event received for active tp TP(topic='meteor-submission-count-by-workflow-changelog', partition=0) in the last 30.0 seconds (last event received 1.29 minute ago) 
[2020-11-29 15:56:05,972] [114] [WARNING] [^---Recovery]: No event received for active tp TP(topic='meteor-submission-count-by-workflow-changelog', partition=0) in the last 30.0 seconds (last event received 1.38 minute ago) 
[2020-11-29 15:56:10,977] [114] [WARNING] [^---Recovery]: No event received for active tp TP(topic='meteor-submission-count-by-workflow-changelog', partition=0) in the last 30.0 seconds (last event received 1.46 minute ago) 
[2020-11-29 15:56:15,974] [114] [WARNING] [^---Recovery]: No event received for active tp TP(topic='meteor-submission-count-by-workflow-changelog', partition=0) in the last 30.0 seconds (last event received 1.54 minute ago) 
[2020-11-29 15:56:20,979] [114] [WARNING] [^---Recovery]: No event received for active tp TP(topic='meteor-submission-count-by-workflow-changelog', partition=0) in the last 30.0 seconds (last event received 1.63 minute ago) 
[2020-11-29 15:56:25,975] [114] [WARNING] [^---Recovery]: No event received for active tp TP(topic='meteor-submission-count-by-workflow-changelog', partition=0) in the last 30.0 seconds (last event received 1.71 minute ago) 
[2020-11-29 15:56:30,980] [114] [WARNING] [^---Recovery]: No event received for active tp TP(topic='meteor-submission-count-by-workflow-changelog', partition=0) in the last 30.0 seconds (last event received 1.79 minute ago) 
[2020-11-29 15:56:35,977] [114] [WARNING] [^---Recovery]: No event received for active tp TP(topic='meteor-submission-count-by-workflow-changelog', partition=0) in the last 30.0 seconds (last event received 1.88 minute ago) 
[2020-11-29 15:56:40,982] [114] [WARNING] [^---Recovery]: Recovery has not flushed buffers in the last 120.0 seconds (last flush was 2.00 minutes ago). Current total buffer size: 5 

Versions

  • Python version: 3.7
  • Faust version: 0.3.0
  • Operating system: Ubuntu 18:04
  • Kafka version: 2.6.0
  • RocksDB version (if applicable)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants