Skip to content

Some events are sent twice #67

@ePaul

Description

@ePaul

Symptoms

Testing a full snapshot creation in one of our systems, I found 405555 log lines about events having been submitted to Nakadi, but only 347889 snapshots should have been created. (The whole snapshot creation took 91 minutes.)

Looking into the logs, I find that some events are sent twice:

2018-01-30 12:43:49,732 INFO defaultTaskScheduler-1 org.zalando.nakadiproducer.transmission.impl.EventTransmissionService.sendEvent(EventTransmissionService.java:51): Event 762966 locked by 67c108fb-e689-4595-a28b-e78ab4030726 was sucessfully transmitted to nakadi
...
2018-01-30 12:44:25,618 INFO pool-6-thread-1 org.zalando.nakadiproducer.transmission.impl.EventTransmissionService.sendEvent(EventTransmissionService.java:51): Event 762966 locked by 2f9e651b-49ef-4f7c-a1ba-2d5d976abed2 was sucessfully transmitted to nakadi

Analysis

The event sending (from the event log) works like this:

  • update all DB entries without lockTime, or with lockTime before now, setting a lock time 10 minutes into the future (and an identifier for the lock)
  • fetch all entries which are locked by my identifier
  • for each entry:
    • send the event for that entry
    • delete the entry from the DB.

The whole thing is a task scheduled every second.
I think the problem occurs if the last loop takes longer than 10 minutes: in this case the next task can lock the same entries again (and start processing), while the first one is still processing them.

Proposed solution

The first step (which locks entries) should get a limit which locks only that many entries as can be comfortably sent by one thread in a time a lot lower than 10 minutes (to make sure that even in extraordinary cases they are all processed in the 10 minute window).
In our example, we had a submission rate of 70-100/s (from our Berlin data center to a Nakadi instance in Ireland's AWS, with single-event POSTs of events with average size 3KB), so a limit of 200 should be more than enough, and might even speed things up by using more parallel processing. With batch submission (#48), likely higher rates can be reached, and the limit should be thus increased a bit more.

An alternative would be to regularly check the current time in the processing thread, and just stop when it is comes close to the lock time, then letting the next scheduled task take over.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions