Update Elasticsearch Output Plugin to retry bulk #2

talevy · 2014-10-31T21:19:41Z

Some actions may fail within the ES client bulk call.
Now, some messages (specifically errors 429 and 503s) will be
retried up to 3 times. If there are still actions that are
unsuccessfully indexed, Stud Buffer will continue its current
behavior (to retry indefinitely).

Stud Buffer will replay all events it first attempted to flush
to Elasticsearch. This means duplicate events may find themselves
in Elasticsearch.

(migrated from elastic/logstash#1997 with a few modifications to make
tests run once plugin is installed)

jsvd · 2014-11-03T10:24:06Z

lib/logstash/outputs/elasticsearch.rb

+      Stud::try(3.times) do
+        bulk_response = @current_client.bulk(actions)
+        if bulk_response["errors"]
+          failed_actions = bulk_response['statuses'].map.with_index {|x, i| actions[i] unless [200, 201].include?(x)}.compact


A reject might be simpler:

failed_actions = actions.reject.with_index {|_,i| [200, 201].include? bulk_response['statuses'][i] }

I didn't know about reject. I like it!

jordansissel · 2014-11-04T19:16:57Z

Let's only retry the documents which failed :)

jordansissel · 2014-11-04T19:17:19Z

If we have to improve Stud::Buffer, we can do that, but we should only retry docs that failed.

talevy · 2014-11-04T20:00:37Z

@jordansissel do you see a way of accomplishing that without updating Stud::Buffer to
handle the partials here? https://github.com/jordansissel/ruby-stud/blob/master/lib/stud/buffer.rb#L217-L227

first way I see:

update flush interface to return status
use that status to delete from @buffer_state[:outgoing_items]/@buffer_state[:pending_items] and
@buffer_state[:outgoing_count]/@buffer_state[:pending_count] accordingly.

what do you think?

talevy · 2014-11-12T00:05:21Z

Update from discussions:

We will take over the whole retry management as to prevent
Stud::Buffer from retrying all queued up events (including successful ones)

colinsurprenant · 2014-11-26T19:38:32Z

I will make a few inline comments, but I'd like to discuss the submit + retry_flush design where retry_flush can call submit... this sequence is hard to reason about: submit -> retry_flush -> submit -> ...

colinsurprenant · 2014-11-26T19:45:16Z

lib/logstash/outputs/elasticsearch.rb

@@ -208,10 +215,26 @@ def wildcard_substitute(name)
    name.gsub(/%\{[^}]+\}/, "*")
  end

+  def retry_flush
+    unless @retry_queue.empty?
+      buffer = @retry_queue.size.times.collect { @retry_queue.pop }


personal preference: I prefer the use of map instead of collect simply because map originates from functional programming languages and is widely used by most programming languages.

I think there is a potential race condition here. retry_flush is called both by the flush thread at periodic interval and by the submit method. The statement

buffer = @retry_queue.size.times.collect { @retry_queue.pop }

grabs the @retry_queue size and then iterates on that size to pop the elements. we can get into a situation where retry flush will be executed concurrently and when that happen, both threads will get their queue size but the other thread will also pop elements so both thread will eventually block on @retry_queue.pop because the other thread is also pop'ing elements from it.

colinsurprenant · 2014-11-26T20:01:46Z

In the receive method, maybe rewrite

    if @index_type
      type = event.sprintf(@index_type)
    else
      type = event["type"] || "logs"
    end

into

  type = @index_type ? event.sprintf(@index_type) : (event["type"] || "logs")

colinsurprenant · 2014-11-26T21:02:18Z

lib/logstash/outputs/elasticsearch.rb

+      failed_actions = actions.select.with_index {|_,i| [429, 503].include?(bulk_response['statuses'][i]) }
+      unless failed_actions.empty?
+        @logger.debug "#{failed_actions.size}/#{actions.size} events were unsuccessful in sending"
+        for failed_action in failed_actions


also, style-wise, I'd avoid for loops and use iterators like each

colinsurprenant · 2014-12-03T20:03:08Z

lib/logstash/outputs/elasticsearch.rb

+          end
+
+          [next_action, next_doc, next_event]
+        }.compact


did you mean having return nil returning from the method or having nil as the map block result? I assume the latter since you call compact at the end. In this case the return nil will return from the function. I'd suggest rewriting this as

buffer = @retry_queue.size.times.map do next_action, next_doc, next_event = @retry_queue.pop next_event['@metadata']['retry_count'] += 1 if next_event['@metadata']['retry_count'] > @max_retries @logger.error("too many attempts at sending event. dropping: #{next_event}") next end [next_action, next_doc, next_event] end.compact

next returns from the block. you can specify a return value to next, but a bare next is equivalent to next nil.

colinsurprenant · 2014-12-03T21:08:43Z

noticed the use of or in the context of boolean logical operator at https://github.com/logstash-plugins/logstash-output-elasticsearch/pull/2/files#diff-e1eda60fc5950f2a16e1209545609fdfL286 as discussed we should use ||

colinsurprenant · 2014-12-03T21:13:31Z

Also, general comment about methods visibility: I personally prefer to group them and write the visibility modifier once and all method definitions that follow that modifier will inherit it. I think it is easier to read & understand: all public method first, then the protected and finally the private methods. No need to specify the visibility on each...

colinsurprenant · 2015-01-13T19:35:04Z

lib/logstash/outputs/elasticsearch.rb

+
+  public
+  def submit(actions)
+    es_actions = actions.map { |a, doc, event| [a, doc, event.to_hash] }


suggested comment:

# synchronize the @current_client.bulk call to avoid concurrency/thread safety issues with the # client libraries which might not be thread safe. the submit method can be called from both the # Stud::Buffer flush thread and from our own retry thread.

colinsurprenant · 2015-01-13T20:16:50Z

lib/logstash/outputs/elasticsearch.rb

-    buffer_receive([event.sprintf(@action), { :_id => document_id, :_index => index, :_type => type }, event.to_hash])
-  end # def receive
+        if next_event['@metadata']['retry_count'] > @max_retries
+          @logger.error "too many attempts at sending event. dropping: #{next_event}"


until we have the DLQ, wondering if we should log more upon dropping the event?

#{next_event.to_hash_with_metadata.inspect}

??

Any thoughts @jordansissel ?

talevy · 2015-01-13T22:37:13Z

@colinsurprenant updated again! with extra test for the max_retries logic (which was not being tested, as the return nil went by unnoticed)

colinsurprenant · 2015-01-13T23:07:37Z

Looks pretty good to me. We will need to followup with a code reorg/cleanup PR.
I did not review the tests and I did not stress test or do integration test on this. I think we should. For http we could simply create a dummy http server to mock an ES instance.

Some actions may fail within the ES client bulk call. Now, some messages (specifically errors 429 and 503s) will be retried up to a configurable number of times (default 3). If there are still actions that are unsuccessfully indexed. These messages will be added to a separate retry queue that will have a separate thread managing its flushes. Back-pressure will be applied to prevent too many failed events from accumulating.

talevy · 2015-01-13T23:22:52Z

thanks @colin for the thorough review

elasticsearch-bot · 2015-01-13T23:24:14Z

Merged sucessfully into master!

jsvd · 2015-01-14T11:00:27Z

WOOHOOO 🎆 so glad to see this merged! 🎉

ralphm · 2015-02-23T19:17:52Z

I just ran into the issue of dropped events because I mistakenly tried to index a document with a type starting with an underscore. This yields a 400 error, but the drop happens silently and I only found out what happened by tcpdump. Although I see elastic/logstash#1631 is closed because of this ticket, I couldn't find the follow-up PR that @colinsurprenant hinted at?

talevy · 2015-02-23T19:40:33Z

@ralphm I will add more explicit logging to inform of this better. Currently we only retry 429 and 503 errors. Others will be logged and dropped until we have a persistent way of keeping track of these without halting the pipeline. Let me know if this would help, or maybe another behavior you would expect.

edit: created PR and will address this today: #62 @ralphm let me know if that is helpful enough. I can update the message to be whatever you think would be useful and easy to parse for you! Thanks!

talevy · 2015-03-02T23:37:25Z

@ralphm the logging changes have been merged into master. hope that makes things a little better until we add dead-letter queuing support!

jsvd reviewed Nov 3, 2014
View reviewed changes

talevy force-pushed the fix/1631-retry branch from 85a098b to 7e5583f Compare November 3, 2014 22:19

talevy force-pushed the fix/1631-retry branch from 7e5583f to 51938d2 Compare November 4, 2014 22:02

talevy force-pushed the fix/1631-retry branch 4 times, most recently from bd290de to 3684a81 Compare November 17, 2014 21:43

talevy force-pushed the fix/1631-retry branch from 3684a81 to cfcf68b Compare November 24, 2014 18:58

colinsurprenant reviewed Nov 26, 2014
View reviewed changes

talevy force-pushed the fix/1631-retry branch 2 times, most recently from c6b793a to 505bd37 Compare December 1, 2014 15:26

colinsurprenant reviewed Dec 3, 2014
View reviewed changes

talevy force-pushed the fix/1631-retry branch 2 times, most recently from 2fb3e58 to 4c0ea0b Compare December 4, 2014 00:03

pickypg mentioned this pull request Dec 4, 2014

Add support for create actions in the node client #29

Closed

talevy force-pushed the fix/1631-retry branch 2 times, most recently from 9bc3267 to 93b63f1 Compare January 12, 2015 23:08

jordansissel added the O(3) label Jan 13, 2015

colinsurprenant reviewed Jan 13, 2015
View reviewed changes

talevy force-pushed the fix/1631-retry branch from 93b63f1 to d364f87 Compare January 13, 2015 20:04

colinsurprenant reviewed Jan 13, 2015
View reviewed changes

talevy force-pushed the fix/1631-retry branch from d364f87 to 78de271 Compare January 13, 2015 22:36

talevy force-pushed the fix/1631-retry branch from 78de271 to 7bcc63b Compare January 13, 2015 23:19

talevy closed this in 2c79df1 Jan 13, 2015

talevy mentioned this pull request Jan 14, 2015

elasticsearch output plugin: rejected docs in bulk indexing partial failure are silently lost elastic/logstash#1631

Closed

This was referenced Jan 21, 2015

Indexes aren't automagically created #36

Closed

improve error handling for elasticsearch_http elastic/logstash#1367

Closed

This was referenced Feb 24, 2015

add better logging to failed actions #62

Closed

Add Better Logging for failed actions in bulk request #61

Closed

jordansissel mentioned this pull request Mar 17, 2015

Invalid mapping in ES allows for dropping events silently elastic/logstash#1751

Closed

cooniur mentioned this pull request Mar 25, 2016

support underscores in hostnames #400

Open

webmat mentioned this pull request May 30, 2018

How should output-elasticsearch deal with a missing @timestamp? #779

Open

Update Elasticsearch Output Plugin to retry bulk #2

Update Elasticsearch Output Plugin to retry bulk #2

Uh oh!

Conversation

talevy commented Oct 31, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jordansissel commented Nov 4, 2014

Uh oh!

jordansissel commented Nov 4, 2014

Uh oh!

talevy commented Nov 4, 2014

Uh oh!

talevy commented Nov 12, 2014

Uh oh!

colinsurprenant commented Nov 26, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

colinsurprenant commented Nov 26, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

colinsurprenant commented Dec 3, 2014

Uh oh!

colinsurprenant commented Dec 3, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

talevy commented Jan 13, 2015

Uh oh!

colinsurprenant commented Jan 13, 2015

Uh oh!

talevy commented Jan 13, 2015

Uh oh!

elasticsearch-bot commented Jan 13, 2015

Uh oh!

jsvd commented Jan 14, 2015

Uh oh!

ralphm commented Feb 23, 2015

Uh oh!

talevy commented Feb 23, 2015

Uh oh!

talevy commented Mar 2, 2015

Uh oh!

Uh oh!