Drastically speed up imports when bulk_size is specified #606

yahooguntu · 2017-11-09T21:40:32Z

If bulk_size is set in an index's import options, two additional steps are taken during an index reset:

Items to be indexed are serialized to JSON so their size is known
Items are split into multiple bulk requests if they collectively exceed bulk_size

I found two performance issues related to those steps. First, the serializer used by Chewy was much slower than the one used by elasticsearch-ruby. If bulk_size isn't set this isn't an issue, since Chewy passes everything to elasticsearch-ruby to be serialized. If bulk_size is set, though, Chewy needs to know the size of each item in order to complete step two, so it serializes everything with whatever .to_json uses (in my case, ActiveSupport/Module::JSON). This commit fixes this by changing Chewy using whatever serializer elasticsearch-ruby is configured to use (in my case, MultiJson/Yajl).

Second, the method that splits bulk requests if they're too big was allocating more strings than it had to. This method works by appending index commands to the last request body in an array of requests, until that request body reaches bulk_size; then it adds a new request body to the array and starts filling it with index commands, and so on. The slowdown was because this method was using .joins to add data to request bodies, which creates a copy of the request body. The solution was to change .joins to <<, which does an in-place string append.

Here's the runtimes I was seeing when reindexing 40,000 documents:

`bulk_size`	Revision	Time
not set	master	12.76s
1.5KB	master	113.12s
1.5KB	Split optimized	113.05s
1.5KB	Split & serialize optimized	108.01s
1GB	master	50.78s
1GB	Split optimized	21.39s
1GB	Split & serialize optimized	13.21s

* 1.5KB forces each update into its own request, and 1GB allows them all to go in one big request.

yahooguntu · 2018-03-06T21:39:44Z

@pyromaniac Conflicts have been resolved.

pyromaniac · 2018-03-07T03:23:00Z

lib/chewy/type/import/journal_builder.rb

@@ -31,7 +31,7 @@ def entries(action, objects)
            index_name: @type.index.derivable_name,
            type_name: @type.type_name,
            action: action,
-            references: identify(objects).map(&:to_json).map(&Base64.method(:encode64)),
+            references: identify(objects).map { |item| Base64.encode64(::Elasticsearch::API.serializer.dump(item)) },


Didn't get how this one is related

That's the serialization optimization mentioned above. This uses the serializer Elasticsearch decided to use, instead of the slow default one.

pyromaniac · 2018-03-10T05:41:16Z

Awesome, thanks for the contribution!

yahooguntu changed the title ~~Drastically speeds up imports when bulk_size is specified~~ Drastically speed up imports when bulk_size is specified Nov 9, 2017

Drastically speeds up imports when bulk_size is specified

e56fcd4

yahooguntu force-pushed the master branch from 614beaa to e56fcd4 Compare November 9, 2017 22:23

Mitch Birti added 2 commits January 25, 2018 09:12

Merge remote-tracking branch 'origin/master'

b9d4e63

Merge remote-tracking branch 'origin/master'

d84b6cd

pyromaniac reviewed Mar 7, 2018

View reviewed changes

pyromaniac merged commit f9c02dd into toptal:master Mar 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drastically speed up imports when bulk_size is specified #606

Drastically speed up imports when bulk_size is specified #606

yahooguntu commented Nov 9, 2017

yahooguntu commented Mar 6, 2018

pyromaniac Mar 7, 2018

yahooguntu Mar 9, 2018

pyromaniac commented Mar 10, 2018

Drastically speed up imports when bulk_size is specified #606

Drastically speed up imports when bulk_size is specified #606

Conversation

yahooguntu commented Nov 9, 2017

yahooguntu commented Mar 6, 2018

pyromaniac Mar 7, 2018

Choose a reason for hiding this comment

yahooguntu Mar 9, 2018

Choose a reason for hiding this comment

pyromaniac commented Mar 10, 2018