Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drastically speed up imports when bulk_size is specified #606

Merged
merged 3 commits into from
Mar 10, 2018

Conversation

yahooguntu
Copy link

If bulk_size is set in an index's import options, two additional steps are taken during an index reset:

  • Items to be indexed are serialized to JSON so their size is known
  • Items are split into multiple bulk requests if they collectively exceed bulk_size

I found two performance issues related to those steps. First, the serializer used by Chewy was much slower than the one used by elasticsearch-ruby. If bulk_size isn't set this isn't an issue, since Chewy passes everything to elasticsearch-ruby to be serialized. If bulk_size is set, though, Chewy needs to know the size of each item in order to complete step two, so it serializes everything with whatever .to_json uses (in my case, ActiveSupport/Module::JSON). This commit fixes this by changing Chewy using whatever serializer elasticsearch-ruby is configured to use (in my case, MultiJson/Yajl).

Second, the method that splits bulk requests if they're too big was allocating more strings than it had to. This method works by appending index commands to the last request body in an array of requests, until that request body reaches bulk_size; then it adds a new request body to the array and starts filling it with index commands, and so on. The slowdown was because this method was using .joins to add data to request bodies, which creates a copy of the request body. The solution was to change .joins to <<, which does an in-place string append.

Here's the runtimes I was seeing when reindexing 40,000 documents:

bulk_size Revision Time
not set master 12.76s
1.5KB master 113.12s
1.5KB Split optimized 113.05s
1.5KB Split & serialize optimized 108.01s
1GB master 50.78s
1GB Split optimized 21.39s
1GB Split & serialize optimized 13.21s

* 1.5KB forces each update into its own request, and 1GB allows them all to go in one big request.

@yahooguntu yahooguntu changed the title Drastically speeds up imports when bulk_size is specified Drastically speed up imports when bulk_size is specified Nov 9, 2017
@yahooguntu
Copy link
Author

@pyromaniac Conflicts have been resolved.

@@ -31,7 +31,7 @@ def entries(action, objects)
index_name: @type.index.derivable_name,
type_name: @type.type_name,
action: action,
references: identify(objects).map(&:to_json).map(&Base64.method(:encode64)),
references: identify(objects).map { |item| Base64.encode64(::Elasticsearch::API.serializer.dump(item)) },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't get how this one is related

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the serialization optimization mentioned above. This uses the serializer Elasticsearch decided to use, instead of the slow default one.

@pyromaniac pyromaniac merged commit f9c02dd into toptal:master Mar 10, 2018
@pyromaniac
Copy link
Contributor

Awesome, thanks for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants