Drastically speed up imports when bulk_size is specified #606
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
If
bulk_size
is set in an index's import options, two additional steps are taken during an index reset:bulk_size
I found two performance issues related to those steps. First, the serializer used by Chewy was much slower than the one used by elasticsearch-ruby. If
bulk_size
isn't set this isn't an issue, since Chewy passes everything to elasticsearch-ruby to be serialized. Ifbulk_size
is set, though, Chewy needs to know the size of each item in order to complete step two, so it serializes everything with whatever.to_json
uses (in my case, ActiveSupport/Module::JSON). This commit fixes this by changing Chewy using whatever serializer elasticsearch-ruby is configured to use (in my case, MultiJson/Yajl).Second, the method that splits bulk requests if they're too big was allocating more strings than it had to. This method works by appending index commands to the last request body in an array of requests, until that request body reaches
bulk_size
; then it adds a new request body to the array and starts filling it with index commands, and so on. The slowdown was because this method was using.joins
to add data to request bodies, which creates a copy of the request body. The solution was to change.joins
to<<
, which does an in-place string append.Here's the runtimes I was seeing when reindexing 40,000 documents:
bulk_size
* 1.5KB forces each update into its own request, and 1GB allows them all to go in one big request.