Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk Indexing Optimization and Refresh on Async Strategies #467

Merged
merged 24 commits into from
Mar 29, 2017

Conversation

ericproulx
Copy link
Contributor

Hello Pyro,

First of all, have a great year 👏👏

After some reading on ElasticSearch documentation, it seems that dynamically disabling refresh_interval and decreasing number_of_replicas to 0 should enhance bulk indexing on indices. If thought it could be a nice configuration setting when zero downtime index resetting. I called it use_enhance_index_settings_while_resetting

Also, I added another configuration setting to disable refresh on async strategies like sidekiq, resque and active_jobs. If you choose an async strategy, probably that you don't need a near-real time searchable document. Well in our case, we don't so I thought it would be nice addition.

Thanks

Copy link
Contributor

@pyromaniac pyromaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, great idea! Did you try it on production?

Another optimization option is to start the index without any replicas, and only later adding them, but that really depends on the use case

Does it make sense to split this 2 options (the refresh interval and the replicas number) to be able to enable them separately?

result =
if Chewy.use_enhance_index_settings_while_resetting
name = build_index_name(suffix: suffix)
client.indices.put_settings index: name, body: { index: { number_of_replicas: 0, refresh_interval: -1 } }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we refactor this? Make number_of_replicas a separate method with the block or so. I mean, this method became way huge, right?

Copy link
Contributor Author

@ericproulx ericproulx Jan 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll refactor.


def extract_index_settings(*args)
return {} unless settings_hash.key?(:settings) || settings_hash[:settings].key?(:index)
settings_hash[:settings][:index].select { |k, _| args.include?(k) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

settings_hash[:settings][:index].slice(*args) is slightly simpler.

@ericproulx
Copy link
Contributor Author

rubocop is now 0.47.0 but I locked it to 0.46.0 since new offenses were flagged

@ericproulx
Copy link
Contributor Author

ericproulx commented Jan 17, 2017

Well, i fixed 0.47.0 offenses. Is that ok?

@pyromaniac
Copy link
Contributor

Sure it is ok. Did you take a look on my comment about separate options? What do you think?

.rubocop.yml Outdated
@@ -32,3 +32,8 @@ Style/MultilineMethodCallIndentation:

Style/MultilineOperationIndentation:
EnforcedStyle: indented

Metrics/BlockLength:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Now we need this in every project :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ericproulx
Copy link
Contributor Author

Honestly, I don't know about separating options. When would put back the original number_of_replicas if it was separated ? I would say the doc is confusing about that.

What do you think about the force_merge after ?

And, a force merge should be called:

curl -XPOST 'http://localhost:9200/test/_forcemerge?max_num_segments=5'

@pyromaniac
Copy link
Contributor

I mean, setting refresh interval should be separate from setting the number of replicas. Somebody can decide to use either one, but not both.

@ericproulx
Copy link
Contributor Author

ericproulx commented Jan 17, 2017

I would change the config settings to reflect user's choice

 :only_refresh
 :only_replicas
 :both
 :none

@pyromaniac
Copy link
Contributor

Hm, what about 2 separate boolean options instead?

@ericproulx
Copy link
Contributor Author

No prob

@ericproulx
Copy link
Contributor Author

ericproulx commented Jan 22, 2017

Well, looks like delete_by_query has been updated in elastic-ruby 5.0.1

elastic/elasticsearch-ruby@cbfd6eb

It was removed in 2.0 (became a plugin). 5.0, it's back again but with a new definition

:reset_no_replicas,

# Refresh or not when import async (sidekiq, resque, activejob)
:disable_refresh_async
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, why did you add this?
I'm thinking about some options passed to the block: Chewy.strategy(:sidekiq, refresh: false) { do_stuff } Instead of the global option. However it would not affect anything, so I'm ok to accept it if you don't want to change your mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both ways could work. We could set the strategy globally for the async strategies or we could simply pass some options to the block to have a different behavior for some cases. What do you think ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this makes sense. I'm just thinking of it being slightly excess. But yeah, we can keep both ways. Tell me when you think this PR is ready, I'll review it again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ready to review sir 👍

@pyromaniac pyromaniac merged commit 540462b into toptal:master Mar 29, 2017
@pyromaniac
Copy link
Contributor

Ok, let's try it out. Thanks for this functionality!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants