-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore use of LogByteSizeMergePolicy for time series data use cases #9241
Comments
There is one more mergepolicy got introduced interleaved ShuffledMergePolicy. Will be curious to know numbers here, but definitely consider comparing |
Thanks @gashutos for taking a look. I'm currently working towards getting numbers out for comparison.
This optimization wouldn't be helpful with anything within BKD, the min and max ranges of timestamps for BKDs will be significantly lower after this change and points would be more dense. Also, searching across segments would be more ordered too. |
I'm not seeing any improvement using LogByteSizeMergePolicy against http_logs workload. Concern 1:http_logs doesn't generate workload which matches real world use cases of data streams where multiple clients are ingesting in a time ordered way. Concern 2:OpenSearch benchmark currently completes ingestion and then run queries for http_logs workload as per test procedure defined here. This is like running queries only on read-only indices which is usually not the case in production environment of time based data like logs. LogByteSize merge policy would be more effective for indices with active writes. I'm currently working on setting up such data stream locally to evaluate the improvements if any. Meanwhile, also working with opensearch-benchmark to add such capability - opensearch-project/opensearch-benchmark#365 I'm sounding more like tailoring the use cases to align with effectiveness of LogByteSize merge policy, but they seem to me like a valid use cases which should be incorporated in opensearch-benchmarks. tagging @nknize Posting numbers with current logic - LogByteSizeMerge PolicyUsing following defaults -
Latency (ms)
TieredMergePolicyLatency (ms)
|
Thank you for the benchmark !! Some regression is there as well.
If you take a segment and see values, they are very much sorted by timestamp. (nearly sorted). Just do Probably for testing, you can try with |
I do see some performance gain when running locally on system with following configuration - I had to customize the http_logs workload. I'm just using one log file, index with one shard and just few operations -
With tiered merge policy
with log merge policy
if you look at the timestamp difference between the 1st and the 8th client at a given time, its quite significant - opensearch-project/opensearch-benchmark#365 (comment) |
below is the snapshot of the segments created by both merge policies. The X axis denotes the segment, sorted by the document count and Y axis represents the
|
@rishabhmaurya If time is really a concern, why not make the merge policy time aware? |
@itiyama open to such experiments however I think the gain may not be significant for timeseries usecase. Assuming the agents sending data from different machines don't have a significant time lag, merging adjacent segments would ensure the overlap of timestamps across segments isn't a lot however tiered merge policy could worsen the overlaps. The timestamp based approach might be more helpful in cases when there is a significant lag between clients ingesting but that's usually not the case and we should not to optimize for such anomalies. Do you have better ideas on how you want to make use of timestamps? |
Some of the outstanding items before making LogByteSizeMergePolicy default for timestamp based index -
|
Is your feature request related to a problem? Please describe.
LogByteSizeMergePolicy always merges adjacent segments together which could be helpful for cases for time series data where documents are sorted based on timestamp and segments usually don't have much overlap on timestamps. At query time, its better if the time range can be contained in lesser number of segments and other segments can be skipped by checking min/max value of timestamp field. When adjacent segments are merged, its likelihood increases significantly.
TieredMergePolicy, which is successor of LogByteSizeMergePolicy and current default merges segments more smartly and can merge non-adjacent segments too, which could be inefficient for time series data.
Describe the solution you'd like
Explore usage of
LogByteSizeMergePolicy
for data streams use cases where@timestamp
is a mandatory field.Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: