Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] [Parallel downloads] Investigate the use of virtual threads as the new I/O mechanism for segment downloads #11708

Open
kotwanikunal opened this issue Jan 2, 2024 · 5 comments
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Search:Remote Search

Comments

@kotwanikunal
Copy link
Member

Is your feature request related to a problem? Please describe

Coming in from #11461 - we need to investigate utilizing virtual threads as the default I/O mechanism for segment downloads.

Describe the solution you'd like

#10786 talks about using virtual threads as the I/O mechanism over the current code base.
As a part of parallel downloads -

  • We can adapt reactive streams from the Java SDK as the standard async I/O mechanism
  • Adapt virtual threads as the I/O mechanism

A major benefit of the virtual thread mechanism is that the existing code constructs and APIs do not need to change - it will leverage the benefits at a logically lower level.

The aim of this issue is to test and compare the overall effort, performance benefits between the two mechanisms as well as documenting the suggested path forward.

Related component

Search:Remote Search

Describe alternatives you've considered

  • Reactive Streams
  • Virtual threads

Additional context

@kotwanikunal kotwanikunal added enhancement Enhancement or improvement to existing feature or request untriaged labels Jan 2, 2024
@kartg kartg self-assigned this Jan 17, 2024
@kotwanikunal kotwanikunal assigned mch2 and unassigned kartg Jan 29, 2024
@kotwanikunal kotwanikunal self-assigned this Feb 26, 2024
@kotwanikunal
Copy link
Member Author

Experiments performed

All of the experiments below were performed on a self managed cluster with 3 r7g.4xl EC2 nodes, a load balancer and a c5.2xl EC2 node running opensearch-benchmark as the load generator.

Recovery

  • Created an index nyc_taxis with 6 primary shards and 0 replicas initially with a primary shard size of around 17
    • The overall index size was around 120 GB.
  • Performed shard recovery across 3 use cases by adding in 2 replica shards for the index.
  • Bumped up the EBS volume throughput to 1000 MB/s and IOPS to 6000, JVM heap configured to 100 GB
  • The number of concurrent recoveries was bumped up to 100 to facilitate for parallel recovery of shards and the speed limit (indices.recovery.max_bytes_per_sec ) was removed.
Shard Size (GBs) Recovery Time (seconds) Throughput (MB/s) Improvement
File-Based Blocking Threads 17.8 94 189.9 Baseline for recovery
File-based Virtual Threads N/A N/A N/A
Part-based Blocking Threads 17.8 85.5 208.23 10% over file based
Part-based Virtual Threads 17.8 70 255.17 34% over file based
22% over blocking multipart

Virtual threaded multipart approach is roughly 34% faster than the file approach and ~22% faster than the blocking multipart approach.

Replication

  • Replication scenario was tested with the above described nyc_taxis index with 6 primary shards of roughly 18gb and 0 replicas initially and a new so index with 3 primary shards and 1 replica spread across 3 nodes.
  • The test involved increasing the replicas for nyc_taxis to 2, kicking off recovery, and simultaneously indexing documents on so to check for overall replication lag
    • There was a separate process polling _cat/segment_replication every 5 seconds till the recovery was in progress and dumping the results to a file
  • The number of concurrent recoveries was limited to 2 to test for system fairness and the recovery speed (indices.recovery.max_bytes_per_sec) was limited 125 MB

Case 1: 1s Refresh Interval for so

Average lag (seconds) Average Lag (GB) Max Lag (seconds) Max Lag (GB)
Baseline 1.9 0.27 42 4.6
File-Based Blocking Threads 45 1.4 132 4.8
File-based Virtual Threads 55 1.7 150 5.8
Part-based Blocking Threads 15 0.9 108 5.2
Part-based Virtual Threads 16 0.5 44 1.4

Case 2: 10s Refresh Interval for so

Average lag (seconds) Average Lag (GB) Max Lag (seonds) Max Lag (GB)
File-Based Blocking Threads 61.84873 2.0458 198 6.9
File-based Virtual Threads 71.4141 2.11445 246 6.3
Part-based Blocking Threads 51.16281 1.14736 168 3.1
Part-based Virtual Threads 41.63 1.44 126 4.8

Compared to the file based approach, the virtual, multithreaded approach gives comparatively better performance. The inherent fairness added by the part based approach helps reduce the overall replication lag and improve replication performance.

The baseline case has been added in for comparison purposes and the virtual, multipart approach is the most fair approach as compared to the other approaches.

Security (JSM)

  • Virtual Threads and SecurityManager are incompatible called out as a part of JEP-444
  • VirtualThreads operate off platform threads and need access to other Threads and ThreadGroups to park and unpark these threads when it encounters blocking code
  • VirtualThreads can operate with the OpenSearch codebase and Security Manager with certain tweaks

Path forward

The next steps for multipart download support would be as follows -

  1. Add support for platform threaded multipart downloads as the default download mechanism for remote store operations
    1. Virtual Thread support will be added for JDK21+ runtimes and is described as step 2
    2. This will be wrapped behind a feature release flag with defaults to file-based downloads pre launch
  2. Add a multi release JAR support for JDK 21+/<JDK21
    1. This will ensure usage of virtual threads with JDK21+ where support exists
    2. This has been previously implemented here:
  3. For JSM, there are 2 possible approaches -
    1. Enable security manager with a passthrough just for virtual threads as highlighted here
    2. Enable Virtual Threads on JDK21+ only if security manager is disabled

TLDR:

  • For <JDK21 - we will use multipart downloads behind a feature flag, with a fallback to file-based downloads until GA
  • For JDK21+ - We will use multipart downloads with virtual threads (documenting the behavior with SecurityManager) and a fallback to platform/blocking thread based multipart downloads

Code

With Security Manager: main...kotwanikunal:OpenSearch:virtual-thread-sm
Without Security Manager: main...kotwanikunal:OpenSearch:virtual-thread

cc: @Bukhtawar @andrross @mch2

@reta
Copy link
Collaborator

reta commented Mar 8, 2024

@kotwanikunal thanks a lot for exploring this area, there have been discussions related to virtual threads for a while now (both in OpenSearch and Apache Lucene communities), at the moment there are a few cautionary cases that we may look into more closely:

  • file system I/O (basically accessing files on disk) is not virtual thread friendly yet, at least on some operation systems, https://openjdk.org/jeps/444 warns about that explicitly

    The vast majority of blocking operations in the JDK will unmount the virtual thread, freeing its carrier and the underlying OS thread to take on new work. However, some blocking operations in the JDK do not unmount the virtual thread, and thus block both its carrier and the underlying OS thread. This is because of limitations at either the OS level (e.g., many filesystem operations) or the JDK level ...

  • as you mentioned, Security Manager is a bummer ([RFC] Replace Java Security Manager (JSM) #1687), the approached to pass through or disable it are (in my opinion) overly simplistic - they completely defeats the security boundaries. Since the issue is not easy to solve at large, we may consider "sealing" the usage of virtual threads in core to some "trusted" code flows, making sure there is no way to hijack it with "non-trusted" one (aka plugins)

  • as a side note, we would very likely need to reconsider the large part of the core related to resource tracking, hot threads reporting, pools configurations, .... I am not 100% sure but I believe we may have surprises here.

@kotwanikunal
Copy link
Member Author

@kotwanikunal thanks a lot for exploring this area, there have been discussions related to virtual threads for a while now (both in OpenSearch and Apache Lucene communities), at the moment there are a few cautionary cases that we may look into more closely:

  • file system I/O (basically accessing files on disk) is not virtual thread friendly yet, at least on some operation systems, https://openjdk.org/jeps/444 warns about that explicitly

Nice to hear from you @reta! :)
Given that we have gains, even with the platform threaded approach and there will be improvements over time with virtual threads, would it make more sense to head in this direction v/s reactive/async programming model? Virtual threads will still be opt-in, possibly with a feature flag, to have some bake time.

@kotwanikunal
Copy link
Member Author

  • as a side note, we would very likely need to reconsider the large part of the core related to resource tracking, hot threads reporting, pools configurations, .... I am not 100% sure but I believe we may have surprises here.

For the approach we are looking at, the queue is not unbounded and we will limit job submissions at the recovery/replication event level. We can build up guardrails and framework around it with these set of operations.

@reta
Copy link
Collaborator

reta commented Mar 9, 2024

Nice to hear from you @reta! :)

❤️ Same, @kotwanikunal !

Given that we have gains, even with the platform threaded approach and there will be improvements over time with virtual threads, would it make more sense to head in this direction v/s reactive/async programming model?

I think for downloading segments (network I/O) the virtual threads are the perfect fit, for filesystem we may need to weight in the tradeoff. As far as I understand, the experimental implementation does not indicate from where exactly gains come from, just in general there are some. If we could collect precise measurements here - it would help.

For the approach we are looking at, the queue is not unbounded and we will limit job submissions at the recovery/replication event level. We can build up guardrails and framework around it with these set of operations.

My apologies, guardrails are certainly needed but I meant different subject here: we need to have visibility into virtual threads disregarding where they are being used. Can we track the CPU share used by virtual thread? (as we do for regular one) Can we capture stacks to collect hot threads? And things like that ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Remote Search
Projects
None yet
Development

No branches or pull requests

4 participants