New feature
This proposal is to enhance Nextflow's task polling mechanism by introducing a jitter, which will improve the robustness and efficiency of workflows, especially when interacting with external services.
Use case
The main use case is to prevent the "thundering herd" problem when Nextflow tasks poll external services (e.g., cloud provider APIs like GCP, AWS, Azure, or other web services) for status updates. In scenarios where numerous tasks initiate polling at synchronized, fixed intervals, this can lead to sudden, high-volume bursts of requests. Such bursts can overwhelm external APIs, resulting in rate limiting, increased latency, or temporary service unavailability. Adding jitter will distribute these requests more evenly over time, making Nextflow a more resilient and "API-friendly" client.
Suggested implementation
The core polling logic resides in TaskPollingMonitor.groovy and ParallelPollingMonitor.groovy.
- Identify Polling Loop: The
pollLoop() method in TaskPollingMonitor.groovy (around line 483) is the primary location where the polling interval is enforced via the await(time) call.
- Introduce Jitter Calculation:
- Modify the
await(long time) method (around line 578) or the pollLoop() to introduce a random delay (jitter) to the fixed pollIntervalMillis.
- The jitter should be calculated such that there's an inverse relationship between the base polling interval and the jitter factor. For example:
- For a shorter
pollIntervalMillis, a larger percentage of that interval could be used for random jitter.
- For a longer
pollIntervalMillis, a smaller percentage of that interval would be used for jitter.
- A possible formula could be
jitter = random_factor * (max_jitter_percentage / pollIntervalMillis_normalized).
- The
random_factor would be a random number between 0 and 1.
- The
max_jitter_percentage would be a configurable value (e.g., 25% of the pollIntervalMillis).
pollIntervalMillis_normalized could be the pollIntervalMillis divided by a base unit (e.g., 1000ms for 1 second) to ensure the inverse relationship scales appropriately.
- The final delay would be
pollIntervalMillis + jitter.
- Configuration: Consider adding a new configuration parameter (e.g.,
executor.<name>.pollJitterFactor) to ExecutorConfig to allow users to control the maximum jitter percentage, or to enable/disable the jitter.
- Impact: This change would be inherited by all executors that use
TaskPollingMonitor and ParallelPollingMonitor, including cloud-specific plugins like nf-google.
New feature
This proposal is to enhance Nextflow's task polling mechanism by introducing a jitter, which will improve the robustness and efficiency of workflows, especially when interacting with external services.
Use case
The main use case is to prevent the "thundering herd" problem when Nextflow tasks poll external services (e.g., cloud provider APIs like GCP, AWS, Azure, or other web services) for status updates. In scenarios where numerous tasks initiate polling at synchronized, fixed intervals, this can lead to sudden, high-volume bursts of requests. Such bursts can overwhelm external APIs, resulting in rate limiting, increased latency, or temporary service unavailability. Adding jitter will distribute these requests more evenly over time, making Nextflow a more resilient and "API-friendly" client.
Suggested implementation
The core polling logic resides in
TaskPollingMonitor.groovyandParallelPollingMonitor.groovy.pollLoop()method inTaskPollingMonitor.groovy(around line 483) is the primary location where the polling interval is enforced via theawait(time)call.await(long time)method (around line 578) or thepollLoop()to introduce a random delay (jitter) to the fixedpollIntervalMillis.pollIntervalMillis, a larger percentage of that interval could be used for random jitter.pollIntervalMillis, a smaller percentage of that interval would be used for jitter.jitter = random_factor * (max_jitter_percentage / pollIntervalMillis_normalized).random_factorwould be a random number between 0 and 1.max_jitter_percentagewould be a configurable value (e.g., 25% of thepollIntervalMillis).pollIntervalMillis_normalizedcould be thepollIntervalMillisdivided by a base unit (e.g., 1000ms for 1 second) to ensure the inverse relationship scales appropriately.pollIntervalMillis + jitter.executor.<name>.pollJitterFactor) toExecutorConfigto allow users to control the maximum jitter percentage, or to enable/disable the jitter.TaskPollingMonitorandParallelPollingMonitor, including cloud-specific plugins likenf-google.