Skip to content

Provide a better way to declare nodes/clusters/cluster formation during the build #30904

Closed
@alpar-t

Description

@alpar-t

Todo:

  • implement Version in java so we can use it in cluster-formation
  • rename to testClusters and TestClustersPlugin ditching ClusterFormation
  • proof of concept plugin to check the integration points with Gradle and write integration test
  • implement support for setting up a single node cluster and actually starting and using it
  • restrict the type of tasks that can use the plugin by default ( ony configure task extensions on specific tasks )
  • start using the new cluster-formation for rest integration tests ( modules, plugins )
  • start using the new cluster-formation for rest integration tests on x-pack

DSL Glimpse

plugins {
    id 'elasticsearch.clusterformation'
}

testClusters {
    myTestCluster {
        distribution = 'ZIP'
        version = '6.3.0'
    }
}

task user1 {
    useCluster testClusters.myTestCluster
    doLast {
        println "Cluster running @ ${elasticsearchNodes.myTestCluster.httpSocketURI}"
    }
}

task user2 {
    useCluster testClusters.myTestCluster
    doLast {
        println "Cluster running @ ${elasticsearchNodes.myTestCluster.httpSocketURI}"
    }
}

Produces this output:

> Task :syncClusterFormationArtifacts UP-TO-DATE

> Task :user1
Starting `myTestCluster`
Cluster running @ [::1]:37347
Not stopping `myTestCluster`, since node still has 1 claim(s)

> Task :user2
Cluster running @ [::1]:37347
Stopping `myTestCluster`, number of claims is 0

BUILD SUCCESSFUL in 10s
3 actionable tasks: 2 executed, 1 up-to-date

Initial Description

The current cluster formation has the following limitations:

  • no straight forward way to create additional clusters, define relationships between them
  • does not currently work with --parallel, and as such has support for no parallelism ( note that test.jvm doesn't help here, these tests always run in sequence)
  • complex tests like rolling upgrade are not readable at all as they make use of relations between Gradle tasks that are really hard to follow.

The main reason --parallel does not work is that Gradle's finalizedBy does not offer any guarantees about when the task will be run. We sue this for stopping clusters, but when running with parallel Gradle puts that off so that one can end up running with 40+ es nodes ( 512mb * 40 ~ 20GB ) before running out of memory and build starting to fail because of this. There is no easy fix for this, other than setting up a bunch of mustRunAfter rules fro the different tasks. Some test run across clusters, upgrade and restart nodes, etc we can't make any assumptions about when the stop tasks is safe to run, so we can't really enforce a "stop after test runner for this cluster completed" rule as the test runners of other clusters might still need this cluster.

Even after doing some hacks to bring down the nodes sooner and not run out of memory, --parallel uncovered some missing ordering relations between tasks that were causing failures.

From some limited testing, I estimate build time could be reduced by at least 30% by being able to run integ tests in parallel (based on running :qa:check on my 6 physical core CPU with 32GB ram).

From what I can see, this is the only thing preventing us from simply running builds with clean check --parallel without having to pick and choose what works in parallel and what doesn't.

I think we should create a cluster formation DSL that does not rely on Gradle tasks to perform it's operations. We would still use gradle to fetch and set up distributions, but everything else would be externalized. The DSL would provide configuration for the cluster and expose methods to alter it's state (start/stop the cluster or individual nodes, change configuration etc ).
There would be methods for high level operations like starting and stopping the cluster, and running tests as well as lower level operations that can manipulate at the node level.

No operation would be carried out by default, a task would have to be set up that calls these operation from the task action (or as doLast). We can provide a task as well with the option to control if it's created to cover the common setup of setting up cluster, running tests and terminating.
Of course we would need to have a way to run tests outside of Gradle, but since we don't use it's infrastructure to do it anyway, it shouldn't be that hard.
The custom DSL can make use of Gradles NamedDomainObjectCollection so plugins can change defaults for different sections of the builds when a new cluster is defined.

Related: #30874, #30903

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions