Description
The Vision
Testing Polkadot changes and releases before they go live is a multi-faceted challenge which we address by a plethora of different approaches, both automated and manual. The vision is to be able to test using an environment which is as close as possible to Kusama/Polkadot main nets. We've built and are continuously improving the core building blocks like ZombieNet as a means to get there and this is the first step that shifts our integration testing in this direction, but with reduced scope and scale. This milestone will make it possible to run long duration integration tests based on existing test cases but also include closer to real world parametrisation of the environment:
- 100 nodes in total (validators + collators)
- A mix of node versions as observed by Telemetry clustered in geographic regions (latency and packet loss perspective)
- Nodes popping in and out of the network
- Whole network upgrade of clients and runtime upgrades
The Plan
Current status of integration testing
We currently cover functionality and small scale testing(up to 10 nodes) with a Zombienet test suite. While we continuously add more tests to it, it's all limited to lab networking conditions, meaning close to 0 latency/packet reordering and loss. While this is good from a basic functionality testing perspective, it doesn't actually cover edge cases or race conditions, that are usually hit in real world scenarios found in Kusama/Polkadot networks.
Moving towards to real production environment testing
Ideally we should be able to make Versi behave similarly to Kusama in terms of latency and behaviour but this would interfere with other types of testing we do as part of development on a regular basis. As an incremental improvement and reasonable compromise we should experiment with Zombienet based chaos testing
. This touches a bit on negative testing and needs to tackle the following scenarios:
- A mix of node versions as observed by Telemetry (1KV)
- Nodes clustered in geographic regions as observed from a latency and packet loss perspective
- Nodes popping in and out of the network
- Whole network upgrade of clients rolled out over a period of time
- Dispute load testing
CI pipelines
Having this implemented per PR is not a good idea, as we want that one to be short and smooth allowing for fast turn around times during development. Instead we need to create a separate pipeline that takes a predefined mix of node versions and network topology and runs more or less the same PR test suite and measures the KPI metrics we currently follow as part of our monitoring and alerting.
Some key indicators of the network health:
- relay chain block times
- finality lag
- dispute resolution times
- error/warnings generated by nodes
- parachain block times
- XCM tput
Addressing the Zombienet scalability limits
We known that the current architecture of Zombienet supports out of the box around 100 nodes in total (validators + collators) for a single network spawn and test run in k8s. With the end goal of 1kV and 40+ parachains in mind we can implement some cheap changes to allow us to use multiple zombienet test instances that spawn nodes to join the same network (via shared bootnodes).
For assertions we need to change the way zombienet checks metrics/logs such that it no longer scrapes the nodes individually, but rather calls Prometheus/Loki APIs to do so.
With these changes we should be able to scale to hundreds of nodes and tens of parachains.
TODO: cut and link granular issues for Zombienet changes, infrastructure and the actual tests.#
Open Questions
Currently the discussion is in the initial phase on the tracking issue, but some open questions that stand out are:
- How do we gather the required parametrisation data from the main networks ?
- Should we use test collators or actual collators from parachain teams in the ecosystem ?
Project tracking board
Metadata
Assignees
Labels
Type
Projects
Status
Backlog