Multiple NameNode role-groups may lead to cluster startup failure

### Affected version

0.7.0-nightly

### Current and expected behavior

Currently, we have a `format-namenode` namenode init container and a script to either create an active or standby namenode.
With one role-group and the `podManagementPolicy: "OrderedReady"` we make sure that namenodes (actually data and journalnodes as well) will spin up after another.

With two role-groups like:
```
  nameNodes:
    roleGroups:
      default:
        replicas: 1
      other_default:
        replicas: 1
```
we get two StatefulSets, which by itself respect the "OrderedReady" policy but spin up their individual Pods in parallel.

This may lead to a cluster startup failure. The namenodes of the different role-groups sometimes (flaky) both format itself as active, with different blob IDs etc. which leads to the "slower" namenode to fail starting up and joining the cluster:

```
Failed to start namenode.
java.io.FileNotFoundException: No valid image files found
	at org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector.getLatestImages(FSImageTransactionalStorageInspector.java:158)
	at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:688)
	at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:339)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1201)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:779)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:681)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:768)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:1020)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:995)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1769)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1834)
```


### Possible solution

We have to improve the `format-namenode` init container script to take into account namenodes (and their formatting) starting in parallel. Currently it just checks if there is already an active namenode and depending on that will format as active or standby (which leads to the "race" condition of having two nodes formatted as active with different blob IDs). 

- Take ZooKeeper into account?
- Let the operator determine which role-group should format as active?
- Introduce "wait" times for different role-groups to make sure they will not spin up in parallel (not very deterministic)

@lfrancke @soenkeliebau @Jimvin any ideas?

### Additional context

This came up when implementing logging for the HDFS operator (and the integrationtests using multiple role-groups per role for custom and automatic log testing).

We should try to get rid of the "OrderedReady" policy part anyways (see https://github.com/stackabletech/hdfs-operator/issues/261) to speed up cluster creation.

### Environment

Failed on GKE 1.23, AWS 1.22, Azure 1.23 (and probably any other provider)

### Would you like to work on fixing this bug?

None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Multiple NameNode role-groups may lead to cluster startup failure #294

Affected version

Current and expected behavior

Possible solution

Additional context

Environment

Would you like to work on fixing this bug?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Multiple NameNode role-groups may lead to cluster startup failure #294

Description

Affected version

Current and expected behavior

Possible solution

Additional context

Environment

Would you like to work on fixing this bug?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions