The hierarchical model has special data preparation to chunk the data into a certain number of chunks of a certain length each. The maximum sequence length is the product of these two numbers. But the length is constrained only by the base encoder (say ~512) and the number of chunks isn't built into the network because attention will average over them. So it isn't strictly required to process the data the same time at inference as during train, and so we don't even put those parameters in the model config. Without them in the config, it's hard to even suggest good numbers, but we maybe want to stay flexible enough to allow them to change? Eh, maybe not.