Hi there, thank you for this great release!
I'm wondering if it would be possible to use the quality filter to filter out documents under a certain length. For example, I'm looking to assemble a dataset where each sequence is between 64-128k in context.
- Is this easily configurable in the quality filter?
- Would this filter be applied before or after tokenization?
Thank you for your help.