* Perplexity filtering: use a (small) pre-trained model to identify "noisy" text * language identification to ensure monolinguistics