Description
Hi everyone,
As everyone has seen, the random seed can have a significant effect on the prediction scores. This is due to the fact that most of us are using algorithms with a random component (e.g., random forest, extra trees...).
The effect is probably enhanced by the fact that the dataset we are working on is small and non stationary.
Matt has been solving the problem by testing a series of random seeds and taking the best. This avoids discarding a model just because of a "bad" random seed. However, this might favor the most unstable models. A very stable model will yield scores in a small range when testing several random seeds, while an unstable model will yield a wide range of scores when testing several random seeds. Thus, it is likely that an unstable model can get a very high score given enough random seeds are tested. But it does not mean the model will be good at predicting new test data.
A possible solution would be to test 10 (or an other number) random seeds and to take the median score as the prediction score. It would require us to directly include that in our scripts to avoid further work for Matt. We could just make 10 predictions, using 10 random seeds and export them in a single csv file.
What do you guys (and especially Matt) think about that?
Activity