Description
Hi, I apologise if this is a stupid question, but I am using CRFsuite for IOB labelling and when running the same experiments identically in 3 trials, the results are (sometimes, not always) very different per run. In some instances, standard deviation of f1-scores is over 5% for the three runs.
For each run, I am using the exact same training and test set (which are completely separate). I do use cross-validation for hyperparameter optimisation, but I set the random_seed there to avoid changes between runs. So basically, I do the following with identical data 3 times:
grid_search = GridSearchCV(crf, hyperparam_search_space, scoring=scorer, verbose=True, cv=KFold(nr_folds, random_state=42))
grid_search.fit(x_train, y_train)
optimised_crf = grid_search.best_estimator_
y_pred = optimised_crf.predict(x_test)
final_score = metrics.flat_f1_score(y_test, y_pred, average='macro', labels=["I", "O", "B"])
to illustrate, these are results from 3 identical runs on identical data:
Example 1:
f1 (micro): 83.2%, 81.6%, 66.2%
f1 (macro): 71.8%, 71.6%, 57.5%
Example 2:
f1 (micro): 81.1%, 77.6%, 66.7%
f1 (macro): 53.5%, 57.3%, 47.1%
The differences are not always this large (and when they are, it is often due to one of the runs which as a much lower score). Micro f1 scores are also more stable than macro f1 scores (data is imbalanced, so there are sometimes only 10% I labels for instance).
So my questions are:
- why are the differences sometimes this large, when the exact same data is used, with the same shuffle for hyperparameter optimisation?
- which random_seeds need to be set to stabilise these results?
thank you!