Why are my results so different on identical runs?

Hi, I apologise if this is a stupid question, but I am using CRFsuite for IOB labelling and when running the same experiments identically in 3 trials, the results are (sometimes, not always) very different per run. In some instances, standard deviation of f1-scores is over 5% for the three runs.

For each run, I am using the exact same training and test set (which are completely separate). I do use cross-validation for hyperparameter optimisation, but I set the random_seed there to avoid changes between runs. So basically, I do the following with identical data 3 times:

grid_search = GridSearchCV(crf, hyperparam_search_space, scoring=scorer, verbose=True, cv=KFold(nr_folds, random_state=42))
grid_search.fit(x_train, y_train)
optimised_crf = grid_search.best_estimator_
y_pred = optimised_crf.predict(x_test)
final_score = metrics.flat_f1_score(y_test, y_pred, average='macro', labels=["I", "O", "B"])

to illustrate, these are results from 3 identical runs on identical data:
Example 1:
f1 (micro): 83.2%, 81.6%, 66.2%
f1 (macro): 71.8%, 71.6%, 57.5%

Example 2:
f1 (micro): 81.1%, 77.6%, 66.7%
f1 (macro): 53.5%, 57.3%, 47.1%

The differences are not always this large (and when they are, it is often due to one of the runs which as a much lower score). Micro f1 scores are also more stable than macro f1 scores (data is imbalanced, so there are sometimes only 10% I labels for instance).

So my questions are:
- why are the differences sometimes this large, when the exact same data is used, with the same shuffle for hyperparameter optimisation?
- which random_seeds need to be set to stabilise these results?

thank you!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why are my results so different on identical runs? #118

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Why are my results so different on identical runs? #118

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions