Skip to content

Why are my results so different on identical runs? #118

Open
@AylaRT

Description

@AylaRT

Hi, I apologise if this is a stupid question, but I am using CRFsuite for IOB labelling and when running the same experiments identically in 3 trials, the results are (sometimes, not always) very different per run. In some instances, standard deviation of f1-scores is over 5% for the three runs.

For each run, I am using the exact same training and test set (which are completely separate). I do use cross-validation for hyperparameter optimisation, but I set the random_seed there to avoid changes between runs. So basically, I do the following with identical data 3 times:

grid_search = GridSearchCV(crf, hyperparam_search_space, scoring=scorer, verbose=True, cv=KFold(nr_folds, random_state=42))
grid_search.fit(x_train, y_train)
optimised_crf = grid_search.best_estimator_
y_pred = optimised_crf.predict(x_test)
final_score = metrics.flat_f1_score(y_test, y_pred, average='macro', labels=["I", "O", "B"])

to illustrate, these are results from 3 identical runs on identical data:
Example 1:
f1 (micro): 83.2%, 81.6%, 66.2%
f1 (macro): 71.8%, 71.6%, 57.5%

Example 2:
f1 (micro): 81.1%, 77.6%, 66.7%
f1 (macro): 53.5%, 57.3%, 47.1%

The differences are not always this large (and when they are, it is often due to one of the runs which as a much lower score). Micro f1 scores are also more stable than macro f1 scores (data is imbalanced, so there are sometimes only 10% I labels for instance).

So my questions are:

  • why are the differences sometimes this large, when the exact same data is used, with the same shuffle for hyperparameter optimisation?
  • which random_seeds need to be set to stabilise these results?

thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions