Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix error to load data at the correct position when resuming from a checkpoint #2520

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

PC91
Copy link
Contributor

@PC91 PC91 commented Nov 19, 2023

This PR contains a mechanism to resume a training from the positions in corpora. The idea is to use a cursor for each corpus and save its text line (the batch variable cid_line_number) to the saved checkpoint file.

The following features are implemented:

  • Adding a new parameter resume_from_corpora: when True, the training will try to resume from the last text line of each corpus. Otherwise, the training will resume from the beginning of all corpora.
  • Update the calculation of cid_line_number to get the text line number directly from the exfile_open function.
  • Conditions to resume the training from the saved text lines:
    • The last text lines of all corpora must be saved in the checkpoint (for backward compatibility with existing versions.)
    • All corpus names in the config and in the saved checkpoint must match.
    • Quick checksum: for each corpus in the config, its saved text line cannot exceed its total number of lines.
  • Communication between the trainer and model saver to handle corpus cursors.

The following scenarios are tested:

  • Backward compatibility test: resume from beginning when using a checkpoint of existing version (with no saved text line.)
  • Resume from a saved checkpoint with saved text lines :
    • When resume_from_corpora=True
      • Some corpora in the config do not match (resume from beginning.)
      • Some saved text lines exceed the total number of text line (resume from beginning.)
      • All check are passed (resume from saved text file.)
    • When resume_from_corpora=False (resume from beginning.)

@PC91 PC91 force-pushed the datagen-from-checkpoint branch 2 times, most recently from cd4d637 to 3c4b7b7 Compare November 19, 2023 20:09
@vince62s
Copy link
Member

This is doing the same thing as what is described here: #2006 (comment)
the issue is that if checkpoint is at 250 000 steps and you want to continue it takes way too long to iterate over those batches. THis is the reason why memorizing the index of each dataset and setting the cursor at this index is more efficient.

@PC91 PC91 marked this pull request as draft January 7, 2024 02:27
@PC91 PC91 force-pushed the datagen-from-checkpoint branch from 3c4b7b7 to 0a06542 Compare January 7, 2024 02:28
@PC91 PC91 force-pushed the datagen-from-checkpoint branch from 0a06542 to 874efcc Compare March 14, 2024 21:20
@PC91 PC91 force-pushed the datagen-from-checkpoint branch 18 times, most recently from d191392 to e2093e2 Compare March 31, 2024 20:25
@PC91 PC91 force-pushed the datagen-from-checkpoint branch from e2093e2 to b422cfc Compare March 31, 2024 20:34
@PC91 PC91 marked this pull request as ready for review March 31, 2024 20:44
@PC91
Copy link
Contributor Author

PC91 commented Mar 31, 2024

This is doing the same thing as what is described here: #2006 (comment) the issue is that if checkpoint is at 250 000 steps and you want to continue it takes way too long to iterate over those batches. THis is the reason why memorizing the index of each dataset and setting the cursor at this index is more efficient.

Thanks @vince62s! The code is updated. Could you have a look and merge to the main code base ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants