You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When training with a random validation subsample, indices of chunks to be used for training and validation are set up at the start of training.
When creating training batches, a random_start_position in the range [0, batch_size) is applied to a random selection from the shuffled list of training chunks. Therefore, for each chunk, the records between coordinates (chunk_start + random offset) and (chunk_end + random offset) are loaded
the use of an offset means that if a validation chunk follows a training chunk, then the validation data can be loaded and used in training, so the validation data are not being truly held out as an independent sample
Example:
Input file chunks
|---0---|---1---|---2---|---3---|---4---|---5---|---6---|---7---|---8---|---9---|...
Random validation chunks: [2, 5, 6]
|---2---| |---5---|---6---|
Training chunks:
|---0---|---1---| |---3---|---4---| |---7---|---8---|---9---|...
Example of a training epoch
Random offset: |..OFFSET..|
Randomly selected training chunks for batch: [0, 4, 8, 9,...]
Training records:
|---0---|---1---|---2---|---3---|---4---|---5---|---6---|---7---|---8---|---9---|...
|..OFFSET..|-------| |..OFFSET..|-------| |..OFFSET..|-----...
Here, data from validation chunks 2, 5 and 6 are included in training. The data associated with chunk 4 is actually entirely validation data.
Possible solutions:
Don't use a random_start_position offset
Chunk order within a batch is already randomised, so this might be enough randomness for training patterns not to be too much of an issue
Records could also be shuffled within a batch after chunks have been assembled (though this might have a performance impact)
Prune chunks preceding validation chunks out of the training list
Probably requires reducing the range of random_start_position from batch_size to chunk_size, otherwise this would be a huge proportion of training data being excluded
Even then, this prevents the run-on issue but means removing approximately validation fraction training chunks (in addition to the chunks used for validation), and complicates the calculation of training/validation fractional split
The text was updated successfully, but these errors were encountered:
Issue confirmed. We propose splitting all chunks into continuous [training chunks, buffer chunks, validation chunks], and allow random_start_position only to the training chunks. The size of the buffer chunks is larger than the maximum possible random_start_position, thus no validations chunk would be involved in training. Will fix after merging #56.
random_start_position
in the range[0, batch_size)
is applied to a random selection from the shuffled list of training chunks. Therefore, for each chunk, the records between coordinates (chunk_start + random offset) and (chunk_end + random offset) are loadedExample:
Here, data from validation chunks 2, 5 and 6 are included in training. The data associated with chunk 4 is actually entirely validation data.
Possible solutions:
random_start_position
offsetrandom_start_position
frombatch_size
tochunk_size
, otherwise this would be a huge proportion of training data being excludedvalidation fraction
training chunks (in addition to the chunks used for validation), and complicates the calculation of training/validation fractional splitThe text was updated successfully, but these errors were encountered: