validation data not entirely held out from training #57

ftostevin-ont · 2021-09-24T09:16:56Z

When training with a random validation subsample, indices of chunks to be used for training and validation are set up at the start of training.
When creating training batches, a random_start_position in the range [0, batch_size) is applied to a random selection from the shuffled list of training chunks. Therefore, for each chunk, the records between coordinates (chunk_start + random offset) and (chunk_end + random offset) are loaded
the use of an offset means that if a validation chunk follows a training chunk, then the validation data can be loaded and used in training, so the validation data are not being truly held out as an independent sample

Example:

Input file chunks
|---0---|---1---|---2---|---3---|---4---|---5---|---6---|---7---|---8---|---9---|...
 
Random validation chunks: [2, 5, 6]
                |---2---|               |---5---|---6---|
Training chunks:
|---0---|---1---|       |---3---|---4---|               |---7---|---8---|---9---|...
 
Example of a training epoch
Random offset: |..OFFSET..|
Randomly selected training chunks for batch: [0, 4, 8, 9,...]
Training records:
|---0---|---1---|---2---|---3---|---4---|---5---|---6---|---7---|---8---|---9---|...
|..OFFSET..|-------|            |..OFFSET..|-------|            |..OFFSET..|-----...

Here, data from validation chunks 2, 5 and 6 are included in training. The data associated with chunk 4 is actually entirely validation data.

Possible solutions:

Don't use a random_start_position offset
- Chunk order within a batch is already randomised, so this might be enough randomness for training patterns not to be too much of an issue
- Records could also be shuffled within a batch after chunks have been assembled (though this might have a performance impact)
Prune chunks preceding validation chunks out of the training list
- Probably requires reducing the range of random_start_position from batch_size to chunk_size, otherwise this would be a huge proportion of training data being excluded
- Even then, this prevents the run-on issue but means removing approximately validation fraction training chunks (in addition to the chunks used for validation), and complicates the calculation of training/validation fractional split

The text was updated successfully, but these errors were encountered:

zhengzhenxian · 2021-09-24T13:46:52Z

Issue confirmed. We propose splitting all chunks into continuous [training chunks, buffer chunks, validation chunks], and allow random_start_position only to the training chunks. The size of the buffer chunks is larger than the maximum possible random_start_position, thus no validations chunk would be involved in training. Will fix after merging #56.

aquaskyline · 2021-10-20T01:51:12Z

fixed in #61

aquaskyline closed this as completed Oct 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validation data not entirely held out from training #57

validation data not entirely held out from training #57

ftostevin-ont commented Sep 24, 2021 •

edited

Loading

zhengzhenxian commented Sep 24, 2021

aquaskyline commented Oct 20, 2021

validation data not entirely held out from training #57

validation data not entirely held out from training #57

Comments

ftostevin-ont commented Sep 24, 2021 • edited Loading

zhengzhenxian commented Sep 24, 2021

aquaskyline commented Oct 20, 2021

ftostevin-ont commented Sep 24, 2021 •

edited

Loading