Rasa spliting algorithm does not give precise number of training samples.

**Description of Problem**: Some examples may be missed due to Rasa spliting algorithm.The issues is clearly depicted at [forum thread](https://forum.rasa.com/t/rasa-split-data-nlu-fails-which-algorithm-is-implemented/29259/5)


**Overview of the Solution**: `rasa data split` does not give precise number of training samples. 
Say overall we have X samples (x1 samples of label l1, x2 samples of label l2, …) and `training-fraction` is 0.8. 
(Note: x1 + x2 + … = X).
In the [code of Rasa](https://github.com/RasaHQ/rasa/blob/1.10.0/rasa/nlu/training_data/training_data.py#L457) , number of training samples is A = int(0.8 * x1) + int(0.8 * x2) + …
Mathematically, A ≤ int(0.8 * X). 
So number of missing samples is int(0.8 * X) - A

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rasa spliting algorithm does not give precise number of training samples. #6582

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development