After merging PR #74, we have seen such abnormal learning curve:

The figure plots the training cost. Notice that in the tails of the curve, there are many spikes, exactly locating at the first batch of each epoch.
Besides, it is not easy to reproduce the phenomenon in a small dataset.