[Speed] IO recordio format support

Currently, we use the python ndarray preprocess and read the content into LoDTensor, it is quite low efficient for training. When we head for the recordio format, we encounter a problem that the go version recordio implementation will introduce the go runtime, which will take over our threadpool with the goroutine. We thought it is too heavy since we only use the recordio for reading.