Train TensorFlow models for image/video/features classification or other tasks. Currently the repository is set to train on image classification by default.
- TensorFlow Model Training
Install tensorflow and related cudnn libraries from the tensorflow-official-documentation if cudnn libraries are not set up.
Create a .env file with the following contents with the correct paths ensuring the correct CUDA install path, with cp .env.example .env:
XLA_FLAGS="--xla_gpu_cuda_data_dir=/usr/local/cuda"
TF_XLA_FLAGS="--tf_xla_enable_xla_devices --tf_xla_auto_jit=2 --tf_xla_cpu_global_jit"
TF_CPP_MIN_LOG_LEVEL='3'
TF_FORCE_GPU_ALLOW_GROWTH="true"
OMP_NUM_THREADS="15"
KMP_BLOCKTIME="0"
KMP_SETTINGS="1"
KMP_AFFINITY="granularity=fine,verbose,compact,1,0"
CUDA_DEVICE_ORDER="PCI_BUS_ID"
CUDA_VISIBLE_DEVICES="0"Set up docker to run with NVIDIA-container-toolkit first.
Create checkpoints dir in the current project directory.
bash scripts/build_docker.sh
bash scripts/run_docker.sh -p TF_BOARD_PORTpoetry install --all-groups
# export pyproject.toml requirements to requirements.txt
python scripts/poetry_to_pip_requirements.pypython -m venv venv; source venv/bin/activate
pip install -r requirements.txtconda create --name tf_gpu tensorflow-gpu python=3.12 -y
conda activate tf_gpu
while read requirement; do conda install --yes $requirement; done < requirements.txtNote: Conda sets the cuda, cudnn and cudatoolkit automatically, downloading non-python dependencies as well.
Assuming the data directory must be organized according to the following structure, with sub-directories having class names containing images. The CIFAR-10 dataset in JPG format can be acquired from https://github.com/YoongiKim/CIFAR-10-images for a sample train and test.
i.e.
data
|_ src_dataset
|_ class_1
|_ img1
|_ img2
|_ ....
|_ class_2
|_ img1
|_ img2
|_ ....
...
Note: ImageNet style ordering of data is also supported i.e. images ordered under subdirectories inside the class directories.
i.e.
data
|_ src_dataset
|_ class_1
|_ 00d
|_ img1
|_ img2
|_ 01
|_ img1
|_ img2
|_ ...
|_ ...
If all the classes do not have equal number of training samples, data duplication can be done.
python data_preparation/duplicate_data.py --sd data/src_dataset --td data/duplicated_dataset -n NUM_TO_DUPLICATE
# find corrupt images (i.e. that cannot be opened with tf.io.decode_image)
python data_preparation/find_corrupt_imgs.py --rd data/src_datasetSet validation and test split in fractions (i.e. 0.1). Both splits are optional.
python data_preparation/create_train_val_test_split.py --sd data/duplicated_dataset --td data/split_dataset[ --vs VAL_SPLIT] [--ts TEST_SPLIT]
# to check the number of images in train, val and test dirs
bash scripts/count_files_per_subdir.sh data/split_datasetNote: The test split should not be converted into tfrecords and the original data->class_sub_directory format should be used.
# convert train files into train tfrecord, select NUM_SHARDS so that each shard has a size of 100 MB+
python data_preparation/convert_imgs_to_tfrecord.py --sd data/split_dataset/train --td data/tfrecord_dataset/train [--cp CLASS_MAP_TXT_SAVEPATH] [--ns NUM_SAMPLES_PER_SHARDS]
# convert val files into val tfrecord, select NUM_SHARDS so that each shard has a size of 100 MB+
python data_preparation/convert_imgs_to_tfrecord.py --sd data/split_dataset/val --td data/tfrecord_dataset/val [--cp CLASS_MAP_TXT_SAVEPATH] [--ns NUM_SAMPLES_PER_SHARDS]
# to use multiprocessing use the --mt flagNote: test dataset is not converted to tfrecord as fast-loading is not a priority as we only run through the test data once.
To extract frames from videos into npy.npz files install opencv and pyav, then run:
python data_preparation/extract_frames_from_video_dataset.py --sd SOURCE_DATA_DIR
# use -h for helpConfigure all values in the YAML files inside the config dir. A sample config file is provided for training on the src_dataset directory in config/train_image_clsf.yaml.
The model information repository is located at tensorflow_training/model/models_info.py. New models can be added or model parameters can be modified through this file.
Set number of GPUs to use, Tensorflow, and other system environment variables in .env.
python train.py --cfg CONFIG_YAML_PATH [-r RESUME_CHECKPOINT_PATH]Notes:
- Using the
-roption while training will override theresume_checkpointparam in config yaml if this param is not null. - To add tensorflow logs to train/test logs, set
"disable_existing_loggers"parameter totrueintensorflow_training/logging/logger_config.json. - Out of Memory errors during training could be caused by large batch sizes, model size or dataset.cache() call in train preprocessing in
tensorflow_training/pipelines/data_pipeline.py. - When using mixed_float16 precision, the data types of the final dense and activation layers must be set to
float32. - An error like:
ValueError: Unexpected result of train_function (Empty logs)could be caused by incorrect paths to train and validation directories in the config.yaml files
tensorboard --logdir=checkpoints/tf_logs/ --port=PORT_NUMMake sure to set the correct test_data_dir under data and the class_map_txt_path under tester in the yaml config file.
The class_map_txt_path file is generated by the convert_imgs_to_tfrecord.py script when converting images to tfrecord format.
python test.py --cfg CONFIG_YAML_PATH -r TEST_CHECKPOINT_PATHWe can use a dockerized uvicorn and fastapi webserver with triton-server to serve the model through a HTTPS API endpoint. Instructions are at tensorflow_training/server/README.md.