Efficient Image Data Loading and Preprocessing for Deep Learning
Loading and preprocessing image data efficiently is critical for training performant deep learning models. This project demonstrates how to load, preprocess, batch, and visualize image datasets using TensorFlow/Keras utilities, ensuring the pipeline is optimized for GPU training and scalable datasets.
The notebook is organized into key steps:
- Directory-based dataset loading – Use
tf.keras.utils.image_dataset_from_directory
to automatically label and split datasets. - Exploring dataset properties – View shapes, class names, and sample counts.
- Data preprocessing – Resize, normalize, and prepare images for model ingestion.
- Performance optimization – Apply
cache()
,shuffle()
, andprefetch()
for efficient training throughput. - Visualization – Display batches of images with their labels for inspection.
From the code:
- TensorFlow / Keras – Dataset loading, preprocessing, and pipeline optimization.
- Matplotlib – Visualizing sample images and labels.
- NumPy – Basic numerical handling (if used).
Not provided explicitly – The notebook loads image data from a local directory structure, where subfolder names correspond to class labels.
Requirements:
pip install tensorflow matplotlib numpy
Run the notebook:
jupyter notebook image_data_loader.ipynb
or in JupyterLab:
jupyter lab image_data_loader.ipynb
Ensure the dataset is organized in a directory with subfolders for each class:
dataset/
├── class1/
│ ├── image1.jpg
│ ├── image2.jpg
├── class2/
│ ├── image3.jpg
│ ├── image4.jpg
- Successfully loaded and labeled images directly from directory structure.
- Normalized and resized images to a consistent shape for model compatibility.
- Optimized pipeline with caching, shuffling, and prefetching to reduce training bottlenecks.
- Verified correct label assignment through visualization.
Example dataset info:
Image shape: (180, 180, 3)
Number of classes: 5
Class names: ['cat', 'dog', 'bird', 'fish', 'horse']
Visualization sample:
[Image of class 'cat'] [Image of class 'dog'] [Image of class 'bird'] ...
Prefetch optimization:
AUTOTUNE = tf.data.AUTOTUNE
dataset = dataset.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
image_dataset_from_directory
simplifies loading while handling labeling automatically.- Proper caching and prefetching significantly improve GPU utilization.
- Visual inspection ensures the dataset is loaded correctly before training.
- A well-prepared data pipeline prevents downstream model performance issues.
💡 Some interactive outputs (e.g., plots, widgets) may not display correctly on GitHub. If so, please view this notebook via nbviewer.org for full rendering.
Mehran Asgari Email: imehranasgari@gmail.com GitHub: https://github.com/imehranasgari
This project is licensed under the Apache 2.0 License – see the LICENSE
file for details.