If you would like a detailed explanation of this project, please refer to the Medium article below.
A complete end-to-end pipeline for building a clean, reliable deep-learning classifier.
This project implements a full deep-learning workflow for classifying animal images using TensorFlow + InceptionV3, with a major focus on dataset validation and cleaning. Before training the model, I built a comprehensive system to detect corrupted images, duplicates, brightness/contrast issues, mislabeled samples, and resolution outliers.
This repository contains the full pipeline—from dataset extraction to evaluation and model saving.
The project includes automated checks for:
- Corrupted or unreadable images
- Hash-based duplicate detection
- Duplicate filenames
- Misplaced or incorrectly labeled images
- File naming inconsistencies
- Extremely dark/bright images
- Very low-contrast (blank) images
- Outlier resolutions
- Resize to 256×256
- Normalization
- Light augmentation (probabilistic)
- Efficient
tf.datapipeline with caching, shuffling, prefetching
- Pretrained ImageNet weights
- Frozen base model
- Custom classification head (GAP → Dense → Dropout → Softmax)
- EarlyStopping + ModelCheckpoint + ReduceLROnPlateau callbacks
- 80% training
- 10% validation
- 10% test
- High stability due to dataset cleaning
The dataset is stored as a ZIP file (Google Drive). After mounting the drive, it is extracted and indexed into a Pandas DataFrame:
drive.mount('/content/drive')
zip_path = '/content/drive/MyDrive/Animals.zip'
extract_to = '/content/my_data'
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
zip_ref.extractall(extract_to)Each image entry records:
- Class
- Filename
- Full path
Before any training, I analyzed:
- Class distribution
- Image dimensions
- Grayscale vs RGB
- Unique sizes
- Folder structures
Example class-count visualization:
plt.figure(figsize=(32, 16))
class_count.plot(kind='bar')This revealed imbalance and inconsistent image sizes early.
Random images were displayed with their brightness, contrast, and shape to manually confirm dataset quality.
This step prevents hidden issues—especially in community-created or scraped datasets.
The system checks for:
def get_hash(path):
with open(path, 'rb') as f:
return hashlib.md5(f.read()).hexdigest()
df['file_hash'] = df['full_path'].apply(get_hash)
duplicate_hashes = df[df.duplicated('file_hash', keep=False)]try:
with Image.open(file_path) as img:
img.verify()
except:
corrupted_files.append(file_path)Using PIL’s ImageStat to detect very dark/bright samples.
folder = os.path.basename(os.path.dirname(row["full_path"]))This catches mislabeled entries where folder name ≠ actual class.
Custom preprocessing:
- Resize → Normalize
- Optional augmentation
- Efficient
tf.databatching
def preprocess_image(path, target_size=(256, 256), augment=True):
img = tf.image.decode_image(...)
img = tf.image.resize(img, target_size)
img = img / 255.0Split structure:
| Split | Percent |
|---|---|
| Train | 80% |
| Validation | 10% |
| Test | 10% |
from tensorflow.keras.applications import InceptionV3
base_model = InceptionV3(weights='imagenet', include_top=False)
for layer in base_model.layers:
layer.trainable = FalseClassification head:
- GlobalAveragePooling2D
- Dense(512, ReLU)
- Dropout(0.5)
- Softmax
Compile:
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)history = model.fit(
train_ds,
validation_data=val_ds,
epochs=10,
callbacks=[...]
)Using:
- EarlyStopping
- ModelCheckpoint
- ReduceLROnPlateau
The model converged quickly thanks to the cleaned dataset.
model.evaluate(test_ds)
model.save("Simple_CNN_Classification.h5")The biggest lesson from this project:
A strong deep-learning model starts with a clean dataset.
Cleaning the data took more time than training the model—but it directly improved accuracy, stability, and model reliability.
If you're building your own image classification project, always verify:
- Dataset quality
- Brightness/contrast issues
- Duplicate removal
- Class consistency
- Resolution outliers
Clean data makes everything else easier.
- Advanced augmentation (CutMix, MixUp)
- Fine-tuning InceptionV3’s deeper layers in another dataset
- Converting model to TensorFlow Lite
- Deploying on my own website