Course: CSE445 – Machine Learning | Section: 06 | Group: 04
Instructor: Professor M. Shifat-E-Rabbi (MSRb)
Institution: North South University, Dhaka, Bangladesh
This project implements a complete supervised machine learning pipeline for single-image super-resolution (SISR). Low-resolution (LR) images are generated from high-resolution (HR) originals using bicubic downsampling and Gaussian blur. An ESPCN (Efficient Sub-Pixel Convolutional Neural Network) model is trained to restore them to their original quality at a ×4 upscale factor, and results are compared against bicubic interpolation using PSNR and visual analysis.
- Overview
- Architecture
- Dataset
- Training
- Results
- Project Structure
- Requirements
- Usage
- Limitations & Future Work
- Group Members
- References
Traditional upscaling methods like bicubic interpolation are fast but cannot recover lost high-frequency details — they apply a fixed mathematical rule with no understanding of image content. ESPCN addresses this by learning filters directly from paired HR-LR image data.
The key design principle of ESPCN is that all convolution operations happen in LR space, making the model computationally efficient. Upscaling occurs only at the very end through a sub-pixel convolution (PixelShuffle) layer that rearranges feature channels into a larger spatial output:
LR Input (64×64)
→ Conv(5×5) + ReLU
→ Conv(3×3) + ReLU
→ Conv(3×3) + ReLU
→ Conv(3×3) [48 channels for RGB ×4]
→ PixelShuffle(×4)
→ SR Output (256×256)
The ESPCN model has 74,128 trainable parameters across 4 convolutional layers. Weights are initialized using Kaiming Normal initialization.
| Layer | In Channels | Out Channels | Kernel | Activation |
|---|---|---|---|---|
| Conv1 | 3 | 64 | 5×5 | ReLU |
| Conv2 | 64 | 64 | 3×3 | ReLU |
| Conv3 | 64 | 32 | 3×3 | ReLU |
| Conv4 | 32 | 3 × (4²) = 48 | 3×3 | — |
| PixelShuffle | 48 ch → | 3 ch, ×4 spatial | — | — |
For RGB ×4 upscaling, the final layer must produce exactly 48 channels. PixelShuffle then rearranges them into the final 3-channel HR image. Output pixels are clamped to [0, 1].
The dataset was built from a collection of internet images containing diverse objects, textures, colors, and structures. Through standardization, augmentation, and patch-based extraction, the final training pipeline used 149,490 paired HR-LR samples.
| Split | Samples | Percentage |
|---|---|---|
| Train | 119,592 | 80% |
| Validation | 14,949 | 10% |
| Test | 14,949 | 10% |
| Total | 149,490 | 100% |
The split uses a fixed random seed (seed=42) for full reproducibility.
Each HR image (256×256) is degraded into an LR version (64×64) by:
- Bicubic downsampling by ×4 (LANCZOS for HR resize, BICUBIC for LR)
- Gaussian blur (radius = 0.5) to simulate realistic sensor degradation
- Random horizontal flip (p = 0.5)
- Random vertical flip (p = 0.5)
- Random 90° rotations (k ∈ {0, 1, 2, 3})
- Random patch cropping — 32×32 LR patch → 128×128 HR patch
Patch extraction was critical because the original image collection was small compared to professional SR datasets. It allows the CNN to learn local structures such as grass blades, fabric lines, petal edges, and building corners.
| Parameter | Value |
|---|---|
| Epochs | 100 |
| Batch size | 16 |
| Loss function | L1 Loss (MAE) |
| Optimizer | Adam |
| Initial learning rate | 1e-3 |
| LR scheduler | CosineAnnealingLR (T_max=100, η_min=1e-5) |
| Mixed precision | torch.amp.autocast + GradScaler |
| GPU | NVIDIA GeForce RTX 5060 Ti (17.1 GB VRAM) |
| ~Time per epoch | ~92 seconds |
| Total training time | ~2.6 hours |
L1 Loss was chosen over MSE because it typically preserves sharper image details. CosineAnnealingLR gradually reduces the learning rate, enabling large updates early and fine refinements near the end. Mixed precision (AMP) improves VRAM efficiency without sacrificing training stability.
The best model checkpoint is saved automatically whenever validation PSNR improves → best_espcn_x4.pth.
1. Load paired LR and HR images
2. Apply augmentation and patch extraction
3. Split data 80% / 10% / 10% (seed=42)
4. Initialize ESPCN with Kaiming Normal weights
5. For each epoch:
a. Pass LR batches through the model
b. Compute L1 loss against HR targets
c. Backpropagate with AMP + Adam update
d. Apply CosineAnnealingLR step
e. Compute validation PSNR
f. Save checkpoint if val PSNR improves
6. Generate final comparison images on test samples
The plot below shows L1 training loss (left) and validation PSNR (right) across all 100 epochs. Loss drops sharply in the first 20 epochs then continues to decrease steadily. Validation PSNR climbs from ~33 dB to a peak of 37.74 dB at epoch 99.
| Epoch | Val PSNR | |
|---|---|---|
| 1 | 33.07 dB | |
| 10 | 36.04 dB | |
| 25 | 36.95 dB | |
| 50 | 37.21 dB | |
| 75 | 37.57 dB | |
| 77 | 37.62 dB | ✅ |
| 96 | 37.70 dB | ✅ |
| 99 | 37.74 dB | ✅ Best |
| 100 | 37.71 dB |
Each panel shows: LR Input (upscaled for view) | Bicubic | ESPCN | HR Ground Truth
ESPCN recovers the circular band structures of the tower and sharpens the cherry blossom branches that are completely blurred in both the LR input and bicubic output.
ESPCN reconstructs individual fruit boundaries and leaf edges from the heavily blurred LR input, closely matching the HR ground truth.
The strongest PSNR gains appear on this grass sample. ESPCN recovers individual blade structures and the fine dark gaps between them, demonstrating that CNN filters excel at learning repeated local textures.
The model was also tested on custom images that were not part of training or validation.
ESPCN produces visually sharper petal edges and more defined petal separation than bicubic. However, PSNR is lower because the model generates high-frequency edge detail that doesn't align exactly at pixel level with the HR reference. The over-sharpened petal outlines are a known artifact of PSNR-trained CNNs.
Text regions and precise architectural lines are the hardest case for ESPCN. While structural textures of the brick facade are somewhat improved, exact character recovery in the signage is limited because those fine pixel patterns are largely lost in the 64×64 LR input. This highlights the known limitation of super-resolution on text and high-precision geometric content.
| Image | Type | Bicubic PSNR | ESPCN PSNR | Gain |
|---|---|---|---|---|
| 137207 | Grass texture | 26.96 dB | 32.67 dB | +5.71 dB |
| 117675 | Tower + blossoms | 24.20 dB | 29.95 dB | +5.74 dB |
| 134722 | Fruit texture | 21.74 dB | 26.66 dB | +4.92 dB |
| flower-729510_1280 | Daisy flower | 30.55 dB | 25.91 dB | −4.64 dB |
| NSU Building | Text + architecture | 30.23 dB | 22.26 dB | −7.97 dB |
ESPCN outperforms bicubic on texture-rich images with repeated local patterns (grass, fabric, petals, tower rings). For images with precise text or thin architectural lines, PSNR drops below bicubic — but this does not mean total failure; the model still sharpens structural textures, and the PSNR gap reflects pixel-level misalignment rather than visual degradation. Both PSNR and side-by-side visual inspection are needed for a fair evaluation.
CSE445_Sec6_Machine_Learning_Project/
├── data/ # Subfolder containing datasets and images
│ ├── HR_256/ # High-resolution images (256×256)
│ ├── LR_x4/ # Low-resolution images (64×64)
│ └── Selected Image for project/ # Custom test images and results
├── others/ # Subfolder for reports, presentations, and video
│ ├── CSE445_Final_Report_Group04.pdf # Final report PDF
│ ├── presentation.pdf # Final presentation PPTX/PDF
│ ├── update_report.pdf # Update report PDF
│ ├── update_presentation.pptx # Update presentation PPTX
│ └── project_demo.mp4 # One-minute video file showing demo run
├── support/ # Subfolder containing other code/support files
│ ├── output_Image/ # All output figures used in README
│ │ ├── training_history.png # L1 loss & validation PSNR curves
│ │ ├── 117675_COMPARE.png # Tower comparison (train sample)
│ │ ├── 134722_COMPARE.png # Fruit texture comparison (train sample)
│ │ ├── 137207_COMPARE.png # Grass texture comparison (train sample)
│ │ ├── flower-729510_1280_COMPARE.png # Daisy flower comparison (test)
│ │ └── 6529fb046efd17541e77b547_COMPARE.png # NSU Building comparison (test)
│ ├── Other test Model/ # Other test Model(SRCNN etc)
│ ├── Standardize_images.ipynb # Image preprocessing & extreme augmentation pipeline
│ └── best_espcn_x4.pth # Best model checkpoint (saved by val PSNR)
├── main.ipynb # Main training & inference notebook
├── README.md # Project explanation
└── requirements.txt # List of required tools and libraries
Python 3.12.3
torch
torchvision
Pillow
numpy
matplotlib
jupyter
Install dependencies:
pip install torch torchvision pillow numpy matplotlib jupyterRun Standardize_images.ipynb to standardize your HR images to 256×256 and generate the paired LR dataset in LR_x4/.
Open main.ipynb and run all cells sequentially:
Cell 1 — Imports & device setup (detects CUDA automatically)
Cell 2 — Auto-detect dataset size & build 80/10/10 split
Cell 3 — Dataset classes (SRDataset + SRImageDataset)
Cell 4 — DataLoaders (batch=16 train, batch=1 val/test)
Cell 5 — ESPCN model definition (74,128 params)
Cell 6 — Helper functions (PSNR, bicubic baseline)
Cell 7 — Training loop (100 epochs, L1, Adam, AMP)
The best checkpoint is saved automatically to best_espcn_x4.pth.
Place any .png / .jpg images into ~/sr_project/Test/, then run the final inference cell in main.ipynb. For each image the pipeline will:
- Resize to a multiple of 4 (LANCZOS)
- Downsample ×4 with bicubic + Gaussian blur (radius 0.5) → LR
- Run ESPCN on the LR image
- Compute PSNR for both ESPCN and bicubic vs HR reference
- Save a 4-panel comparison: LR Input | Bicubic | ESPCN | HR Reference
Example output:
Original size : 640×480
LR size : 160×120 → saved as image_LR.png
ESPCN output : 640×480 → saved as image_ESPCN.png
Bicubic PSNR : 28.41 dB
ESPCN PSNR : 33.12 dB (gain: +4.71 dB)
Comparison : saved as image_COMPARE.png
Current limitations:
- Original image diversity is limited compared to large professional SR datasets (e.g., DIV2K, VOC2012)
- PSNR does not always reflect perceived visual quality — sharper ESPCN outputs can score lower than bicubic if reconstructed edges don't align exactly at pixel level
- The model may introduce over-sharpened artifacts in regions where high-frequency detail is not recoverable from the LR input
- Designed specifically for ×4 upscaling; other scale factors require retraining
Future improvements:
- Train on a larger and more diverse dataset
- Add SSIM as an additional perceptual evaluation metric
- Compare with SRCNN, FSRCNN, EDSR, and ESRGAN
- Experiment with perceptual / adversarial loss for more natural textures
- Build a simple web UI for uploading LR images and receiving SR output automatically
| Student ID | Name | |
|---|---|---|
| 2211950642 | MD. Rokib Hasan Oli | rokib.oli@northsouth.edu |
| 1831906642 | Kazi Eraj Al Minahi Turjo | kazi.turjo@northsouth.edu |
| 1620018042 | Md. Sifur Rahman | sifur.rahman@northsouth.edu |
Department of ECE, North South University, Dhaka, Bangladesh





