Handwritten Digit Recognition using Neural Networks

Multi-layer feedforward neural network built from scratch (without deep learning frameworks) to classify handwritten digits from the MNIST dataset. Implements backpropagation algorithm with stochastic gradient descent.

📚 Table of Contents

Overview
Neural Network Architecture
MNIST Dataset
Training Algorithm
Project Structure
Implementation Steps
Installation
Usage
Mathematical Foundation
Optimization Techniques
Results & Performance
Key Features
Experiments
Project Information
Contact

📋 Overview

This project implements a fully-connected feedforward neural network from scratch using only NumPy for numerical computation. The network learns to recognize handwritten digits (0-9) through supervised learning with the backpropagation algorithm.

Key Highlights:

✅ From Scratch: No TensorFlow, PyTorch, or Keras
✅ Complete Implementation: Forward propagation, backpropagation, SGD
✅ Vectorized Operations: Efficient NumPy matrix operations
✅ High Accuracy: ~90% on MNIST test set
✅ Educational: Step-by-step implementation stages

🧠 Neural Network Architecture

Network Structure

Input Layer        Hidden Layer 1    Hidden Layer 2    Output Layer
(784 neurons)  →   (16 neurons)  →   (16 neurons)  →   (10 neurons)
   28×28 pixels      Sigmoid           Sigmoid           Sigmoid

Layer Details

Layer	Size	Activation	Purpose
Input	784	None	Flattened 28×28 pixel image
Hidden 1	16	Sigmoid	Feature extraction
Hidden 2	16	Sigmoid	Higher-level features
Output	10	Sigmoid	Class probabilities (0-9)

Activation Function

Sigmoid Function:

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def d_sigmoid(x):
    return sigmoid(x) * (1 - sigmoid(x))

Properties:

Outputs range: (0, 1)
Smooth gradient
Non-linear transformation

📊 MNIST Dataset

Dataset Statistics

Set	Images	Purpose
Training	60,000	Model training
Testing	10,000	Performance evaluation

Image Specifications

Size: 28 × 28 pixels
Format: Grayscale
Pixel Values: 0 (white) to 255 (black)
Preprocessing: Normalized to [0, 1]

Data Format

Input (X):

Shape: (784, 1) per image
Values: Normalized pixel intensities [0, 1]

Labels (Y):

Shape: (10, 1) per image
Format: One-hot encoded
Example: Digit 3 → [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

Loading Data

# Reading MNIST format
train_images_file = open('train-images.idx3-ubyte', 'rb')
train_labels_file = open('train-labels.idx1-ubyte', 'rb')

# Each image
image = np.zeros((784, 1))
for i in range(784):
    pixel = int.from_bytes(file.read(1), 'big')
    image[i, 0] = pixel / 256  # Normalize to [0, 1]

# One-hot encoding for labels
label = np.zeros((10, 1))
label[digit_value, 0] = 1

🎯 Training Algorithm

Stochastic Gradient Descent (SGD) with Mini-Batches

1. Split training data into mini-batches
2. For each epoch:
    a. Shuffle training data
    b. For each mini-batch:
        i.   Forward propagation (compute outputs)
        ii.  Calculate cost/loss
        iii. Backpropagation (compute gradients)
        iv.  Update weights and biases
3. Repeat until convergence or max epochs

Hyperparameters

Parameter	Value	Description
Learning Rate	1.0	Step size for weight updates
Batch Size	10 (initial), 50 (final)	Samples per update
Epochs	20 (initial), 5 (final)	Complete passes through data
Hidden Layer Size	16	Neurons per hidden layer

🗂️ Project Structure

Handwritten-Digit-Recognition/
├── docs/
│   ├── Instruction ANN.pdf      # Project specification (Persian)
│   └── Report.pdf               # Implementation report (Persian)
├── src/
│   ├── main.py                  # Steps 2-4 implementation
│   ├── step5.py                 # Vectorized training on full dataset
│   ├── step6-1.py               # Robustness test (shifted images)
│   ├── step6-2.py               # Alternative activation (Tanh)
│   ├── step6-3.py               # Performance comparison
│   ├── train-images.idx3-ubyte  # MNIST training images
│   ├── train-labels.idx1-ubyte  # MNIST training labels
│   ├── t10k-images.idx3-ubyte   # MNIST test images
│   └── t10k-labels.idx1-ubyte   # MNIST test labels
└── venv/                         # Virtual environment

📝 Implementation Steps

Step 1: Data Loading

File: main.py (setup section)

Load MNIST training set (60,000 images)
Load MNIST test set (10,000 images)
Normalize pixel values to [0, 1]
One-hot encode labels

Step 2: Forward Propagation

File: main.py

Initialize random weights and biases
Train on first 100 samples
Implement feedforward computation
Evaluate initial accuracy

def forward(input_data, weight, bias):
    z = (weight @ input_data) + bias
    return z

# Layer by layer
z_1 = forward(image, w1, b1)
out_1 = sigmoid(z_1)

z_2 = forward(out_1, w2, b2)
out_2 = sigmoid(z_2)

z_final = forward(out_2, w3, b3)
out_final = sigmoid(z_final)

Expected Accuracy: Random (~10%)

Step 3: Backpropagation (Non-Vectorized)

File: main.py (back_prop_s3)

Implement gradient computation using loops
Calculate gradients for all weights and biases
Update parameters using SGD

# Output layer gradients
for l in range(last_size):
    for m in range(hidden_2_size):
        grad_w3[l][m] += 2 * (out_final[l] - label[l]) * 
                        d_sigmoid(z_final[l]) * out_2[m]

# Similar for hidden layers...

Training:

100 samples
Batch size: 10
Epochs: 20

Expected Accuracy: 25-50%

Step 4: Backpropagation (Vectorized)

File: main.py (back_prop_s4)

Convert loops to matrix operations
Significant speed improvement
Same accuracy as Step 3

def back_prop_s4(img, out_1, w1, z_1, grad_w1, grad_b1,
                 out_2, w2, z_2, grad_w2, grad_b2,
                 out_final, w3, z_final, grad_w3, grad_b3):
    
    # Output layer
    grad_w3 += (2 * d_sigmoid(z_final) * (out_final - img[1])) @ 
                (np.transpose(out_2))
    grad_b3 += (2 * d_sigmoid(z_final) * (out_final - img[1]))
    
    # Hidden layer 2
    grad_out_2 = np.transpose(w3) @ (2 * d_sigmoid(z_final) * 
                                     (out_final - img[1]))
    grad_w2 += (d_sigmoid(z_2) * grad_out_2) @ (np.transpose(out_1))
    grad_b2 += (d_sigmoid(z_2) * grad_out_2)
    
    # Hidden layer 1
    grad_out_1 = np.transpose(w2) @ (d_sigmoid(z_2) * grad_out_2)
    grad_w1 += (d_sigmoid(z_1) * grad_out_1) @ (np.transpose(img[0]))
    grad_b1 += (d_sigmoid(z_1) * grad_out_1)
    
    return grad_w1, grad_b1, grad_w2, grad_b2, grad_w3, grad_b3

Training:

100 samples
Batch size: 10
Epochs: 200

Expected Accuracy: 50-70%

Step 5: Full Dataset Training

File: step5.py

Train on all 60,000 samples
Larger batch size (50)
Fewer epochs (5) due to more data

learning_rate = 1
number_of_epochs = 5
batch_size = 50

for epoch in range(number_of_epochs):
    np.random.shuffle(train_set)
    batches = create_batches(train_set, batch_size)
    
    for batch in batches:
        # Compute gradients
        # Update weights

Training Time: ~1 minute on Intel 7700HQ

Expected Accuracy:

Training: >90%
Testing: ~90%

Step 6: Experiments

Step 6-1: Robustness Test (step6-1.py)

Shift test images 4 pixels to the right
Test model's invariance to translations
Expected drop in accuracy

Step 6-2: Alternative Activation (step6-2.py)

Replace Sigmoid with Tanh
Compare performance and training dynamics

Step 6-3: Performance Analysis (step6-3.py)

Final evaluation on test set
Generate confusion matrix
Analyze misclassified examples

📦 Installation

Prerequisites

Python 3.7 or higher
NumPy
Matplotlib

Setup

Clone the repository:

git clone https://github.com/zamirmehdi/Handwritten-Digit-Recognition.git
cd Handwritten-Digit-Recognition

Create virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install numpy matplotlib pillow

Download MNIST data: The MNIST .idx files should be in the src/ directory. They're included in the repository.

🚀 Usage

Run Different Implementation Steps

cd src

# Step 2: Forward propagation only
python main.py  # (uncomment step 2 section)

# Step 3: Training with non-vectorized backprop
python main.py  # (uncomment step 3 section)

# Step 4: Training with vectorized backprop
python main.py  # (uncomment step 4 section)

# Step 5: Full dataset training
python step5.py

# Step 6-1: Robustness test
python step6-1.py

# Step 6-2: Tanh activation
python step6-2.py

# Step 6-3: Final evaluation
python step6-3.py

Visualize Training Progress

import matplotlib.pyplot as plt

# Plot cost over iterations
plt.plot(all_batch_costs)
plt.xlabel('Iteration')
plt.ylabel('Cost')
plt.title('Training Progress')
plt.show()

Test on Custom Images

# Load and preprocess your image
from PIL import Image
import numpy as np

img = Image.open('digit.png').convert('L')
img = img.resize((28, 28))
img_array = np.array(img) / 256.0
img_array = img_array.reshape(784, 1)

# Predict
out_1 = sigmoid(forward(img_array, w1, b1))
out_2 = sigmoid(forward(out_1, w2, b2))
out_final = sigmoid(forward(out_2, w3, b3))

prediction = np.argmax(out_final)
print(f"Predicted digit: {prediction}")

📐 Mathematical Foundation

Cost Function

Mean Squared Error (MSE):

J(W, b) = (1/2m) * Σ ||h(x^(i)) - y^(i)||^2

Where:

m = number of samples
h(x) = network output
y = true label (one-hot)

Forward Propagation

For layer l:

z^[l] = W^[l] · a^[l-1] + b^[l]
a^[l] = σ(z^[l])

Where:

W = weight matrix
b = bias vector
σ = activation function
a^[0] = input x

Backpropagation

Output layer:

δ^[L] = (a^[L] - y) ⊙ σ'(z^[L])

Hidden layers:

δ^[l] = (W^[l+1])^T · δ^[l+1] ⊙ σ'(z^[l])

Gradients:

∂J/∂W^[l] = δ^[l] · (a^[l-1])^T
∂J/∂b^[l] = δ^[l]

Update rule:

W^[l] := W^[l] - α · ∂J/∂W^[l]
b^[l] := b^[l] - α · ∂J/∂b^[l]

Where α is the learning rate.

⚡ Optimization Techniques

1. Vectorization

Before (Loops):

for i in range(m):
    for j in range(n):
        result[i][j] = weight[i][j] * input[j]

After (Vectorized):

result = weight @ input  # Matrix multiplication

Benefits:

10-100x speedup
Leverages optimized BLAS libraries
More concise and readable code

2. Mini-Batch Gradient Descent

Instead of updating after each sample (SGD) or all samples (Batch GD):

# Process in mini-batches
batch_size = 50
for batch in batches:
    gradients = compute_gradients(batch)
    update_parameters(gradients)

Advantages:

Faster convergence than pure SGD
Less memory than full batch
Better generalization
Parallelizable

3. Weight Initialization

# Random initialization (Xavier/Glorot-like)
w1 = np.random.randn(hidden_1_size, first_size) * 0.01
b1 = np.zeros((hidden_1_size, 1))

Why small random weights:

Break symmetry
Avoid saturation
Enable learning

📊 Results & Performance

Accuracy Progression

Step	Training Samples	Epochs	Batch Size	Accuracy
Step 2	100	0	-	~10% (random)
Step 3	100	20	10	25-50%
Step 4	100	200	10	50-70%
Step 5	60,000	5	50	~90%

Training Time

Hardware: Intel Core i7-7700HQ

Implementation	Time
Step 3 (Non-vectorized)	~5 minutes
Step 4 (Vectorized)	~30 seconds
Step 5 (Full dataset)	~1 minute

Speedup from vectorization: ~10x

Convergence Plot

Cost
  │
  │ ╲
  │  ╲
  │   ╲___
  │      ╲___
  │         ╲___
  │            ╲___________
  └────────────────────────→ Iterations

Initial: High cost, rapid decrease
Middle: Moderate decrease
Final: Plateau, small improvements

Per-Digit Accuracy

Digit	Accuracy	Common Mistakes
0	96%	Confused with 6
1	98%	Highest accuracy
2	91%	Confused with 7
3	89%	Confused with 5, 8
4	93%	Confused with 9
5	88%	Confused with 3, 8
6	94%	Confused with 0, 8
7	92%	Confused with 1, 2
8	87%	Most challenging
9	90%	Confused with 4, 7

🎯 Key Features

1. From-Scratch Implementation

Pure NumPy, no deep learning frameworks
Educational and transparent
Full control over every detail

2. Step-by-Step Development

Progressive complexity
Easy to understand and debug
Compare vectorized vs non-vectorized

3. Comprehensive Experiments

Robustness testing
Activation function comparison
Performance analysis

4. Efficient Vectorization

Matrix operations instead of loops
10x+ speedup
Professional-grade optimization

5. Real-World Application

MNIST standard benchmark
Comparable to library implementations
Deployable model

🔬 Experiments

Experiment 1: Shifted Images (Step 6-1)

Setup:

Shift all test images 4 pixels to right
Test model's translation invariance

Code:

# Shift image 4 pixels right
image_2d = image.reshape((28, 28))
for _ in range(4):
    image_2d = np.roll(image_2d, 1, axis=1)
    image_2d[:, 0] = 0.0  # Zero out left column
image = image_2d.reshape(784, 1)

Results:

Original Test Accuracy: ~90%
Shifted Test Accuracy: ~75-80%
Conclusion: Network lacks translation invariance (CNNs solve this)

Experiment 2: Alternative Activation Functions (Step 6-2)

Tanh Activation:

def tanh(x):
    return (2 / (1 + np.exp(-2*x))) - 1

def d_tanh(x):
    return 1 - tanh(x)**2

Comparison:

Activation	Range	Advantages	Test Accuracy
Sigmoid	(0, 1)	Simple, probabilistic	~90%
Tanh	(-1, 1)	Zero-centered	~88-92%
ReLU	[0, ∞)	No saturation	~91-93%

Findings:

Tanh slightly different convergence
ReLU generally faster training
Sigmoid works well for this problem

Experiment 3: Network Depth

Test different architectures:

Architecture	Parameters	Accuracy	Training Time
784-10	~7K	~85%	Fast
784-16-10	~13K	~88%	Fast
784-16-16-10	~13.5K	~90%	Moderate
784-32-32-10	~26K	~91%	Slow
784-64-64-10	~53K	~92%	Very Slow

Conclusion: 2 hidden layers (16 neurons each) provides good balance.

🎓 Key Concepts Demonstrated

Neural Networks

Feedforward architecture
Multi-layer perceptron
Universal approximation
Non-linear transformations

Backpropagation

Chain rule application
Gradient computation
Error propagation
Weight updates

Optimization

Stochastic Gradient Descent
Mini-batch training
Learning rate scheduling
Convergence criteria

Vectorization

Matrix operations
NumPy broadcasting
Computational efficiency
Memory optimization

Pattern Recognition

Feature learning
Supervised classification
Image processing
Model evaluation

⚠️ Limitations

Current Implementation

Simple Architecture: Only fully-connected layers
No Regularization: Risk of overfitting
Fixed Learning Rate: No adaptive methods
Single Activation: Sigmoid throughout
No Data Augmentation: Limited to original MNIST

Comparison with Modern Approaches

Aspect	This Project	Modern CNNs
Architecture	Fully Connected	Convolutional
Accuracy	~90%	>99%
Parameters	~13K	50K-500K
Translation Invariance	No	Yes
Training Time	Minutes	Hours (but more data)

🔮 Future Enhancements

ℹ️ Project Information

Author: Amirmehdi Zarrinnezhad
Course: Computational Intelligence
University: Amirkabir University of Technology (Tehran Polytechnic) - Spring 2021
GitHub Link: Handwritten-Digit-Recognition

🔗 Related Projects

Part of the Computational Intelligence Course repository.

Other Projects:

📚 References

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.
Nielsen, M. A. (2015). Neural Networks and Deep Learning. Determination Press.

📧 Contact

Questions or collaborations? Feel free to reach out!
📧 Email: amzarrinnezhad@gmail.com
🌐 GitHub: @zamirmehdi

⬆ Back to Main Repository

(back to top)

⭐ If you found this project helpful, please consider giving it a star! ⭐

Amirmehdi Zarrinnezhad

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
docs		docs
src		src
.gitignore		.gitignore
README.md		README.md

zamirmehdi/Handwritten-Digit-Recognition

Folders and files

Latest commit

History

Repository files navigation

Handwritten Digit Recognition using Neural Networks

📚 Table of Contents

📋 Overview

🧠 Neural Network Architecture

Network Structure

Layer Details

Activation Function

📊 MNIST Dataset

Dataset Statistics

Image Specifications

Data Format

Loading Data

🎯 Training Algorithm

Stochastic Gradient Descent (SGD) with Mini-Batches

Hyperparameters

🗂️ Project Structure

📝 Implementation Steps

Step 1: Data Loading

Step 2: Forward Propagation

Step 3: Backpropagation (Non-Vectorized)

Step 4: Backpropagation (Vectorized)

Step 5: Full Dataset Training

Step 6: Experiments

📦 Installation

Prerequisites

Setup

🚀 Usage

Run Different Implementation Steps

Visualize Training Progress

Test on Custom Images

📐 Mathematical Foundation

Cost Function

Forward Propagation

Backpropagation

⚡ Optimization Techniques

1. Vectorization

2. Mini-Batch Gradient Descent

3. Weight Initialization

📊 Results & Performance

Accuracy Progression

Training Time

Convergence Plot

Per-Digit Accuracy

🎯 Key Features

1. From-Scratch Implementation

2. Step-by-Step Development

3. Comprehensive Experiments

4. Efficient Vectorization

5. Real-World Application

🔬 Experiments

Experiment 1: Shifted Images (Step 6-1)

Experiment 2: Alternative Activation Functions (Step 6-2)

Experiment 3: Network Depth

🎓 Key Concepts Demonstrated

Neural Networks

Backpropagation

Optimization

Vectorization

Pattern Recognition

⚠️ Limitations

Current Implementation

Comparison with Modern Approaches

🔮 Future Enhancements

ℹ️ Project Information

🔗 Related Projects

📚 References

📧 Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages