Multi-layer feedforward neural network built from scratch (without deep learning frameworks) to classify handwritten digits from the MNIST dataset. Implements backpropagation algorithm with stochastic gradient descent.
This project implements a fully-connected feedforward neural network from scratch using only NumPy for numerical computation. The network learns to recognize handwritten digits (0-9) through supervised learning with the backpropagation algorithm.
Key Highlights:
- โ From Scratch: No TensorFlow, PyTorch, or Keras
- โ Complete Implementation: Forward propagation, backpropagation, SGD
- โ Vectorized Operations: Efficient NumPy matrix operations
- โ High Accuracy: ~90% on MNIST test set
- โ Educational: Step-by-step implementation stages
Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer
(784 neurons) โ (16 neurons) โ (16 neurons) โ (10 neurons)
28ร28 pixels Sigmoid Sigmoid Sigmoid
| Layer | Size | Activation | Purpose |
|---|---|---|---|
| Input | 784 | None | Flattened 28ร28 pixel image |
| Hidden 1 | 16 | Sigmoid | Feature extraction |
| Hidden 2 | 16 | Sigmoid | Higher-level features |
| Output | 10 | Sigmoid | Class probabilities (0-9) |
Sigmoid Function:
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def d_sigmoid(x):
return sigmoid(x) * (1 - sigmoid(x))Properties:
- Outputs range: (0, 1)
- Smooth gradient
- Non-linear transformation
| Set | Images | Purpose |
|---|---|---|
| Training | 60,000 | Model training |
| Testing | 10,000 | Performance evaluation |
- Size: 28 ร 28 pixels
- Format: Grayscale
- Pixel Values: 0 (white) to 255 (black)
- Preprocessing: Normalized to [0, 1]
Input (X):
- Shape: (784, 1) per image
- Values: Normalized pixel intensities [0, 1]
Labels (Y):
- Shape: (10, 1) per image
- Format: One-hot encoded
- Example: Digit 3 โ [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
# Reading MNIST format
train_images_file = open('train-images.idx3-ubyte', 'rb')
train_labels_file = open('train-labels.idx1-ubyte', 'rb')
# Each image
image = np.zeros((784, 1))
for i in range(784):
pixel = int.from_bytes(file.read(1), 'big')
image[i, 0] = pixel / 256 # Normalize to [0, 1]
# One-hot encoding for labels
label = np.zeros((10, 1))
label[digit_value, 0] = 11. Split training data into mini-batches
2. For each epoch:
a. Shuffle training data
b. For each mini-batch:
i. Forward propagation (compute outputs)
ii. Calculate cost/loss
iii. Backpropagation (compute gradients)
iv. Update weights and biases
3. Repeat until convergence or max epochs
| Parameter | Value | Description |
|---|---|---|
| Learning Rate | 1.0 | Step size for weight updates |
| Batch Size | 10 (initial), 50 (final) | Samples per update |
| Epochs | 20 (initial), 5 (final) | Complete passes through data |
| Hidden Layer Size | 16 | Neurons per hidden layer |
Handwritten-Digit-Recognition/
โโโ docs/
โ โโโ Instruction ANN.pdf # Project specification (Persian)
โ โโโ Report.pdf # Implementation report (Persian)
โโโ src/
โ โโโ main.py # Steps 2-4 implementation
โ โโโ step5.py # Vectorized training on full dataset
โ โโโ step6-1.py # Robustness test (shifted images)
โ โโโ step6-2.py # Alternative activation (Tanh)
โ โโโ step6-3.py # Performance comparison
โ โโโ train-images.idx3-ubyte # MNIST training images
โ โโโ train-labels.idx1-ubyte # MNIST training labels
โ โโโ t10k-images.idx3-ubyte # MNIST test images
โ โโโ t10k-labels.idx1-ubyte # MNIST test labels
โโโ venv/ # Virtual environment
File: main.py (setup section)
- Load MNIST training set (60,000 images)
- Load MNIST test set (10,000 images)
- Normalize pixel values to [0, 1]
- One-hot encode labels
File: main.py
- Initialize random weights and biases
- Train on first 100 samples
- Implement feedforward computation
- Evaluate initial accuracy
def forward(input_data, weight, bias):
z = (weight @ input_data) + bias
return z
# Layer by layer
z_1 = forward(image, w1, b1)
out_1 = sigmoid(z_1)
z_2 = forward(out_1, w2, b2)
out_2 = sigmoid(z_2)
z_final = forward(out_2, w3, b3)
out_final = sigmoid(z_final)Expected Accuracy: Random (~10%)
File: main.py (back_prop_s3)
- Implement gradient computation using loops
- Calculate gradients for all weights and biases
- Update parameters using SGD
# Output layer gradients
for l in range(last_size):
for m in range(hidden_2_size):
grad_w3[l][m] += 2 * (out_final[l] - label[l]) *
d_sigmoid(z_final[l]) * out_2[m]
# Similar for hidden layers...Training:
- 100 samples
- Batch size: 10
- Epochs: 20
Expected Accuracy: 25-50%
File: main.py (back_prop_s4)
- Convert loops to matrix operations
- Significant speed improvement
- Same accuracy as Step 3
def back_prop_s4(img, out_1, w1, z_1, grad_w1, grad_b1,
out_2, w2, z_2, grad_w2, grad_b2,
out_final, w3, z_final, grad_w3, grad_b3):
# Output layer
grad_w3 += (2 * d_sigmoid(z_final) * (out_final - img[1])) @
(np.transpose(out_2))
grad_b3 += (2 * d_sigmoid(z_final) * (out_final - img[1]))
# Hidden layer 2
grad_out_2 = np.transpose(w3) @ (2 * d_sigmoid(z_final) *
(out_final - img[1]))
grad_w2 += (d_sigmoid(z_2) * grad_out_2) @ (np.transpose(out_1))
grad_b2 += (d_sigmoid(z_2) * grad_out_2)
# Hidden layer 1
grad_out_1 = np.transpose(w2) @ (d_sigmoid(z_2) * grad_out_2)
grad_w1 += (d_sigmoid(z_1) * grad_out_1) @ (np.transpose(img[0]))
grad_b1 += (d_sigmoid(z_1) * grad_out_1)
return grad_w1, grad_b1, grad_w2, grad_b2, grad_w3, grad_b3Training:
- 100 samples
- Batch size: 10
- Epochs: 200
Expected Accuracy: 50-70%
File: step5.py
- Train on all 60,000 samples
- Larger batch size (50)
- Fewer epochs (5) due to more data
learning_rate = 1
number_of_epochs = 5
batch_size = 50
for epoch in range(number_of_epochs):
np.random.shuffle(train_set)
batches = create_batches(train_set, batch_size)
for batch in batches:
# Compute gradients
# Update weightsTraining Time: ~1 minute on Intel 7700HQ
Expected Accuracy:
- Training: >90%
- Testing: ~90%
Step 6-1: Robustness Test (step6-1.py)
- Shift test images 4 pixels to the right
- Test model's invariance to translations
- Expected drop in accuracy
Step 6-2: Alternative Activation (step6-2.py)
- Replace Sigmoid with Tanh
- Compare performance and training dynamics
Step 6-3: Performance Analysis (step6-3.py)
- Final evaluation on test set
- Generate confusion matrix
- Analyze misclassified examples
- Python 3.7 or higher
- NumPy
- Matplotlib
- Clone the repository:
git clone https://github.com/zamirmehdi/Handwritten-Digit-Recognition.git
cd Handwritten-Digit-Recognition- Create virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install numpy matplotlib pillow- Download MNIST data:
The MNIST
.idxfiles should be in thesrc/directory. They're included in the repository.
cd src
# Step 2: Forward propagation only
python main.py # (uncomment step 2 section)
# Step 3: Training with non-vectorized backprop
python main.py # (uncomment step 3 section)
# Step 4: Training with vectorized backprop
python main.py # (uncomment step 4 section)
# Step 5: Full dataset training
python step5.py
# Step 6-1: Robustness test
python step6-1.py
# Step 6-2: Tanh activation
python step6-2.py
# Step 6-3: Final evaluation
python step6-3.pyimport matplotlib.pyplot as plt
# Plot cost over iterations
plt.plot(all_batch_costs)
plt.xlabel('Iteration')
plt.ylabel('Cost')
plt.title('Training Progress')
plt.show()# Load and preprocess your image
from PIL import Image
import numpy as np
img = Image.open('digit.png').convert('L')
img = img.resize((28, 28))
img_array = np.array(img) / 256.0
img_array = img_array.reshape(784, 1)
# Predict
out_1 = sigmoid(forward(img_array, w1, b1))
out_2 = sigmoid(forward(out_1, w2, b2))
out_final = sigmoid(forward(out_2, w3, b3))
prediction = np.argmax(out_final)
print(f"Predicted digit: {prediction}")Mean Squared Error (MSE):
J(W, b) = (1/2m) * ฮฃ ||h(x^(i)) - y^(i)||^2
Where:
m= number of samplesh(x)= network outputy= true label (one-hot)
For layer l:
z^[l] = W^[l] ยท a^[l-1] + b^[l]
a^[l] = ฯ(z^[l])
Where:
W= weight matrixb= bias vectorฯ= activation functiona^[0]= input x
Output layer:
ฮด^[L] = (a^[L] - y) โ ฯ'(z^[L])
Hidden layers:
ฮด^[l] = (W^[l+1])^T ยท ฮด^[l+1] โ ฯ'(z^[l])
Gradients:
โJ/โW^[l] = ฮด^[l] ยท (a^[l-1])^T
โJ/โb^[l] = ฮด^[l]
Update rule:
W^[l] := W^[l] - ฮฑ ยท โJ/โW^[l]
b^[l] := b^[l] - ฮฑ ยท โJ/โb^[l]
Where ฮฑ is the learning rate.
Before (Loops):
for i in range(m):
for j in range(n):
result[i][j] = weight[i][j] * input[j]After (Vectorized):
result = weight @ input # Matrix multiplicationBenefits:
- 10-100x speedup
- Leverages optimized BLAS libraries
- More concise and readable code
Instead of updating after each sample (SGD) or all samples (Batch GD):
# Process in mini-batches
batch_size = 50
for batch in batches:
gradients = compute_gradients(batch)
update_parameters(gradients)Advantages:
- Faster convergence than pure SGD
- Less memory than full batch
- Better generalization
- Parallelizable
# Random initialization (Xavier/Glorot-like)
w1 = np.random.randn(hidden_1_size, first_size) * 0.01
b1 = np.zeros((hidden_1_size, 1))Why small random weights:
- Break symmetry
- Avoid saturation
- Enable learning
| Step | Training Samples | Epochs | Batch Size | Accuracy |
|---|---|---|---|---|
| Step 2 | 100 | 0 | - | ~10% (random) |
| Step 3 | 100 | 20 | 10 | 25-50% |
| Step 4 | 100 | 200 | 10 | 50-70% |
| Step 5 | 60,000 | 5 | 50 | ~90% |
Hardware: Intel Core i7-7700HQ
| Implementation | Time |
|---|---|
| Step 3 (Non-vectorized) | ~5 minutes |
| Step 4 (Vectorized) | ~30 seconds |
| Step 5 (Full dataset) | ~1 minute |
Speedup from vectorization: ~10x
Cost
โ
โ โฒ
โ โฒ
โ โฒ___
โ โฒ___
โ โฒ___
โ โฒ___________
โโโโโโโโโโโโโโโโโโโโโโโโโโ Iterations
- Initial: High cost, rapid decrease
- Middle: Moderate decrease
- Final: Plateau, small improvements
| Digit | Accuracy | Common Mistakes |
|---|---|---|
| 0 | 96% | Confused with 6 |
| 1 | 98% | Highest accuracy |
| 2 | 91% | Confused with 7 |
| 3 | 89% | Confused with 5, 8 |
| 4 | 93% | Confused with 9 |
| 5 | 88% | Confused with 3, 8 |
| 6 | 94% | Confused with 0, 8 |
| 7 | 92% | Confused with 1, 2 |
| 8 | 87% | Most challenging |
| 9 | 90% | Confused with 4, 7 |
- Pure NumPy, no deep learning frameworks
- Educational and transparent
- Full control over every detail
- Progressive complexity
- Easy to understand and debug
- Compare vectorized vs non-vectorized
- Robustness testing
- Activation function comparison
- Performance analysis
- Matrix operations instead of loops
- 10x+ speedup
- Professional-grade optimization
- MNIST standard benchmark
- Comparable to library implementations
- Deployable model
Setup:
- Shift all test images 4 pixels to right
- Test model's translation invariance
Code:
# Shift image 4 pixels right
image_2d = image.reshape((28, 28))
for _ in range(4):
image_2d = np.roll(image_2d, 1, axis=1)
image_2d[:, 0] = 0.0 # Zero out left column
image = image_2d.reshape(784, 1)Results:
- Original Test Accuracy: ~90%
- Shifted Test Accuracy: ~75-80%
- Conclusion: Network lacks translation invariance (CNNs solve this)
Tanh Activation:
def tanh(x):
return (2 / (1 + np.exp(-2*x))) - 1
def d_tanh(x):
return 1 - tanh(x)**2Comparison:
| Activation | Range | Advantages | Test Accuracy |
|---|---|---|---|
| Sigmoid | (0, 1) | Simple, probabilistic | ~90% |
| Tanh | (-1, 1) | Zero-centered | ~88-92% |
| ReLU | [0, โ) | No saturation | ~91-93% |
Findings:
- Tanh slightly different convergence
- ReLU generally faster training
- Sigmoid works well for this problem
Test different architectures:
| Architecture | Parameters | Accuracy | Training Time |
|---|---|---|---|
| 784-10 | ~7K | ~85% | Fast |
| 784-16-10 | ~13K | ~88% | Fast |
| 784-16-16-10 | ~13.5K | ~90% | Moderate |
| 784-32-32-10 | ~26K | ~91% | Slow |
| 784-64-64-10 | ~53K | ~92% | Very Slow |
Conclusion: 2 hidden layers (16 neurons each) provides good balance.
- Feedforward architecture
- Multi-layer perceptron
- Universal approximation
- Non-linear transformations
- Chain rule application
- Gradient computation
- Error propagation
- Weight updates
- Stochastic Gradient Descent
- Mini-batch training
- Learning rate scheduling
- Convergence criteria
- Matrix operations
- NumPy broadcasting
- Computational efficiency
- Memory optimization
- Feature learning
- Supervised classification
- Image processing
- Model evaluation
- Simple Architecture: Only fully-connected layers
- No Regularization: Risk of overfitting
- Fixed Learning Rate: No adaptive methods
- Single Activation: Sigmoid throughout
- No Data Augmentation: Limited to original MNIST
| Aspect | This Project | Modern CNNs |
|---|---|---|
| Architecture | Fully Connected | Convolutional |
| Accuracy | ~90% | >99% |
| Parameters | ~13K | 50K-500K |
| Translation Invariance | No | Yes |
| Training Time | Minutes | Hours (but more data) |
- Implement Convolutional Neural Networks
- Add dropout regularization
- Use Adam optimizer
- Implement batch normalization
- Add data augmentation
- Support for other datasets (Fashion-MNIST, CIFAR)
- Save/load trained models
- Web interface for digit drawing and prediction
- GPU acceleration with CuPy
- Cross-validation
Author: Amirmehdi Zarrinnezhad
Course: Computational Intelligence
University: Amirkabir University of Technology (Tehran Polytechnic) - Spring 2021
GitHub Link: Handwritten-Digit-Recognition
Part of the Computational Intelligence Course repository.
Other Projects:
- LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
- Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.
- Nielsen, M. A. (2015). Neural Networks and Deep Learning. Determination Press.
Questions or collaborations? Feel free to reach out!
๐ง Email: amzarrinnezhad@gmail.com
๐ GitHub: @zamirmehdi
โญ If you found this project helpful, please consider giving it a star! โญ
Amirmehdi Zarrinnezhad