Low-level neural network implementation in C and POSIX threads parallelism.
This project implements from scratch:
- Single Layer Perceptron (SLP)
- Multi-Layer Perceptron (MLP)
- Explicit backpropagation
- Custom BLAS-like linear algebra core
- Parallelization using POSIX threads (pthreads)
The goal is not to build a framework, is to understand how neural networks operate at a low level: memory layout, numerical computation, and parallel execution.
- Matrix representation in row-major
- Manual memory control
- Vector operations
- Linear Algebra core
- Unit testing & Valgrind validation
- Row partitioning strategy using pthreads
- Parallel matrix-vector multiplication & matrix-matrix multiplication
- Parallel batch operations
- Persistent Thread Pool (Minimalist BLAS-style)
- Speedup benchmarking (Sequential vs Parallel tool)
- Activation functions & Derivatives
- Loss functions
- Weight Initialization (Xavier/He)
- 4.1 Forward:
y = activation(Wx + b) - 4.2 Explicit Backward:
dW = dL/dy * x^T,db = dL/dy - 4.3 Training Loop: Forward -> Loss -> Backward -> Update
- 4.4 Validation: Data linearly separable convergence
- 5.1 Structure: Multi-layer model representation
- 5.2 Layer Caching: Storing Z and A states for backprop
- 5.3 General Backpropagation: Iterative chain rule implementation
- 5.4 Numerical Gradient Checking: Comparative validation
- 6.1 Vectorized input (Batch x Features)
- 6.2 Batched Matmul optimization
- 6.3 Aggregated gradient calculation
- 7.1 Thread-level batch partitioning
- 7.2 Gradient reduction buffers and synchronization
- Persistent Thread Pool implementation (eliminates pthread overhead)
- Worker synchronization via Condition Variables
- 9.1 Cache-aware optimization (Loop Tiling / Blocking)
- 9.2 Memory alignment & Branch avoidance
- 10.1 Data loader (CSV/Binary)
- 10.2 Integration tests (MLP training)
- 10.3 CLI for hyperparameter tuning
- 10.4 Shape assertions & Debug mode
- Advanced Optimizers (Adam, Momentum)
- Regularization (L2, Dropout)
- Model serialization (Save/Load binaries)
- C
- Row-major memory layout
- Manual memory management
- Explicit parallelism (pthreads)
- No hidden abstractions
This project prioritizes clarity of execution over abstraction.
Neural-Networks-in-C
├── assets/
├── include/
│ ├── matrix.h
│ ├── parallel.h
│ ├── linalg.h
│ ├── runtime.h
│ ├── thread_pool.h
│ └── nn_infra.h
├── src/
│ ├── parallel/
│ ├── matrix.c
│ ├── linalg.c
│ ├── runtime.c
│ ├── thread_pool.c
│ └── nn_infra.c
├── tests/
│ └── test_*.c
├── build/ # compiled binaries
├── main.c # soon
├── run_valgrind.sh
└── MakefileThe project uses a persistent Thread Pool (BLAS-style) to manage parallel tasks. Unlike a naive approach where threads are created and joined on every operation, this implementation spawns workers once at startup (runtime_init) and signals work using condition variables. This eliminates the significant overhead of pthread_create/pthread_join syscalls, making small-to-medium matrix operations much more efficient.
makeCompile all tests:
make testrun:
./build/test_matmul
./build/test_matvec
./build/test_performanceTo use this feature, you need to install valgrind:
sudo apt-get install valgrindYou need to give permission to run the script:
chmod +x run_valgrind.shThen run:
./run_valgrind.sh test_runtime