Implemented the following:
- Matrix Addition (memcpy operations and optimized number of thread blocks)
- Array Reduction (coalesced memory accesses)
- Matrix Multiplication (shared memory tiling)
- Histogram Analysis (atomic operations, shared memory)
- Ported an existing C based machine learning project to CUDA