This repo holds a course project for Cyberinfrastructure at RIT. This project implements a GPU-accelerated version of the Levenshtein distance algorithm using PyCUDA. It's designed to efficiently compute the edit distance between two large strings, making it particularly useful for applications in bioinformatics, natural language processing, and data analysis. Please check the project report in this repo for more information.
The Levenshtein distance, also known as edit distance, is a string metric for measuring the difference between two sequences. It represents the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into another. This algorithm has various applications, including:
- Spell checking
- DNA sequence alignment
- Plagiarism detection
- Speech recognition
PyCUDA is a Python wrapper for NVIDIA's CUDA parallel computation API. It allows Python programmers to easily write GPU-accelerated algorithms by leveraging NVIDIA's CUDA architecture. PyCUDA provides:
- Access to NVIDIA's CUDA API from Python
- Automatic code generation and compilation
- Easy integration with NumPy
- GPU-accelerated Levenshtein distance calculation
- Ability to process large strings
- Integration with MATLAB file formats for input data
- Performance measurement for execution time
- Python 3.x
- NVIDIA GPU with CUDA support
- PyCUDA
- NumPy
- SciPy
- Ensure you have an NVIDIA GPU with CUDA support.
- Install the CUDA Toolkit and compatible GPU drivers.
- Install the required Python packages:
pip install pycuda numpy scipy
- Place your input MATLAB (.mat) files in the appropriate directory.
- Update the file paths in the script to point to your input files.
- Run the script:
python levenGPU_demo.py
- The script loads two strings from MATLAB files.
- It allocates managed memory on the GPU for input strings and computation matrix.
- A CUDA kernel is defined to perform the Levenshtein distance calculation.
- The computation is split into multiple passes to handle large strings.
- The final Levenshtein distance is extracted from the result matrix.
This GPU-accelerated implementation can significantly outperform CPU-based versions, especially for large strings. The exact performance gain depends on the GPU hardware and the size of the input strings.
- Requires NVIDIA GPU with CUDA support.
This project was inspired by the need for fast edit distance calculations in large-scale data processing tasks. Special thanks to the PyCUDA development team for providing the tools to make GPU acceleration accessible in Python.