CodeScribe AI is a powerful proof-of-concept tool that bridges the physical and digital worlds of programming. It transforms handwritten pseudocode or algorithms from an image into clean, executable Python code. Leveraging a sophisticated pipeline of computer vision and a Large Language Model (LLM), it not only generates the code but also allows you to interactively ask questions about its logic, complexity, and functionality.
Turn your whiteboard sketches and notebook scribbles into a reality!
In the world of software development, ideas often begin on a whiteboard or in a notebook. Translating these handwritten thoughts into functional code is a manual, time-consuming, and error-prone process. This project automates that translation.
A typical input is a photo of handwritten code, which may be messy, have inconsistent formatting, and contain non-code elements.
Input
The script outputs clean, commented, and functional Python code, followed by an interactive Q&A session to explain it.
Output
The magic happens in a four-stage pipeline. Each stage prepares the data for the next, culminating in an intelligent, interactive experience.
graph TD
A[📷 Image Input] -->|image.png| B(Stage 1: Image Pre-processing);
B -->|Grayscale & Binary Image| C(Stage 2: OCR with Tesseract);
C -->|Messy Extracted Text| D(Stage 3: Code Generation with Gemini);
D -->|Clean Python Code| E(Stage 4: Interactive Q&A with Gemini);
subgraph "Computer Vision"
B
C
end
subgraph "Generative AI"
D
E
end
style A fill:#f9f,stroke:#333,stroke-width:2px
style E fill:#ccf,stroke:#333,stroke-width:2px
-
Image Pre-processing (with OpenCV): The input image is first processed to be machine-readable. This involves:
- Reading the Image: Loading the image file into memory.
- Grayscaling: Converting the image from color to grayscale, as color information is not needed for OCR.
- Binarization: Applying an Otsu and Binary Inverse threshold. This crucial step converts the image to pure black and white, making the characters stand out sharply from the background, which significantly improves OCR accuracy.
-
Text Extraction (with Tesseract OCR): The pre-processed image is fed into the Tesseract engine.
- Tesseract scans the image and performs Optical Character Recognition (OCR), converting the pixels representing characters into a raw text string.
- As seen in the sample output, this raw text is often imperfect and contains recognition errors (e.g.,
rl att ee sean!ora{j-1) > afj)) ¢).
-
Code Generation (with Google Gemini): This is where the LLM's power shines.
- The messy, OCR-extracted text is sent to the Gemini API with a carefully crafted prompt.
- The prompt instructs the model to act as a code interpreter, recognize that the input is OCR'd text with errors, and translate the underlying logic into a clean, single Python method.
- Gemini doesn't just translate; it infers the original intent, corrects syntax errors, formats the code correctly, and even adds comments explaining how it interpreted the garbled sections.
-
Interactive Q&A (with Google Gemini): After generating the code, the script enters a conversational loop.
- The user can ask questions in natural language about the generated code.
- For each question, a new prompt is sent to Gemini, providing the generated code as context. This allows the model to give accurate, context-aware answers about the algorithm's logic, time complexity, or specific lines of code.
Architecture
This project stands on the shoulders of several powerful open-source libraries and APIs.
- OpenCV (
cv2): The cornerstone of the image pre-processing stage. It provides the tools necessary to filter and transform the image to make it optimal for OCR. - Pytesseract: A Python wrapper for Google's Tesseract-OCR Engine. It serves as the bridge between our pre-processed image and the raw, extracted text.
- Pillow (
PIL): Used to handle image data and pass it between OpenCV and Pytesseract seamlessly. - Google Gemini (
google-generativeai): The intelligent core of the application. Thegemini-2.5-flashmodel is used for its speed and powerful reasoning capabilities. It excels at:- Error Correction: Fixing gibberish from the OCR process.
- Contextual Understanding: Inferring programming logic (like loops and swaps) from incomplete pseudocode.
- Code Generation: Producing syntactically correct and idiomatic Python.
- Natural Language Processing: Powering the final interactive Q&A session.
Follow these steps to get the project running on your local machine.
- Python 3.8+
- Tesseract-OCR Engine: This is a system dependency, not a Python package. You must install it separately.
- Windows: Download and run the installer from the Tesseract at UB Mannheim page. Make sure to note the installation path.
- macOS:
brew install tesseract - Linux (Debian/Ubuntu):
sudo apt-get install tesseract-ocr
git clone <your-repository-url>
cd <repository-directory>Create a requirements.txt file with the following content:
opencv-python
numpy
pytesseract
Pillow
google-generativeai
Then, install them using pip:
pip install -r requirements.txt- Tesseract Path (Windows Only): If you are on Windows, you may need to specify the path to your Tesseract installation within your Python script:
# Example for Python script # import pytesseract # pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
- Google Gemini API Key: The script will require a Gemini API key. It's recommended to set this up as an environment variable (
API_KEY). You can get a free API key from Google AI Studio.
- Place an image of handwritten code in the project's root directory (or use the web interface to upload it).
- Run the application.
- Upload the image and click "Generate Code".
- Observe the output and interact with the Q&A bot!
Vaibhav Shikhar Singh

