This repository contains a collection of Python scripts demonstrating how to run various AI tasks locally using models from the Hugging Face Hub and the transformers library (along with related libraries like datasets, sentence-transformers, etc.).
These examples cover a range of modalities including Text, Vision, Audio, and Multimodal combinations, showcasing different models and pipelines available within the Hugging Face ecosystem. Each script aims to be runnable with minimal modification (often just providing an input file path or configuring text/labels/data within the script).
The scripts are categorized by the primary data modalities they handle:
- Sentiment Analysis (
run_sentiment.py)- Task: Text Classification (Positive/Negative)
- Model:
distilbert-base-uncased-finetuned-sst-2-english(Pipeline Default)
- Text Generation (
run_generation.py)- Task: Generating text following a prompt.
- Model:
gpt2
- Zero-Shot Text Classification (
run_zero_shot.py)- Task: Classifying text using arbitrary labels without specific fine-tuning.
- Model:
facebook/bart-large-mnli(Pipeline Default)
- Named Entity Recognition (NER) (
run_ner.py)- Task: Identifying named entities (Person, Location, Org).
- Model:
dbmdz/bert-large-cased-finetuned-conll03-english
- Summarization (
run_summarization.py)- Task: Creating a shorter summary of a longer text.
- Model:
facebook/bart-large-cnn
- Translation (EN->FR) (
run_translation.py)- Task: Translating text from English to French.
- Model:
Helsinki-NLP/opus-mt-en-fr
- Question Answering (Extractive Text) (
run_qa.py)- Task: Finding the answer span within a context paragraph given a question.
- Model:
distilbert-base-cased-distilled-squad
- Fill-Mask (
run_fill_mask.py)- Task: Predicting masked words in a sentence (Masked Language Modeling).
- Model:
roberta-base
- Sentence Embeddings & Similarity (
run_embeddings.py,run_similarity_search.py)- Task: Generating semantic vector representations and finding similar sentences.
- Model:
sentence-transformers/all-MiniLM-L6-v2(viasentence-transformerslibrary)
- Emotion Classification (
run_emotion.py)- Task: Text Classification (Detecting emotions like joy, anger, sadness).
- Model:
j-hartmann/emotion-english-distilroberta-base
- Table Question Answering (
run_table_qa.py)- Task: Answering questions based on tabular data (requires
pandas,torch-scatter). - Model:
google/tapas-base-finetuned-wtq
- Task: Answering questions based on tabular data (requires
- Dialogue Simulation (
run_dialogue_generation.py)- Task: Simulating multi-turn conversation via text generation pipeline.
- Model:
microsoft/DialoGPT-medium
- Part-of-Speech (POS) Tagging (
run_pos_tagging.py)- Task: Identifying grammatical parts of speech for each word.
- Model:
vblagoje/bert-english-uncased-finetuned-pos
- Image Classification (
run_image_classification.py)- Task: Classifying the main subject of an image.
- Model:
google/vit-base-patch16-224
- Object Detection (
run_object_detection_annotated.py)- Task: Identifying multiple objects in an image with bounding boxes and labels (plus annotation).
- Model:
facebook/detr-resnet-50
- Depth Estimation (
run_depth_estimation.py)- Task: Estimating depth from a single image, saving a depth map.
- Model:
Intel/dpt-large
- Image Segmentation (
run_segmentation.py)- Task: Assigning category labels (e.g., road, sky, car) to each pixel (requires
matplotlib,numpy). - Model:
nvidia/segformer-b0-finetuned-ade-512-512
- Task: Assigning category labels (e.g., road, sky, car) to each pixel (requires
- Image Super-Resolution (
run_super_resolution.py)- Task: Upscaling an image (x2) to enhance resolution.
- Model:
caidas/swin2SR-classical-sr-x2-64
- Audio Classification (
run_audio_classification.py)- Task: Classifying the type of sound in an audio file (e.g., Speech, Music). Requires
torchaudio. - Model:
MIT/ast-finetuned-audioset-10-10-0.4593
- Task: Classifying the type of sound in an audio file (e.g., Speech, Music). Requires
- Image Captioning (
run_image_captioning.py)- Task: Generating a text description for an image.
- Model:
nlpconnect/vit-gpt2-image-captioning
- Visual Question Answering (VQA) (
run_vqa.py)- Task: Answering questions based on image content.
- Model:
dandelin/vilt-b32-finetuned-vqa
- Zero-Shot Image Classification (
run_zero_shot_image.py)- Task: Classifying images against arbitrary text labels (requires
ftfy,regex). - Model:
openai/clip-vit-base-patch32
- Task: Classifying images against arbitrary text labels (requires
- Document Question Answering (DocVQA) (
run_docvqa.py)- Task: Answering questions based on document image content (requires
sentencepiece). - Model:
naver-clova-ix/donut-base-finetuned-docvqa
- Task: Answering questions based on document image content (requires
- Automatic Speech Recognition (ASR) (
run_asr_flexible.py)- Task: Transcribing speech from an audio file to text.
- Model:
openai/whisper-base
- Zero-Shot Audio Classification (
run_zero_shot_audio.py)- Task: Classifying sounds against arbitrary text labels.
- Model:
laion/clap-htsat-unfused
- Text-to-Speech (TTS) (
run_tts.py)- Task: Generating speech audio from text (requires
SpeechRecognition,protobuf). - Model:
microsoft/speecht5_tts+microsoft/speecht5_hifigan
- Task: Generating speech audio from text (requires
(Refer to comments within each script for more specific details on models and implementation.)
Before running these scripts, ensure you have the following:
- Python: Python 3.8 or later is recommended.
- System Dependencies (Ubuntu/Debian): Some scripts (especially audio-related) require system libraries. Install common ones using:
(Removed
# libsndfile1 is for reading/writing audio files # ffmpeg is often needed by libraries for handling various audio/video formats sudo apt update && sudo apt install libsndfile1 ffmpeg
tesseract-ocr. Other operating systems may require different commands). - Python Libraries: It's highly recommended to use a Python virtual environment. You can install all common dependencies used across the remaining examples with a single command:
pip install "transformers[audio,sentencepiece]" torch datasets soundfile librosa sentence-transformers Pillow torchvision timm requests pandas torch-scatter ftfy regex numpy torchaudio matplotlib SpeechRecognition protobuf- Note: Removed
pytesseract. Using"transformers[audio,sentencepiece]"helps install common audio dependencies andsentencepiece. Not every script requires all of these libraries. However, installing them all ensures you can run most examples. Refer to comments within the files for minimal requirements if needed.
- Note: Removed
- Clone the Repository:
git clone <repository-url> cd <repository-directory>
- Create Virtual Environment (Recommended):
(Use
python3 -m venv .venv source .venv/bin/activate.\.venv\Scripts\activateon Windows) - Install System Dependencies: Follow the instructions in the Prerequisites section if applicable for your OS (especially
libsndfile1,ffmpegon Ubuntu/Debian). - Install Python Libraries: Run the combined pip command from the Prerequisites section within your activated virtual environment.
- Configure Script Inputs (IMPORTANT):
- Many scripts require you to provide input, such as a path to a local image file, an audio file, specific text/questions, candidate labels, or table data inside the script.
- Open the specific
.pyscript you want to run in a text editor before executing it. - Look for comments indicating
USER ACTION REQUIREDor variables likeuser_image_path,user_audio_path,user_doc_image_path,question,candidate_labels,data(for tables),text_to_speak, etc. - Modify these variables according to the script's needs (e.g., provide a valid file path, change the question text, update labels, define table data). Some scripts include logic to download a sample file if a local one isn't found - read the script comments for details.
- Run the Script:
- Execute the desired script using Python from your terminal (ensure your virtual environment is active):
(e.g.,
python <script_name>.py
python run_sentiment.py,python run_docvqa.py)
- Execute the desired script using Python from your terminal (ensure your virtual environment is active):
The first time you run a script using a specific Hugging Face model, the necessary model weights, configuration, and tokenizer/processor files will be automatically downloaded from the Hugging Face Hub and cached locally (usually in ~/.cache/huggingface/ or C:\Users\<User>\.cache\huggingface\). Subsequent runs using the same model will load directly from the cache, making them much faster and enabling offline use (provided all necessary files are cached).
- CPU: Most scripts will run on a CPU, but performance (especially for larger models or complex tasks like vision, audio, generation) might be slow.
- GPU: An NVIDIA GPU with CUDA configured correctly and a compatible version of
torchinstalled is highly recommended for significantly faster inference. The scripts include basic logic to attempt using the GPU if available. - RAM: Models vary greatly in size. Ensure you have sufficient RAM. Smaller models might need 4-8GB, while larger ones (like
largevariants, vision/audio/document models) might require 16GB or more.
- The Python scripts in this repository are provided as examples, likely under the MIT License (or specify your chosen license).
- The Hugging Face libraries (
transformers,datasets, etc.) are typically licensed under Apache 2.0. - Individual models downloaded from the Hugging Face Hub have their own licenses. Please refer to the model card on the Hub for specific terms of use for each model (note that some models like Donut or specific fine-tunes might have non-commercial or other restrictions).