Hugging Face Local Inference Examples

This repository contains a collection of Python scripts demonstrating how to run various AI tasks locally using models from the Hugging Face Hub and the transformers library (along with related libraries like datasets, sentence-transformers, etc.).

These examples cover a range of modalities including Text, Vision, Audio, and Multimodal combinations, showcasing different models and pipelines available within the Hugging Face ecosystem. Each script aims to be runnable with minimal modification (often just providing an input file path or configuring text/labels/data within the script).

Examples Included

The scripts are categorized by the primary data modalities they handle:

📝 Text Examples

Sentiment Analysis (run_sentiment.py)
- Task: Text Classification (Positive/Negative)
- Model: distilbert-base-uncased-finetuned-sst-2-english (Pipeline Default)
Text Generation (run_generation.py)
- Task: Generating text following a prompt.
- Model: gpt2
Zero-Shot Text Classification (run_zero_shot.py)
- Task: Classifying text using arbitrary labels without specific fine-tuning.
- Model: facebook/bart-large-mnli (Pipeline Default)
Named Entity Recognition (NER) (run_ner.py)
- Task: Identifying named entities (Person, Location, Org).
- Model: dbmdz/bert-large-cased-finetuned-conll03-english
Summarization (run_summarization.py)
- Task: Creating a shorter summary of a longer text.
- Model: facebook/bart-large-cnn
Translation (EN->FR) (run_translation.py)
- Task: Translating text from English to French.
- Model: Helsinki-NLP/opus-mt-en-fr
Question Answering (Extractive Text) (run_qa.py)
- Task: Finding the answer span within a context paragraph given a question.
- Model: distilbert-base-cased-distilled-squad
Fill-Mask (run_fill_mask.py)
- Task: Predicting masked words in a sentence (Masked Language Modeling).
- Model: roberta-base
Sentence Embeddings & Similarity (run_embeddings.py, run_similarity_search.py)
- Task: Generating semantic vector representations and finding similar sentences.
- Model: sentence-transformers/all-MiniLM-L6-v2 (via sentence-transformers library)
Emotion Classification (run_emotion.py)
- Task: Text Classification (Detecting emotions like joy, anger, sadness).
- Model: j-hartmann/emotion-english-distilroberta-base
Table Question Answering (run_table_qa.py)
- Task: Answering questions based on tabular data (requires pandas, torch-scatter).
- Model: google/tapas-base-finetuned-wtq
Dialogue Simulation (run_dialogue_generation.py)
- Task: Simulating multi-turn conversation via text generation pipeline.
- Model: microsoft/DialoGPT-medium
Part-of-Speech (POS) Tagging (run_pos_tagging.py)
- Task: Identifying grammatical parts of speech for each word.
- Model: vblagoje/bert-english-uncased-finetuned-pos

🖼️ Vision Examples (Purely Image Input/Output)

Image Classification (run_image_classification.py)
- Task: Classifying the main subject of an image.
- Model: google/vit-base-patch16-224
Object Detection (run_object_detection_annotated.py)
- Task: Identifying multiple objects in an image with bounding boxes and labels (plus annotation).
- Model: facebook/detr-resnet-50
Depth Estimation (run_depth_estimation.py)
- Task: Estimating depth from a single image, saving a depth map.
- Model: Intel/dpt-large
Image Segmentation (run_segmentation.py)
- Task: Assigning category labels (e.g., road, sky, car) to each pixel (requires matplotlib, numpy).
- Model: nvidia/segformer-b0-finetuned-ade-512-512
Image Super-Resolution (run_super_resolution.py)
- Task: Upscaling an image (x2) to enhance resolution.
- Model: caidas/swin2SR-classical-sr-x2-64

🎧 Audio Examples (Purely Audio Input/Output)

Audio Classification (run_audio_classification.py)
- Task: Classifying the type of sound in an audio file (e.g., Speech, Music). Requires torchaudio.
- Model: MIT/ast-finetuned-audioset-10-10-0.4593

🔄 Multimodal Examples (Vision + Text)

Image Captioning (run_image_captioning.py)
- Task: Generating a text description for an image.
- Model: nlpconnect/vit-gpt2-image-captioning
Visual Question Answering (VQA) (run_vqa.py)
- Task: Answering questions based on image content.
- Model: dandelin/vilt-b32-finetuned-vqa
Zero-Shot Image Classification (run_zero_shot_image.py)
- Task: Classifying images against arbitrary text labels (requires ftfy, regex).
- Model: openai/clip-vit-base-patch32
Document Question Answering (DocVQA) (run_docvqa.py)
- Task: Answering questions based on document image content (requires sentencepiece).
- Model: naver-clova-ix/donut-base-finetuned-docvqa

🔄 Multimodal Examples (Audio + Text)

Automatic Speech Recognition (ASR) (run_asr_flexible.py)
- Task: Transcribing speech from an audio file to text.
- Model: openai/whisper-base
Zero-Shot Audio Classification (run_zero_shot_audio.py)
- Task: Classifying sounds against arbitrary text labels.
- Model: laion/clap-htsat-unfused
Text-to-Speech (TTS) (run_tts.py)
- Task: Generating speech audio from text (requires SpeechRecognition, protobuf).
- Model: microsoft/speecht5_tts + microsoft/speecht5_hifigan

(Refer to comments within each script for more specific details on models and implementation.)

Prerequisites

Before running these scripts, ensure you have the following:

Python: Python 3.8 or later is recommended.
System Dependencies (Ubuntu/Debian): Some scripts (especially audio-related) require system libraries. Install common ones using:
```
# libsndfile1 is for reading/writing audio files
# ffmpeg is often needed by libraries for handling various audio/video formats
sudo apt update && sudo apt install libsndfile1 ffmpeg
```
(Removed tesseract-ocr. Other operating systems may require different commands).
Python Libraries: It's highly recommended to use a Python virtual environment. You can install all common dependencies used across the remaining examples with a single command:
```
pip install "transformers[audio,sentencepiece]" torch datasets soundfile librosa sentence-transformers Pillow torchvision timm requests pandas torch-scatter ftfy regex numpy torchaudio matplotlib SpeechRecognition protobuf
```
- Note: Removed pytesseract. Using "transformers[audio,sentencepiece]" helps install common audio dependencies and sentencepiece. Not every script requires all of these libraries. However, installing them all ensures you can run most examples. Refer to comments within the files for minimal requirements if needed.

General Usage

Clone the Repository:

git clone <repository-url>
cd <repository-directory>

Create Virtual Environment (Recommended):
```
python3 -m venv .venv
source .venv/bin/activate
```
(Use .\.venv\Scripts\activate on Windows)
Install System Dependencies: Follow the instructions in the Prerequisites section if applicable for your OS (especially libsndfile1, ffmpeg on Ubuntu/Debian).
Install Python Libraries: Run the combined pip command from the Prerequisites section within your activated virtual environment.
Configure Script Inputs (IMPORTANT):
- Many scripts require you to provide input, such as a path to a local image file, an audio file, specific text/questions, candidate labels, or table data inside the script.
- Open the specific .py script you want to run in a text editor before executing it.
- Look for comments indicating USER ACTION REQUIRED or variables like user_image_path, user_audio_path, user_doc_image_path, question, candidate_labels, data (for tables), text_to_speak, etc.
- Modify these variables according to the script's needs (e.g., provide a valid file path, change the question text, update labels, define table data). Some scripts include logic to download a sample file if a local one isn't found - read the script comments for details.
Run the Script:
- Execute the desired script using Python from your terminal (ensure your virtual environment is active):
```
python <script_name>.py
```
  (e.g., python run_sentiment.py, python run_docvqa.py)

Model Downloads

The first time you run a script using a specific Hugging Face model, the necessary model weights, configuration, and tokenizer/processor files will be automatically downloaded from the Hugging Face Hub and cached locally (usually in ~/.cache/huggingface/ or C:\Users\<User>\.cache\huggingface\). Subsequent runs using the same model will load directly from the cache, making them much faster and enabling offline use (provided all necessary files are cached).

Hardware Considerations

CPU: Most scripts will run on a CPU, but performance (especially for larger models or complex tasks like vision, audio, generation) might be slow.
GPU: An NVIDIA GPU with CUDA configured correctly and a compatible version of torch installed is highly recommended for significantly faster inference. The scripts include basic logic to attempt using the GPU if available.
RAM: Models vary greatly in size. Ensure you have sufficient RAM. Smaller models might need 4-8GB, while larger ones (like large variants, vision/audio/document models) might require 16GB or more.

License

The Python scripts in this repository are provided as examples, likely under the MIT License (or specify your chosen license).
The Hugging Face libraries (transformers, datasets, etc.) are typically licensed under Apache 2.0.
Individual models downloaded from the Hugging Face Hub have their own licenses. Please refer to the model card on the Hub for specific terms of use for each model (note that some models like Donut or specific fine-tunes might have non-commercial or other restrictions).

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
README.md		README.md
ReceiptSwiss.jpg		ReceiptSwiss.jpg
asr_sample.wav		asr_sample.wav
caption_sample_image.jpg		caption_sample_image.jpg
cats.jpg		cats.jpg
depth_map_output.png		depth_map_output.png
depth_sample_image.jpg		depth_sample_image.jpg
detr_sample_image.jpg		detr_sample_image.jpg
image.jpeg		image.jpeg
image_captioning.JPG		image_captioning.JPG
low-res.jpg		low-res.jpg
my_segmentation_image.jpg		my_segmentation_image.jpg
obj_detection.JPG		obj_detection.JPG
object_detection_output.jpg		object_detection_output.jpg
requirements.txt		requirements.txt
run_asr.md		run_asr.md
run_asr.py		run_asr.py
run_audio_classification.md		run_audio_classification.md
run_audio_classification.py		run_audio_classification.py
run_conversation.md		run_conversation.md
run_conversation.py		run_conversation.py
run_depth_estimation.md		run_depth_estimation.md
run_depth_estimation.py		run_depth_estimation.py
run_docvqa.md		run_docvqa.md
run_docvqa.py		run_docvqa.py
run_embeddings.py		run_embeddings.py
run_embeddings.txt		run_embeddings.txt
run_emotion.py		run_emotion.py
run_emotion.txt		run_emotion.txt
run_fill_mask.py		run_fill_mask.py
run_fill_mask.txt		run_fill_mask.txt
run_generation.py		run_generation.py
run_generation.txt		run_generation.txt
run_image_captioning.md		run_image_captioning.md
run_image_captioning.py		run_image_captioning.py
run_image_classification.py		run_image_classification.py
run_image_classification.txt		run_image_classification.txt
run_ner.py		run_ner.py
run_ner.txt		run_ner.txt
run_object_detection.py		run_object_detection.py
run_object_detection.txt		run_object_detection.txt
run_pos_tagging.md		run_pos_tagging.md
run_pos_tagging.py		run_pos_tagging.py
run_qa.py		run_qa.py
run_qa.txt		run_qa.txt
run_segmentation.md		run_segmentation.md
run_segmentation.py		run_segmentation.py
run_sentiment.py		run_sentiment.py
run_sentiment.txt		run_sentiment.txt
run_summarisation.py		run_summarisation.py
run_summarisation.txt		run_summarisation.txt
run_super_resolution.md		run_super_resolution.md
run_super_resolution.py		run_super_resolution.py
run_table_qa.md		run_table_qa.md
run_table_qa.py		run_table_qa.py
run_translation.py		run_translation.py
run_translation.txt		run_translation.txt
run_tts.md		run_tts.md
run_tts.py		run_tts.py
run_vqa.md		run_vqa.md
run_vqa.py		run_vqa.py
run_zero_shot.md		run_zero_shot.md
run_zero_shot.py		run_zero_shot.py
run_zero_shot_audio.md		run_zero_shot_audio.md
run_zero_shot_audio.py		run_zero_shot_audio.py
run_zero_shot_image.md		run_zero_shot_image.md
run_zero_shot_image.py		run_zero_shot_image.py
segmentation_visualization.png		segmentation_visualization.png
super_resolution_output.png		super_resolution_output.png
tts_output.wav		tts_output.wav
verify_transformers.py		verify_transformers.py
vqa_sample_image.jpg		vqa_sample_image.jpg
zero_shot_sample_image.jpg		zero_shot_sample_image.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hugging Face Local Inference Examples

Examples Included

📝 Text Examples

🖼️ Vision Examples (Purely Image Input/Output)

🎧 Audio Examples (Purely Audio Input/Output)

🔄 Multimodal Examples (Vision + Text)

🔄 Multimodal Examples (Audio + Text)

Prerequisites

General Usage

Model Downloads

Hardware Considerations

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

kr4ckhe4d/NLP-Examples

Folders and files

Latest commit

History

Repository files navigation

Hugging Face Local Inference Examples

Examples Included

📝 Text Examples

🖼️ Vision Examples (Purely Image Input/Output)

🎧 Audio Examples (Purely Audio Input/Output)

🔄 Multimodal Examples (Vision + Text)

🔄 Multimodal Examples (Audio + Text)

Prerequisites

General Usage

Model Downloads

Hardware Considerations

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages