This Python project processes audio transcriptions and classifies them for aggression and hate speech categories using various language models.
- Binary Classification: Determines if text is aggressive (0 = not aggressive, 1 = aggressive).
- Multiclass Classification: Categorizes text into one of five classes:
- 0: Racism (attacks based on race, nationality, or religion)
- 1: Sexism (attacks directed at women, gender roles, objectification)
- 2: Hate Speech (general hate speech not fitting other categories)
- 3: Vulgarism (vulgar language without targeted attacks)
- 4: Neutral (non-aggressive language)
- Supports multiple model providers: Local (Ollama), OpenAI, and Google Gemini.
- Processes input files from the
input/
directory and outputs results tooutput/
. - Handles errors gracefully (e.g., unreachable models, invalid responses).
- Python 3.7+
- Required packages (install via
pip install -r requirements.txt
):litellm
requests
openai
(for OpenAI provider)google-genai
(for Gemini provider)
- For local models: Ollama server running with supported models (e.g., llama3.3, mistral-large).
- API keys for OpenAI and Gemini if using those providers.
-
Clone the repository:
git clone <repository-url> cd Classification
-
Install dependencies:
pip install -r requirements.txt
-
Set up API keys (if using OpenAI or Gemini):
- Edit the script files to set
OPENAI_API_KEY
andGEMINI_API_KEY
, or set environment variables.
- Edit the script files to set
-
For local models, ensure Ollama is installed and running:
- Install Ollama from ollama.ai.
- Pull required models:
ollama pull llama3.3
, etc. - Set
SELF_HOSTED_MODELS_URL
to your Ollama API base (e.g.,http://localhost:11434
).
-
Place input files in the
input/
directory. Each file should contain lines in the format:filename;text
(semicolon-separated). -
Run the classification script:
- For one-shot classification:
python classification-oneshot.py
- For zero-shot classification:
python classification-zeroshot.py
- For one-shot classification:
-
Results will be saved in the
output/
directory as CSV files, one per model and input file.
Output files are CSV with semicolon-separated values:
filename
: Original filenamebinary_aggressive
: 0 or 1 (or error codes)multiclass_label
: 0-4 (or error codes)text
: Sanitized input text
- Local: llama3.3, mistral-large, SpeakLeash/bielik-11b-v2.3-instruct:Q8_0
- OpenAI: gpt-5 (adjust as needed)
- Gemini: gemini-2.5-flash