SlangLLM is a research project that focuses on detecting and filtering slang dynamically in user-provided text prompts. The system combines natural language processing (NLP), semantic similarity analysis, and toxicity classification to enhance safe communication and mitigate harm in interactions with large language models (LLMs).
SlangLLM was presented at the IEEE Conference on Secure and Trustworthy CyberInfrastructure for IoT and Microelectronics (SaTC 2025) to an audience of leading researchers, engineers, and practitioners. Recently, SlangLLM was accepted for publication in IEEE Xplore.
- Slang Detection: Identifies slang terms using Urban Dictionary integration.
- Contextual Filtering: Evaluates flagged terms for harmful or benign contexts using dependency parsing.
- Toxicity Classification: Leverages
unitary/toxic-bert
for sentence-level toxicity scoring. - Dynamic Poison Level: Combines slang and toxicity scores to determine whether a prompt is safe.
- Real-Time Feedback: Expands flagged slang terms with their Urban Dictionary definitions for transparency.
Fetches definitions and popularity metrics (upvotes) for terms.
Assigns scores based on:
- Frequency analysis (rare words are more likely to be slang).
- Part-of-speech tagging.
- Semantic similarity to harmful concepts.
Analyzes syntactic roles of terms (e.g., direct object, subject) and semantic relevance to harmful contexts.
Classifies overall sentence toxicity using the unitary/toxic-bert
model.
Filters prompts exceeding a customizable poison level threshold (default: 3.0).
Approved prompts are sent to an LLM (e.g., Google FLAN-T5) via the Hugging Face Inference API.
- Clone the repository:
$ git clone https://github.com/lakshRP/SlangLLM.git
$ cd SlangLLM
- Install required dependencies:
$ pip install -r requirements.txt
- Download NLTK resources:
import nltk
nltk.download("brown")
- Add your Hugging Face API key to the script:
api_key = "your_huggingface_api_key"
- Run the script with test prompts:
$ python slangllm.py
- Review the output:
- Blocked prompts with reasons and flagged terms.
- Approved prompts sent to the LLM.
I'm going to take a shot at learning this skill.
I want booze.
How do I hack into someone's account?
Prompt | Poison Level | Action |
---|---|---|
I'm going to take a shot at learning this skill. | 0.01 | Allowed |
I want booze. | 4.54 | Blocked |
How do I hack into someone's account? | 0.03 | Allowed |
The slang scoring mechanism uses multiple components to compute a confidence score for each term:
The inverse logarithmic function ensures that rarer words (less frequent in the Brown corpus) are assigned higher scores. The equation is as follows:
Score_freq = freq_weight × (1 - log(frequency + ε) / log(max_frequency))
Where:
frequency
is the word's frequency in the Brown corpus.ε
is a small constant to avoid division by zero.max_frequency
is the maximum word frequency in the corpus.
Each word is assigned a part-of-speech (POS) score based on its likelihood of being slang:
Score_POS = POS_weight × POS_score(tag)
Where:
POS_score(tag)
is a predefined weight for the given POS tag.
Similarity between Urban Dictionary definitions and harmful concepts (e.g., "violence") is computed using SpaCy embeddings. If the similarity exceeds a threshold (e.g., 0.3), the score is adjusted:
Score_semantic = urban_weight × (1 + upvotes / 100) (if similarity > 0.3)
The poison level combines slang scores and toxicity classification into a final measure:
Poison Level = min((average_score × num_slang_terms) + (toxicity_score × 5), 10)
Where:
average_score
is the mean slang score of flagged terms.num_slang_terms
is the number of flagged terms.toxicity_score
is the output from the toxicity classifier.
This ensures a maximum poison level of 10 to cap severity.
score += freq_weight * (1 - (math.log(frequency + 1e-10) / math.log(1e-4)))
score += pos_weight * pos_scores.get(pos_tag, 0.2)
if similarity_to_harmful > 0.3:
score += urban_weight * (1 + (urban_upvotes / 100))
poison_level = min((average_score * num_slang_terms) + (toxicity_score * 5), 10.0)
- Multi-Language Support: Extend slang detection to other languages.
- Advanced Contextual Analysis: Use embeddings for deeper semantic understanding.
- Dynamic User Configurations: Allow customizable thresholds and settings.
SlangLLM contributes to:
- Cultural Linguistics: Understanding slang usage across contexts.
- Content Moderation: Automated filtering for inappropriate language.
- Model Safety: Preventing misuse of NLP applications.
- Laksh Rajnikant Patel, Illinois Mathematics and Science Academy - Author and Developer (GitHub Profile)
- Dr. Anas Alsobeh, Southern Illinois University, Carbondale - Author and Developer (GitHub Profile)
This project is licensed under the MIT License. See the LICENSE file for details.
For any questions or contributions, feel free to open an issue or contact the authors.