Sentence-Embeddings-Using-N-Gram-Features-and-Contrastive-Learning-for-Multilingual-Datasets

This project explores a novel approach to generating sentence embeddings by leveraging N-gram features combined with contrastive learning. It aims to improve semantic representation across multilingual datasets, especially for low-resource languages.

🚀 Features

Sentence embeddings using character and word N-grams
Contrastive learning framework to align similar sentences
Works across multilingual and domain-diverse datasets
Evaluation using cosine similarity and clustering visualization

📁 Datasets Used

Harry Potter and The Half-Blood Prince.txt – Fictional narrative
bible.txt – Multilingual structured text
Dataset25.txt – Custom multilingual dataset

🛠️ Setup & Usage

Clone the repository

git clone https://github.com/MS134340/Sentence-Embeddings-Using-N-Gram-Features-and-Contrastive-Learning-for-Multilingual-Datasets.git
cd Sentence-Embeddings-Using-N-Gram-Features-and-Contrastive-Learning-for-Multilingual-Datasets

Install required libraries

pip install numpy pandas scikit-learn matplotlib seaborn

Run the notebook Open and execute the .ipynb file in Jupyter or Google Colab.

📊 Results Summary

Outperformed simple TF-IDF and averaging methods.
Embeddings better captured sentence-level semantic similarity.
Contrastive loss improved clustering of similar meanings.

🧠 Techniques Used

N-gram Feature Engineering
Sentence Vector Aggregation
Supervised Contrastive Loss
Cosine Similarity Evaluation

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Dataset25.txt		Dataset25.txt
Harry Potter and The Half-Blood Prince.txt		Harry Potter and The Half-Blood Prince.txt
README.md		README.md
Sentence-Embeddings-Using-N-Gram-Features-and-Contrastive-Learning-for-Multilingual-Datasets.ipynb		Sentence-Embeddings-Using-N-Gram-Features-and-Contrastive-Learning-for-Multilingual-Datasets.ipynb
bible.txt		bible.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sentence-Embeddings-Using-N-Gram-Features-and-Contrastive-Learning-for-Multilingual-Datasets

🚀 Features

📁 Datasets Used

🛠️ Setup & Usage

📊 Results Summary

About

Uh oh!

Releases

Packages

Languages

MS134340/Sentence-Embeddings-Using-N-Gram-Features-and-Contrastive-Learning-for-Multilingual-Datasets

Folders and files

Latest commit

History

Repository files navigation

Sentence-Embeddings-Using-N-Gram-Features-and-Contrastive-Learning-for-Multilingual-Datasets

🚀 Features

📁 Datasets Used

🛠️ Setup & Usage

📊 Results Summary

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages