Skip to content

MS134340/Sentence-Embeddings-Using-N-Gram-Features-and-Contrastive-Learning-for-Multilingual-Datasets

Repository files navigation

Sentence-Embeddings-Using-N-Gram-Features-and-Contrastive-Learning-for-Multilingual-Datasets

This project explores a novel approach to generating sentence embeddings by leveraging N-gram features combined with contrastive learning. It aims to improve semantic representation across multilingual datasets, especially for low-resource languages.


πŸš€ Features

  • Sentence embeddings using character and word N-grams
  • Contrastive learning framework to align similar sentences
  • Works across multilingual and domain-diverse datasets
  • Evaluation using cosine similarity and clustering visualization

πŸ“ Datasets Used

  • Harry Potter and The Half-Blood Prince.txt – Fictional narrative
  • bible.txt – Multilingual structured text
  • Dataset25.txt – Custom multilingual dataset

πŸ› οΈ Setup & Usage

  1. Clone the repository
    git clone https://github.com/MS134340/Sentence-Embeddings-Using-N-Gram-Features-and-Contrastive-Learning-for-Multilingual-Datasets.git
    cd Sentence-Embeddings-Using-N-Gram-Features-and-Contrastive-Learning-for-Multilingual-Datasets
    
  2. Install required libraries
    pip install numpy pandas scikit-learn matplotlib seaborn
    
  3. Run the notebook Open and execute the .ipynb file in Jupyter or Google Colab.

πŸ“Š Results Summary

  • Outperformed simple TF-IDF and averaging methods.
  • Embeddings better captured sentence-level semantic similarity.
  • Contrastive loss improved clustering of similar meanings.

🧠 Techniques Used

  • N-gram Feature Engineering
  • Sentence Vector Aggregation
  • Supervised Contrastive Loss
  • Cosine Similarity Evaluation

About

"Generates sentence embeddings using N-gram features and contrastive learning, optimized for multilingual datasets and semantic similarity tasks."

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published