A Framework and Visual Analytics Tool for Exploring Compositionality in Sentence Embeddings

Understanding and effectively utilizing sentence embeddings—a cornerstone of modern Natural Language Processing (NLP)—is often challenging due to their "black-box" nature and the limitations of traditional aggregate evaluation metrics. This project addresses these challenges by providing a novel Framework and Visual Analytics (VA) Tool designed to empower researchers with deeper, interactive insights into how various embedding models, composition functions, and similarity metrics influence textual representations.

Project Overview

This work's core contribution is the VA tool itself, which integrates comprehensive visualizations with interactive filtering and detailed drill-down capabilities. To enable this granular analysis, we developed an experimental framework or pipeline that systematically processes and normalizes the outputs of diverse embedding models, ensuring consistent and comparable data for the VA application. This framework is a valuable artifact, facilitating the reproduction of results and the expansion of the tool's dataset, and motivates the subsequent development of the VA tool.

The VA tool enhances embedding model interpretability by allowing visual exploration of embedding behavior across different configurations and layers. It facilitates systematic comparison across models, even those with disparate architectures, within a unified analytical environment. Crucially, it moves beyond aggregate performance by focusing on the error gap between predicted and actual similarity scores. Through detailed examples, interactive error gap heatmaps, and an Alternative Functions Heatmap for specific challenging instances, the tool enables fine-grained evaluation, revealing nuanced model strengths and limitations often obscured by summary statistics. This work provides an intuitive platform for diagnosing model failures, understanding representational biases, and fostering more informed decisions in the development and application of sentence embeddings.

Features

Interactive Visual Analytics: Explore sentence embedding behavior with rich, interactive visualizations.
Error Gap Analysis: Focus on the discrepancies between predicted and actual similarity scores for in-depth model diagnosis.
Comparative Analysis: Systematically compare diverse embedding models and composition functions within a unified environment.
Drill-down Capabilities: Investigate specific challenging instances with detailed error gap heatmaps and alternative function analyses.
Reproducible Experimental Framework: A robust pipeline for processing and normalizing embedding model outputs, ensuring consistent and comparable data.

Getting Started

This project is built using Python and Dash Plotly.

Prerequisites

To set up the project, you will need a typical Python environment.

Installation

Clone the repository:

git clone https://github.com/david-xander/visual-analytics-tool-sentence-embeddings
cd visual-analytics-tool-sentence-embeddings

Install dependencies:
```
pip install -r requirements.txt
```

Running the Experimental Framework

The experimental framework processes and normalizes embedding model outputs.

To run the framework:

python run_experiment.py

Running the Visual Analytics Tool

The VA tool is a Dash Plotly application that runs as a web server.

To run the server:

python run_dashboard.py

Once the server is running, you can access the VA tool through your web browser, typically at http://127.0.0.1:8050/.

Data

The datasets generated by the experimental framework, particularly the extended STS-B dataset with computed similarities and correlation results, are included within this GitHub repository to facilitate direct use with the VA tool.

Note: Intermediate embedding files (".pt" files mentioned in Chapter 3 of the thesis) are not included due to GitHub's file size limits. The current version of the VA tool does not include embeddings visualization through a 2D or 3D scatterplot view, making these files unnecessary for its functionality.

Composition Functions

The composition functions used in this project are derived from the AllSpark project. I would like to acknowledge their valuable contribution:

GitHub Repository: https://github.com/adriangh-ai/AllSpark
Publication: Ghajari, Adrián, Victor Fresno, and Enrique Amigo. “Platform for exploring Semantic Composition from pre-trained Language Models and static embeddings.” In Proceedings of the Annual Conference of the Spanish Association for Natural Language Processing: Projects and Demonstrations (SEPLN-PD 2022), Vol-3224:52–56. A Coruña, Spain: CEUR-WS, 2022. http://ceur-ws.org/Vol-3224/paper13.pdf.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
allspark/composition		allspark/composition
assets		assets
dashboard		dashboard
exptfm		exptfm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_dashboard.py		run_dashboard.py
run_experiments.py		run_experiments.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Framework and Visual Analytics Tool for Exploring Compositionality in Sentence Embeddings

Project Overview

Features

Getting Started

Prerequisites

Installation

Running the Experimental Framework

Running the Visual Analytics Tool

Data

Composition Functions

About

Uh oh!

Releases

Packages

Languages

License

david-xander/visual-analytics-tool-sentence-embeddings

Folders and files

Latest commit

History

Repository files navigation

A Framework and Visual Analytics Tool for Exploring Compositionality in Sentence Embeddings

Project Overview

Features

Getting Started

Prerequisites

Installation

Running the Experimental Framework

Running the Visual Analytics Tool

Data

Composition Functions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages