Skip to content

A visual analytics tool and framework for exploring compositionality in sentence embeddings. Gain interactive insights into how embedding models, composition functions, and similarity metrics influence textual representations, focusing on error gap analysis for enhanced model interpretability.

License

Notifications You must be signed in to change notification settings

david-xander/visual-analytics-tool-sentence-embeddings

Repository files navigation

A Framework and Visual Analytics Tool for Exploring Compositionality in Sentence Embeddings

Understanding and effectively utilizing sentence embeddings—a cornerstone of modern Natural Language Processing (NLP)—is often challenging due to their "black-box" nature and the limitations of traditional aggregate evaluation metrics. This project addresses these challenges by providing a novel Framework and Visual Analytics (VA) Tool designed to empower researchers with deeper, interactive insights into how various embedding models, composition functions, and similarity metrics influence textual representations.


Project Overview

This work's core contribution is the VA tool itself, which integrates comprehensive visualizations with interactive filtering and detailed drill-down capabilities. To enable this granular analysis, we developed an experimental framework or pipeline that systematically processes and normalizes the outputs of diverse embedding models, ensuring consistent and comparable data for the VA application. This framework is a valuable artifact, facilitating the reproduction of results and the expansion of the tool's dataset, and motivates the subsequent development of the VA tool.

The VA tool enhances embedding model interpretability by allowing visual exploration of embedding behavior across different configurations and layers. It facilitates systematic comparison across models, even those with disparate architectures, within a unified analytical environment. Crucially, it moves beyond aggregate performance by focusing on the error gap between predicted and actual similarity scores. Through detailed examples, interactive error gap heatmaps, and an Alternative Functions Heatmap for specific challenging instances, the tool enables fine-grained evaluation, revealing nuanced model strengths and limitations often obscured by summary statistics. This work provides an intuitive platform for diagnosing model failures, understanding representational biases, and fostering more informed decisions in the development and application of sentence embeddings.


Features

  • Interactive Visual Analytics: Explore sentence embedding behavior with rich, interactive visualizations.
  • Error Gap Analysis: Focus on the discrepancies between predicted and actual similarity scores for in-depth model diagnosis.
  • Comparative Analysis: Systematically compare diverse embedding models and composition functions within a unified environment.
  • Drill-down Capabilities: Investigate specific challenging instances with detailed error gap heatmaps and alternative function analyses.
  • Reproducible Experimental Framework: A robust pipeline for processing and normalizing embedding model outputs, ensuring consistent and comparable data.

Getting Started

This project is built using Python and Dash Plotly.

Prerequisites

To set up the project, you will need a typical Python environment.

Installation

  1. Clone the repository:
    git clone https://github.com/david-xander/visual-analytics-tool-sentence-embeddings
    cd visual-analytics-tool-sentence-embeddings
  2. Install dependencies:
    pip install -r requirements.txt

Running the Experimental Framework

The experimental framework processes and normalizes embedding model outputs.

To run the framework:

python run_experiment.py

Running the Visual Analytics Tool

The VA tool is a Dash Plotly application that runs as a web server.

To run the server:

python run_dashboard.py

Once the server is running, you can access the VA tool through your web browser, typically at http://127.0.0.1:8050/.


Data

The datasets generated by the experimental framework, particularly the extended STS-B dataset with computed similarities and correlation results, are included within this GitHub repository to facilitate direct use with the VA tool.

Note: Intermediate embedding files (".pt" files mentioned in Chapter 3 of the thesis) are not included due to GitHub's file size limits. The current version of the VA tool does not include embeddings visualization through a 2D or 3D scatterplot view, making these files unnecessary for its functionality.


Composition Functions

The composition functions used in this project are derived from the AllSpark project. I would like to acknowledge their valuable contribution:

  • GitHub Repository: https://github.com/adriangh-ai/AllSpark
  • Publication: Ghajari, Adrián, Victor Fresno, and Enrique Amigo. “Platform for exploring Semantic Composition from pre-trained Language Models and static embeddings.” In Proceedings of the Annual Conference of the Spanish Association for Natural Language Processing: Projects and Demonstrations (SEPLN-PD 2022), Vol-3224:52–56. A Coruña, Spain: CEUR-WS, 2022. http://ceur-ws.org/Vol-3224/paper13.pdf.

About

A visual analytics tool and framework for exploring compositionality in sentence embeddings. Gain interactive insights into how embedding models, composition functions, and similarity metrics influence textual representations, focusing on error gap analysis for enhanced model interpretability.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published