Add node embeddings visualization using t-SNE

JohT · JohT · Oct 3, 2023 · Sep 27, 2023 · Oct 2, 2023 · Oct 3, 2023
commit cf3fec7536dca6fa796e8d95e468df341f0eeaae
diff --git a/jupyter/NodeEmbeddings.ipynb b/jupyter/NodeEmbeddings.ipynb
@@ -0,0 +1,311 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "2f0eabc4",
+   "metadata": {},
+   "source": [
+    "# Node Embeddings\n",
+    "\n",
+    "Here we will have a look at node embeddings and how to further reduce their dimensionality to be able to visualize them in a 2D plot. \n",
+    "\n",
+    "### Note about data dependencies\n",
+    "\n",
+    "PageRank centrality and Leiden community are also fetched from the Graph and need to be calculated first.\n",
+    "This makes it easier to see in the visualization if the embeddings approximate the structural information of the graph.\n",
+    "If these properties are missing you will only see black dots all of the same size without community coloring.\n",
+    "In future it might make sense to also run a community detection algorithm co-located in here to not depend on the order of execution.\n",
+    "\n",
+    "<br>  \n",
+    "\n",
+    "### References\n",
+    "- [jqassistant](https://jqassistant.org)\n",
+    "- [Neo4j Python Driver](https://neo4j.com/docs/api/python-driver/current)\n",
+    "- [Tutorial: Applied Graph Embeddings](https://neo4j.com/developer/graph-data-science/applied-graph-embeddings)\n",
+    "- [Visualizing the embeddings in 2D](https://github.com/openai/openai-cookbook/blob/main/examples/Visualizing_embeddings_in_2D.ipynb)\n",
+    "- [Fast Random Projection](https://neo4j.com/docs/graph-data-science/current/machine-learning/node-embeddings/fastrp)\n",
+    "- [scikit-learn TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html#sklearn.manifold.TSNE)\n",
+    "- [AttributeError: 'list' object has no attribute 'shape'](https://bobbyhadz.com/blog/python-attributeerror-list-object-has-no-attribute-shape)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4191f259",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import pandas as pd\n",
+    "import matplotlib.pyplot as plot\n",
+    "import typing as typ\n",
+    "import numpy as np\n",
+    "from sklearn.manifold import TSNE\n",
+    "from neo4j import GraphDatabase"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f8ef41ff",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sklearn\n",
+    "print('The scikit-learn version is {}.'.format(sklearn.__version__))\n",
+    "print('The pandas version is {}.'.format(pd.__version__))\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1c5dab37",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Please set the environment variable \"NEO4J_INITIAL_PASSWORD\" in your shell \n",
+    "# before starting jupyter notebook to provide the password for the user \"neo4j\". \n",
+    "# It is not recommended to hardcode the password into jupyter notebook for security reasons.\n",
+    "\n",
+    "driver = GraphDatabase.driver(uri=\"bolt://localhost:7687\", auth=(\"neo4j\", os.environ.get(\"NEO4J_INITIAL_PASSWORD\")))\n",
+    "driver.verify_connectivity()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c1db254b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def get_cypher_query_from_file(filename):\n",
+    "    with open(filename) as file:\n",
+    "        return ' '.join(file.readlines())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "59310f6f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def query_cypher_to_data_frame(filename, parameters_: typ.Optional[typ.Dict[str, typ.Any]] = None):\n",
+    "    records, summary, keys = driver.execute_query(get_cypher_query_from_file(filename),parameters_=parameters_)\n",
+    "    return pd.DataFrame([r.values() for r in records], columns=keys)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "da9e8edb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#The following cell uses the build-in %html \"magic\" to override the CSS style for tables to a much smaller size.\n",
+    "#This is especially needed for PDF export of tables with multiple columns."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9deaabce",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%html\n",
+    "<style>\n",
+    "/* CSS style for smaller dataframe tables. */\n",
+    ".dataframe th {\n",
+    "    font-size: 8px;\n",
+    "}\n",
+    ".dataframe td {\n",
+    "    font-size: 8px;\n",
+    "}\n",
+    "</style>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c2496caf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Main Colormap\n",
+    "main_color_map = 'nipy_spectral'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0c68aa20",
+   "metadata": {},
+   "source": [
+    "## Preparation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fcec9b7d",
+   "metadata": {},
+   "source": [
+    "### Create Graph Projection\n",
+    "\n",
+    "Create an in-memory undirected graph projection containing Package nodes (vertices) and their dependencies (edges)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "20190661",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "package_embeddings_parameters={\n",
+    "    \"dependencies_projection\": \"package-embeddings-notebook\",\n",
+    "    \"dependencies_projection_node\": \"Package\",\n",
+    "    \"dependencies_projection_weight_property\": \"weight25PercentInterfaces\",\n",
+    "    \"dependencies_projection_wright_property\": \"nodeEmbeddingsFastRandomProjection\",\n",
+    "    \"dependencies_projection_embedding_dimension\":\"64\" \n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "82e99db2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query_cypher_to_data_frame(\"../cypher/Dependencies_Projection/Dependencies_1_Delete_Projection.cypher\", package_embeddings_parameters)\n",
+    "query_cypher_to_data_frame(\"../cypher/Dependencies_Projection/Dependencies_2_Delete_Subgraph.cypher\", package_embeddings_parameters)\n",
+    "query_cypher_to_data_frame(\"../cypher/Dependencies_Projection/Dependencies_4_Create_Undirected_Projection.cypher\", package_embeddings_parameters)\n",
+    "query_cypher_to_data_frame(\"../cypher/Dependencies_Projection/Dependencies_5_Create_Subgraph.cypher\", package_embeddings_parameters)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "145dca19",
+   "metadata": {},
+   "source": [
+    "### Generate Node Embeddings using Fast Random Projection (Fast RP)\n",
+    "\n",
+    "[Fast Random Projection](https://neo4j.com/docs/graph-data-science/current/machine-learning/node-embeddings/fastrp) calculates an array of floats (length = embedding dimension) for every node in the graph. These numbers approximate the relationship and similarity information of each node and are called node embeddings. Random Projections is used to reduce the dimensionality of the node feature space while preserving pairwise distances.\n",
+    "\n",
+    "The result can be used in machine learning as features approximating the graph structure. It can also be used to further reduce the dimensionality to visualize the graph in a 2D plot, as we will be doing here."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8efca2cf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "fast_random_projection = query_cypher_to_data_frame(\"../cypher/Node_Embeddings/Node_Embeddings_1d_Fast_Random_Projection_Stream.cypher\", package_embeddings_parameters)\n",
+    "fast_random_projection.head() # Look at the first entries of the table \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "76d8bca1",
+   "metadata": {},
+   "source": [
+    "### Dimensionality reduction with t-distributed stochastic neighbor embedding (t-SNE)\n",
+    "\n",
+    "This step takes the original node embeddings with a higher dimensionality (e.g. list of 32 floats) and\n",
+    "reduces them to a 2 dimensional array for visualization. \n",
+    "\n",
+    "> It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.\n",
+    "\n",
+    "(see https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html#sklearn.manifold.TSNE)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b2de000f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Calling the fit_transform method just with a list doesn't seem to work (anymore?). \n",
+    "# It leads to an error with the following message: 'list' object has no attribute 'shape'\n",
+    "# This can be solved by converting the list to a numpy array using np.array(..).\n",
+    "# See https://bobbyhadz.com/blog/python-attributeerror-list-object-has-no-attribute-shape\n",
+    "embeddings_as_numpy_array = np.array(fast_random_projection.embedding.to_list())\n",
+    "\n",
+    "# Use TSNE to reduce the dimensionality of the previous calculated node embeddings to 2 dimensions for visualization\n",
+    "t_distributed_stochastic_neighbor_embedding = TSNE(n_components=2, verbose=1, random_state=50)\n",
+    "two_dimension_node_embeddings = t_distributed_stochastic_neighbor_embedding.fit_transform(embeddings_as_numpy_array)\n",
+    "two_dimension_node_embeddings.shape"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8ce7ea41",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create a new DataFrame with the results of the 2 dimensional node embeddings\n",
+    "# and the code unit and artifact name of the query above as preparation for the plot\n",
+    "node_embeddings_for_visualization = pd.DataFrame(data = {\n",
+    "    \"codeUnit\": fast_random_projection.codeUnitName,\n",
+    "    \"artifact\": fast_random_projection.artifactName,\n",
+    "    \"communityId\": fast_random_projection.communityId,\n",
+    "    \"centrality\": fast_random_projection.centrality,\n",
+    "    \"x\": [value[0] for value in two_dimension_node_embeddings],\n",
+    "    \"y\": [value[1] for value in two_dimension_node_embeddings]\n",
+    "})\n",
+    "node_embeddings_for_visualization.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "459a819c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plot.scatter(\n",
+    "    x=node_embeddings_for_visualization.x,\n",
+    "    y=node_embeddings_for_visualization.y,\n",
+    "    s=node_embeddings_for_visualization.centrality * 200,\n",
+    "    c=node_embeddings_for_visualization.communityId,\n",
+    "    cmap=main_color_map,\n",
+    ")\n",
+    "plot.title(\"Package nodes positioned by their dependency relationships using t-SNE\")\n",
+    "plot.show()"
+   ]
+  }
+ ],
+ "metadata": {
+  "authors": [
+   {
+    "name": "JohT"
+   }
+  ],
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  },
+  "title": "Object Oriented Design Quality Metrics for Java with Neo4j"
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/jupyter/environment.yml b/jupyter/environment.yml
@@ -10,6 +10,7 @@ dependencies:
   - numpy=1.23.*
   - pandas=1.5.*
   - pip=22.3.*
+  - scikit-learn=1.3.* # NodeEmbeddings.ipynb uses sklearn.manifold.TSNE 
   - pip:
       - monotonic==1.*
       - wordcloud==1.9.*

diff --git a/scripts/reports/NodeEmbeddingsJupyter.sh b/scripts/reports/NodeEmbeddingsJupyter.sh
@@ -0,0 +1,34 @@
+#!/usr/bin/env bash
+
+# Creates the "node-embeddings" report (ipynb, md, pdf) based on the Jupyter Notebook "NodeEmbeddings.ipynb".
+# It shows how to create node embeddings for package dependencies using "Fast Random Projection" and
+# how these embeddings can be further reduced in their dimensionality down to two dimensions for visualization.
+# The plot also shows the community as color and the PageRank as size to have a visual feedback on how well they are clustered.
+
+# Requires executeJupyterNotebook.sh
+
+# Overrideable Constants (defaults also defined in sub scripts)
+REPORTS_DIRECTORY=${REPORTS_DIRECTORY:-"reports"}
+
+## Get this "scripts/reports" directory if not already set
+# Even if $BASH_SOURCE is made for Bourne-like shells it is also supported by others and therefore here the preferred solution. 
+# CDPATH reduces the scope of the cd command to potentially prevent unintended directory changes.
+# This way non-standard tools like readlink aren't needed.
+REPORTS_SCRIPT_DIR=${REPORTS_SCRIPT_DIR:-$( CDPATH=. cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd -P )}
+echo "NodeEmbeddingsJupyter: REPORTS_SCRIPT_DIR=${REPORTS_SCRIPT_DIR}"
+
+# Get the "scripts" directory by taking the path of this script and going one directory up.
+SCRIPTS_DIR=${SCRIPTS_DIR:-"${REPORTS_SCRIPT_DIR}/.."} # Repository directory containing the shell scripts
+echo "NodeEmbeddingsJupyter: SCRIPTS_DIR=${SCRIPTS_DIR}"
+
+# Get the "jupyter" directory by taking the path of this script and going two directory up and then to "jupyter".
+JUPYTER_NOTEBOOK_DIRECTORY=${JUPYTER_NOTEBOOK_DIRECTORY:-"${SCRIPTS_DIR}/../jupyter"} # Repository directory containing the Jupyter Notebooks
+echo "NodeEmbeddingsJupyter: JUPYTER_NOTEBOOK_DIRECTORY=$JUPYTER_NOTEBOOK_DIRECTORY"
+
+# Create report directory
+REPORT_NAME="node-embeddings"
+FULL_REPORT_DIRECTORY="${REPORTS_DIRECTORY}/${REPORT_NAME}"
+mkdir -p "${FULL_REPORT_DIRECTORY}"
+
+# Execute and convert the Jupyter Notebook "InternalDependencies.ipynb" within the given reports directory
+(cd "${FULL_REPORT_DIRECTORY}" && exec ${SCRIPTS_DIR}/executeJupyterNotebook.sh ${JUPYTER_NOTEBOOK_DIRECTORY}/NodeEmbeddings.ipynb) || exit 1