aspitarl
diff --git a/‎energy_storage_nlp/generate_df_info.py
Lines changed: 34 additions & 0 deletions b/‎energy_storage_nlp/generate_df_info.py
Lines changed: 34 additions & 0 deletions
diff --git a/‎energy_storage_nlp/index.html
Lines changed: 105 additions & 0 deletions b/‎energy_storage_nlp/index.html
Lines changed: 105 additions & 0 deletions
diff --git a/‎energy_storage_nlp/lda_model.html
Lines changed: 41 additions & 0 deletions b/‎energy_storage_nlp/lda_model.html
Lines changed: 41 additions & 0 deletions
diff --git a/‎energy_storage_nlp/wedgeplot.html
Lines changed: 85 additions & 0 deletions b/‎energy_storage_nlp/wedgeplot.html
Lines changed: 85 additions & 0 deletions
diff --git a/‎index.html
Lines changed: 14 additions & 0 deletions b/‎index.html
Lines changed: 14 additions & 0 deletions
@@ -0,0 +1,34 @@
+#%% 
+import pandas as pd
+import sqlite3
+import os
+import sys
+
+data_folder = r'C:\Users\aspit\Git\MHDLab-Projects\Energy Storage\data'
+
+con = sqlite3.connect(os.path.join(data_folder, 'nlp.db'))
+cursor = con.cursor()
+
+df = pd.read_sql_query("SELECT * FROM texts", con, index_col='ID')
+
+df = df.dropna(subset=['processed_text'])
+df = df[df['language'] == 'en']
+
+#%%
+
+# %%
+num_papers = str(len(df))
+
+print('Number of papers: ' + num_papers)
+
+# %%
+search_terms = ", ".join(set(df['searchterm']))
+
+
+print('Search Terms: ' + search_terms)
+
+
+
+
+
+# %%
@@ -0,0 +1,105 @@
+<!DOCTYPE html>
+<meta content="text/html;charset=utf-8" http-equiv="Content-Type">
+<meta content="utf-8" http-equiv="encoding">
+<html>
+<head>
+  <title>Energy Storage Abstract Clustering</title>
+  <style>
+
+  .column {
+    float: left;
+  }
+
+  /* Clear floats after the columns */
+  .row:after {
+    content: "";
+    display: table;
+    clear: both;
+
+  }
+  </style>
+      
+</head>
+
+<body>
+
+<h1>
+  <center>Clustering of abstracts related to energy storage</center>
+</h1>
+
+<p>
+  Below are interactive plots visualizing topic modeling on a collection of article abstracts pulled from Microsoft Academic related to energy storage. 
+</p>
+
+<h2>
+  Obtaining the abstracts
+</h2>
+<p>
+  The abstracts were obtained with the search terms below, returning the top 1000 results. Duplicate papers were removed (identified by DOI) and only articles in english were retained, resulting in 7857 abstracts. :
+</p>
+<p>
+  <u>Search Terms:</u> High Temperature Energy Storage,  Energy Storage, Fossil Energy Storage, Superconducting Magnetic Energy Storage, Thermal Energy Storage, Flow Battery Energy Storage, Electrochemical Energy Storage, Advanced Adiabatic Compressed Air Energy Storage, Liquid Air Energy Storage, Thermochemical Energy Storage, Mechanical Energy Storage, Sensible Thermal Energy Storage, Methanol Energy Storage, Hydrogen Energy Storage, Li-ion Energy Storage, Lead Acid Energy Storage, Latent Thermal Energy Storage, Ammonia Energy Storage  
+
+
+</p>
+
+<h2>
+  Topic Modeling
+</h2>
+
+<p>
+  Topic modeling was performed using Latent Diriclet Allocation (LDA) with gensim. LDA is an unsupervised machine learning technique to determine a set of topics that can represnt the modeled collection of texts (corpus). 
+
+  Each document is given a probability of being in each topic, where topics are probability distributions over words. This is a 'soft' clustering technique, in contrast to Kmeans (used previously) which assigns each document to just one cluster. This removes the nuance of papers that lie at the intersection between fields.
+</p>
+
+<h2> Topic Visualization with t-SNE </h2>
+
+<p>
+  Below is a visualization of the topic modeling of the corpus. First, the texts are represented as points on a 2D surface using t-Distributed Stochastic Neighbor Embedding (t-SNE). 
+
+  The topic distribution for each paper is visualized by representing each paper as a pie chart. Each slice represents a topic, and the fractional size (angle) of each slice represents the probability of that topic. Only the top 3 topics for each paper are inclused (resulting in an incomplete pie chart) for the sake of graphics processing. 
+
+  The top words for each topic are indicated in the legend (see next visualization to explore the topic words in more detail). The topics in the legend are sorted by the number of papers that have that topic as their most probable topic.
+
+  <br><br>
+  To use the plot, mouse over each item to get information about the paper. Papers can be clicked to open up the articles web page. Use the tools on the right to move around, and note the 'refresh' button to reset the graph. Topics can be hidden by clicking on the topic color in the legend.
+</p>
+
+
+<div class="row">
+
+  <embed type="text/html" src="wedgeplot.html" style="width:100%" height=650> 
+
+
+</div>
+
+<h2> Topic Visualization with pyLDAvis </h2>
+
+<p>
+
+  Below is the visualization of the LDA model using pyLDAvis. The graph on the left using Principal Component Analysis to visualize the topics in 2D, similar to TSNE. The dashboard on the right is useful for exploring the words associated with each topic. Slide the relevance metric to about 0.5 to get words more specific to each topic. 
+
+</p>
+
+
+<div class="row">
+<embed type="text/html" src="lda_model.html" style="width:100%" height=900> 
+</div>
+
+
+<h1> Model Parameters</h1>
+
+<p>
+  Bigram modeling parameters: {'min_count': 5, 'threshold': 100}
+  <br>
+  Num Bigrams: 950, Total Words: 25076, Bigram Fraction: 0.038
+  <br>
+  LDA modeling paramters: {'alpha': 0.2, 'eta': 0.2, 'num_topics': 20, 'passes': 5}
+
+</p>
+
+
+</body>
+</html>
+  
@@ -0,0 +1,14 @@
+<!DOCTYPE html>
+<meta content="text/html;charset=utf-8" http-equiv="Content-Type">
+<meta content="utf-8" http-equiv="encoding">
+<html>
+<body>
+<center>
+
+<h1>Welcome!</h1>
+
+<h2><a href="energy_storage_nlp/index.html">Energy Storage Literature Clustering</a> </h2>
+
+
+</body>
+</html>