Merge pull request #1 from adapt-sjtu/dev_badge

Add badges and citations
adapt-sjtu · Dec 20, 2020 · 528722b · 528722b
2 parents 4aafe32 + 75028f0
commit 528722b
Show file tree

Hide file tree

Showing 10 changed files with 713 additions and 96 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1 +1,4 @@
-commonsense-papers.bib
+commonsense-papers.bib
+README.html
+test_scripts/
+.vscode/
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -4,6 +4,8 @@ Welcome to the `commonsense-papers` project. We aim to select the most represent
 
 Since our paper-list cannot be complete, we welcome your contributions. 
 
+The `README_edit` is the file for human to edit, and `index.html` should be automatically generated by `gen_badge.py`.
+
 ## Paper
 
 All the papers include here should follow the following format: 
@@ -28,6 +30,12 @@ If a paper does not fit into a current (sub)class, can add a new one.
 
 If a paper provides a resource as well as a method, we usually put it into the resource class.
 
-## Statistics
+## Statistics and Badges
+
+After editing the paper list in `README.md` with the format stated above, simply run `gen_badge.py` and the statistics in `README.md` and the webpage `index.html` with badges will automatically update.
 
-After editing the paper list with the format stated above, simply run `gen_stats.py` and the statistics in `README.md` will automatically update.
+You may need to install dependencies before. 
+
+```
+pip install -r requirements.txt
+```
diff --git a/README.md b/README.md
@@ -3,6 +3,10 @@ Must-read papers on commonsense knowledge and others resources and tutorials
 
 We aim to select the most representative and innovative papers in the research field of **commonsense knowledge**, and provide taxonomy/classification as well as statistics of these papers to give a quick overview of the field and help focused reading.
 
+We've also added (influential) citation numbers according to [Semantic Scholar](https://www.semanticscholar.org/), [AltMetric Badge](http://api.altmetric.com/embeds.html) (paper influence in social media) and [Dimensions Badge](https://badge.dimensions.ai/) (paper citations) for **papers that can be linked to an arxiv id/DOI**. Highly influential papers should now be easier to identify, though we still encourage readers to read other papers that might have been overlooked. Due to rendering limitation, the badges are only visible on our [website](https://adapt-sjtu.github.io/commonsense-papers/). 
+
+![badges](images/badges.jpg)
+
 Contributed by [ADAPTers](https://adapt.seiee.sjtu.edu.cn/) (major efforts by Zhiling Zhang([@blmoistawinde](https://github.com/blmoistawinde)), Siyu Ren, Hongru Huang, Zelin Zhou, Yanzhu Guo)
 
 Our list may not be complete. We will keep adding papers and improving it. [Contributions](CONTRIBUTING.md) are welcomed!
@@ -55,37 +59,36 @@ Non-stopping words in title, indicating the hot topics in this field.
       <td>5</td>
     </tr>
     <tr>
-      <th>challenge</th>
+      <th>question</th>
       <td>5</td>
     </tr>
     <tr>
-      <th>common</th>
+      <th>model</th>
       <td>5</td>
     </tr>
     <tr>
-      <th>question</th>
+      <th>challenge</th>
       <td>5</td>
     </tr>
     <tr>
-      <th>model</th>
+      <th>common</th>
       <td>5</td>
     </tr>
     <tr>
-      <th>answering</th>
+      <th>pre</th>
       <td>4</td>
     </tr>
     <tr>
       <th>story</th>
       <td>4</td>
     </tr>
     <tr>
-      <th>pre</th>
+      <th>answering</th>
       <td>4</td>
     </tr>
   </tbody>
 </table>
 </anchor>
-
 <br/>
 
 **Researchers**
@@ -102,43 +105,43 @@ Most active researchers in this field
   </thead>
   <tbody>
     <tr>
-      <th>Yejin Choi</th>
+      <th><a href="https://www.semanticscholar.org/author/1699545">Yejin Choi</a></th>
       <td>14</td>
     </tr>
     <tr>
-      <th>Antoine Bosselut</th>
+      <th><a href="https://www.semanticscholar.org/author/2691021">Antoine Bosselut</a></th>
       <td>7</td>
     </tr>
     <tr>
-      <th>Chandra Bhagavatula</th>
+      <th><a href="https://www.semanticscholar.org/author/1857797">Chandra Bhagavatula</a></th>
       <td>7</td>
     </tr>
     <tr>
-      <th>Bill Yuchen Lin</th>
+      <th><a href="https://www.semanticscholar.org/author/51583409">Bill Yuchen Lin</a></th>
       <td>6</td>
     </tr>
     <tr>
-      <th>Ronan Le Bras</th>
+      <th><a href="https://www.semanticscholar.org/author/39227408">Ronan Le Bras</a></th>
       <td>5</td>
     </tr>
     <tr>
-      <th>Xiang Ren</th>
+      <th><a href="https://www.semanticscholar.org/author/1384550891">Xiang Ren</a></th>
       <td>4</td>
     </tr>
     <tr>
-      <th>Hannah Rashkin</th>
+      <th><a href="https://www.semanticscholar.org/author/2516777">Hannah Rashkin</a></th>
       <td>4</td>
     </tr>
     <tr>
-      <th>Dan Roth</th>
+      <th><a href="https://www.semanticscholar.org/author/144590225">Dan Roth</a></th>
       <td>4</td>
     </tr>
     <tr>
-      <th>Maarten Sap</th>
+      <th><a href="https://www.semanticscholar.org/author/2729164">Maarten Sap</a></th>
       <td>4</td>
     </tr>
     <tr>
-      <th>Hongming Zhang</th>
+      <th><a href="https://www.semanticscholar.org/author/48212577">Hongming Zhang</a></th>
       <td>3</td>
     </tr>
   </tbody>
@@ -190,7 +193,7 @@ Just an estimation. May not be precise as arxiv papers may appear in other venue
 
 *Shane Storks, Qiaozi Gao, Joyce Y. Chai*
 
-**T6: Commonsense Reasoning for Natural Language Processing.** ACL 2020. [slides and video](https://slideslive.com/38931667/t6-commonsense-reasoning-for-natural-language-processing) 
+**T6: Commonsense Reasoning for Natural Language Processing.** ACL 2020. [slides and video](https://slideslive.com/38931667/t6-commonsense-reasoning-for-natural-language-processing)
 
 *Antoine Bosselut, Dan Roth, Maarten Sap, Vered Shwartz, Yejin Choi*
 
@@ -246,11 +249,11 @@ Just an estimation. May not be precise as arxiv papers may appear in other venue
 ### Related Knowledge Bases
 <br/>
 
-**WordNet: A Lexical Database for English** Communications of the ACM Vol. 38, No. 11: 39-41. 1995. [homepage] (https://wordnet.princeton.edu/)
+**WordNet: A Lexical Database for English** Communications of the ACM Vol. 38, No. 11: 39-41. 1995. [homepage](https://wordnet.princeton.edu/)
 
 *George A. Miller*
 
-**Toward an Architecture for Never-Ending Language Learning** (NELL). AAAI 2010 [paper](http://rtw.ml.cmu.edu/papers/carlson-aaai10.pdf) [homepage](http://rtw.ml.cmu.edu/rtw/) 
+**Toward an Architecture for Never-Ending Language Learning** (NELL). AAAI 2010 [paper](http://rtw.ml.cmu.edu/papers/carlson-aaai10.pdf) [homepage](http://rtw.ml.cmu.edu/rtw/)
 
 *Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr., and Tom M. Mitchell*
 
@@ -406,7 +409,7 @@ Just an estimation. May not be precise as arxiv papers may appear in other venue
 <br/>
 
 **Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine Reading Comprehension** ACL 2019 [paper](https://www.aclweb.org/anthology/P19-1226.pdf) [code](https://github.com/PaddlePaddle/Research/tree/master/NLP/ACL2019-KTNET)
-- resource: WordNet, NELL 
+- resource: WordNet, NELL
 
 *An Yang, Quan Wang, Jing Liu, Kai Liu, Yajuan Lyu, Hua Wu, Qiaoqiao She, Sujian Li*
 
@@ -457,4 +460,4 @@ Just an estimation. May not be precise as arxiv papers may appear in other venue
 
 *Tan Wang, Jianqiang Huang, Hanwang Zhang, Qianru Sun*
 
-[back to table of contents](#toc)
+[back to table of contents](#toc)
diff --git a/gen_badge.py b/gen_badge.py
@@ -0,0 +1,171 @@
+import re
+import numpy as np
+import pandas as pd
+from collections import defaultdict, Counter
+import spacy
+nlp = spacy.load("en", disabled=["ner", "parser"])
+import semanticscholar as sch
+import markdown2
+import jinja2
+
+def str_sim(str1, str2):
+    # currently use this simple heuristic
+    chars1, chars2 = set(str1), set(str2)
+    return len(chars1 & chars2) / max(min(len(chars1), len(chars2)), 1)
+
+def most_sim_author(str1, author_list):
+    best_sim, best_match = -1, None
+    for str2 in author_list:
+        sim0 = str_sim(str1, str2)
+        if sim0 > best_sim:
+            best_sim, best_match = sim0, str2
+    return best_match
+
+allow_search = True
+
+paper_cnt = 0
+author_cnt = Counter()
+kwds_cnt = Counter()
+venue_cnt = defaultdict(int)
+my_stopwords = {"commonsense", "knowledge", "natural", "language"}
+lid2badges = defaultdict(str)
+lid2citation_nums = defaultdict(str)
+mention2aid = {}
+aid2url = {}
+curr_authors_info = None
+readme_lines = []
+with open("README.md", encoding="utf-8") as f:
+    for lid, line in enumerate(f):
+        line = line.strip()
+        readme_lines.append(line)
+        if re.search(r"^\*\*([^\*]+)\*\*", line):   # title
+            title = re.search(r"^\*\*([^\*]+)\*\*", line).group(1)
+            keywords = {x.lemma_.lower() for x in nlp(title) if not (x.is_stop or x.is_punct or x.lemma_.lower() in my_stopwords)}
+            kwds_cnt.update(keywords)
+            paper_cnt += 1
+            # find arxiv id, and get details from semantic scholar API
+            if re.search(r"https://arxiv.org/(pdf|abs)/(\d+\.\d+)", line):
+                arxiv_id = re.search(r"https://arxiv.org/(pdf|abs)/(\d+\.\d+)", line).group(2)
+                alt_badge = f' <div data-badge-popover="right" data-badge-type="2" data-hide-no-mentions="true" class="altmetric-embed" data-arxiv-id="{arxiv_id}"  style="float:left"></div> '
+                lid2badges[lid] = alt_badge
+                if allow_search:
+                    paper_info = sch.paper(f'arxiv:{arxiv_id}', timeout=2)
+                    citations = len(paper_info["citations"])
+                    inf_citations = paper_info["influentialCitationCount"]
+                    if inf_citations > 0:
+                        lid2citation_nums[lid] = f" (Citations: {citations}, {inf_citations} influential) "
+                    else:
+                        lid2citation_nums[lid] = f" (Citations: {citations}) "
+                    curr_authors_info = paper_info["authors"]
+                    doi = paper_info.get('doi', None)
+                    print("DOI", doi)
+                    if doi is not None:
+                        # use doi to link to Dimensions Badge
+                        dim_badge = f' <span class="__dimensions_badge_embed__" data-doi="{doi}" data-style="small_rectangle"  style="float:left"></span> '
+                        lid2badges[lid] = lid2badges[lid] + dim_badge
+                else:
+                    lid2citation_nums[lid] = " (Citations: ?) "
+            else:
+                lid2badges[lid] = ""
+
+            try:
+                beg = line.rfind(r"** ") + 3
+                end = line.find("[") - 1
+                venue_text = line[beg:end]
+                if ")" in venue_text:
+                    venue_text = venue_text[venue_text.find(")")+1:]
+                venue = venue_text.strip().split()[0]
+                venue_cnt[venue] += 1
+            except Exception as e:
+                pass
+            continue
+        if re.search(r"^\*.+\*$", line):   # author
+            authors = [x.strip() for x in line[1:-1].split(", ")]
+            author_cnt.update(authors)
+            if curr_authors_info != None:
+                # match author mention with semantic scholar std name
+                # first and last author should be accurately matched
+                mention2aid[authors[0]] = curr_authors_info[0]['authorId']
+                mention2aid[authors[-1]] = curr_authors_info[-1]['authorId']
+                tmp_author2id = {}
+                for author_info in curr_authors_info:
+                    aid2url[author_info['authorId']] = author_info['url']
+                    tmp_author2id[author_info['name']] = author_info['authorId']
+                if len(authors) > 2:
+                    for mention0 in authors[1:-1]:
+                        if len(tmp_author2id) > 0:
+                            matched_aname = most_sim_author(mention0, tmp_author2id)
+                            mention2aid[mention0] = tmp_author2id[matched_aname]
+                            del tmp_author2id[matched_aname]
+
+            curr_authors_info = None
+
+kwds_cnt = pd.DataFrame(pd.Series(kwds_cnt).sort_values(ascending=False), columns=["count"])
+author_cnt = pd.DataFrame(pd.Series(author_cnt).sort_values(ascending=False), columns=["count"])
+venue_cnt = pd.DataFrame(pd.Series(venue_cnt).sort_values(ascending=False), columns=["count"])
+
+readme_to_md = "\n".join(readme_lines)
+readme_to_md = re.sub(r'<anchor id="cnt">(.*?)</anchor>', f'<anchor id="cnt">{paper_cnt}</anchor>', readme_to_md)
+html0 = kwds_cnt.head(10).to_html()
+readme_to_md = re.sub(r'<anchor id="keyword">\n(.*?)\n</anchor>', f'<anchor id="keyword">\n{html0}\n</anchor>', readme_to_md, flags=re.DOTALL)
+html0 = author_cnt.head(10).to_html()
+for mention0 in author_cnt.index:
+    if mention0 in mention2aid:
+        url0 = aid2url[mention2aid[mention0]]
+        html0 = html0.replace(mention0, f'<a href="{url0}">{mention0}</a>')
+readme_to_md = re.sub(r'<anchor id="researcher">\n(.*?)\n</anchor>', f'<anchor id="researcher">\n{html0}\n</anchor>', readme_to_md, flags=re.DOTALL)
+html0 = venue_cnt.head(5).to_html()
+readme_to_md = re.sub(r'<anchor id="venue">\n(.*?)\n</anchor>', f'<anchor id="venue">\n{html0}\n</anchor>', readme_to_md, flags=re.DOTALL)
+with open("README.md", "w", encoding="utf-8") as f:
+    f.write(readme_to_md)
+
+# write to website
+for lid, cite_str in lid2citation_nums.items():
+    if cite_str != "":
+        readme_lines[lid] += cite_str
+for lid, badge_str in lid2badges.items():
+    if badge_str != "":
+        readme_lines[lid] += (badge_str + "<br/>")
+readme_to_html = "\n".join(readme_lines)
+readme_to_html = re.sub(r'<anchor id="cnt">(.*?)</anchor>', f'<anchor id="cnt">{paper_cnt}</anchor>', readme_to_html)
+html0 = kwds_cnt.head(10).to_html()
+readme_to_html = re.sub(r'<anchor id="keyword">\n(.*?)\n</anchor>', f'<anchor id="keyword">\n{html0}\n</anchor>', readme_to_html, flags=re.DOTALL)
+html0 = author_cnt.head(10).to_html()
+for mention0 in author_cnt.index:
+    if mention0 in mention2aid:
+        url0 = aid2url[mention2aid[mention0]]
+        html0 = html0.replace(mention0, f'<a href="{url0}">{mention0}</a>')
+readme_to_html = re.sub(r'<anchor id="researcher">\n(.*?)\n</anchor>', f'<anchor id="researcher">\n{html0}\n</anchor>', readme_to_html, flags=re.DOTALL)
+html0 = venue_cnt.head(5).to_html()
+readme_to_html = re.sub(r'<anchor id="venue">\n(.*?)\n</anchor>', f'<anchor id="venue">\n{html0}\n</anchor>', readme_to_html, flags=re.DOTALL)
+html = markdown2.markdown(readme_to_html)
+
+template = """
+<head>
+    <script type='text/javascript' charset="utf-8">{{dimensions_badge}}</script>
+    <script type='text/javascript' charset="utf-8">{{altmetrics_badge}}</script>
+</head>
+
+{{main_body}}
+"""
+
+template = jinja2.Template(template)
+
+template_vars = {
+    "main_body": html,
+    "dimensions_badge": open("static/badge.js", encoding="utf-8").read(),
+    "altmetrics_badge": open("static/embed.js", encoding="utf-8").read()
+}
+
+html_out = template.render(template_vars)
+with open("index.html", "w", encoding="utf-8") as f:
+    f.write(html_out)
+
+print("Results (for human read)")
+print("\n--Keyword--\n")
+print(kwds_cnt.head(10))
+print("\n--Author--\n")
+print(author_cnt.head(10))
+print("\n--Venue--\n")
+print(venue_cnt.head(5))
+print(f"Paper count: {paper_cnt}")