Skip to content

Commit

Permalink
Merge pull request #1 from adapt-sjtu/dev_badge
Browse files Browse the repository at this point in the history
Add badges and citations
  • Loading branch information
blmoistawinde authored Dec 20, 2020
2 parents 4aafe32 + 75028f0 commit 528722b
Show file tree
Hide file tree
Showing 10 changed files with 713 additions and 96 deletions.
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
commonsense-papers.bib
commonsense-papers.bib
README.html
test_scripts/
.vscode/
12 changes: 10 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ Welcome to the `commonsense-papers` project. We aim to select the most represent

Since our paper-list cannot be complete, we welcome your contributions.

The `README_edit` is the file for human to edit, and `index.html` should be automatically generated by `gen_badge.py`.

## Paper

All the papers include here should follow the following format:
Expand All @@ -28,6 +30,12 @@ If a paper does not fit into a current (sub)class, can add a new one.

If a paper provides a resource as well as a method, we usually put it into the resource class.

## Statistics
## Statistics and Badges

After editing the paper list in `README.md` with the format stated above, simply run `gen_badge.py` and the statistics in `README.md` and the webpage `index.html` with badges will automatically update.

After editing the paper list with the format stated above, simply run `gen_stats.py` and the statistics in `README.md` will automatically update.
You may need to install dependencies before.

```
pip install -r requirements.txt
```
47 changes: 25 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@ Must-read papers on commonsense knowledge and others resources and tutorials

We aim to select the most representative and innovative papers in the research field of **commonsense knowledge**, and provide taxonomy/classification as well as statistics of these papers to give a quick overview of the field and help focused reading.

We've also added (influential) citation numbers according to [Semantic Scholar](https://www.semanticscholar.org/), [AltMetric Badge](http://api.altmetric.com/embeds.html) (paper influence in social media) and [Dimensions Badge](https://badge.dimensions.ai/) (paper citations) for **papers that can be linked to an arxiv id/DOI**. Highly influential papers should now be easier to identify, though we still encourage readers to read other papers that might have been overlooked. Due to rendering limitation, the badges are only visible on our [website](https://adapt-sjtu.github.io/commonsense-papers/).

![badges](images/badges.jpg)

Contributed by [ADAPTers](https://adapt.seiee.sjtu.edu.cn/) (major efforts by Zhiling Zhang([@blmoistawinde](https://github.com/blmoistawinde)), Siyu Ren, Hongru Huang, Zelin Zhou, Yanzhu Guo)

Our list may not be complete. We will keep adding papers and improving it. [Contributions](CONTRIBUTING.md) are welcomed!
Expand Down Expand Up @@ -55,37 +59,36 @@ Non-stopping words in title, indicating the hot topics in this field.
<td>5</td>
</tr>
<tr>
<th>challenge</th>
<th>question</th>
<td>5</td>
</tr>
<tr>
<th>common</th>
<th>model</th>
<td>5</td>
</tr>
<tr>
<th>question</th>
<th>challenge</th>
<td>5</td>
</tr>
<tr>
<th>model</th>
<th>common</th>
<td>5</td>
</tr>
<tr>
<th>answering</th>
<th>pre</th>
<td>4</td>
</tr>
<tr>
<th>story</th>
<td>4</td>
</tr>
<tr>
<th>pre</th>
<th>answering</th>
<td>4</td>
</tr>
</tbody>
</table>
</anchor>

<br/>

**Researchers**
Expand All @@ -102,43 +105,43 @@ Most active researchers in this field
</thead>
<tbody>
<tr>
<th>Yejin Choi</th>
<th><a href="https://www.semanticscholar.org/author/1699545">Yejin Choi</a></th>
<td>14</td>
</tr>
<tr>
<th>Antoine Bosselut</th>
<th><a href="https://www.semanticscholar.org/author/2691021">Antoine Bosselut</a></th>
<td>7</td>
</tr>
<tr>
<th>Chandra Bhagavatula</th>
<th><a href="https://www.semanticscholar.org/author/1857797">Chandra Bhagavatula</a></th>
<td>7</td>
</tr>
<tr>
<th>Bill Yuchen Lin</th>
<th><a href="https://www.semanticscholar.org/author/51583409">Bill Yuchen Lin</a></th>
<td>6</td>
</tr>
<tr>
<th>Ronan Le Bras</th>
<th><a href="https://www.semanticscholar.org/author/39227408">Ronan Le Bras</a></th>
<td>5</td>
</tr>
<tr>
<th>Xiang Ren</th>
<th><a href="https://www.semanticscholar.org/author/1384550891">Xiang Ren</a></th>
<td>4</td>
</tr>
<tr>
<th>Hannah Rashkin</th>
<th><a href="https://www.semanticscholar.org/author/2516777">Hannah Rashkin</a></th>
<td>4</td>
</tr>
<tr>
<th>Dan Roth</th>
<th><a href="https://www.semanticscholar.org/author/144590225">Dan Roth</a></th>
<td>4</td>
</tr>
<tr>
<th>Maarten Sap</th>
<th><a href="https://www.semanticscholar.org/author/2729164">Maarten Sap</a></th>
<td>4</td>
</tr>
<tr>
<th>Hongming Zhang</th>
<th><a href="https://www.semanticscholar.org/author/48212577">Hongming Zhang</a></th>
<td>3</td>
</tr>
</tbody>
Expand Down Expand Up @@ -190,7 +193,7 @@ Just an estimation. May not be precise as arxiv papers may appear in other venue

*Shane Storks, Qiaozi Gao, Joyce Y. Chai*

**T6: Commonsense Reasoning for Natural Language Processing.** ACL 2020. [slides and video](https://slideslive.com/38931667/t6-commonsense-reasoning-for-natural-language-processing)
**T6: Commonsense Reasoning for Natural Language Processing.** ACL 2020. [slides and video](https://slideslive.com/38931667/t6-commonsense-reasoning-for-natural-language-processing)

*Antoine Bosselut, Dan Roth, Maarten Sap, Vered Shwartz, Yejin Choi*

Expand Down Expand Up @@ -246,11 +249,11 @@ Just an estimation. May not be precise as arxiv papers may appear in other venue
### Related Knowledge Bases
<br/>

**WordNet: A Lexical Database for English** Communications of the ACM Vol. 38, No. 11: 39-41. 1995. [homepage] (https://wordnet.princeton.edu/)
**WordNet: A Lexical Database for English** Communications of the ACM Vol. 38, No. 11: 39-41. 1995. [homepage](https://wordnet.princeton.edu/)

*George A. Miller*

**Toward an Architecture for Never-Ending Language Learning** (NELL). AAAI 2010 [paper](http://rtw.ml.cmu.edu/papers/carlson-aaai10.pdf) [homepage](http://rtw.ml.cmu.edu/rtw/)
**Toward an Architecture for Never-Ending Language Learning** (NELL). AAAI 2010 [paper](http://rtw.ml.cmu.edu/papers/carlson-aaai10.pdf) [homepage](http://rtw.ml.cmu.edu/rtw/)

*Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr., and Tom M. Mitchell*

Expand Down Expand Up @@ -406,7 +409,7 @@ Just an estimation. May not be precise as arxiv papers may appear in other venue
<br/>

**Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine Reading Comprehension** ACL 2019 [paper](https://www.aclweb.org/anthology/P19-1226.pdf) [code](https://github.com/PaddlePaddle/Research/tree/master/NLP/ACL2019-KTNET)
- resource: WordNet, NELL
- resource: WordNet, NELL

*An Yang, Quan Wang, Jing Liu, Kai Liu, Yajuan Lyu, Hua Wu, Qiaoqiao She, Sujian Li*

Expand Down Expand Up @@ -457,4 +460,4 @@ Just an estimation. May not be precise as arxiv papers may appear in other venue

*Tan Wang, Jianqiang Huang, Hanwang Zhang, Qianru Sun*

[back to table of contents](#toc)
[back to table of contents](#toc)
171 changes: 171 additions & 0 deletions gen_badge.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
import re
import numpy as np
import pandas as pd
from collections import defaultdict, Counter
import spacy
nlp = spacy.load("en", disabled=["ner", "parser"])
import semanticscholar as sch
import markdown2
import jinja2

def str_sim(str1, str2):
# currently use this simple heuristic
chars1, chars2 = set(str1), set(str2)
return len(chars1 & chars2) / max(min(len(chars1), len(chars2)), 1)

def most_sim_author(str1, author_list):
best_sim, best_match = -1, None
for str2 in author_list:
sim0 = str_sim(str1, str2)
if sim0 > best_sim:
best_sim, best_match = sim0, str2
return best_match

allow_search = True

paper_cnt = 0
author_cnt = Counter()
kwds_cnt = Counter()
venue_cnt = defaultdict(int)
my_stopwords = {"commonsense", "knowledge", "natural", "language"}
lid2badges = defaultdict(str)
lid2citation_nums = defaultdict(str)
mention2aid = {}
aid2url = {}
curr_authors_info = None
readme_lines = []
with open("README.md", encoding="utf-8") as f:
for lid, line in enumerate(f):
line = line.strip()
readme_lines.append(line)
if re.search(r"^\*\*([^\*]+)\*\*", line): # title
title = re.search(r"^\*\*([^\*]+)\*\*", line).group(1)
keywords = {x.lemma_.lower() for x in nlp(title) if not (x.is_stop or x.is_punct or x.lemma_.lower() in my_stopwords)}
kwds_cnt.update(keywords)
paper_cnt += 1
# find arxiv id, and get details from semantic scholar API
if re.search(r"https://arxiv.org/(pdf|abs)/(\d+\.\d+)", line):
arxiv_id = re.search(r"https://arxiv.org/(pdf|abs)/(\d+\.\d+)", line).group(2)
alt_badge = f' <div data-badge-popover="right" data-badge-type="2" data-hide-no-mentions="true" class="altmetric-embed" data-arxiv-id="{arxiv_id}" style="float:left"></div> '
lid2badges[lid] = alt_badge
if allow_search:
paper_info = sch.paper(f'arxiv:{arxiv_id}', timeout=2)
citations = len(paper_info["citations"])
inf_citations = paper_info["influentialCitationCount"]
if inf_citations > 0:
lid2citation_nums[lid] = f" (Citations: {citations}, {inf_citations} influential) "
else:
lid2citation_nums[lid] = f" (Citations: {citations}) "
curr_authors_info = paper_info["authors"]
doi = paper_info.get('doi', None)
print("DOI", doi)
if doi is not None:
# use doi to link to Dimensions Badge
dim_badge = f' <span class="__dimensions_badge_embed__" data-doi="{doi}" data-style="small_rectangle" style="float:left"></span> '
lid2badges[lid] = lid2badges[lid] + dim_badge
else:
lid2citation_nums[lid] = " (Citations: ?) "
else:
lid2badges[lid] = ""

try:
beg = line.rfind(r"** ") + 3
end = line.find("[") - 1
venue_text = line[beg:end]
if ")" in venue_text:
venue_text = venue_text[venue_text.find(")")+1:]
venue = venue_text.strip().split()[0]
venue_cnt[venue] += 1
except Exception as e:
pass
continue
if re.search(r"^\*.+\*$", line): # author
authors = [x.strip() for x in line[1:-1].split(", ")]
author_cnt.update(authors)
if curr_authors_info != None:
# match author mention with semantic scholar std name
# first and last author should be accurately matched
mention2aid[authors[0]] = curr_authors_info[0]['authorId']
mention2aid[authors[-1]] = curr_authors_info[-1]['authorId']
tmp_author2id = {}
for author_info in curr_authors_info:
aid2url[author_info['authorId']] = author_info['url']
tmp_author2id[author_info['name']] = author_info['authorId']
if len(authors) > 2:
for mention0 in authors[1:-1]:
if len(tmp_author2id) > 0:
matched_aname = most_sim_author(mention0, tmp_author2id)
mention2aid[mention0] = tmp_author2id[matched_aname]
del tmp_author2id[matched_aname]

curr_authors_info = None

kwds_cnt = pd.DataFrame(pd.Series(kwds_cnt).sort_values(ascending=False), columns=["count"])
author_cnt = pd.DataFrame(pd.Series(author_cnt).sort_values(ascending=False), columns=["count"])
venue_cnt = pd.DataFrame(pd.Series(venue_cnt).sort_values(ascending=False), columns=["count"])

readme_to_md = "\n".join(readme_lines)
readme_to_md = re.sub(r'<anchor id="cnt">(.*?)</anchor>', f'<anchor id="cnt">{paper_cnt}</anchor>', readme_to_md)
html0 = kwds_cnt.head(10).to_html()
readme_to_md = re.sub(r'<anchor id="keyword">\n(.*?)\n</anchor>', f'<anchor id="keyword">\n{html0}\n</anchor>', readme_to_md, flags=re.DOTALL)
html0 = author_cnt.head(10).to_html()
for mention0 in author_cnt.index:
if mention0 in mention2aid:
url0 = aid2url[mention2aid[mention0]]
html0 = html0.replace(mention0, f'<a href="{url0}">{mention0}</a>')
readme_to_md = re.sub(r'<anchor id="researcher">\n(.*?)\n</anchor>', f'<anchor id="researcher">\n{html0}\n</anchor>', readme_to_md, flags=re.DOTALL)
html0 = venue_cnt.head(5).to_html()
readme_to_md = re.sub(r'<anchor id="venue">\n(.*?)\n</anchor>', f'<anchor id="venue">\n{html0}\n</anchor>', readme_to_md, flags=re.DOTALL)
with open("README.md", "w", encoding="utf-8") as f:
f.write(readme_to_md)

# write to website
for lid, cite_str in lid2citation_nums.items():
if cite_str != "":
readme_lines[lid] += cite_str
for lid, badge_str in lid2badges.items():
if badge_str != "":
readme_lines[lid] += (badge_str + "<br/>")
readme_to_html = "\n".join(readme_lines)
readme_to_html = re.sub(r'<anchor id="cnt">(.*?)</anchor>', f'<anchor id="cnt">{paper_cnt}</anchor>', readme_to_html)
html0 = kwds_cnt.head(10).to_html()
readme_to_html = re.sub(r'<anchor id="keyword">\n(.*?)\n</anchor>', f'<anchor id="keyword">\n{html0}\n</anchor>', readme_to_html, flags=re.DOTALL)
html0 = author_cnt.head(10).to_html()
for mention0 in author_cnt.index:
if mention0 in mention2aid:
url0 = aid2url[mention2aid[mention0]]
html0 = html0.replace(mention0, f'<a href="{url0}">{mention0}</a>')
readme_to_html = re.sub(r'<anchor id="researcher">\n(.*?)\n</anchor>', f'<anchor id="researcher">\n{html0}\n</anchor>', readme_to_html, flags=re.DOTALL)
html0 = venue_cnt.head(5).to_html()
readme_to_html = re.sub(r'<anchor id="venue">\n(.*?)\n</anchor>', f'<anchor id="venue">\n{html0}\n</anchor>', readme_to_html, flags=re.DOTALL)
html = markdown2.markdown(readme_to_html)

template = """
<head>
<script type='text/javascript' charset="utf-8">{{dimensions_badge}}</script>
<script type='text/javascript' charset="utf-8">{{altmetrics_badge}}</script>
</head>
{{main_body}}
"""

template = jinja2.Template(template)

template_vars = {
"main_body": html,
"dimensions_badge": open("static/badge.js", encoding="utf-8").read(),
"altmetrics_badge": open("static/embed.js", encoding="utf-8").read()
}

html_out = template.render(template_vars)
with open("index.html", "w", encoding="utf-8") as f:
f.write(html_out)

print("Results (for human read)")
print("\n--Keyword--\n")
print(kwds_cnt.head(10))
print("\n--Author--\n")
print(author_cnt.head(10))
print("\n--Venue--\n")
print(venue_cnt.head(5))
print(f"Paper count: {paper_cnt}")
Loading

0 comments on commit 528722b

Please sign in to comment.