-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build ContentMine-based workflow for "main subject" of papers in Wikidata #51
Comments
As a starting point, it would make sense to go for papers that are already on Wikidata and have a P932 (PMCID) statement. The query for that is
Without the LIMIT command, this just took 6s and gave 334628 results, which sounds like a good maximal size for a test set. |
Daniel-Mietchen and I discussed this with the possible outcomes of:
High-level strategy Collect a corpus of Open articles and carry out supervised term analysis of the content, supported by #wikidata-enhanced dictionaries. Articles with a "main topic" which maps onto #Wikidata items (Q\d+) are likely to have many mentions of the main topic. For example article http://europepmc.org/articles/PMC2491585 mentions
and the most common terms (Bag of words) are:
We can infer that the main topic of the article is Dengue Virus and antigenicity. This is consistent with the title:
The term "vaccine" occurs 16 times in the main text (whereas "HLA" and "peptide" - the mechanism of vaccination is emphasised. Corpus of articles: |
OK, I've added these to https://www.wikidata.org/wiki/Q24288762#P921 . How can we scale that up? Can you provide a list of the following kind?
|
Reopening this, as we are still working on it. |
A side project could be to identify the main subject(s) for journals — currently, ca. 40k instances of scientific journal do not have any main subject set in Wikidata
|
As per
https://www.wikidata.org/wiki/Property_talk:P921 ,
P921 is for the "primary topic of a work", which should have a Wikidata
entry.
It doesn't have to be just one - for instance, the article [Causal or not: applying the Bradford Hill aspects of evidence to the association between Zika virus and microcephaly (Q24261170)](https://www.wikidata.org/wiki/Q24261170) currently [has](https://www.wikidata.org/wiki/Q24261170#P921) Zika virus, microcephaly and
Bradford Hill criteria.
…On Sun, Mar 19, 2017 at 4:41 PM, Stefan Kasberger ***@***.***> wrote:
What do you exactly mean with the term "main subject"?
|
ContentMine can analyze papers in various ways, including as to what the most salient terms are, e.g. via https://en.wikipedia.org/wiki/Tf%E2%80%93idf .
It would be nice to harvest that to annotate Wikidata items about papers with the property P921 "main subject".
The text was updated successfully, but these errors were encountered: