Training data for original model #26

hasan-sayeed · 2021-08-19T23:46:51Z

I was wondering, is the training data used for the original model is available anywhere as the Matscholar API is currently not available.

jdagdelen · 2021-08-19T23:55:43Z

Hi Hasan, Unfortunately we can’t share the training data due to copyright, and the matscholar API will not provide that functionality.

…

On Thu, Aug 19, 2021 at 4:47 PM Hasan Muhammad Sayeed < ***@***.***> wrote: I was wondering, is the training data used for the original model is available anywhere as the Matscholar API is currently not available. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#26>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACFODYLZASRB3KN73ZZE573T5WJXNANCNFSM5CPHH5LQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

hasan-sayeed · 2021-08-19T23:58:21Z

Could you please share the code you used to query the APIs and filter the abstracts, as described in the Methods sections of the paper?

jdagdelen · 2021-08-20T00:47:42Z

This is still an active area of development for us, so we aren't able to release that yet. However, our code is very specific to working on our infrastructure at LBL, (database config, etc) and wouldn't really be that useful to you, I think. However, I'm happy to give you an overview of how we went about it and point you to some libraries/resources that can help.

Our code uses the pybliometrics library to connect to the ScienceDirectAPI. We constructed a list of journals we were interested in and the years they were in service from the spreadsheet published by Elsevier every year (here is the list for journals on ScienceDirect). We then split up this list into journal-year pairs and queried the ScienceDirect API for those parameters. After that, we process the entries to make sure everything has the same metadata (doe, authors, etc.) We hand-labeled a number of abstracts for relevance (I think it was 1000) and then trained a classifier, which I believe used a bag of words featurization, for relevance.

michaeljtobias · 2021-11-08T17:00:03Z

Regarding the hand-labels, I see there is a dois.txt file and also a relevant_dois.json, so is it correct to assume the dois.txt is the complete set, and the relevant_dois.json are those predicted in by the classifier trained with the 1000 hand-labels? In that case is it possible to provide a table of those 1000 hand labels so I could attempt to recreate the same classifier?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training data for original model #26

Training data for original model #26

hasan-sayeed commented Aug 19, 2021

jdagdelen commented Aug 19, 2021 via email

hasan-sayeed commented Aug 19, 2021

jdagdelen commented Aug 20, 2021 •

edited

Loading

michaeljtobias commented Nov 8, 2021

Training data for original model #26

Training data for original model #26

Comments

hasan-sayeed commented Aug 19, 2021

jdagdelen commented Aug 19, 2021 via email

hasan-sayeed commented Aug 19, 2021

jdagdelen commented Aug 20, 2021 • edited Loading

michaeljtobias commented Nov 8, 2021

jdagdelen commented Aug 20, 2021 •

edited

Loading