Skip to content

How ReCiter works

paulalbert1 edited this page Mar 27, 2019 · 23 revisions

Table of contents

Introduction

This article describes from a functionality perspective, as opposed to a technical one, how ReCiter retrieves, processes, and suggests articles.

Here's how ReCiter works, in a nutshell:

  1. Create a search in PubMed (e.g., Albert P[au]) based on data stored in the Identity table.
  2. Retrieve candidate records from PubMed and store them in the PubMed article table. If there are too many results, employ different strategies to limit the number of results.
  3. If we have received feedback that a given person has written certain articles in PubMed as stored in the GoldStandard table, look those up as well.
  4. Optional: look up any complimentary record in Scopus. Store the record in the ScopusArticle table.
  5. Identify the target author – which author among the authors listed for a given article were possibly written by our person of interest.
  6. Identify similarities between candidate articles to create clusters or groups of articles.
  7. Use a variety of evidence about our person of interest to score each article across different types of evidence.
  8. Create a raw total score, which speaks to the overall confidence that our target person wrote an article. We store this output in the Analysis table unless it's too large, in which case it's stored in S3, AWS's file hosting service.
  9. Map to a standardized score.
  10. In using the API, administrator can select which types of articles to return, e.g., only accepted articles, only articles above a certain score, etc.

Key prerequisites include:

  • Installing the application
  • Populating the Identity table with information about your target author
  • Populating the GoldStandard table with publications, as designated by PMIDs, that you know your target author did or did not author
  • Configuring application.properties

Now it's time to compute suggestions!

Options for the feature generator API

Start with the feature generator API. The feature generator API /reciter/feature-generator/by/uid coordinates the entire retrieval and scoring process. The feature generator API negotiates with other APIs including the ones for identity retrieval, PubMed retrieval, Scopus retrieval (if you so designate), and Gold Standard retrieval. While you can interact with these APIs individually, it is a best practice, at least in a production environment, to allow the Feature Generator API to do this for you.

Think of a feature like a characteristic. With the feature generator API, we can see not only the recommended articles and all their associated metadata (journal, pages, DOI, etc.), but how strong the evidence or features are for a given suggestion.

For more, see the Using the APIs article

Retrieving candidate records from PubMed

Here's how ReCiter retrieves articles:

  1. Retrieve list of PMIDs, including accepted and rejected records, in the GoldStandard table (goldStandardRetrievalStrategy). Lists of PMIDs for this and other retrieval strategies are stored in the ESearchResult table. The complete version of articles (independent of any data associated with our targetAuthor) are stored in PubMedArticle. PMIDs serve as the primary key for all article-level objects.

  2. Search PubMed for articles by emails explicitly asserted in the Identity table (emailRetrievalStrategy).

  3. Retrieve all unique forms of identity.name.lastName and identity.name.firstInitial from Identity table for targetAuthor.

  4. Derive additional names in the case of compounds surnames. For example: K. Garcia Marquez --> K. Garcia OR K. Marquez OR K. Garcia Marquez. These will be looked up in strict mode (described below).

  5. Sanitize names in which we convert non-English characters into English, remove suffixes, etc.

  6. Retrieve count of records from PubMed. (lastNameFirstInitialRetrievalStrategy)

  7. Count the number of candidate records returned. Does this count exceed the value set in searchStrategy-lenient-threshold?

  • If yes, we're going into "strict mode", in which we limit our search by additional qualifiers.
  • If no, we use "lenient mode", in which no such qualifiers are used.

For more, see Issue 259

Lenient mode

We do a simple lastName1 firstInitial1 OR lastName2 firstInitial2... search.

Strict mode

Yi Wang is a radiologist at Weill Cornell. At last count, about 121,000+, or 0.3%, of all articles in PubMed were written by a "Y. Wang." Without modification, searching for this name alone can lead to high computation times and very low precision.

In strict mode, we have our lastName firstInitial search as well as one of several additional qualifiers. For example, for grants, this would be the search: Wang Y[au] AND (HS-17029 OR TR-457)

Here are the additional limiters:

  1. strictRetrievalStrategy-knownRelationships - firstInitial and lastName of knownRelationship
  2. strictRetrievalStrategy-fullName - full name of target author
  3. strictRetrievalStrategy-grants - grant identifiers from identity.grants
  4. strictRetrievalStrategy-institutions - keywords from homeInstitution-keywords as defined in strategy.authorAffiliationScoringStrategy.homeInstitution-keywords in application.properties
  5. strictRetrievalStrategy-departments - departments from identity.departments
  6. strictRetrievalStrategy-secondInitial - first two capital letters in the user's first name or middle name

If number of results returned for any one of these additional searches exceeds searchStrategy-strict-threshold, as controlled in application.properties, the records are not discarded. We've seen this occur if a person has a common name and their departmental affiliation is Medicine.

Limit by date

If ONLY_NEWLY_ADDED_PUBLICATIONS was invoked, we limit the search by publications that were added to PubMed only since the last time a search was run.

Get complimentary data from Scopus (optional)

If Scopus is set up, we store Scopus records in the ScopusArticle table. As with articles stored in PubMedArticle, PMID is the primary key and these records exist indepedent of any person-specific data. For this reason, if PersonA and PersonB co-author a lot of papers, and you already have looked up PersonA's candidate papers, the lookup should not require creating additional records in ScopusArticle or PubMedArticle in cases where both authors have the same candidate articles.

The advantage of using Scopus is two-fold:

  1. disambiguated organizational affiliations
  2. more likely to have complete names of authors

We have found that using Scopus can improve accuracy by a couple percentage points.

Scopus does have its flaws including duplicates, but by getting only some complimentary data we avoid that problem. Also, that system sometimes mints new affiliation identifiers when they shouldn't or assign affiliations at too high a level (the university-level rather than the college-level). Still, it is better than nothing.

In truth, we haven't fully explored using Scopus rather than Web of Science. One possibility for additional development is to set up ReCiter so that it works with Web of Science. We may be wrong, but it is not clear that Web of Science's organizational disambiguation is on par with Scopus's.

For more, see Scopus configuration

Store data

Retrieved articles are stored in DynamoDB. ReCiter will replace existing articles even if looked up one was retrieved a minute ago.

Identify target author

We will now try to designate one author per article as our putative targetAuthor.

  1. If Scopus has been configured and we successfully mapped our PubMed article to a Scopus article, we first see if Scopus has indexed a more complete name than the one in PubMed (e.g., Mark Jones vs. M. Jones). This is true of older articles. If it is more complete, we use that.
  2. Retrieve all the names from identity.primaryName and identity.alternateName.
  3. Of these, discard names that are less complete versions of other names, e.g. Mark T. Jones vs. Mark Jones.
  4. Do a series of checks between the names in the Identity table and those in the Article metadata in the hopes of finding one and only one matching author per article. If you have three distinct names in Identity, you're doing three different sets of checks for each candidate article. The checks go from more to less rigorous.

In cases where a targetAuthor is identified, targetAuthor = TRUE. Else, it is set to FALSE.

For more, see Issue 185

Create clusters

Clustering is useful for accuracy. It allows us to more strongly suggest ArticleB where ArticleA shares the same features as ArticleB, but only ArticleA scores well on an individual basis.

At this point, retrieval is complete. There are one or more strategies in the ESearchResult table.

ReCiter will now retrieve the PMIDs associated with each of the retrieval strategies. If useGoldStandard is set to FOR_TESTING_ONLY, ReCiter will not retrieve PMIDs returned with the goldStandardRetrievalStrategy.

Next, ReCiter assigns each individual article to its own cluster.

When we cluster, we're using the transitive property until each article is in one and only one cluster. Consider this example of clusters:

{A,B} {B,C} {C,D} {D,E} {E,F}

These would all be combined into a single cluster:

{A,B,C,D,E,F}

For more, see Issue 217.

Tepid clustering

With "tepid clustering", we're looking for features, which are somewhat uncommon (frequency of around 100,000 records or fewer out of a corpus of 30+ million). By themselves, these features are not positively conclusive. But if a certain proportion of features between any two articles share these features, we conclude they were written by the same person, and they are clustered.

The features used in tepid clustering are:

  • journal name
  • co-author name excluding cases where targetAuthor=TRUE and very common co-author names as defined in namesIgnoredCoauthors in application.properties
  • MeSH major term where count in the MeSHTerm table is < 100,000; by using MeSH major, we are excluding a lot of common terms (e.g., humans, female, neoplasms); one individual at Weill Cornell have published a corpus of papers where he accounts for a whopping 0.6% of all of PubMed for a given MeSH major topic (e.g., Ronald Crystal and adenoviridae)!
  • Scopus Affiliation ID for targetAuthor

Create an array.

{
	clusterId: 1
	journals: Cell; 
	MeSH major: Thalassemia, Sunlight, Tamoxifen, Tryptophan, Brain;
	coauthors: Marshall T, Michaels A;
        targetAuthorScopusAffiliationID: 6007997;
}

Compare the overlap between different clusters using some basic algebra described in issue 217. If the score exceeds the score defined in cluster.similarity.threshold.score in application.properties, these articles are combined into a single cluster. Then, we keep combining clusters until there is no overlap.

Definitive clustering

With definitive clustering, we're looking to combine clusters where features generally occur thousands or fewer times in a corpus of 30 million records. Because they occur so infrequently, we will merge clusters whenever there is a single piece of evidence. No proportion of overlap required.

Any article that shares any of these features with another article should be in the same cluster as that other article:

  • email
  • grant identifiers excluding papers that mention more grants than listed in clusteringGrants-threshold as defined in application.properties
  • cites or cited by
  • MeSH major where global raw count in MeSH table < 4,000

Scoring the evidence

We will now score each individual article based on the information we know about our target person as stored in the identity table.

Name evidence

The goal of this scoring strategy, as defined in issue 219, is to have a reliable score for how closely any of the names in the Identity table match the targetAuthor's indexed name in the article.

There's a lot of logic here, but here's how it works at a summary level:

  1. Decide whether to use Scopus. We use Scopus when all of the following are true:
  • use.scopus.articles=true as set in application.properties
  • number of authors in Scopus equals number of authors in PubMed
  • length of given-name in Scopus for targetAuthor greater than forename in PubMed
  1. Preprocess names by removing special characters, suffixes, etc.
  2. Score the last name.
  3. Determine if identity.middleName is available to match against.
  4. Score the first name in cases where identity.middleName is null.
  5. Otherwise, score the first and middle name.

Similar to the way the targetAuthor is identified, the name scores uses both primary and any available alternate names; we begin with an attempt to make a match in as rigorous a manner as possible. For example, a match against "full and complete match" of identity.primaryName or identity.alternateNames when matched against article.forename would score highly whereas a match in which just the first initial was a positive match would score relatively poorly. Check out the "ScoreByNameStrategy Score" stanza in application.properties to see all the possible scores used for matching names.

Sample output:

"authorNameEvidence": {
  "institutionalAuthorName": {
    "firstName": "Jochen",
    "firstInitial": "J",
    "lastName": "Buck"
  },
  "articleAuthorName": {
    "firstName": "Jochen",
    "firstInitial": "J",
    "lastName": "Buck"
  },
  "nameScoreTotal": 7.2,
  "nameMatchFirstType": "full-exact",
  "nameMatchFirstScore": 4.2,
  "nameMatchMiddleType": "identityNull-MatchNotAttempted",
  "nameMatchMiddleScore": 1,
  "nameMatchLastType": "full-exact",
  "nameMatchLastScore": 2,
  "nameMatchModifierScore": 0
},

Organizational unit evidence

Departmental, center, divisional, program, and other organizational affiliations are strong signals that a given user authored a candidate publication.

Here's how the departmentScoringStrategy works.

  1. For each targetAuthor, retrieve article.affiliation.

  2. Retrieve org units from identity.departments.

  3. Sanitize by removing commas and dashes. Create aliases if possible by substituting certain other terms like "and".

  4. Add organizational unit aliases in the form of organizational unit synonyms. (Names vary for a variety of reasons including name changes, misspellings, donor names, abbreviations, truncations, and poor indexing. It also may be true that a department at one institution may have an analogous department at a different institution but with a different name.) We do this by looking in strategy.orgUnitScoringStrategy.organizationalUnitSynonym in application.properties. For example, Weill Cornell uses the follow synonyms for Otolaryngology:

  • Otolaryngology - Head and Neck Surgery
  • Otolaryngology Head and Neck Surgery
  • Ear Nose & Throat (Otolaryngology)
  • Otolaryngology-Head and Neck Surgery
  • Otolaryngology
  • Otorhinolaryngology
  • Ear Nose and Throat (Otolaryngology)
  • ENT Such synonyms are separated by a "|" delimiter while distinct groups of synonyms are separated by the "," delimiter.
  1. We attempt to match between article affiliation and identity.departments. If an org unit is long (e.g., Center for Integrative Medicine), then we look for that string (or any of its synonyms or aliases) as a substring in article affiliation. If it's not, then it needs to be preceded by "Department of", "Division of", etc. Or, it needs to be succeed by "Department" or "Division." If an affiliation is defined as a program we also look for that to be preceded by "Program of."

  2. Finally, we notice that a lot of authors have an affiliation with the Department of Medicine, so if that's a match, that score is modified, and there's a slight decrease in the score output.

Sample output where an author matches on three separate known org. units:

"organizationalUnitEvidence": [
  {
    "identityOrganizationalUnit": "Information Technologies and Services",
    "articleAffiliation": "Information Technologies & Services Department, Weill Cornell Medicine, New York, NY, USA; Department of Medicine, Weill Cornell Medicine, New York, NY, USA; Department of Healthcare Policy & Research, Weill Cornell Medicine, New York, NY, USA.",
    "organizationalUnitMatchingScore": 2,
    "organizationalUnitModifierScore": 0
  },
  {
    "identityOrganizationalUnit": "Medicine",
    "articleAffiliation": "Information Technologies & Services Department, Weill Cornell Medicine, New York, NY, USA; Department of Medicine, Weill Cornell Medicine, New York, NY, USA; Department of Healthcare Policy & Research, Weill Cornell Medicine, New York, NY, USA.",
    "organizationalUnitMatchingScore": 2,
    "organizationalUnitModifier": "Medicine",
    "organizationalUnitModifierScore": -1
  },
  {
    "identityOrganizationalUnit": "Healthcare Policy and Research",
    "articleAffiliation": "Information Technologies & Services Department, Weill Cornell Medicine, New York, NY, USA; Department of Medicine, Weill Cornell Medicine, New York, NY, USA; Department of Healthcare Policy & Research, Weill Cornell Medicine, New York, NY, USA.",
    "organizationalUnitMatchingScore": 2,
    "organizationalUnitModifierScore": 0
  }
],

For more, see Issue 229

Email evidence

Here's how the email scoring strategy works:

  1. Get any emails from article.email for targetAuthor. One author may have several emails in article.email.
  2. Get any emails from identity.email. One person may have several emails in identity.email. Some 13% of records in PubMed contain a Gmail email address. Knowing someone's personal emails can be quite useful.
  3. Get domain aliases from application.properties. For Weill Cornell, these are:
@nyp.org
@weill.cornell.edu
@med.cornell.edu
@mail.med.cornell.edu

Look to see if there any cases where any emails from identity are contained within article.affiliation

Else, look to see if there are any cases where identifier (uid) + domain aliases are contained within article.affiliation. For example, there is no such email, but if an article contained paa2013@nyp.org, that would be considered a positive match. If there is, output a score.

Sample output:

"emailEvidence": {
  "emailMatch": "jobuck@med.cornell.edu",
  "emailMatchScore": 40
},

For more, see Issue 223.

Journal category evidence

Authors in certain organizational units are far more likely to publish in journals associated with one or two specialties compared to others. Authors may switch up journals, but there are certain fields where people publish again and again in the same specialty. For example, Weill Cornell faculty in Orthopaedic Surgery are 200 times more likely to author papers in journals assigned to the "Orthopedics" subfield/category (as assigned by ScienceMetrix) compared to random chance. The odds ratio for librarians to publish in Medical Library journals is on the order of 2,000.

We have found that this approach is more powerful than keyword / bag of words matches.

Here's the logic:

  1. Get all ISSNs from article.
  2. Get identity.organizationalUnits for targetAuthor. Include all types including programs.
  3. Look up any department synonyms in departmentSynonym in application.properties.
  4. Does ISSN exist in the ScienceMetrix table?
  • If yes, get journalSubfieldLabel and journalSubfieldID. Go to 5.
  • If no, output a null match.
  1. Go to the ScienceMetrixDepartmentCategory table. Look up any organizational units associated with the category of the journal.
  2. Does any organizationalUnit exist for the subfield in question?
  • If yes, go to 7.
  • If no, that means that there was no match. Output journalSubfieldScore as stored in application.properties.
  1. Output the matching journal category, department, and logOddsRatio multiplied by journalSubfieldFactorScore (as stored in application.properties).

Sample output:

"journalCategoryEvidence": {
  "journalSubfieldScienceMetrixLabel": "Medical Informatics",
  "journalSubfieldScienceMetrixID": 36,
  "journalSubfieldDepartment": "Quality and Medical Informatics",
  "journalSubfieldScore": 4.64
},

For more, see Issue 262

Affiliation evidence

With this scoring strategy, we're trying to account for the extent to which affiliation of all authors affects the likelihood a given targetAuthor authored an article. To do this, we need to ask and answer several questions.

  1. Which affiliation(s) are we considering?
  • targetAuthor
  • non-targetAuthor
  1. What type of match is this?
  • explicitly defined for the individual, e.g., Dr. X got an undergraduate degree from Georgetown University, did her residency at Montefiore, etc. explicitly defined for the institution, e.g., Weill Cornell faculty frequently - co-author papers with individuals from Hospital for Special Surgery
  • match was not attempted because there was no available affiliation data
  • match was attempted but failed
  1. Which sources are we using to make the match?
  • Scopus (optional) - does institutional disambiguation; provides affiliations as numeric codes (e.g., 6007997)
  • PubMed - affiliations are just strings

The DynamoDB-based InstitutionAfid table contains Scopus institution identifiers matched to strings. These strings correspond to names used in Weill Cornell identity systems. Weill Cornell has 276,666 institutions in the Identity table, which represents 3,861 unique institutions. We've looked up the Scopus Institution IDs for the 1,786 institutions that are most often cited as being a faculty's current or historical affiliation. This collectively represents 273,006 affiliations. In other words, ~99% of the time we can predict what the Scopus Institution ID could be. Note that a given institution such as Weill Cornell might have multiple institution IDs.

Here's a summary of how this strategy works:

  1. Decide which source to use for targetAuthor. In order to use Scopus, the following must be true:
  • use.scopus.articles=true in application.properties.
  • article has a Scopus affiliation for targetAuthor If this is not the case, we use PubMed affiliation. If there is no affiliation, we return null.
  1. If Scopus can be used for targetAuthor, here's how we use it to score institutional affiliation(s):
  • Get list of institutions (these are strings) from identity.Institution for target person.
  • Get Scopus institution IDs from homeInstitution-scopusInstitutionIDs and collaboratingInstitutions-scopusInstitutionIDs in application.properties.
  • Use values from identity.Institution to lookup Scopus institutional identifiers in InstitutionAfid table.
  • Get any scopusInstitutionIDs (e.g., 60007997) from article.affiliation for targetAuthor. One targetAuthor may have multiple affiliation IDs.
  • Attempt match between article and identity.
  • Each type of match - individual affiliation, home institution, common collaborator of institution - confers a separate score.
  • While there can be multiple matches, the maximum score returned for this type of match should be 1.
  1. If PubMed can be used for targetAuthor, here's how we use it to score institutional affiliation:
  • Get list of institutions (these are strings) from identity.institutions for person under consideration.
  • Get article.affiliation for targetAuthor.
  • Preprocess. Remove stopwords from institution-Stopwords field in application.properties as well as commas and dashes from article.affiliation and identity.institutions.
  • Attempt match from article.affiliation and identity.institutions.
  • If no match, we try to match against homeInstitution-keywords from application.properties. Here we're looking for cases where homeInstitution keywords is present in affiliation string in any order. For example, in Weill Cornell's homeInstitution-keywords field, one of the groups of keywords is "weill|cornell". In order for this to be a match, both terms must be present in any order, with any case. These are matches: "Cornell Weill Medical College", "The Weill Medical School of Cornell University." These are not matches: "Cornell University", "Cornell Med"
  • If there's no match, we attempt a match using keywords associated with collaborating institutions. A common collaborator of Weill Cornell is Rockefeller University, noted as "rockefeller|university" defined at collaboratingInstitutions-keywords. Any affiliation which contains these two keywords in any order is considered a match.
  1. Use Scopus to score the affiliations of the non-targetAuthor. Here are the steps for scoring this:
  • Retrieve all scopusInstitutionIDs (e.g., 60007997) from article.affiliation for all nonTargetAuthors.
  • Retrieve Scopus home institution IDs from homeInstitution-scopusInstitutionIDs and identity.institution for targetAuthor.
  • Also retrieve Scopus collaborating institution IDs from collaboratingInstitutions-scopusInstitutionIDs in application.properties.
  • Now we look for overlap between affiliation IDs associated with the article and those associated with our targetAuthor or the home institution or the collaborating institution.
  • The resulting score is a proportion of stated institutional affiliations that we predicted from the above retrievals, with individual and home institutional affiliations scoring higher than common collaborating institutions, and non-matching institutions scoring worst of all.

Note that we don't currently score the affiliations of non-targetAuthors in PubMed.

Sample output:

"affiliationEvidence": {
  "scopusTargetAuthorAffiliation": [
    {
      "targetAuthorInstitutionalAffiliationSource": "SCOPUS",
      "targetAuthorInstitutionalAffiliationIdentity": "Weill Cornell Medical College",
      "targetAuthorInstitutionalAffiliationArticleScopusLabel": "Weill Cornell Medical College",
      "targetAuthorInstitutionalAffiliationArticleScopusAffiliationId": 60007997,
      "targetAuthorInstitutionalAffiliationMatchType": "POSITIVE_MATCH_INDIVIDUAL",
      "targetAuthorInstitutionalAffiliationMatchTypeScore": 3
    }
  ],
  "pubmedTargetAuthorAffiliation": {
    "targetAuthorInstitutionalAffiliationArticlePubmedLabel": "Department of Pharmacology, Weill Cornell Medical College, New York, New York, USA.",
    "targetAuthorInstitutionalAffiliationMatchTypeScore": 0
  },
  "scopusNonTargetAuthorAffiliation": {
    "nonTargetAuthorInstitutionalAffiliationSource": "SCOPUS",
    "nonTargetAuthorInstitutionalAffiliationMatchKnownInstitution": [
      "Weill Cornell Medical College, 60007997, 7"
    ],
    "nonTargetAuthorInstitutionalAffiliationMatchCollaboratingInstitution": [
      "Rockefeller University, 60026827, 3"
    ],
    "nonTargetAuthorInstitutionalAffiliationScore": 1.59
  }
},

For more, see Issue 47.

Relationship evidence

Institutions generally have a wealth of data that speaks to different forms of working relationships target authors have with their colleagues. When a candidate article's co-author has the same name, particularly if it's verbose, as one of these individuals, that immediately increases the likelihood that our target person wrote a candidate article.

At Weill Cornell Medicine, we have ready access to the following types of relationships:

Type Description Enumerated value used in JSON (ENUM)
Shared organizational unit two people work in the same organizational unit; Weill Cornell chooses to load these data only for org units that have fewer than 200 individuals HR
Manager and managee one person supervises another; if a subordinate's manager appears as the senior author, odds of authorship further increase MANAGER or REPORT
Mentor and mentee one person serves as another's mentor; if a mentee's mentor appears as a senior author, odds of authorship further increase MENTOR or MENTEE
Co-investigator two individuals are listed together on the same grant; we only load grants where there are fewer than 100 co-investigators, excluding CO_INVESTIGATOR

Here's how ReCiter uses this evidence:

  1. Get all authors and their respective ranks from article where targetAuthor=FALSE.
  2. Get all instances of identity.knownRelationships.
  3. First try to do a verbose first name match against all target authors (e.g., Jones, James).
  4. If that's not available, try truncated (e.g., James, J).
  5. If the matching relationship is a mentor or manager and in the senior author position, we add a relationshipMatchModifier-Mentor score to the relationship score.

Sample output:

"relationshipEvidence": [
  {
    "relationshipName": {
      "firstName": "Giovanni",
      "firstInitial": "G",
      "lastName": "Manfredi"
    },
    "relationshipType": [
      "CO_INVESTIGATOR"
    ],
    "relationshipMatchType": "verbose",
    "relationshipMatchingScore": 2.2,
    "relationshipVerboseMatchModifierScore": 0.6,
    "relationshipMatchModifierMentor": 0,
    "relationshipMatchModifierMentorSeniorAuthor": 0,
    "relationshipMatchModifierManager": 0,
    "relationshipMatchModifierManagerSeniorAuthor": 0
  },
  {
    "relationshipName": {
      "firstName": "Hannes",
      "firstInitial": "H",
      "lastName": "Buck"
    },
    "relationshipType": [
      "HR"
    ],
    "relationshipMatchType": "verbose",
    "relationshipMatchingScore": 2.2,
    "relationshipVerboseMatchModifierScore": 0.6,
    "relationshipMatchModifierMentor": 0,
    "relationshipMatchModifierMentorSeniorAuthor": 0,
    "relationshipMatchModifierManager": 0,
    "relationshipMatchModifierManagerSeniorAuthor": 0
  },
  {
    "relationshipName": {
      "firstName": "Lonny",
      "firstInitial": "L",
      "middleName": "R.",
      "middleInitial": "R",
      "lastName": "Levin"
    },
    "relationshipType": [
      "CO_INVESTIGATOR",
      "HR"
    ],
    "relationshipMatchType": "verbose",
    "relationshipMatchingScore": 2.2,
    "relationshipVerboseMatchModifierScore": 0.6,
    "relationshipMatchModifierMentor": 0,
    "relationshipMatchModifierMentorSeniorAuthor": 0,
    "relationshipMatchModifierManager": 0,
    "relationshipMatchModifierManagerSeniorAuthor": 0
  }
],

For more, see issue 226.

Grant evidence

When known NIH grants for a given person are indexed in candidate articles, the likelihood a target person has authored a paper increase. Here's how this strategy works:

  1. Get all grant identifiers from identity.grants and all grants, in post-processed form, from article.grants. A given article may have N grants.
  2. Look for cases where there is overlap between article.grants and identity.grants.
  3. In cases where there is overlap, we output the grant from article.grants and identity.grants.

Sample output:

"grantEvidence": {
  "grants": [
    {
      "institutionGrant": "GM-62328",
      "articleGrant": "R01 GM062328",
      "grantMatchScore": 3
    },
    {
      "institutionGrant": "GM-107442",
      "articleGrant": "R01 GM107442",
      "grantMatchScore": 3
    },
    {
      "institutionGrant": "NS-55255",
      "articleGrant": "R01 NS055255",
      "grantMatchScore": 3
    }
  ]
},

For more, see issue 225.

Education year evidence

Scholars generally don't author papers more than a certain number of years prior to receiving their bachelor degree or a different number of years prior to their first doctoral degree. We have noticed that the pattern around doctoral year discrepancy has changed in the last couple decades. Starting around 1998, authors with shorter careers, as benchmarked against the year of their doctoral degree, are more likely to author papers earlier in their careers. ReCiter accomodates all of the above.

This strategy is especially useful for current students; irrespective of other evidence, the date of publication alone makes it impossible that a student wrote ~90% of the articles in PubMed.

Here, we grab years of degree from identity.degreeyear.bacheloryear and identity.degreeyear.doctoralyear, and compare them to the year of publication. In cases where an article is published too early relative to either degree year, the article receives one or two penalties: discrepancyDegreeYear-BachelorScore or discrepancyDegreeYear-DoctoralScore.

Sample output:

"educationYearEvidence": {
  "identityDoctoralYear": 2014,
  "identityBachelorYear": 2008,
  "articleYear": 1983,
  "discrepancyDegreeYearBachelor": -25,
  "discrepancyDegreeYearBachelorScore": -8,
  "discrepancyDegreeYearDoctoral": -31,
  "discrepancyDegreeYearDoctoralScore": -8
},

When we load degree year data to Identity for students, we will load a predicted year when they will get their doctoral degrees.

For more, see issue 224.

Person type evidence

Users with some person types (e.g., full-time faculty) are more likely to author a paper; users with other person types (e.g. MD students) are less likely. This scoring strategy provides a flexible method to upweight or downweight certain person types. There are probably certain patterns for tenure track faculty vs. instructors that can be explored.

We get person types for our target individual from identity.personTypes.

Then we look to see if any person types and their respective weights are listed in application.properties. If they are, we output a person type score for each article for each matching person type.

Sample output:

"personTypeEvidence": {
  "personType": "academic-faculty-weillfulltime",
  "personTypeScore": 2
},

For more, see issue 227.

Article count evidence

With article count evidence, we reward each candidate article in which there are few articles retreived and penalizes cases in which a lot of articles are retrieved. This is consistent with Bayesian insights about probability.

To do this for each article, we need three values:

  • countArticlesRetrieved - count of the articles retrieved; this is only for the articles actually retrieved; note that if we're using the strict name lookup strategy, you need not say, for example, that yiwang has 120,000 pubs
  • articleCountThresholdScore, as stored in application.properties, e.g., 800
  • articleCountWeight, as stored in application.properties, e.g., 200

articleCountScore for each article is equal to: (countArticlesRetrieved - articleCountThresholdScore) / articleCountWeight

Sample output:

"articleCountEvidence": {
  "countArticlesRetrieved": 681,
  "articleCountScore": 0.2975
},

See more in Issue 228.

Average clustering evidence

The last step before computing a final raw score is to compute the averageClusteringScore.

With averageClusteringScore, we take advantage of the fact that articles in the same cluster are more likely to be written by the same person. Issue 232 describes the functionality in detail, but here's how it works:

  1. We take an average raw score of each cluster.

  2. If useGoldStandard = AS_EVIDENCE and an article is accepted or rejected in the gold standard, we add acceptedArticleScore or rejectedArticleScore to the individual articles. This is done only for the purpose of computing the cluster score and is not reflected in totalArticleScoreNonStandardized.

  3. If the cluster score (clusterScoreAverage) is higher than that of the target article (totalArticleScoreWithoutClustering), then the target article's raw score is increased. If it's lower, the target article's raw score is decreased. clusterScore-Factor located in application.properties controls the extent to which this is the case.

  4. Clustering is generally reliable; in testing, it seems to work > 90% of the time. When it's not working as intended, it can make matters worse. For this reason, we also compute a "cluster reliability score." With a cluster reliability score, we're looking to gauge the extent to which a target author's verbose first name where it exists is consistent within a given cluster. For example, look at the first names of the targetAuthor for articles in this cluster. How valid is this cluster?

firstName=[RaeKwon]
firstName=[RaeKwon]
firstName=[RaeKwon]
firstName=[RaeKwon]
firstName=[RaeKwon]
firstName=[RK]
firstName=[RockBum]
firstName=[RockBum]
firstName=[RockBum]
firstName=[RockBum]
firstName=[RockBum]
firstName=[RulBin]
firstName=[RyeoJin]
firstName=[RyeoJin]
firstName=[RyoonHo]

Not valid at all. For this sub-strategy, we're looking just at lower case letters. Where first name is inconsistent, the cluster reliability score quickly decays. The above cluster has 14 records, and only 5 of them have the same first name. As far as clusters go, that's a dud. Then, the cluster reliability score, out of a maximum of 1, would be 0.0455, a reduction of > 95%! The cluster reliability score is then multiplied by the clusterScore-Factor to affect the overall extent to which an article is affected by the scores of other members of the cluster.

As you can imagine, the feedback authors provide helps the system more appropriately score articles in the same cluster. This is how: feedback on one article can affect the score of another article and why it's important not only that we select the correct articles, but to score them appropriately. Score an article too high, and every other member in that cluster gets too high a score. Score too low, and every other member in that cluster gets too low a score.

Sample output:

"averageClusteringEvidence": {
  "totalArticleScoreWithoutClustering": 60.3,
  "clusterScoreAverage": 27.95,
  "clusterReliabilityScore": 0.97,
  "clusterScoreModificationOfTotalScore": -20.63
}

Map article to a standardized score

We now have a final raw score for each candidate article, which ReCiter stores as totalArticleScoreNonStandardized. This value is not user-friendly and may change both as we tweak the constants in application.properties and identify additional identity data to populate in DynamoDB.

To insulate users from changing raw scores and its non-intuitive range ("How good is 12.1?"), we map the raw score to a 1-10 scale. Articles with scores between the 1st and 2nd term in standardizedScoreMapping (as stored in application.properties) have a score of 1. Articles with scores between the 2nd and 3rd term in standardizedScoreMapping have a score of 2. Etc.

Some people who use ReCiter have asked to see a confidence percentage. Yes, that would be nice, but it's not valid. Absent more analysis, we simply can't conclude a certain score corresponds to a certain accuracy percentage.

For more, see Issue 233.

Compute accuracy

There are two types of assertions a user can make: ACCEPTED and REJECTED. These are recorded in the GoldStandard table. There's also the possibility of a NULL assertion. FYI - we have discovered a number of False Positives are cases where the article has no feedback in the GoldStandard table due to the fact that it has been recently published.

The Feature Generator API will return all suggestions for the target person where the suggested articles have a score meeting or exceeding the totalStandardizedArticleScore threshold. If filterByFeedback=ALL, ReCiter also includes rejected articles. Displaying rejected articles is useful for display in the user interface, especially when we can't count on the fact that the only person giving feedback is the target person themself.

The common practice for testing diagnostic/recommendation systems is to divide the recommendations into one of four categories:

  • True positive - ReCiter suggests an article; the article has been accepted in the Gold Standard
  • False positive - ReCiter suggests an article; the article is either explicitly rejected or receives no feedback in the Gold Standard
  • True negative - ReCiter fails to suggest an article; the article is either explicitly rejected or receives no feedback in the Gold Standard
  • False negative - ReCiter fails to suggest an article; the article has been accepted in the Gold Standard

Now we can calculate recall, precision, and overall accuracy.

  • Recall = (True Positives) / (True Positives + False Negatives)
  • Precision = (True Negatives) / (True Negatives + False Positives)
  • Overall accuracy = (Recall + Precision) / 2

People sometimes ask what ReCiter's accuracy is. It depends. If we measure ReCiter against longstanding faculty at Weill Cornell in cases where we have a rich set of identity data and a lot of publications in which there are common features which we can analyze, we do relatively well. For new users or those with short careers, we do worse.

Additionally, take someone like Yi Wang who is a faculty at Weill Cornell. Wang Y returns 120,000 publications. Suppose we correctly identify only two publications and erroneously recommend 40 publications for this researcher. Do we count some 119,900 publications among the True Negatives? Should the precision be 99.999%. That would give a misleading accuracy score. In this case, we only count publications from articles that we consider - unless they are false negatives.

Output results

ReCiter will now output the following data:

  • Precision, recall, and overall accuracy
  • Article metadata including: PMID, publication date, article title, journal title, ISSN, DOI, and list of authors with the putative targetAuthor marked as true
  • Scores for each article and each evidence type
  • A raw score for each article as well as a standardized score

...Are you serious? Did you really read the whole article?! The first person mentions this message will get a crisp new dollar bill from Paul.

Clone this wiki locally