You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This clustering strategy should exist parallel to the existing strategy so that we can assess performance and maybe improve this one day.
1. Retrieve candidate articles from esearchresult
Instructions
If mode=evidence, retrieve all articles from esearchresult
If mode=testing, retrieve all articles from esearchresult except those returned by goldStandardRetrievalStategy. Only use those.
2. Assign each individual article to its own cluster.
3. "Tepid Clustering" - merge articles into the same clusters in cases where they share a certain proportion of common features
Theory: these features generally occur fewer than 100,000 times in a corpus of 30 million records. We will use these merge articles into a single cluster but only when they occur a certain proportion of the time.
General instructions: compare clusters to each other using the below features. If their similarity exceeds some threshold, combine the clusters.
Rationale: Because these features may occur more often by chance, we do not automatically combine the clusters if they share the feature. Instead the cluster-cluster comparison needs to meet or exceed a scoring threshold (which is described below).
3a. Determine if name match is plausible. (ON HOLD)
When doing ANY clustering (including "definitive clustering" below), we should first check to see if it's plausible that the same author wrote the same paper. We're trying to make a simple determination: are a pair of articles eligible to be combined during clustering. The following logic should apply to both tepid and definitive clustering.
Do both articles must have targetAuthor=TRUE assigned for at least one of their authors?
If yes, go to 2
If no, the pair of articles are eligible to cluster.
Is length(article1.forename) = 2 characters, and length(article2.forename) = 2 characters?
If yes, go to 3
If no, go to 4
Does article1.forename = article2.forename?
If yes, the pair of articles are eligible to cluster.
If no, the pair of articles are NOT eligible to cluster.
Is length(article1.forename) > 3 characters, and length(article2.forename) > 3 characters?
If yes, go to 5.
If no, the pair of articles are eligible to cluster.
Check to see if any of the following conditions true. (One will suffice.)
forename1 = forename2
4 consecutive characters of forename1 overlap with forename2
Levenstein distance of < 2 between forename1 and forename2 (e.g., AhRum vs. AhReum)
Is one of the above conditions is true?
If yes, the pair of articles are eligible to cluster.
If no, the pair of articles are NOT eligible to cluster.
Test case: mcr2004, 27741972 (M. Carrington) should not be clustered with 27631718 (Christopher M).
3b. Identify features of each cluster.
Feature: journal name
Identify journal title of all articles in a given cluster
Feature: co-author name
Identify the lastName, firstInitial of all authors in a given cluster where targetAuthor=FALSE
Exclude the following common author names (lastName, firstInitial):
Y. Wang
J. Wang
J. Smith
S. Kim
S. Lee
J. Lee
Feature: MeSH major term
Identify all MeSH major terms in a given cluster where count of that MeSH term is < 100,000 in MeSH table.
3d. Calculate the clusterClusterSimilarityScore between ALL pairs of different clusters
So, if you had 4 clusters: A, B, C, D, you would need to calculate the following cluster-cluster similarity scores. (As you'll see, the order of the cluster comparisons, A-B vs. B-A, doesn't matter.):
A-B
A-C
A-D
B-C
B-D
C-D
There are three variables for each clusterCluster comparison:
totalItemsCluster1 = 1 + 5 + 2 + 1 = 9
totalItemsCluster2 = 1 + 2 + 3 + 1 = 7
overlapCluster1Cluster2 = 6
/* Notes:
- Overlap is done only between types, i.e., a journal (e.g., Brain) can't match with a MeSH major (e.g., Brain). If one of the two articles doesn't have a feature (e.g., MeSH major), neither article's features are included in the matching.
- Overlap between institutions counts a maximum of one point even if one target author has multiple affiliations and another has one affiliation.
*/
Compute the clusterClusterSimilarityScore as per this formula...
Let's figure out clusterClusterSimilarityScore in this example:
(6^2) / (9 * 7) = 0.57
3e. Compare against threshold
Set clusterClusterSimilarityScoreThreshold in application.properties to be 0.2. (We can change this if it's too aggressive. It's actually a bit high perhaps.)
If clusterClusterSimilarityScore > clusterClusterSimilarityScoreThreshold, merge clusters.
Example
Suppose there are three clusters. We want to measure similarity between all of these:
- Cluster 1 vs. Cluster 2 - score = 0.5
- Cluster 1 vs. Cluster 3 - score = 0.1
- Cluster 1 vs. Cluster 4 - score = 0.1
- Cluster 1 vs. Cluster 5 - score = 0.0
- Cluster 2 vs. Cluster 3 - score = 0.1
- Cluster 2 vs. Cluster 4 - score = 0.6
- Cluster 2 vs. Cluster 5 - score = 0.0
- Cluster 3 vs. Cluster 4 - score = 0.1
- Cluster 3 vs. Cluster 5 - score = 0.5
- Cluster 4 vs. Cluster 5 - score = 0.0
Identify clusterClusterSimilarityScores that exceeds our threshold:
{1,2}
{2,4}
{3,5}
Combine clusters until there is no overlap.
{1,2,4}
{3,5}
4. Definitive Clustering - merge articles into the same clusters in cases where they share certain features.
Theory: these features generally occur thousands or fewer times in a corpus of 30 million records. Because they occur so infrequently, we will use these to merge clusters whenever they occur.
Instructions: any article that shares any of these features with another article should be in the same cluster as that other article.
Feature: email
Parse email addresses of all authors including cases where targetAuthor=FALSE and targetAuthor=TRUE
Parse NIH grant identifiers into a standard format. Logic:
Find the first two consecutive letters. Track the letters.
There may be a space or a dash or no additional characters.
Now identify the first 4-6 consecutive numbers afterward.
Stop looking for additional numbers when:
a dash or space interrupts the numbers
or the value ends
or, we're exceeding 6 numbers
Track the numbers.
This gives you a normalized version of a grant ID - "DA-01457"
Note that we’re ignoring British grants - G0902173 22927437, MOP2390941 25692343. Also, if there are multiple grants in a single grant ID, we're only selecting the second one.
Ignore cases where article indexes more grants than clusteringGrants-threshold (see below) as recorded in application.properties. (e.g., amc2056, 22966490)
Identify cases where an article from one cluster cites an article from another cluster, or vise versa.
This code already exists.
Theoretically, we could also do this with data from Scopus, which contains 3x as much citation coverage.
Feature: MeSH major where global raw count in MeSH table < 4,000
Identify cases where an article from one cluster shares the same MeSH major as an article from another cluster, and that MeSH major has a global count of < 4,000.
Part of this code already exists.
The text was updated successfully, but these errors were encountered:
This clustering strategy should exist parallel to the existing strategy so that we can assess performance and maybe improve this one day.
1. Retrieve candidate articles from esearchresult
2. Assign each individual article to its own cluster.
3. "Tepid Clustering" - merge articles into the same clusters in cases where they share a certain proportion of common features
3a. Determine if name match is plausible. (ON HOLD)
When doing ANY clustering (including "definitive clustering" below), we should first check to see if it's plausible that the same author wrote the same paper. We're trying to make a simple determination: are a pair of articles eligible to be combined during clustering. The following logic should apply to both tepid and definitive clustering.
Test case: mcr2004, 27741972 (M. Carrington) should not be clustered with 27631718 (Christopher M).
3b. Identify features of each cluster.
Feature: journal name
Feature: co-author name
targetAuthor=FALSE
Feature: MeSH major term
Feature: Scopus Affiliation ID for targetAuthor
3c. Create arrays for each cluster
3d. Calculate the clusterClusterSimilarityScore between ALL pairs of different clusters
So, if you had 4 clusters: A, B, C, D, you would need to calculate the following cluster-cluster similarity scores. (As you'll see, the order of the cluster comparisons, A-B vs. B-A, doesn't matter.):
There are three variables for each clusterCluster comparison:
Let's do a sample calculation:
Compute those variables.
Compute the clusterClusterSimilarityScore as per this formula...
Let's figure out clusterClusterSimilarityScore in this example:
3e. Compare against threshold
Set clusterClusterSimilarityScoreThreshold in application.properties to be 0.2. (We can change this if it's too aggressive. It's actually a bit high perhaps.)
If clusterClusterSimilarityScore > clusterClusterSimilarityScoreThreshold, merge clusters.
Example
Suppose there are three clusters. We want to measure similarity between all of these:
Identify clusterClusterSimilarityScores that exceeds our threshold:
Combine clusters until there is no overlap.
4. Definitive Clustering - merge articles into the same clusters in cases where they share certain features.
Feature: email
Feature: grant identifiers
G0902173 22927437, MOP2390941 25692343
. Also, if there are multiple grants in a single grant ID, we're only selecting the second one.We want these:
We don't care about these:
Feature: cites or cited by
Feature: MeSH major where global raw count in MeSH table < 4,000
The text was updated successfully, but these errors were encountered: