Skip to content

FinnishCancerRegistry/gleason_extraction_py

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[](readme.md generated by make_readme.R; do not modify by hand!)

Gleason score extraction at the Finnish Cancer Registry

Gleason_extraction_py is a python tool to extract gleason scores from pathology texts. This is python implementation of the original research project written in R written for the peer-reviewed study "Accurate pattern-based extraction of complex Gleason score expressions from pathology reports" (https://doi.org/10.1016/j.jbi.2021.103850).

Setup

# with this project cloned as a sub-dir of your project
cd gleason_extraction_py
# optional virtual environment
python -m venv .venv
./.venv/Scripts/activate

# install deps
pip install -r requirements.txt

Usage

import gleason_extraction_py as ge

ge.extract_gleason_scores_from_texts(
	texts=["gleason 4 + 3 something something gleason 4 + 4"],
	patterns=["gleason (?P<A>[3-5])[ +]+(?P<B>[3-5])"],
	match_types=["a + b"]
)

   text_id  obs_id  a  b     t     c  start  stop match_type warning
0        0       0  4  3  <NA>  <NA>      0    13      a + b    None
1        0       1  4  4  <NA>  <NA>     34    47      a + b    None
import gleason_extraction_py as ge

ge.extract_gleason_scores_from_text(
	text="gleason 4 + 3 something something gleason 4 + 4",
	patterns=["gleason (?P<A>[3-5])[ +]+(?P<B>[3-5])"],
	match_types=["a + b"]
)

	 obs_id  a  b     t     c  start  stop match_type warning
0       0  4  3  <NA>  <NA>      0    13      a + b    None
1       1  4  4  <NA>  <NA>     34    47      a + b    None

Interpretation of results

extract_gleason_scores_from_texts simply calls extract_gleason_scores_from_text for each texts element.

extract_gleason_scores_from_text performs the following steps:

  • Create a dict of list objects, object out. These list objects will be populated during extraction and each list is a column in the resulting pd.DataFrame.
  • If pd.isna(text), return a pd.DataFrame with zero rows (but the same columns as always).
  • Compile patterns by passing each element to regex.compile. At least with regex 2.5.161 this also works for pre-compiled regexes, so patterns can contain either str or regex.Pattern type elements.
  • If prepare_text = True, run prepare_text on text.
    • prepare_text removes some known false positives (e.g. "Is bad (Gleason score 9-10): no") and replaces repeated whitespaces with a single whitespace ("gleason score 9" -> "gleason score 9"). These actions cause the output string to be shorter than the input string.
  • For each match of each compiled pattern:
    • Collect named capture groups with regex.capturesdict into object cd. E.g. "gleason 3 + 4" -> cd = {"A": ["3"], "B": ["4"]} and "gleason 3 + 4 / 4 + 4" -> cd = {"A": ["3", "4"], "B": ["4", "4"]}.
    • If named group A_and_B was extracted, replace cd = {"A": cd["A_and_B"], "B": cd["A_and_B"]}. This enables to correctly collect e.g. "sample was entirely gleason grade 3" as cd = {"A": ["3"], "B": ["3"]}.
    • Append each collected value of each element type into the correct list in out. E.g. out["A"].append(int("3")).
    • Pad the other columns with None values so all list objects in out are again the same length --- except start and stop get the start and stop positions of the match and match_type gets the corresponding match_types element of the current pattern.
  • After looping through all matches of all the compiled patterns, turn out into a pd.DataFrame with the correct column types and sort its rows by column start.
  • Populate column warning by calling make_column_warning.
    • make_column_warning loops over its inputs and calls make_warning at each step. A warning text is created if the match_type does not correspond with what was extracted --- e.g. match_type = "a + b = c" but c is missing (was not extraced). Additionally, a warning is added if all of a, b, and c was extracted but a + b != c. If nothing was wrong then the warning is None.
  • If more than one row in out is an "orphan" --- A, B, T, or C alone with other value columns missing --- then determine_element_combinations is called on those "orphan" rows and out is re-cast into a form where combined orphan values now appear on the same rows. E.g. A = 3 and B = 4 on separate rows may now appear on the same row.
    • determine_element_combinations uses a hard-coded list of allowed gleason element combinations to search for (["c", "a", "b", "t"], ["c", "a", "b"], ["c", "b", "a"], ...) in its input data. It simply goes through each combination, repeats each element from 1 to n_max_each (default 5) times and sees what fits the so-far unmatched elements in the input data. Note that the repeating of elements means that e.g. ["c", "c", "a", "a", "b", "b"] can also be matched. Repetition of this sort is not uncommon in tables within the text. Values matched with a combination get a common combination identifier (e.g. ["c", "c", "a", "a", "b", "b"] -> [0, 1, 0, 1, 0, 1]).
  • Once more sort the rows in out by start. Populate obs_id with a running number starting from zero. Return out.

Regular expressions written for the study

The regular expressions used in the study were formed by writing "lego blocks" of simpler regular expressions, functions that process regular expressions into more complex ones, and making use of these two types of objects to form rather long regular expressions that were ultimately used to extract Gleason scores in our data.

While a programme based on regular expressions is always specific for the dataset for which they were developed, this is also true in statistical models where the statistical model is general but the fit is specific to the dataset. Our regexes can be straightforward to adapt to other datasets because the "lego blocks" are often lists of mandatory words that must appear before and/or after a gleason score match. For instance the regular expression for kw_a, keyword-and-primary, requires both the words "gleason" (in some form, with common typos) and the word "primary", with possible conjugates, and many synonyms. Both must appear but in either order, so both e.g. "gleason primary 3" and "primary gleason 3" are matched.

Some mandatory words are required in all the regular expressions, even "3 + 3 = 6" would not be matched without a preceding "gleason". We chose this approach because we considered it far worse to collect false alarms than to miss some extractions. Indeed in our study less than 1 % of collected values were false alarms.