[](readme.md generated by make_readme.R; do not modify by hand!)
Gleason_extraction_py is a python tool to extract gleason scores from pathology texts. This is python implementation of the original research project written in R written for the peer-reviewed study "Accurate pattern-based extraction of complex Gleason score expressions from pathology reports" (https://doi.org/10.1016/j.jbi.2021.103850).
# with this project cloned as a sub-dir of your project
cd gleason_extraction_py
# optional virtual environment
python -m venv .venv
./.venv/Scripts/activate
# install deps
pip install -r requirements.txt
import gleason_extraction_py as ge
ge.extract_gleason_scores_from_texts(
texts=["gleason 4 + 3 something something gleason 4 + 4"],
patterns=["gleason (?P<A>[3-5])[ +]+(?P<B>[3-5])"],
match_types=["a + b"]
)
text_id obs_id a b t c start stop match_type warning
0 0 0 4 3 <NA> <NA> 0 13 a + b None
1 0 1 4 4 <NA> <NA> 34 47 a + b None
import gleason_extraction_py as ge
ge.extract_gleason_scores_from_text(
text="gleason 4 + 3 something something gleason 4 + 4",
patterns=["gleason (?P<A>[3-5])[ +]+(?P<B>[3-5])"],
match_types=["a + b"]
)
obs_id a b t c start stop match_type warning
0 0 4 3 <NA> <NA> 0 13 a + b None
1 1 4 4 <NA> <NA> 34 47 a + b None
extract_gleason_scores_from_texts
simply calls
extract_gleason_scores_from_text
for each texts
element.
extract_gleason_scores_from_text
performs the following steps:
- Create a
dict
oflist
objects, objectout
. Theselist
objects will be populated during extraction and eachlist
is a column in the resultingpd.DataFrame
. - If
pd.isna(text)
, return apd.DataFrame
with zero rows (but the same columns as always). - Compile
patterns
by passing each element toregex.compile
. At least withregex
2.5.161 this also works for pre-compiled regexes, sopatterns
can contain eitherstr
orregex.Pattern
type elements. - If
prepare_text = True
, runprepare_text
ontext
.prepare_text
removes some known false positives (e.g. "Is bad (Gleason score 9-10): no") and replaces repeated whitespaces with a single whitespace ("gleason score 9" -> "gleason score 9"). These actions cause the output string to be shorter than the input string.
- For each match of each compiled pattern:
- Collect named capture groups with
regex.capturesdict
into objectcd
. E.g."gleason 3 + 4"
->cd = {"A": ["3"], "B": ["4"]}
and"gleason 3 + 4 / 4 + 4"
->cd = {"A": ["3", "4"], "B": ["4", "4"]}
. - If named group
A_and_B
was extracted, replacecd = {"A": cd["A_and_B"], "B": cd["A_and_B"]}
. This enables to correctly collect e.g. "sample was entirely gleason grade 3" ascd = {"A": ["3"], "B": ["3"]}
. - Append each collected value of each element type into the correct
list
inout
. E.g.out["A"].append(int("3"))
. - Pad the other columns with
None
values so alllist
objects inout
are again the same length --- exceptstart
andstop
get the start and stop positions of the match andmatch_type
gets the correspondingmatch_types
element of the current pattern.
- Collect named capture groups with
- After looping through all matches of all the compiled patterns, turn
out
into apd.DataFrame
with the correct column types and sort its rows by columnstart
. - Populate column
warning
by callingmake_column_warning
.make_column_warning
loops over its inputs and callsmake_warning
at each step. A warning text is created if thematch_type
does not correspond with what was extracted --- e.g.match_type = "a + b = c"
butc
is missing (was not extraced). Additionally, a warning is added if all ofa
,b
, andc
was extracted buta + b != c
. If nothing was wrong then the warning isNone
.
- If more than one row in
out
is an "orphan" --- A, B, T, or C alone with other value columns missing --- thendetermine_element_combinations
is called on those "orphan" rows andout
is re-cast into a form where combined orphan values now appear on the same rows. E.g.A = 3
andB = 4
on separate rows may now appear on the same row.determine_element_combinations
uses a hard-coded list of allowed gleason element combinations to search for (["c", "a", "b", "t"], ["c", "a", "b"], ["c", "b", "a"], ...) in its input data. It simply goes through each combination, repeats each element from 1 ton_max_each
(default 5) times and sees what fits the so-far unmatched elements in the input data. Note that the repeating of elements means that e.g. ["c", "c", "a", "a", "b", "b"] can also be matched. Repetition of this sort is not uncommon in tables within the text. Values matched with a combination get a common combination identifier (e.g.["c", "c", "a", "a", "b", "b"]
->[0, 1, 0, 1, 0, 1]
).
- Once more sort the rows in
out
bystart
. Populateobs_id
with a running number starting from zero. Returnout
.
The regular expressions used in the study were formed by writing "lego blocks" of simpler regular expressions, functions that process regular expressions into more complex ones, and making use of these two types of objects to form rather long regular expressions that were ultimately used to extract Gleason scores in our data.
While a programme based on regular expressions is always specific for the
dataset for which they were developed, this is also true in statistical models
where the statistical model is general but the fit is specific to the dataset.
Our regexes can be straightforward to adapt to other datasets because the
"lego blocks" are often lists of mandatory words that must appear before
and/or after a gleason score match. For instance the regular expression
for kw_a
, keyword-and-primary, requires both the words "gleason" (in some
form, with common typos) and the word "primary", with possible conjugates,
and many synonyms. Both must appear but in either order, so both e.g.
"gleason primary 3" and "primary gleason 3" are matched.
Some mandatory words are required in all the regular expressions, even "3 + 3 = 6" would not be matched without a preceding "gleason". We chose this approach because we considered it far worse to collect false alarms than to miss some extractions. Indeed in our study less than 1 % of collected values were false alarms.