Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final capture replace filtering #36

Merged
merged 10 commits into from
Aug 10, 2022
Merged

Conversation

asishallab
Copy link
Contributor

@asishallab asishallab commented Aug 6, 2022

Added option to polish human readable descriptions (HRDs) assigned to queries (proteins or sequence families) by prot-scriber's scoring mechanism. Polishing uses capture-replace-pairs, i.e. regular expressions (fancy-regex syntax) and replace instructions to iteratively process HRDs as a last step before writing the output table. Currently this is used to delete trailing non-informative words like [...] and, [...] or, [...] the, etc. A command line option has been added allowing to either suppress this final polishing by provision of -d none or or allow for custom polishing capture replace pairs. Also an example file with the default has been added to ./misc.

Closes #26 #33 #34

coeit and others added 10 commits August 3, 2022 15:21
Add a command-line option to exclude "unknown protein" or "unknown sequence family" from results
Currently, phrases, i.e. sub-sets of candidate descriptions, that only consist of non-informative words are still scored and might be assigned as the final annotation. This should be changed.

If the set of informative words is empty, classify the protein or sequence family as "unknown".
- Replaced standard Rust regex with fancy-regex in capture-replace-pairs, thus
  allowing for (named) backreferences.
- Using named backreferences multiple occurrences of the same word are replaced
  with the first occurrence, i.e. any subsequent occurrence is deleted.
- Polishing iteratively applies capture-replace-pairs (fancy-regex, replace instruction)
  to the human readable descriptions (HRDs) assigned to the queries (families or proteins)
- This is used among others to remove terminal non-informative words like '[...] and', or
  [...] the', or '[...] or' etc.
- The polishing step can be suppressed (skipped) by providing the new command line option
  --polish-capture-replace-pairs (-d) with "none". Use the same option to provide
  custom capture-replace-pairs.
- problem was mutability and iteration over mutable references...
Example polish-capture-replace-pairs file in misc.
@asishallab asishallab requested a review from coeit August 6, 2022 09:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When generating the human readable description only add phrases that have not already been scores.
2 participants