Skip to content
This repository was archived by the owner on Jul 28, 2025. It is now read-only.

Conversation

@mart-r
Copy link
Collaborator

@mart-r mart-r commented Feb 11, 2025

Since transformers==4.47.0 there's a different way they handle some of the tokenizer attributes (see huggingface/transformers#34461). But our saved models have no idea of this change and keep the data in the old format. So when trying to run the DeID model, the new method is used to read the data, but there's nothing there to be read.

We haven't seen anyone really complain about this, but it's mostly due to the insallations being static. If someone installed before 17th of December, their installation would have transformers<4.47 and unless they've updated since, their setup will still work just fine.

This PR fixes that. It moves the relevant data to the correct location upon creation of pipeline.

Verification that this fixes the issue Ran on master branch:
% python -c "from medcat.utils.ner.deid import DeIdModel;model = DeIdModel.load_model_pack(\"temp/mct_1_15_beta_2025_02_11/medcat_deid_model_691c3f6a6e5400e7.zip\");print(model.deid_text('Patient: Mr James Doe'))"
/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/.venv3.10.13/lib/python3.10/site-packages/transformers/training_args.py:1575: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Device set to use mps:0
/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/.venv3.10.13/lib/python3.10/site-packages/spacy/util.py:910: UserWarning: [W095] Model 'en_core_web_md' (3.1.0) was trained with spaCy v3.1.0 and may not be 100% compatible with the current version (3.7.5). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/medcat/utils/ner/deid.py", line 85, in deid_text
    entities = self.cat.get_entities(text)['entities']
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/medcat/cat.py", line 1094, in get_entities
    doc = self(text)
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/medcat/cat.py", line 526, in __call__
    return self.pipe(text)  # type: ignore
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/medcat/pipe.py", line 278, in __call__
    return self._nlp(text) if len(text) > 0 else None  # type: ignore
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/.venv3.10.13/lib/python3.10/site-packages/spacy/language.py", line 1054, in __call__
    error_handler(name, proc, [doc], e)
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/.venv3.10.13/lib/python3.10/site-packages/spacy/util.py", line 1722, in raise_error
    raise e
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/.venv3.10.13/lib/python3.10/site-packages/spacy/language.py", line 1049, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))  # type: ignore[call-arg]
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/medcat/ner/transformers_ner.py", line 442, in __call__
    doc = next(self.pipe(iter([doc])))
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/medcat/ner/transformers_ner.py", line 389, in pipe
    yield from self._process(stream, batch_size_chars)  # type: ignore
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/medcat/ner/transformers_ner.py", line 399, in _process
    res = self.ner_pipe(doc.text, aggregation_strategy=self.config.general['ner_aggregation_strategy'])
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/.venv3.10.13/lib/python3.10/site-packages/transformers/pipelines/token_classification.py", line 250, in __call__
    return super().__call__(inputs, **kwargs)
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/.venv3.10.13/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1354, in __call__
    return next(
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/.venv3.10.13/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/.venv3.10.13/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 269, in __next__
    processed = self.infer(next(self.iterator), **self.params)
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/.venv3.10.13/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/.venv3.10.13/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/.venv3.10.13/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
    data.append(next(self.dataset_iter))
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/.venv3.10.13/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 186, in __next__
    processed = next(self.subiterator)
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/.venv3.10.13/lib/python3.10/site-packages/transformers/pipelines/token_classification.py", line 255, in preprocess
    inputs = self.tokenizer(
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/.venv3.10.13/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2868, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/.venv3.10.13/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2978, in _call_one
    return self.encode_plus(
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/.venv3.10.13/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3045, in encode_plus
    padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/.venv3.10.13/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2769, in _get_padding_truncation_strategies
    if padding_strategy != PaddingStrategy.DO_NOT_PAD and (self.pad_token is None or self.pad_token_id < 0):
  File "/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/.venv3.10.13/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1108, in __getattr__
    raise AttributeError(f"{self.__class__.__name__} has no attribute {key}")
AttributeError: RobertaTokenizerFast has no attribute pad_token. Did you mean: '_pad_token'?

And after the changes:

% python -c "from medcat.utils.ner.deid import DeIdModel;model = DeIdModel.load_model_pack(\"temp/mct_1_15_beta_2025_02_11/medcat_deid_model_691c3f6a6e5400e7.zip\");print(model.deid_text('Patient: Mr James Doe'))"
/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/.venv3.10.13/lib/python3.10/site-packages/transformers/training_args.py:1575: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Device set to use mps:0
/Users/martratas/Documents/CogStack/.MedCAT.nosync/MedCAT/.venv3.10.13/lib/python3.10/site-packages/spacy/util.py:910: UserWarning: [W095] Model 'en_core_web_md' (3.1.0) was trained with spaCy v3.1.0 and may not be 100% compatible with the current version (3.7.5). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
Patient: Mr [Name]

@tomolopolis
Copy link
Member

Task linked: CU-8697x7y9x Fix DeID transformers issue

Copy link
Member

@tomolopolis tomolopolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mart-r mart-r merged commit b8692fe into master Feb 12, 2025
8 checks passed
mart-r added a commit that referenced this pull request Feb 12, 2025
* CU-8697x7y9x: Fix issue with transformers 4.47+ affecting DeID

* CU-8697x7y9x: Add type-ignore to module unrelated to current change
mart-r added a commit that referenced this pull request Feb 12, 2025
* CU-8697x7y9x: Fix issue with transformers 4.47+ affecting DeID

* CU-8697x7y9x: Add type-ignore to module unrelated to current change
mart-r added a commit that referenced this pull request Feb 12, 2025
* CU-8697x7y9x: Fix issue with transformers 4.47+ affecting DeID

* CU-8697x7y9x: Add type-ignore to module unrelated to current change
mart-r added a commit that referenced this pull request Apr 25, 2025
* CU-8693bc9kc: Add python 3.12 support (#511)

* CU-8693bc9kc: Add python 3.12 support

* CU-8693bc9kc: Amend dependencies so as to be compatible with python 3.12

* Bump default spacy model version (to 3.8)

* CU-8693bc9kc: Fix some typing issues due to numpy2

* CU-8693bc9kc: Fix some typing issues due to numpy2 (try 2)

* CU-8693bc9kc: Change spacy models to 3.7.2

* CU-8693bc9kc: Pin numpy to v1

* CU-8693bc9kc: Fix numpy requirement comment

* CU-8693bc9kc: Fix usage of old/deprecated assert methods in tests

* CU-8693bc9kc: Update some requirement comments

* CU-8697c86rf: Update docs build requirements (#514)

* CU-8697c86rf: Update docs build requirements

* CU-8697c86rf: Fix docs build requirements (hopefully)

* CU-8697c86rf: Fix docs build requirements (hopefully) x2

* CU-8697x7y9x: Fix issue with transformers 4.47+ affecting DeID (#517)

* CU-8697x7y9x: Fix issue with transformers 4.47+ affecting DeID

* CU-8697x7y9x: Add type-ignore to module unrelated to current change

* Updates for MetaCAT (#515)

* Pushing update for MetaCAT

- Addressing the multiple zero-division-error warnings per epoch while training
- Accommodating the variations in category name and class name across NHS sites

* Adding comments

* Pushing requested changes

* Pushing type fix

* Pushing updates to metacat config

* Support expansion of transformers ner models to include new concepts (#519)

* CU-8697v6qr2 support expansion of transformers ner models to include new concepts
* CU-8697v6qr2 add logging suggested by the review

* CU-869805t7e alt names fixes (#520)

* CU-869805t7e: Move getting of applicable category name to the config

* CU-869805t7e: Use alternative category names in eval method

* CU-869805t7e: Reduce indentation

* CU-869805t7e: Reduce indentation (again)

* CU-869805t7e: Some comment fixing due to rearrangements before

* CU-869805t7e: Fix usage of matched class name when encoding category values

* CU-869805t7e: Avoid duplicating exception message

* CU-8697qfvzz train metacat on sup train (#516)

* CU-8697qfvzz: Add new optional keyword argumnet to allow training MetaCAT models during supervised training

* CU-8697qfvzz: Add tests regarding training meta-cats during supervised training

* CU-8697qfvzz: Fix small typo in comment

* CU-8697qfvzz: Allow using alternative category names if/when training meta cats through CAT.train_supervised

* CU-8698ek477: Fix AdamW import from tranformers to torch (#523)

* CU-8698ek477: Add TODO to MetaCAT ML utils regarding AdamW import

* CU-8698ek477: Fix AdamW import (trf->torch)

* CU-8698f8fgc: Fix negative sampling including indices for words without a vector (#524)

* CU-8698f8fgc: Add new test to check that the negative sampling indices do not include non-vectored indices

* CU-8698f8fgc: Add fix for negative sampling including indices for words without a vector

* CU-8698f8fgc: Update tests to make sure index frequencies are respected

* CU-8698f8fgc: Add 3.9-friendly counter totalling method

* CU-8698gkrqa: Add argument to allow specifying the changes warrenting a model save (#525)

* CU-8698hfkch: Add eval method to deid model

* CU-8698hfkch: lint checks

* CU-8698gqumv: Fix regression test vocab vector sizes (#526)

* CU-8698gqumv: Add tests for Vocab upon regression testing

* CU-8698gqumv: Fix regression time vocab data

* CU-86983ruw9 Fix test train split (#521)

* CU-86983ruw9: Fix train-test splitter leaving train set empty for smaller datasets

* CU-86983ruw9: Add additional optional arguments to test-train splitting for minimum concept count and maximum test fraction

* CU-86983ruw9: Add a few tests for test-train splitting

* CU-8698hfkch: Add eval method to deid model (#527)

* CU-8698hfkch: Add eval method to deid model

* CU-8698hfkch: lint checks

---------

Co-authored-by: Tom Searle <tom@cogstack.org>

* CU-8698jzjj3: pass in extra param if ignore_extra_labels is set, and test

* CU-8698mqu96 Transformers update (4.51.0) fix (#531)

* CU-8698mqu96: Update special tokens lengths attribute

* CU-8698mqu96: Update MetaCAT usage of BertTokenizer.from_pretrained for type safety

* CU-8698mqu96: Ignore typing where mypy is wrong + add note in code

* CU-8698mqu96: Ignore typing where mypy may be wrong + add comment

* CU-8698mqu96: Fix tokenizer wrapper import for rel cat

* CU-8698mqu96: Rename evaluation strategy keyword argument in line with changes

* CU-8698mqu96: Type-ignore method where mypy says it does not exist

* CU-8698mqu96: Fix TRF-NER output dir typing issue

* CU-8698mqu96: Update a doc string for darglint

* CU-8698mqu96: Fix typing issue for TrfNER trainer callback

* Relation extraction llama (#522)

* Added files.

* More additions to rel extraction.

* Rel base.

* Update.

* Updates.

* Dependency parsing.

* Updates.

* Added pre-training steps.

* Added training & model utils.

* Cleanup & fixes.

* Update.

* Evaluation updates for pretraining.

* Removed duplicate relation storage.

* Moved RE model file location.

* Structure revisions.

* Added custom config for RE.

* Implemented custom dataset loader for RE.

* More changes.

* Small fix.

* Latest additions to RelCAT (pipe + predictions)

* Setup.py fix.

* RE utils update.

* rel model update.

* rel dataset + tokenizer improvements.

* RelCAT updates.

* RelCAT saving/loading improvements.

* RelCAT saving/loading improvements.

* RelCAT model fixes.

* Attempted gpu learning fix. Dataset label generation fixes.

* Minor train dataset gen fix.

* Minor train dataset gen fix No.2.

* Config updates.

* Gpu support fixes. Added label stats.

* Evaluation stat fixes.

* Cleaned stat output mode during training.

* Build fix.

* removed unused dependencies and fixed code formatting

* Mypy compliance.

* Fixed linting.

* More Gpu mode train fixes.

* Fixed model saving/loading issues when using other baes models.

* More fixes to stat evaluation. Added proper CAT integration of RelCAT.

* Setup.py typo fix.

* RelCAT loading fix.

* RelCAT Config changes.

* Type fix. Minor additions to RelCAT model.

* Type fixes.

* Type corrections.

* RelCAT update.

* Type fixes.

* Fixed type issue.

* RelCATConfig: added seed param.

* Adaptations to the new codebase + type fixes..

* Doc/type fixes.

* Fixed input size issue for model.

* Fixed issue(s) with model size and config.

* RelCAT: updated configs to new style.

* RelCAT: removed old refs to logging.

* Fixed GPU training + added extra stat print for train set.

* Type fixes.

* Updated dev requirements.

* Linting.

* Fixed pin_memory issue when training on CPU.

* Updated RelCAT dataset get + default config.

* Updated RelDS generator + default config

* Linting.

* Updated RelDatset + config.

* Pushing updates to model

Made changes to:
1) Extracting given number of context tokens left and right of the entities
2) Extracting hidden state from bert for all the tokens of the entities and performing max pooling on them

* Fixing formatting

* Update rel_dataset.py

* Update rel_dataset.py

* Update rel_dataset.py

* RelCAT: added test resource files.

* RelCAT: Fixed model load/checkpointing.

* RelCAT: updated to pipe spacy doc call.

* RelCAT: added tests.

* Fixed lint/type issues & added rel tag to test DS.

* Fixed ann id to token issue.

* RelCAT: updated test dataset + tests.

* RelCAT: updates to requested changes + dataset improvements.

* RelCAT: updated docs/logs according to commends.

* RelCAT: type fix.

* RelCAT: mct export dataset updates.

* RelCAT: test updates + requested changes p2.

* RelCAT: log for MCT export train.

* Updated docs + split train_test & dataset for benchmarks.

* type fixes.

* RelCAT: Initial Llama integration.

* RelCAT: updates to Llama impl.

* RelCAT: model typo fix.

* RelCAT: label_id /sample no. mixup fix.

* Updated cleaned up Relataset, added new ways to create relations via anno types (doc/export only for now).

* Added option to predict any text /w annotations via RelCAT. MCT export train fixes.

* RelCAT: added sample limiter / class, more logging info.

* RelCAT: test/train ds shuffle update.

* RelCAT: added option to keep original text when using reldataset class.

* Pushing change for stratified batching

Implement stratified batching for improved class representation and balanced training

* RelCAT: fixed doc processing issue + class weights.

* RelCAT: class weights addtions to cfg + param.

* RelCAT: added config params for Adam optimizer.

* RelCAT updated default config.

* RelCAT: config update + optimizer change.

* RelCAT: fixed model freeze flags.

* RelCAT: model optimizer save/load fix.

* RelCAT: added export ent tag check.

* Fixed issues when saving/loading model for class weights + inference device cast.

* RelCAT: bug fix for ents that are @ EoS.

* Rel Dataset updates.

* Rel Dataset updates.

* Pushing change for ModernBERT

* Bumped transformers version.

* Updated rel dataset generation from fake Spacy Docs.

* ModernBert updates.

* Updated RelCAT model-load/save.

* Minor relCAT updates, code format.

* Type check updates.

* Fixed inference issue.

* RelCAT: testing updates.

* Type fixes.

* Type fixes.

* Type fixes.

* Type fixes IV.

* Type fixes python 3.9.

* RelCAT: flake8 fixes.

* RelCAT: flake8 fixes.

* RelCAT: Updates (fixed model loading after save).

* Fixed test.

* Update RelCAT stuff for improved abstraction

* Move separate model implementations to separate packages

* Some minor abstraction changes

* Remove accidentally copied abstract method decorator

* Fix import in test

* Fix RelCAT impport in pipe tests

* Update base relcat model implementation to include config

* Latest RelCAT module updates.

* Type fixes + run issues.

* Type fixes.

* Fixed Llama tokenizer.

* Type fixes.

* Type fixes: Python3.10 adjustements.

* Linting.

* Fix base flake8 lint issues

* Fix doc string in ConfigRelCAT.load

* Fix base component init doc string

* Fixed BaseComponent.load method doc string

* Fix doc strings in rel_cat ml_utils

* Fix doc strings in rel_cat models module

* Fix rel-cat test time import

* Fix type casting

* Align pipe tests with rel cat changes

* Fix property paths in rel cat tests

* Updates.

* Fixed tests.

* Fixed relCAT config save.

* Latest fixes for model saving/loading.

* Lint fix.

* RelCAT cfg load test fix.

* Remove install requirements from gitignore

---------

Co-authored-by: Shubham Agarwal <66172189+shubham-s-agarwal@users.noreply.github.com>
Co-authored-by: mart-r <mart.ratas@gmail.com>

* CU-8698vewzp: Fix docs requirements (hopefully) (#534)

* CU-8698veb6y: Use Ubuntu 24.04 for publishing to test PyPI (#533)

---------

Co-authored-by: Shubham Agarwal <66172189+shubham-s-agarwal@users.noreply.github.com>
Co-authored-by: Xi Bai <82581439+baixiac@users.noreply.github.com>
Co-authored-by: Tom Searle <tom@cogstack.org>
Co-authored-by: tomolopolis <tsearle88@gmail.com>
Co-authored-by: Vlad Dinu <62345326+vladd-bit@users.noreply.github.com>
alhendrickson pushed a commit to CogStack/cogstack-nlp that referenced this pull request Jul 1, 2025
…ack/MedCAT#517)

* CU-8697x7y9x: Fix issue with transformers 4.47+ affecting DeID

* CU-8697x7y9x: Add type-ignore to module unrelated to current change
alhendrickson pushed a commit to CogStack/cogstack-nlp that referenced this pull request Jul 1, 2025
…ack/MedCAT#517)

* CU-8697x7y9x: Fix issue with transformers 4.47+ affecting DeID

* CU-8697x7y9x: Add type-ignore to module unrelated to current change
alhendrickson pushed a commit to CogStack/cogstack-nlp that referenced this pull request Jul 1, 2025
* CU-8693bc9kc: Add python 3.12 support (CogStack/MedCAT#511)

* CU-8693bc9kc: Add python 3.12 support

* CU-8693bc9kc: Amend dependencies so as to be compatible with python 3.12

* Bump default spacy model version (to 3.8)

* CU-8693bc9kc: Fix some typing issues due to numpy2

* CU-8693bc9kc: Fix some typing issues due to numpy2 (try 2)

* CU-8693bc9kc: Change spacy models to 3.7.2

* CU-8693bc9kc: Pin numpy to v1

* CU-8693bc9kc: Fix numpy requirement comment

* CU-8693bc9kc: Fix usage of old/deprecated assert methods in tests

* CU-8693bc9kc: Update some requirement comments

* CU-8697c86rf: Update docs build requirements (CogStack/MedCAT#514)

* CU-8697c86rf: Update docs build requirements

* CU-8697c86rf: Fix docs build requirements (hopefully)

* CU-8697c86rf: Fix docs build requirements (hopefully) x2

* CU-8697x7y9x: Fix issue with transformers 4.47+ affecting DeID (CogStack/MedCAT#517)

* CU-8697x7y9x: Fix issue with transformers 4.47+ affecting DeID

* CU-8697x7y9x: Add type-ignore to module unrelated to current change

* Updates for MetaCAT (CogStack/MedCAT#515)

* Pushing update for MetaCAT

- Addressing the multiple zero-division-error warnings per epoch while training
- Accommodating the variations in category name and class name across NHS sites

* Adding comments

* Pushing requested changes

* Pushing type fix

* Pushing updates to metacat config

* Support expansion of transformers ner models to include new concepts (CogStack/MedCAT#519)

* CU-8697v6qr2 support expansion of transformers ner models to include new concepts
* CU-8697v6qr2 add logging suggested by the review

* CU-869805t7e alt names fixes (CogStack/MedCAT#520)

* CU-869805t7e: Move getting of applicable category name to the config

* CU-869805t7e: Use alternative category names in eval method

* CU-869805t7e: Reduce indentation

* CU-869805t7e: Reduce indentation (again)

* CU-869805t7e: Some comment fixing due to rearrangements before

* CU-869805t7e: Fix usage of matched class name when encoding category values

* CU-869805t7e: Avoid duplicating exception message

* CU-8697qfvzz train metacat on sup train (CogStack/MedCAT#516)

* CU-8697qfvzz: Add new optional keyword argumnet to allow training MetaCAT models during supervised training

* CU-8697qfvzz: Add tests regarding training meta-cats during supervised training

* CU-8697qfvzz: Fix small typo in comment

* CU-8697qfvzz: Allow using alternative category names if/when training meta cats through CAT.train_supervised

* CU-8698ek477: Fix AdamW import from tranformers to torch (CogStack/MedCAT#523)

* CU-8698ek477: Add TODO to MetaCAT ML utils regarding AdamW import

* CU-8698ek477: Fix AdamW import (trf->torch)

* CU-8698f8fgc: Fix negative sampling including indices for words without a vector (CogStack/MedCAT#524)

* CU-8698f8fgc: Add new test to check that the negative sampling indices do not include non-vectored indices

* CU-8698f8fgc: Add fix for negative sampling including indices for words without a vector

* CU-8698f8fgc: Update tests to make sure index frequencies are respected

* CU-8698f8fgc: Add 3.9-friendly counter totalling method

* CU-8698gkrqa: Add argument to allow specifying the changes warrenting a model save (CogStack/MedCAT#525)

* CU-8698hfkch: Add eval method to deid model

* CU-8698hfkch: lint checks

* CU-8698gqumv: Fix regression test vocab vector sizes (CogStack/MedCAT#526)

* CU-8698gqumv: Add tests for Vocab upon regression testing

* CU-8698gqumv: Fix regression time vocab data

* CU-86983ruw9 Fix test train split (CogStack/MedCAT#521)

* CU-86983ruw9: Fix train-test splitter leaving train set empty for smaller datasets

* CU-86983ruw9: Add additional optional arguments to test-train splitting for minimum concept count and maximum test fraction

* CU-86983ruw9: Add a few tests for test-train splitting

* CU-8698hfkch: Add eval method to deid model (CogStack/MedCAT#527)

* CU-8698hfkch: Add eval method to deid model

* CU-8698hfkch: lint checks

---------

Co-authored-by: Tom Searle <tom@cogstack.org>

* CU-8698jzjj3: pass in extra param if ignore_extra_labels is set, and test

* CU-8698mqu96 Transformers update (4.51.0) fix (CogStack/MedCAT#531)

* CU-8698mqu96: Update special tokens lengths attribute

* CU-8698mqu96: Update MetaCAT usage of BertTokenizer.from_pretrained for type safety

* CU-8698mqu96: Ignore typing where mypy is wrong + add note in code

* CU-8698mqu96: Ignore typing where mypy may be wrong + add comment

* CU-8698mqu96: Fix tokenizer wrapper import for rel cat

* CU-8698mqu96: Rename evaluation strategy keyword argument in line with changes

* CU-8698mqu96: Type-ignore method where mypy says it does not exist

* CU-8698mqu96: Fix TRF-NER output dir typing issue

* CU-8698mqu96: Update a doc string for darglint

* CU-8698mqu96: Fix typing issue for TrfNER trainer callback

* Relation extraction llama (CogStack/MedCAT#522)

* Added files.

* More additions to rel extraction.

* Rel base.

* Update.

* Updates.

* Dependency parsing.

* Updates.

* Added pre-training steps.

* Added training & model utils.

* Cleanup & fixes.

* Update.

* Evaluation updates for pretraining.

* Removed duplicate relation storage.

* Moved RE model file location.

* Structure revisions.

* Added custom config for RE.

* Implemented custom dataset loader for RE.

* More changes.

* Small fix.

* Latest additions to RelCAT (pipe + predictions)

* Setup.py fix.

* RE utils update.

* rel model update.

* rel dataset + tokenizer improvements.

* RelCAT updates.

* RelCAT saving/loading improvements.

* RelCAT saving/loading improvements.

* RelCAT model fixes.

* Attempted gpu learning fix. Dataset label generation fixes.

* Minor train dataset gen fix.

* Minor train dataset gen fix No.2.

* Config updates.

* Gpu support fixes. Added label stats.

* Evaluation stat fixes.

* Cleaned stat output mode during training.

* Build fix.

* removed unused dependencies and fixed code formatting

* Mypy compliance.

* Fixed linting.

* More Gpu mode train fixes.

* Fixed model saving/loading issues when using other baes models.

* More fixes to stat evaluation. Added proper CAT integration of RelCAT.

* Setup.py typo fix.

* RelCAT loading fix.

* RelCAT Config changes.

* Type fix. Minor additions to RelCAT model.

* Type fixes.

* Type corrections.

* RelCAT update.

* Type fixes.

* Fixed type issue.

* RelCATConfig: added seed param.

* Adaptations to the new codebase + type fixes..

* Doc/type fixes.

* Fixed input size issue for model.

* Fixed issue(s) with model size and config.

* RelCAT: updated configs to new style.

* RelCAT: removed old refs to logging.

* Fixed GPU training + added extra stat print for train set.

* Type fixes.

* Updated dev requirements.

* Linting.

* Fixed pin_memory issue when training on CPU.

* Updated RelCAT dataset get + default config.

* Updated RelDS generator + default config

* Linting.

* Updated RelDatset + config.

* Pushing updates to model

Made changes to:
1) Extracting given number of context tokens left and right of the entities
2) Extracting hidden state from bert for all the tokens of the entities and performing max pooling on them

* Fixing formatting

* Update rel_dataset.py

* Update rel_dataset.py

* Update rel_dataset.py

* RelCAT: added test resource files.

* RelCAT: Fixed model load/checkpointing.

* RelCAT: updated to pipe spacy doc call.

* RelCAT: added tests.

* Fixed lint/type issues & added rel tag to test DS.

* Fixed ann id to token issue.

* RelCAT: updated test dataset + tests.

* RelCAT: updates to requested changes + dataset improvements.

* RelCAT: updated docs/logs according to commends.

* RelCAT: type fix.

* RelCAT: mct export dataset updates.

* RelCAT: test updates + requested changes p2.

* RelCAT: log for MCT export train.

* Updated docs + split train_test & dataset for benchmarks.

* type fixes.

* RelCAT: Initial Llama integration.

* RelCAT: updates to Llama impl.

* RelCAT: model typo fix.

* RelCAT: label_id /sample no. mixup fix.

* Updated cleaned up Relataset, added new ways to create relations via anno types (doc/export only for now).

* Added option to predict any text /w annotations via RelCAT. MCT export train fixes.

* RelCAT: added sample limiter / class, more logging info.

* RelCAT: test/train ds shuffle update.

* RelCAT: added option to keep original text when using reldataset class.

* Pushing change for stratified batching

Implement stratified batching for improved class representation and balanced training

* RelCAT: fixed doc processing issue + class weights.

* RelCAT: class weights addtions to cfg + param.

* RelCAT: added config params for Adam optimizer.

* RelCAT updated default config.

* RelCAT: config update + optimizer change.

* RelCAT: fixed model freeze flags.

* RelCAT: model optimizer save/load fix.

* RelCAT: added export ent tag check.

* Fixed issues when saving/loading model for class weights + inference device cast.

* RelCAT: bug fix for ents that are @ EoS.

* Rel Dataset updates.

* Rel Dataset updates.

* Pushing change for ModernBERT

* Bumped transformers version.

* Updated rel dataset generation from fake Spacy Docs.

* ModernBert updates.

* Updated RelCAT model-load/save.

* Minor relCAT updates, code format.

* Type check updates.

* Fixed inference issue.

* RelCAT: testing updates.

* Type fixes.

* Type fixes.

* Type fixes.

* Type fixes IV.

* Type fixes python 3.9.

* RelCAT: flake8 fixes.

* RelCAT: flake8 fixes.

* RelCAT: Updates (fixed model loading after save).

* Fixed test.

* Update RelCAT stuff for improved abstraction

* Move separate model implementations to separate packages

* Some minor abstraction changes

* Remove accidentally copied abstract method decorator

* Fix import in test

* Fix RelCAT impport in pipe tests

* Update base relcat model implementation to include config

* Latest RelCAT module updates.

* Type fixes + run issues.

* Type fixes.

* Fixed Llama tokenizer.

* Type fixes.

* Type fixes: Python3.10 adjustements.

* Linting.

* Fix base flake8 lint issues

* Fix doc string in ConfigRelCAT.load

* Fix base component init doc string

* Fixed BaseComponent.load method doc string

* Fix doc strings in rel_cat ml_utils

* Fix doc strings in rel_cat models module

* Fix rel-cat test time import

* Fix type casting

* Align pipe tests with rel cat changes

* Fix property paths in rel cat tests

* Updates.

* Fixed tests.

* Fixed relCAT config save.

* Latest fixes for model saving/loading.

* Lint fix.

* RelCAT cfg load test fix.

* Remove install requirements from gitignore

---------

Co-authored-by: Shubham Agarwal <66172189+shubham-s-agarwal@users.noreply.github.com>
Co-authored-by: mart-r <mart.ratas@gmail.com>

* CU-8698vewzp: Fix docs requirements (hopefully) (CogStack/MedCAT#534)

* CU-8698veb6y: Use Ubuntu 24.04 for publishing to test PyPI (CogStack/MedCAT#533)

---------

Co-authored-by: Shubham Agarwal <66172189+shubham-s-agarwal@users.noreply.github.com>
Co-authored-by: Xi Bai <82581439+baixiac@users.noreply.github.com>
Co-authored-by: Tom Searle <tom@cogstack.org>
Co-authored-by: tomolopolis <tsearle88@gmail.com>
Co-authored-by: Vlad Dinu <62345326+vladd-bit@users.noreply.github.com>
alhendrickson pushed a commit to CogStack/cogstack-nlp that referenced this pull request Jul 1, 2025
…ack/MedCAT#517)

* CU-8697x7y9x: Fix issue with transformers 4.47+ affecting DeID

* CU-8697x7y9x: Add type-ignore to module unrelated to current change
alhendrickson pushed a commit to CogStack/cogstack-nlp that referenced this pull request Jul 1, 2025
…ack/MedCAT#517)

* CU-8697x7y9x: Fix issue with transformers 4.47+ affecting DeID

* CU-8697x7y9x: Add type-ignore to module unrelated to current change
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants