-
-
Notifications
You must be signed in to change notification settings - Fork 90
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support arbitrarily long docs (#332)
* Add context length info. Refactor BuiltinTask and models to facilitate this. * Add token count estimator plumbing. * Add plumbing for mapper and reducer. * Add ShardMapper prototype. * Integrating mapping into prompt generation workflow. * Update response parsing and component to support sharding (WIP). * Fix shard & prompt flow. * Fix shard & prompt flow. * Remove todo comments. * Fix Anthropic, Cohere, NoOp model tests. * Fix test_llm_pipe(). * Fix type checking test. * Fix span parsing tests. * Fix internal tests. * Fix _CountTask. * Fix sentiment and summarization tasks and tests. * Fix Azure connection URL. Fix Model test pings. * Fix Lemma parsing. * Start work on doc-to-shard property copying. * Fix REL doc preprocessing. * Remove comment on doc attribute handling during sharding, as this is done by spaCy's slicing directly. * Add reducer implementations. * Implement outstanding task reducers. * Add shardable/non-shardable LLM task typing distinction. Add support for handling both types of tasks. Update tests. * Fix EL task. * Fix EL tokenization and highlighting partially. * Fix tokenization and whitespaces for EL task. * Add new registry handlers (with context length and arbitrary model names) for all REST models. * Add sharding test with simple count task. * Fix sharding algorithm. * Add test with simple count task. * Add context length as init arg in HF models. * Fix tests. Don't stringify IO lists if sharded. * Fix tests. * Add NER sharding test. * Add REL and sentiment sharding tests. * Add summary sharding tests. * Add EL sharding task. Fix bug in shard mapper. * Fix REL error with RELExample parsing. * Use regex for punctuation in REL conversion. * Maintain custom doc attributes, incl. test. * Filter merge warnings in textcat reduction. * Fix custom doc data merging. * Update spacy_llm/models/langchain/model.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy_llm/pipeline/llm.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Incorporate feedback. * Move sharding compatibility warning to component constructor. * Update spacy_llm/tasks/entity_linker/util.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update spacy_llm/models/hf/base.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Incorporate feedback. * Fix doc string --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
- Loading branch information
Showing
94 changed files
with
3,441 additions
and
1,113 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.