Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add tokenizer and sparse encoding (#1301)
* add tokenizer and sparse encoding Signed-off-by: xinyual <xinyual@amazon.com> * add tokenizer and sparse encoding Signed-off-by: xinyual <xinyual@amazon.com> * add tokenizer and sparse encoding Signed-off-by: xinyual <xinyual@amazon.com> * add tokenizer and sparse encoding Signed-off-by: xinyual <xinyual@amazon.com> * add tokenizer and sparse encoding Signed-off-by: xinyual <xinyual@amazon.com> * remove special token Signed-off-by: xinyual <xinyual@amazon.com> * add filter Signed-off-by: xinyual <xinyual@amazon.com> * try empty model Signed-off-by: xinyual <xinyual@amazon.com> * remove warm up Signed-off-by: xinyual <xinyual@amazon.com> * try empty model Signed-off-by: xinyual <xinyual@amazon.com> * add block Signed-off-by: xinyual <xinyual@amazon.com> * add log Signed-off-by: xinyual <xinyual@amazon.com> * add log Signed-off-by: xinyual <xinyual@amazon.com> * add log Signed-off-by: xinyual <xinyual@amazon.com> * remove log Signed-off-by: xinyual <xinyual@amazon.com> * remove pt file detect Signed-off-by: xinyual <xinyual@amazon.com> * add log Signed-off-by: xinyual <xinyual@amazon.com> * add functionName pipeline Signed-off-by: xinyual <xinyual@amazon.com> * remove verify log Signed-off-by: xinyual <xinyual@amazon.com> * skip special token in sparse encoding Signed-off-by: xinyual <xinyual@amazon.com> * skip omit tokenize config Signed-off-by: xinyual <xinyual@amazon.com> * skip omit tokenize config-change warm up logic Signed-off-by: xinyual <xinyual@amazon.com> * reArch Signed-off-by: xinyual <xinyual@amazon.com> * deduplicate Signed-off-by: xinyual <xinyual@amazon.com> * omit ml config in sparse encoding Signed-off-by: xinyual <xinyual@amazon.com> * add null config in warm up Signed-off-by: xinyual <xinyual@amazon.com> * fix original test Signed-off-by: xinyual <xinyual@amazon.com> * add tokenize ut half Signed-off-by: xinyual <xinyual@amazon.com> * fix sparse encoding bug Signed-off-by: xinyual <xinyual@amazon.com> * add UT for sparse encoding and tokenize Signed-off-by: xinyual <xinyual@amazon.com> * remove useless framwork type Signed-off-by: xinyual <xinyual@amazon.com> * common/src/test/java/org/opensearch/ml/common/input/MLInputTest.java Signed-off-by: xinyual <xinyual@amazon.com> * change key for tokenize Signed-off-by: xinyual <xinyual@amazon.com> * reArch DLModel Signed-off-by: xinyual <xinyual@amazon.com> * reArch DLModel again Signed-off-by: xinyual <xinyual@amazon.com> * response format Signed-off-by: xinyual <xinyual@amazon.com> * tokenize only one output Signed-off-by: xinyual <xinyual@amazon.com> * clean sparse output Signed-off-by: xinyual <xinyual@amazon.com> * clean sparse output Signed-off-by: xinyual <xinyual@amazon.com> * change UT number Signed-off-by: xinyual <xinyual@amazon.com> * remove useless predict code Signed-off-by: xinyual <xinyual@amazon.com> * remove useless part Signed-off-by: xinyual <xinyual@amazon.com> * change tokenize way Signed-off-by: xinyual <xinyual@amazon.com> * reArch add textEmbedding model Signed-off-by: xinyual <xinyual@amazon.com> * add tokenize logic Signed-off-by: xinyual <xinyual@amazon.com> * add abstract Signed-off-by: xinyual <xinyual@amazon.com> * clear code Signed-off-by: xinyual <xinyual@amazon.com> * fix it class Signed-off-by: xinyual <xinyual@amazon.com> * fix it class Signed-off-by: xinyual <xinyual@amazon.com> * add IT file Signed-off-by: xinyual <xinyual@amazon.com> * reformulate Signed-off-by: xinyual <xinyual@amazon.com> * reformulate remote inference Signed-off-by: xinyual <xinyual@amazon.com> * reformulate remote inference Signed-off-by: xinyual <xinyual@amazon.com> * reformulate remote inference json and array Signed-off-by: xinyual <xinyual@amazon.com> * verify Signed-off-by: xinyual <xinyual@amazon.com> * undo string utils Signed-off-by: xinyual <xinyual@amazon.com> * skip dummy model Signed-off-by: xinyual <xinyual@amazon.com> * skip dummy model Signed-off-by: xinyual <xinyual@amazon.com> * skip dummy model Signed-off-by: xinyual <xinyual@amazon.com> * skip dummy model Signed-off-by: xinyual <xinyual@amazon.com> * skip dummy model Signed-off-by: xinyual <xinyual@amazon.com> * skip dummy model Signed-off-by: xinyual <xinyual@amazon.com> * add inner load Model Signed-off-by: xinyual <xinyual@amazon.com> * rename variable Signed-off-by: xinyual <xinyual@amazon.com> * add default for idf Signed-off-by: xinyual <xinyual@amazon.com> * add ut for sparse encoding and tokenizer Signed-off-by: xinyual <xinyual@amazon.com> * add close model Signed-off-by: xinyual <xinyual@amazon.com> * change mock class Signed-off-by: xinyual <xinyual@amazon.com> * remove buffer for sparse encoding output Signed-off-by: xinyual <xinyual@amazon.com> * change tokenize model ready logic Signed-off-by: xinyual <xinyual@amazon.com> * rewrite input functionName Signed-off-by: xinyual <xinyual@amazon.com> * deduplicate Signed-off-by: xinyual <xinyual@amazon.com> * change UT usage Signed-off-by: xinyual <xinyual@amazon.com> * fix downloadAndSplit test Signed-off-by: xinyual <xinyual@amazon.com> * fix Helper test Signed-off-by: xinyual <xinyual@amazon.com> * remove meaningless change Signed-off-by: xinyual <xinyual@amazon.com> * remove complie change Signed-off-by: xinyual <xinyual@amazon.com> * rename Signed-off-by: xinyual <xinyual@amazon.com> * fix typo error and simplify wrap code Signed-off-by: xinyual <xinyual@amazon.com> * add comment Signed-off-by: xinyual <xinyual@amazon.com> * using gson and remove useless close logic Signed-off-by: xinyual <xinyual@amazon.com> * update comment and import problem Signed-off-by: xinyual <xinyual@amazon.com> * add static idf name Signed-off-by: xinyual <xinyual@amazon.com> * fix format problem Signed-off-by: xinyual <xinyual@amazon.com> * extract an abstract model for sparse and dense sentence transformer translator Signed-off-by: xinyual <xinyual@amazon.com> * fix typo error Signed-off-by: xinyual <xinyual@amazon.com> * remove duplicate tokenizer file, fix import problem and add comment for tokenizer model Signed-off-by: xinyual <xinyual@amazon.com> --------- Signed-off-by: xinyual <xinyual@amazon.com> (cherry picked from commit 31a4e25)
- Loading branch information