Add support for text data & tokenization by s314cy · Pull Request #577 · epfml/disco

s314cy · 2023-04-20T09:15:35Z

closes #572 and closes #491

add support for text data & tokenization:

tokenize samples
load labelled text data in the browser/node
load unlabelled text data in the browser/node
lazily load text data in node

this PR includes a rework of the data preprocessing pipeline, which is much more modular and makes it easy to add new preprocessing functions!

it also fixes the CI by:

making the github actions data cache run-specific
ensuring the data download script bypasses the gbucket cache
replacing the example data's archive from a BSD tar to a GNU tar (macOS vs. linux) which caused issues in the CI

martinjaggi · 2023-04-20T09:37:20Z

discojs/discojs-core/src/dataset/preprocessing/text_preprocessing.ts

+  Tokenize = 'tokenize'
+}
+
+export function getPreprocessImage (task: Task): PreprocessText {


should this one be called image?

also mind adding a comment if you will output a stream of token ids?

for LLMs, we can then also support datasets without any label being needed

also let's say where/how people could load different tokenizers (task config or hardcoded either is fine)

I'll make sure that the PR follows your comments once it's out of the "draft" stage!

martinjaggi · 2023-04-20T09:39:20Z

very cool, thanks for getting this started!

s314cy added feature New feature or request discojs Related to Disco.js labels Apr 20, 2023

s314cy self-assigned this Apr 20, 2023

martinjaggi reviewed Apr 20, 2023

View reviewed changes

s314cy force-pushed the 572-tokenizer-support-s314cy branch 2 times, most recently from 283003b to 8834d99 Compare April 24, 2023 14:26

s314cy mentioned this pull request May 4, 2023

add documentation in TASK for nlp and lstm task #565

Closed

s314cy force-pushed the 572-tokenizer-support-s314cy branch from 57803d6 to e8c307f Compare May 4, 2023 10:37

s314cy force-pushed the 572-tokenizer-support-s314cy branch 2 times, most recently from 55be642 to 6711773 Compare May 23, 2023 12:54

s314cy force-pushed the 572-tokenizer-support-s314cy branch 2 times, most recently from 699116f to 9c96d71 Compare July 6, 2023 12:13

s314cy added 2 commits July 31, 2023 13:48

feat: add support for text data & tokenization

0892e40

fix export of immutable collections

f78a7e9

s314cy force-pushed the 572-tokenizer-support-s314cy branch from bce22ad to f78a7e9 Compare July 31, 2023 11:48

s314cy marked this pull request as ready for review July 31, 2023 11:48

s314cy merged commit acd4250 into develop Jul 31, 2023

s314cy deleted the 572-tokenizer-support-s314cy branch July 31, 2023 11:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add support for text data & tokenization#577

Add support for text data & tokenization#577
s314cy merged 2 commits intodevelopfrom
572-tokenizer-support-s314cy

s314cy commented Apr 20, 2023 •

edited

Loading

Uh oh!

martinjaggi Apr 20, 2023

Uh oh!

martinjaggi Apr 20, 2023

Uh oh!

s314cy Apr 20, 2023

Uh oh!

martinjaggi commented Apr 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

s314cy commented Apr 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martinjaggi Apr 20, 2023

Choose a reason for hiding this comment

Uh oh!

martinjaggi Apr 20, 2023

Choose a reason for hiding this comment

Uh oh!

s314cy Apr 20, 2023

Choose a reason for hiding this comment

Uh oh!

martinjaggi commented Apr 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

s314cy commented Apr 20, 2023 •

edited

Loading