Skip to content

Commit 4eff3cb

Browse files
authored
Merge branch 'master' into peft
2 parents 1a90bb2 + 3eb727f commit 4eff3cb

37 files changed

+636
-8344
lines changed

README.md

+34-83
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,6 @@ Transform texts in a hundred different [languages](https://github.com/artitw/tex
2020
* [Index](https://github.com/artitw/text2text#index)
2121
* [Distance](https://github.com/artitw/text2text#levenshtein-sub-word-edit-distance)
2222
* [Translation](https://github.com/artitw/text2text#translation)
23-
* [Question Answering](https://github.com/artitw/text2text#question-answering)
24-
* [Question Generation](https://github.com/artitw/text2text#question-generation)
25-
* [Summarization](https://github.com/artitw/text2text#summarization)
2623
* [Data Augmentation](https://github.com/artitw/text2text#data-augmentation--back-translation)
2724
* [Finetuning](https://github.com/artitw/text2text#training--finetuning)
2825
* [Identification](https://github.com/artitw/text2text#identification)
@@ -55,15 +52,12 @@ Module Importing | `import text2text as t2t` | Libraries imported
5552
[Assistant](https://github.com/artitw/text2text#assistant) | `t2t.Assistant().transform("Describe Text2Text in a few words: ")` | `['Text2Text is an AI-powered text generation tool that creates coherent and continuous text based on prompts.']`
5653
[Language Model Setting](https://github.com/artitw/text2text#byot-bring-your-own-translator) | `t2t.Transformer.PRETRAINED_TRANSLATOR = "facebook/m2m100_418M"` | Change from default
5754
[Tokenization](https://github.com/artitw/text2text#tokenization) | `t2t.Tokenizer().transform(["Hello, World!"])` | `[['▁Hello', ',', '▁World', '!']]`
58-
[Embedding](https://github.com/artitw/text2text#embedding--vectorization) | `t2t.Vectorizer().transform(["Hello, World!"])` | `array([[0.18745188, 0.05658336, ..., 0.6332584 , 0.43805206]], dtype=float32)`
55+
[Embedding](https://github.com/artitw/text2text#embedding--vectorization) | `t2t.Vectorizer().transform(["Hello, World!"])` | `[[0.18745188, 0.05658336, ..., 0.6332584 , 0.43805206]]`
5956
[TF-IDF](https://github.com/artitw/text2text#tf-idf) | `t2t.Tfidfer().transform(["Hello, World!"])` | `[{'!': 0.5, ',': 0.5, '▁Hello': 0.5, '▁World': 0.5}]`
6057
[BM25](https://github.com/artitw/text2text#bm25) | `t2t.Bm25er().transform(["Hello, World!"])` | `[{'!': 0.3068528194400547, ',': 0.3068528194400547, '▁Hello': 0.3068528194400547, '▁World': 0.3068528194400547}]`
6158
[Indexer](https://github.com/artitw/text2text#index) | `index = t2t.Indexer().transform(["Hello, World!"])` | Index object for information retrieval
6259
[Translation](https://github.com/artitw/text2text#translation) | `t2t.Translater().transform(["Hello, World!"], src_lang="en, tgt_lang="zh")` | `['你好,世界!']`
63-
[Question Generation](https://github.com/artitw/text2text#question-generation) | `t2t.Questioner().transform(["Hello, World!"], src_lang="en)` | `[('What is the name of the world you are in?', 'The world')]`
64-
[Summarization](https://github.com/artitw/text2text#summarization) | `t2t.Summarizer().transform(["Hello, World!"], src_lang="en)` | `["World ' s largest world"]`
6560
[Data Augmentation](https://github.com/artitw/text2text#data-augmentation--back-translation) | `t2t.Variator().transform(["Hello, World!"], src_lang="en)` | `['Hello the world!', 'Welcome to the world.', 'Hello to the world!',...`
66-
[Question Answering](https://github.com/artitw/text2text#question-answering) | `t2t.Answerer().transform(["Hello, World! [SEP] Hello, what?"], src_lang="en")` | `['World']`
6761
[Distance](https://github.com/artitw/text2text#levenshtein-sub-word-edit-distance) | `t2t.Measurer().transform(["Hello, World! [SEP] Hello, what?"])` | `[2]`
6862
[Training/Finetuning](https://github.com/artitw/text2text#training--finetuning) | `t2t.Fitter().transform(["Hello, World! [TGT] Hello, what?"])` | Finetuned model saved
6963
[Identification](https://github.com/artitw/text2text#identification) | `t2t.Identifier().transform(["Aj keď sa Buzz Aldrin stal až „druhým človekom“..."])` | `['sk', 'Slovak']`
@@ -203,6 +197,33 @@ chat_history = [
203197
result = asst.chat_completion(chat_history, stream=True) #{'role': 'assistant', 'content': '1. Make a list of things to be grateful for.\n2. Go outside and take a walk in nature.\n3. Practice mindfulness meditation.\n4. Connect with a loved one or friend.\n5. Do something kind for someone else.\n6. Engage in a creative activity like drawing or writing.\n7. Read an uplifting book or listen to motivational podcasts.'}
204198
for chunk in result:
205199
print(chunk['message']['content'], end='', flush=True)
200+
201+
# Running conversation
202+
messages = []
203+
while True:
204+
user_input = input("User: ")
205+
print()
206+
messages.append({"role": "user", "content": user_input})
207+
print("Assistant: ")
208+
result = asst.chat_completion(messages, stream=False)
209+
print(result["message"]["content"])
210+
messages.append(result["message"])
211+
print()
212+
213+
# Schema for structured output
214+
from pydantic import BaseModel
215+
216+
class Song(BaseModel):
217+
name: str
218+
artist: str
219+
220+
result = asst.chat_completion([
221+
{"role": "user", "content": "What is Britney Spears's best song?"}
222+
], schema=Song)
223+
# Song(name='Toxic', artist='Britney Spears')
224+
225+
# Embeddings
226+
asst.embed(["hello, world!", "this will be embedded"])
206227
```
207228

208229
### Tokenization
@@ -228,12 +249,12 @@ t2t.Vectorizer().transform([
228249
])
229250
230251
# Embeddings
231-
array([[-0.00352954, 0.0260059 , 0.00407429, ..., -0.04830331,
232-
-0.02540749, -0.00924972],
233-
[ 0.00043362, 0.00249816, 0.01755436, ..., 0.04451273,
234-
0.05118701, 0.01895813],
235-
[-0.03563676, -0.04856304, 0.00518898, ..., -0.00311068,
236-
0.00071953, -0.00216325]])
252+
[[-0.00352954, 0.0260059 , 0.00407429, ..., -0.04830331,
253+
-0.02540749, -0.00924972],
254+
[ 0.00043362, 0.00249816, 0.01755436, ..., 0.04451273,
255+
0.05118701, 0.01895813],
256+
[-0.03563676, -0.04856304, 0.00518898, ..., -0.00311068,
257+
0.00071953, -0.00216325]]
237258
```
238259

239260
### TF-IDF
@@ -429,76 +450,6 @@ t2t.Translator().transform(
429450

430451
</details>
431452

432-
### Question Answering
433-
Question must follow context with ` [SEP] ` in between.
434-
```
435-
t2t.Answerer().transform([
436-
"Hello, this is Text2Text! [SEP] What is this?",
437-
"It works very well. It's awesome! [SEP] How is it?"
438-
])
439-
440-
t2t.Answerer().transform([
441-
"很喜欢陈慧琳唱歌。[SEP] 喜欢做什么?"
442-
], src_lang="zh")
443-
444-
# Answers
445-
['Text2Text', 'awesome']
446-
['唱歌']
447-
```
448-
449-
### Question Generation
450-
```
451-
t2t.Questioner().transform(["很喜欢陈慧琳唱歌。"], src_lang='zh')
452-
t2t.Questioner().transform([
453-
bio_str,
454-
bio_str,
455-
bio_str,
456-
bio_str,
457-
bio_str,
458-
"I will go to school today to take my math exam.",
459-
"I will go to school today to take my math exam.",
460-
"Tomorrow is my cousin's birthday. He will turn 24 years old.",
461-
notre_dame_str,
462-
bacteria_str,
463-
bacteria_str,
464-
bacteria_str,
465-
"I will go to school today to take my math exam. [SEP] school",
466-
"I will go to school today to take my math exam. [SEP] exam",
467-
"I will go to school today to take my math exam. [SEP] math",
468-
], src_lang='en')
469-
470-
```
471-
Note that the last three answers were controlled by specifying the `[SEP]` token in the input above.
472-
```
473-
# Questions
474-
[('我喜欢做什么?', '唱歌')]
475-
[('What is biology the science that studies?', 'life'),
476-
('What is the study of life?', 'studies'),
477-
('What would you find the question " life "?', 'sound'),
478-
('What can viruses do to living organisms?', 'attack'),
479-
('What is the study of life?', 'studies'),
480-
('Where will I go to to take my math exam?', 'school'),
481-
('Where will I go to to take my math exam?', 'school'),
482-
("What will my cousin's birthday?", 'turn'),
483-
('What type of oversight does The Observer not have?', 'editorial'),
484-
('What shape can bacteria be found in?', 'rods'),
485-
('What is the typical length of bacteria?', 'micrometres'),
486-
('What is the typical length of bacteria?', 'micrometres'),
487-
('Where will I go to to take my math exam?', 'school'),
488-
('What will I take after school?', 'exam'),
489-
('What exam will I take?', 'math')]
490-
```
491-
492-
### Summarization
493-
```
494-
t2t.Summarizer().transform([notre_dame_str, bacteria_str, bio_str], src_lang='en')
495-
496-
# Summaries
497-
["Notre Dame's students run nine student - run outlets . [X_SEP] Scholastic magazine claims to be the oldest continuous collegiate publication in the United States . [X_SEP] The Observer is an independent publication .",
498-
'Bacteria were among the first life forms to appear on Earth .',
499-
'biology is the science that studies life .']
500-
```
501-
502453
### Data Augmentation / Back-Translation
503454
Back-translations useful for augmenting training data
504455
```

0 commit comments

Comments
 (0)