Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Da 24/reading comprehension #74

Merged
merged 74 commits into from
Dec 4, 2023
Merged
Changes from 1 commit
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
298e100
Broken reading-comprehension generators and generator training
metric-space Nov 15, 2023
ce4a23f
Refactor out shared utils and add domain tokenizer training as a opti…
metric-space Nov 16, 2023
9987c06
Corrections
metric-space Nov 16, 2023
b842667
Further corrections
metric-space Nov 16, 2023
f6d253a
Add q&a extractor as a util and to the pipeline example script
metric-space Nov 16, 2023
571b41c
Util correction and context addition to output
metric-space Nov 16, 2023
5c349ca
Revert previous corrections
metric-space Nov 16, 2023
f11797b
Chatml outputformat for regex based rc and remove domain keyword
metric-space Nov 17, 2023
21b6210
Generator additions
metric-space Nov 17, 2023
3c49970
More generator corrections and additions
metric-space Nov 17, 2023
c7b63f6
Add train num epochs as an arg to the generator training script
metric-space Nov 17, 2023
8a9ec67
Add new dependencies
metric-space Nov 18, 2023
c0c530f
1. json dump when writing to file
metric-space Nov 18, 2023
fcdfa3b
Post pipeline test run corrections
metric-space Nov 19, 2023
0f7b7e0
Remove (explicit) use of ConstantLengthDataset
metric-space Nov 21, 2023
7515b0a
Lift state management to a higer level and some corrections
metric-space Nov 21, 2023
75c7811
Add option to save dataset as huggingface dataset
metric-space Nov 21, 2023
979408c
Training script corrections
metric-space Nov 21, 2023
42f77f5
Trainer script cleanup and corrections
metric-space Nov 21, 2023
de7f60c
Add cloud friendly logger
metric-space Nov 21, 2023
ead9e7b
Reformatted synthetic dataset generation + corrections
metric-space Nov 21, 2023
3efa61b
More formatting for the synth-gen script
metric-space Nov 21, 2023
1d39491
More corrections to synth-gen
metric-space Nov 21, 2023
3abc635
Regex-gen changes and banner addition
metric-space Nov 21, 2023
980cc6a
Missing comma
metric-space Nov 21, 2023
1868423
Type hint all the functions in utils
metric-space Nov 21, 2023
ce9b352
Lightly refactor the pipeline
metric-space Nov 21, 2023
3760fc5
Address proper negation of lags
metric-space Nov 21, 2023
fe85703
Switch out generator for iterator when type hinting
metric-space Nov 21, 2023
331eaed
Util typing correction
metric-space Nov 21, 2023
8c1a35a
Correct all linting issues
metric-space Nov 22, 2023
e77d273
Pipeline corrections
metric-space Nov 22, 2023
2f91c25
More corrections for output type of generator
metric-space Nov 22, 2023
3b8e270
More corrections to the pipeline
metric-space Nov 22, 2023
a37b5d3
Appeasing the linter for the pipeline code
metric-space Nov 22, 2023
280e2dd
Appeasing the linter for llm synth script
metric-space Nov 22, 2023
3facf92
Appeasing the linter for the training script
metric-space Nov 22, 2023
3f50761
Linter based corrections for utils
metric-space Nov 22, 2023
8bb10c4
More appeasing of the linter and work arounds
metric-space Nov 22, 2023
42330eb
Incorporate csv reading and associated changes
metric-space Nov 22, 2023
194cf91
Unicode decoding revisit
metric-space Nov 22, 2023
451015b
More fixes
metric-space Nov 22, 2023
2e3fd08
Forgot to put in replace line
metric-space Nov 22, 2023
9a430e5
Better logging and removal of statefile and more corrections
metric-space Nov 23, 2023
3eb3b44
Add missing general spm input validation line to pipeline script
metric-space Nov 23, 2023
14f65c5
More validation lines for pipeline
metric-space Nov 23, 2023
55f03ef
More corrections
metric-space Nov 23, 2023
4d5802d
Banner correction, corrections
metric-space Nov 23, 2023
3f17238
Start of README.md and add general sentencepiece model to resources
metric-space Nov 23, 2023
88a74c1
Add defaults for cli args
metric-space Nov 23, 2023
4ab1211
Add more detail to README.md
metric-space Nov 23, 2023
b6fa0ae
Add defaults to function
metric-space Nov 23, 2023
644c294
Defaults
metric-space Nov 23, 2023
9c2f00d
README.md for rc pipeline
metric-space Nov 23, 2023
1d4ec97
transformers version dependency constraint
metric-space Nov 23, 2023
9171734
alpha -> beta
metric-space Nov 23, 2023
1af81cd
Better warning message
metric-space Nov 23, 2023
3039050
Correct description in README.md
metric-space Nov 23, 2023
4d93d4a
Stream arg correction for trainer
metric-space Nov 23, 2023
5fb1252
Add prompt link to README
metric-space Nov 23, 2023
958f1f4
Add general spm to resources
metric-space Nov 27, 2023
749f11c
- Better input content generator (deals with directory of csv(s))
metric-space Nov 29, 2023
903ba18
Vocab size second-try and key error fix
metric-space Nov 29, 2023
bc63323
Correct logging
metric-space Nov 29, 2023
a6f0e6e
Update dalm/pipelines/reading_comprehension_pipeline.py
metric-space Dec 2, 2023
0b89e1d
Update dalm/pipelines/reading_comprehension_pipeline.py
metric-space Dec 2, 2023
340b969
Update dalm/datasets/reading_comprehension_generation/synthetic_based.py
metric-space Dec 2, 2023
9ae906d
Update dalm/datasets/reading_comprehension_generation/utils.py
metric-space Dec 2, 2023
1d3ed52
Update dalm/pipelines/reading_comprehension_pipeline.py
metric-space Dec 2, 2023
a7ab91c
Update dalm/pipelines/reading_comprehension_pipeline.py
metric-space Dec 2, 2023
9cb16ee
Corrections
metric-space Dec 2, 2023
4025e45
Post linting
metric-space Dec 2, 2023
39b7f1e
Update README with suggested corrections
metric-space Dec 2, 2023
81a2ea0
grammar corrections
metric-space Dec 2, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add train num epochs as an arg to the generator training script
  • Loading branch information
metric-space committed Nov 17, 2023
commit c7b63f63d0b6d157a42ade5c54a906d1ab455594
5 changes: 4 additions & 1 deletion dalm/training/generator_only/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ def parse_args():
parser.add_argument("--num_workers", type=int, default=4, help="the number of workers")

parser.add_argument("--eval_steps", type=int, default=200, help="the evaluation frequency")
parser.add_argument("--num_train_epochs", type=int, default=3, help="the number of training epochs")
parser.add_argument("--logging_steps", type=int, default=10, help="the logging frequency")
parser.add_argument("--per_device_train_batch_size", type=int, default=1, help="the per device train batch size")
parser.add_argument("--per_device_eval_batch_size", type=int, default=1, help="the per device eval batch size")
Expand Down Expand Up @@ -124,6 +125,7 @@ def parse_args():
def train_generator(
model_name,
dataset_name,
num_train_epochs,
split,
size_valid_set,
streaming,
Expand Down Expand Up @@ -183,7 +185,7 @@ def train_generator(
per_device_eval_batch_size=per_device_eval_batch_size,
learning_rate=learning_rate,
logging_steps=logging_steps,
num_train_epochs=3,
num_train_epochs=num_train_epochs,
report_to=log_with,
save_strategy="epoch",
evaluation_strategy="steps",
Expand Down Expand Up @@ -244,6 +246,7 @@ def main():
shuffle_buffer=args.shuffle_buffer,
seq_length=args.seq_length,
num_workers=args.num_workers,
num_train_epochs=args.num_train_epochs,
eval_steps=args.eval_steps,
logging_steps=args.logging_steps,
per_device_train_batch_size=args.per_device_train_batch_size,
Expand Down
Loading