Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V1.0 candidate; new deduper options, new taggers #100

Merged
merged 99 commits into from
Feb 1, 2024
Merged
Changes from 1 commit
Commits
Show all changes
99 commits
Select commit Hold shift + click to select a range
5634389
added more runs
soldni Nov 27, 2023
936bae3
new plots
soldni Nov 28, 2023
11be06d
tokenizer fix
soldni Nov 28, 2023
4e43dbe
squatted
soldni Nov 28, 2023
05a6656
new lang id
soldni Nov 29, 2023
997cf7d
all fasttext lang id
soldni Nov 29, 2023
dde3bb5
plots
soldni Nov 29, 2023
5bcbcd8
further plots
soldni Dec 1, 2023
bebd46f
wip
soldni Dec 1, 2023
907de63
progress!
soldni Dec 1, 2023
747eb52
style
soldni Dec 1, 2023
e6a1fd0
fixed format
soldni Dec 1, 2023
fdc9b13
added configs
soldni Dec 1, 2023
4040b55
dts
soldni Dec 1, 2023
876b9d4
configs
soldni Dec 2, 2023
6d874dd
more
soldni Dec 2, 2023
172172d
refine
soldni Dec 2, 2023
fdcb9bc
fix
soldni Dec 2, 2023
fa8ae25
fix
soldni Dec 2, 2023
5a59215
adding new features to deduper
soldni Dec 3, 2023
ed7c990
accidentally removed tests
soldni Dec 3, 2023
9c45b91
added cli options
soldni Dec 3, 2023
a6c89d0
big commit
soldni Dec 3, 2023
4d0ef02
improvement to tokenizer
soldni Dec 3, 2023
87d2801
bumping version
soldni Dec 3, 2023
f8da3db
fix error in empty
soldni Dec 3, 2023
430f7f2
new dedupe docs
soldni Dec 8, 2023
8d1f1f6
names
soldni Dec 8, 2023
fca1bae
configs
soldni Dec 19, 2023
4808b15
fixed paths
soldni Dec 19, 2023
0d49ec4
stack
soldni Dec 19, 2023
c80ca46
switched to v2
soldni Dec 19, 2023
486a350
fixed dedupe config
soldni Dec 19, 2023
4e25e4d
updated
soldni Dec 20, 2023
6ad8b1c
middle dedupe
soldni Dec 20, 2023
9c80a8b
mix text length
soldni Dec 20, 2023
b9dca47
Reddit processing code (#74)
drschwenk Nov 30, 2023
729d2e4
Merge branch 'main' into soldni/paper
soldni Dec 20, 2023
e8e2e98
more plots
soldni Dec 20, 2023
fd6b730
fixed version
soldni Dec 20, 2023
266548f
names
soldni Dec 20, 2023
0e83e52
different path
soldni Dec 20, 2023
4df8bff
added support for retries
soldni Dec 20, 2023
9541957
wip test
soldni Dec 21, 2023
be42570
fixed tests
soldni Dec 21, 2023
d2ab428
fixed
soldni Dec 21, 2023
ced2a2d
removing repetitions
soldni Dec 21, 2023
62a8d8c
dedupe docs
soldni Dec 21, 2023
1c86ee5
Merge branch 'main' into soldni/paper
soldni Dec 21, 2023
7335601
reddit stats
soldni Dec 21, 2023
785ac9e
paths
soldni Dec 21, 2023
63a1d1d
bugfix
soldni Dec 21, 2023
698a968
base
soldni Dec 21, 2023
357a740
version of pycld2 that compiles on M macs
soldni Dec 22, 2023
f4c3b9e
new config middle
soldni Dec 22, 2023
1f5f7d2
3 parts
soldni Dec 22, 2023
cad2030
further s3 tests
soldni Dec 23, 2023
f5fa8e6
decode
soldni Dec 23, 2023
1505c83
still write empty docs to attributes when skip_empty is True
soldni Dec 23, 2023
f2f1008
wiki adjusted
soldni Dec 27, 2023
c7dfbc7
wiki config
soldni Dec 27, 2023
9b6a526
simple counts
soldni Dec 28, 2023
1b88496
changed path
soldni Dec 30, 2023
170e0af
added new features
soldni Jan 2, 2024
a94d38f
plots
soldni Jan 9, 2024
e5f6f09
added new digits vocab
soldni Jan 9, 2024
4af1ef3
added config to sample
soldni Jan 4, 2024
378641d
small
soldni Jan 9, 2024
c740b8e
added tokenizer script
soldni Jan 10, 2024
2133298
merging
soldni Jan 15, 2024
13d809e
code abl
soldni Jan 15, 2024
35a21cd
cargo
soldni Jan 15, 2024
898374e
version bump
soldni Jan 17, 2024
586cc32
made it stable
soldni Jan 17, 2024
eb58c57
topics
soldni Jan 17, 2024
2dd17ac
sampling
soldni Jan 18, 2024
1afe414
rename
soldni Jan 18, 2024
9afab09
new config for 1.6
soldni Jan 20, 2024
9679158
Merge branch 'main' into soldni/paper
soldni Jan 20, 2024
5acff2f
llama config
soldni Jan 20, 2024
4bcaaa8
llama config (fix)
soldni Jan 20, 2024
f3dae82
Merge branch 'main' into soldni/paper
soldni Jan 23, 2024
b938db8
figures
soldni Jan 25, 2024
887a9b0
adding docs dedupe
soldni Jan 28, 2024
b4d70a8
added more dedup configs
soldni Jan 28, 2024
2e27442
style
soldni Jan 28, 2024
aeaf924
added counts
soldni Jan 28, 2024
7f93446
more cli
soldni Jan 28, 2024
a3ab54b
style
soldni Jan 29, 2024
2063ef0
style
soldni Jan 29, 2024
dd1a848
removed autopep8
soldni Jan 29, 2024
4cabad8
resorted
soldni Jan 29, 2024
d4e1b9b
testing change
soldni Jan 29, 2024
80898bb
corner cases
soldni Jan 31, 2024
625fc44
Merge branch 'main' into soldni/paper
soldni Jan 31, 2024
e46a0b6
figures
soldni Jan 31, 2024
d1e0975
added current paper
soldni Feb 1, 2024
93b4651
reverted cli
soldni Feb 1, 2024
fc3754d
documentation
soldni Feb 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
3 parts
soldni committed Dec 22, 2023
commit 1f5f7d293963084a958ea53619b4d109eac18b9f
17 changes: 17 additions & 0 deletions configs/dolma-v1_5r2/doc_dedupe/cc_en_tail_part1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
documents:
- s3://ai2-llm/pretraining-data/sources/common-crawl/v1-c4-cleaned/documents/cc_en_tail/cc_en_tail-0*.json.gz

dedupe:
name: dedupe_docs_v2
documents:
attribute_name: bff_duplicate_docs
key: $.text
skip_empty: true

bloom_filter:
file: /tmp/cc_en_tail_dedupe_docs.bloom
read_only: false
estimated_doc_count: 30000000000
desired_false_positive_rate: 1e-06

processes: 188
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
documents:
- s3://ai2-llm/pretraining-data/sources/common-crawl/v1-c4-cleaned/documents/cc_en_tail/*.gz
- s3://ai2-llm/pretraining-data/sources/common-crawl/v1-c4-cleaned/documents/cc_en_tail/cc_en_tail-1*.json.gz

dedupe:
name: dedupe_docs_v2
@@ -9,9 +9,9 @@ dedupe:
skip_empty: true

bloom_filter:
file: /tmp/cc_en_head_dedupe_docs.bloom
file: /tmp/cc_en_tail_dedupe_docs.bloom
read_only: false
estimated_doc_count: 60000000000
estimated_doc_count: 30000000000
desired_false_positive_rate: 1e-06

processes: 188
17 changes: 17 additions & 0 deletions configs/dolma-v1_5r2/doc_dedupe/cc_en_tail_part3.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
documents:
- s3://ai2-llm/pretraining-data/sources/common-crawl/v1-c4-cleaned/documents/cc_en_tail/cc_en_tail-2*.json.gz

dedupe:
name: dedupe_docs_v2
documents:
attribute_name: bff_duplicate_docs
key: $.text
skip_empty: true

bloom_filter:
file: /tmp/cc_en_tail_dedupe_docs.bloom
read_only: false
estimated_doc_count: 30000000000
desired_false_positive_rate: 1e-06

processes: 188