-
Notifications
You must be signed in to change notification settings - Fork 66
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
69 changed files
with
9,950 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -165,3 +165,4 @@ cython_debug/ | |
figures/ | ||
.vscode/ | ||
data/ | ||
.DS_Store |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
# Code of Conduct | ||
|
||
## Our Pledge | ||
|
||
In the interest of fostering an open and welcoming environment, we as | ||
contributors and maintainers pledge to make participation in our project and | ||
our community a harassment-free experience for everyone, regardless of age, body | ||
size, disability, ethnicity, sex characteristics, gender identity and expression, | ||
level of experience, education, socio-economic status, nationality, personal | ||
appearance, race, religion, or sexual identity and orientation. | ||
|
||
## Our Standards | ||
|
||
Examples of behavior that contributes to creating a positive environment | ||
include: | ||
|
||
* Using welcoming and inclusive language | ||
* Being respectful of differing viewpoints and experiences | ||
* Gracefully accepting constructive criticism | ||
* Focusing on what is best for the community | ||
* Showing empathy towards other community members | ||
|
||
Examples of unacceptable behavior by participants include: | ||
|
||
* The use of sexualized language or imagery and unwelcome sexual attention or | ||
advances | ||
* Trolling, insulting/derogatory comments, and personal or political attacks | ||
* Public or private harassment | ||
* Publishing others' private information, such as a physical or electronic | ||
address, without explicit permission | ||
* Other conduct which could reasonably be considered inappropriate in a | ||
professional setting | ||
|
||
## Our Responsibilities | ||
|
||
Project maintainers are responsible for clarifying the standards of acceptable | ||
behavior and are expected to take appropriate and fair corrective action in | ||
response to any instances of unacceptable behavior. | ||
|
||
Project maintainers have the right and responsibility to remove, edit, or | ||
reject comments, commits, code, wiki edits, issues, and other contributions | ||
that are not aligned to this Code of Conduct, or to ban temporarily or | ||
permanently any contributor for other behaviors that they deem inappropriate, | ||
threatening, offensive, or harmful. | ||
|
||
## Scope | ||
|
||
This Code of Conduct applies within all project spaces, and it also applies when | ||
an individual is representing the project or its community in public spaces. | ||
Examples of representing a project or community include using an official | ||
project e-mail address, posting via an official social media account, or acting | ||
as an appointed representative at an online or offline event. Representation of | ||
a project may be further defined and clarified by project maintainers. | ||
|
||
This Code of Conduct also applies outside the project spaces when there is a | ||
reasonable belief that an individual's behavior may have a negative impact on | ||
the project or its community. | ||
|
||
## Enforcement | ||
|
||
Instances of abusive, harassing, or otherwise unacceptable behavior may be | ||
reported by contacting the project team at <opensource-conduct@meta.com>. All | ||
complaints will be reviewed and investigated and will result in a response that | ||
is deemed necessary and appropriate to the circumstances. The project team is | ||
obligated to maintain confidentiality with regard to the reporter of an incident. | ||
Further details of specific enforcement policies may be posted separately. | ||
|
||
Project maintainers who do not follow or enforce the Code of Conduct in good | ||
faith may face temporary or permanent repercussions as determined by other | ||
members of the project's leadership. | ||
|
||
## Attribution | ||
|
||
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, | ||
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html | ||
|
||
[homepage]: https://www.contributor-covenant.org | ||
|
||
For answers to common questions about this code of conduct, see | ||
https://www.contributor-covenant.org/faq |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# Contributing to | ||
|
||
We want to make contributing to this project as easy and transparent as | ||
possible. | ||
|
||
## Pull Requests | ||
|
||
We actively welcome your pull requests. | ||
|
||
1. Fork the repo and create your branch from `main`. | ||
2. If you've added code that should be tested, add tests. | ||
3. If you've changed APIs, update the documentation. | ||
4. Ensure the test suite passes. | ||
5. Make sure your code lints. | ||
6. If you haven't already, complete the Contributor License Agreement ("CLA"). | ||
|
||
## Contributor License Agreement ("CLA") | ||
|
||
In order to accept your pull request, we need you to submit a CLA. You only need | ||
to do this once to work on any of Meta's open source projects. | ||
|
||
Complete your CLA here: <https://code.facebook.com/cla> | ||
|
||
## Issues | ||
|
||
We use GitHub issues to track public bugs. Please ensure your description is | ||
clear and has sufficient instructions to be able to reproduce the issue. | ||
|
||
Meta has a [bounty program](https://www.facebook.com/whitehat/) for the safe | ||
disclosure of security bugs. In those cases, please go through the process | ||
outlined on that page and do not file a public issue. | ||
|
||
## License | ||
|
||
By contributing to BLT, you agree that your contributions will be licensed | ||
under the LICENSE file in the root directory of this source tree. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
BSD 3-Clause License | ||
|
||
Copyright 2024 Meta | ||
|
||
Redistribution and use in source and binary forms, with or without modification, | ||
are permitted provided that the following conditions are met: | ||
|
||
1. Redistributions of source code must retain the above copyright notice,this list | ||
of conditions and the following disclaimer. | ||
|
||
2. Redistributions in binary form must reproduce the above copyright notice, this | ||
list of conditions and the following disclaimer in the documentation | ||
and/or other materials provided with the distribution. | ||
|
||
3. Neither the name of the copyright holder nor the names of its contributors may | ||
be used to endorse or promote products derived from this software without specific | ||
prior written permission. | ||
|
||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY | ||
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES | ||
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT | ||
SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, | ||
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED | ||
TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR | ||
BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN | ||
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN | ||
ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH | ||
DAMAGE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,117 @@ | ||
# Byte Latent Transformer | ||
|
||
This repository contains code for our paper: "Byte Latent Transformer: Patches Scale Better Than Tokens" | ||
|
||
- [Paper Link](https://dl.fbaipublicfiles.com/blt/BLT__Patches_Scale_Better_Than_Tokens.pdf) | ||
|
||
## Abstract | ||
|
||
We introduce the Byte Latent Transformer architecture (BLTs), a new byte-level LLM architecture that | ||
for the first time, matches tokenization-based LLM performance at scale, with significant improvements | ||
in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve | ||
as the primary units of computation. Patches are segmented dynamically based on the entropy of the | ||
next byte, allocating more compute and model capacity where there is more data complexity. The BLT | ||
architecture includes new attention mechanisms to maximize the information flow between byte and | ||
patch hidden representations and a new type of byte-sequence memory. We present the first scaling | ||
study of byte-level models up to 8B parameters and 8T training bytes, showing for the first time | ||
that we can train a model end-to-end at scale from bytes with no tokenization or other preprocessing. | ||
Scaling trends reveal training and inference efficiency benefits from dynamically selecting very long | ||
patches on average, along with qualitative improvements with reasoning and long tail generalization | ||
from modeling byte-sequences. | ||
|
||
![BLT Architecture Diagram](blt-figure.jpg) | ||
|
||
## Development Status | ||
|
||
We are actively updating the blt code to make it easier to reproduce our results. | ||
Please file an issue and/or be patient while we make more of our code public! | ||
|
||
## Quick start | ||
|
||
The following commands launch a SLURM job that creates an environment for Meta Lingua. | ||
The env creation should take around 5 minutes without counting downloads. | ||
|
||
```bash | ||
git clone https://github.com/facebookresearch/blt | ||
cd blt | ||
|
||
bash setup/create_env.sh | ||
# or if you have access to a SLURM cluster | ||
sbatch setup/create_env.sh | ||
``` | ||
|
||
Once that is done your can activate the environment | ||
|
||
```bash | ||
conda activate blt_<date> | ||
``` | ||
|
||
use the provided script to download and prepare data from huggingface (among `fineweb_edu`, `fineweb_edu_10bt`, or `dclm_baseline_1.0`). | ||
This command will download the `fineweb_edu` and prepare it for training in the `./data` directory, specifying the amount of memory `terashuf` (the tool used to shuffle samples) will be allocated. By default, the number of chunks (`nchunks`) is 32. If you are running on fewer than 32 GPUs, it is recommended to set `nchunks` to 1 or to match `nchunks` with the number of GPUs (`nchunks` = NGPUs). See [here](https://github.com/facebookresearch/lingua/issues/55#issuecomment-2483643076) for more details. | ||
|
||
```bash | ||
python setup/download_prepare_hf_data.py fineweb_edu <MEMORY> --data_dir ./data --seed 42 --nchunks <NCHUNKS> | ||
``` | ||
|
||
to download tokenizer (here llama3), use the folowing script: | ||
|
||
```bash | ||
python setup/download_tokenizer.py llama3 <SAVE_PATH> --api_key <HUGGINGFACE_TOKEN> | ||
``` | ||
|
||
Now launch a debug job to check if everything works. **The provided configurations are templates, you need to adapt them for them to work (change `dump_dir`, `data.root_dir`, `data.tokenizer.path`, etc ...)** | ||
|
||
```bash | ||
# stool stands for SLURM tool ! | ||
python -m bytelatent.stool script=bytelatent.train config=apps/bytelatent/configs/debug.yaml nodes=1 partition=<partition> | ||
# if you want to launch locally you can use torchrun | ||
torchrun --nproc-per-node 8 -m bytelatent.train config=apps/bytelatent/configs/debug.yaml | ||
# or you can also launch on 1 GPU | ||
python -m bytelatent.train config=apps/bytelatent/configs/debug.yaml | ||
``` | ||
|
||
When using `stool`, if a job crashes, it can be relaunched using sbatch: | ||
|
||
```bash | ||
sbatch path/to/dump_dir/submit.slurm | ||
``` | ||
|
||
## Linting | ||
|
||
To lint, run the following command | ||
|
||
``` | ||
bash dev/lint.sh | ||
``` | ||
|
||
## Citation | ||
|
||
The BLT is partially based on Meta Lingua, so consider citing it in addition to our BLT paper if you re-use our work. | ||
|
||
BLT Paper Citation (will be updated to arXiv soon) | ||
|
||
``` | ||
@article{meta_blt, | ||
author = {Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman†, Srinivasan Iyer}, | ||
title = {Byte Latent Transformer: Patches Scale Better Than Tokens}, | ||
url = {https://github.com/facebookresearch/blt}, | ||
year = {2024} | ||
} | ||
``` | ||
|
||
Lingua Code | ||
|
||
``` | ||
@misc{meta_lingua, | ||
author = {Mathurin Videau, Badr Youbi Idrissi, Daniel Haziza, Luca Wehrstedt, Jade Copet, Olivier Teytaud, David Lopez-Paz}, | ||
title = {{Meta Lingua}: A minimal {PyTorch LLM} training library}, | ||
url = {https://github.com/facebookresearch/lingua}, | ||
year = {2024} | ||
} | ||
``` | ||
|
||
## License | ||
|
||
The BLT code is partially based on Meta Lingia. | ||
|
||
Meta Lingua is licensed under BSD-3-Clause license. Refer to the LICENSE file in the top level directory. |
Empty file.
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
name: "debug_evals" | ||
# ckpt_dir: !!CHANGETHIS!! | ||
# dump_dir: !!CHANGETHIS!! | ||
generator: | ||
max_tokens: 8192 | ||
dtype: bf16 | ||
temperature: 1.0 | ||
top_p: 0.95 | ||
harness: | ||
tasks: | ||
- hellaswag | ||
- task: boolq | ||
dataset_kwargs: | ||
trust_remote_code: true | ||
- task: nq_open | ||
num_fewshot: 5 | ||
- piqa | ||
- task: social_iqa | ||
dataset_kwargs: | ||
trust_remote_code: true | ||
- triviaqa | ||
- winogrande | ||
- openbookqa | ||
- arc_easy | ||
- arc_challenge | ||
- race | ||
- commonsense_qa | ||
# - coqa | ||
- copa | ||
- gsm8k | ||
- bbh | ||
- mmlu | ||
- mmlu_pro | ||
validation: | ||
max_steps: 1000 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
# dump_dir: !!!CHANGE_THIS!!! | ||
name: large_lm | ||
steps: 60_000 | ||
probe_freq: null | ||
seed: 777 | ||
|
||
optim: | ||
lr: 3e-3 | ||
weight_decay: 0.033 | ||
warmup: 5000 | ||
lr_min_ratio: 0.000001 | ||
clip: 1.0 | ||
|
||
distributed: | ||
fsdp_type: full_shard | ||
compile: true | ||
model_dtype: bf16 | ||
matmul_allow_tf32: false | ||
selective_activation_checkpointing: false | ||
tp_size: 1 | ||
|
||
model: | ||
dim: 2048 | ||
n_layers: 25 | ||
n_heads: 16 | ||
|
||
data: | ||
root_dir: data/shuffled | ||
sources: | ||
dclm_baseline_1.0: 100.0 | ||
batch_size: 4 | ||
prefetch_size: 1024 | ||
seq_len: 4096 | ||
n_views: 2 | ||
load_async: true | ||
add_bos: true | ||
add_eos: true | ||
tokenizer: | ||
name: tiktoken | ||
path: tokenizers/cl_toplang_128k.tiktoken | ||
|
||
profiling: | ||
run: true | ||
mem_warmup: 0 | ||
mem_steps: 4 | ||
profile_warmup: 100 | ||
profile_steps: 4 | ||
|
||
checkpoint: | ||
dump: | ||
every: 2500 | ||
keep: 3 | ||
eval: | ||
every: 5000 | ||
keep: -1 | ||
|
||
logging: | ||
freq: 1 | ||
|
||
async_eval_gpus: 8 | ||
eval: | ||
harness: | ||
tasks: | ||
- hellaswag | ||
- task: boolq | ||
dataset_kwargs: | ||
trust_remote_code: true | ||
- piqa | ||
- task: social_iqa | ||
dataset_kwargs: | ||
trust_remote_code: true | ||
- winogrande | ||
- openbookqa | ||
- arc_easy | ||
- arc_challenge | ||
- race | ||
- commonsense_qa | ||
- copa | ||
# - coqa | ||
# - task: nq_open | ||
# num_fewshot: 5 | ||
# - triviaqa | ||
validation: | ||
max_steps: 1000 | ||
generator: | ||
max_tokens: 16384 | ||
dtype: bf16 |
Oops, something went wrong.