-
Notifications
You must be signed in to change notification settings - Fork 585
Add LLaMA support #323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LLaMA support #323
Conversation
b094f17 to
89aba9d
Compare
7a4e801 to
6367fb8
Compare
| run: | | ||
| export HF_TAG=${{ hashFiles('setup.cfg', 'src/petals/cli/convert_model.py') }} | ||
| export MODEL_NAME=bloom-testing/test-bloomd-560m-$HF_TAG | ||
| export MODEL_NAME=bigscience/bloom-560m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loading the entire 560m model takes only 5 sec, so I'd stick to using it.
However, it takes a little more RAM, so we run 4 servers instead of 5 below.
| - name: Delete any test models older than 1 week | ||
| if: steps.cache-model.outputs.cache-hit != 'true' | ||
| run: | | ||
| python tests/scripts/remove_old_models.py --author bloom-testing --use_auth_token $BLOOM_TESTING_WRITE_TOKEN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not forget to manually delete them in a week or so
| value_states = value_states.view(batch_size * self.self_attn.num_heads, seq_length, self.self_attn.head_dim) | ||
| key_states = key_states.view(*value_states.shape) | ||
| key_states = key_states.permute(0, 2, 1) | ||
| return (key_states, value_states) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be better to allow transformer blocks to define these methods than reordering from bloom.
If you agree, it's definitely okay to do it in a separate PR.
If you don't, please explain why.
| initial_peers=initial_peers, | ||
| start=True, | ||
| num_workers=self.block_config.n_layer, | ||
| num_workers=self.block_config.num_hidden_layers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sanity check: is this field guaranteed for all models or only for bloom and llama? couldn't find it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this field is standard because I saw models like BLOOM remapping model-specific config keys like n_layer to num_hidden_layers for compatibility (not vice versa): https://github.com/huggingface/transformers/blob/6ab045d6fe7a859ddc219cd144e638bb4d8ab2fe/src/transformers/models/bloom/configuration_bloom.py#L108
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that's probably the largest petals PR of all time)
The new structure appears sound. The only real concern is that we hard-code bloom caching order on the backend side -- instead of allowing each block to define their own cache order in as that block's methods or some other cunning plan. If you plan to do something similar, it's perfectly okay to do that later. If not, let's quickly discuss it, i might be missing something.
Another non_urgent point is covering LLaMA with tests. We can randomly initialize make a llama-pattern 0.2B model to use it for testing
…339) Before this PR, `free_disk_space_for()` was able to remove **(a)** only entire cached revisions (= git commits/branches) and **(b)** only from the repository we're loading right now. This PR allows this functions to remove arbitrary files separately from any repositories. This is useful for transition to Petals 1.2.0+, since it now uses original repos instead of the ones with converted models (see #323). In particular, the cache for `bigscience/bloom-petals` is now deprecated and should be removed in favor of `bigscience/bloom`. This is also useful as a way to free space before loading LoRA adapters (#335).
This PR:
Abolishes the model conversion procedure. Now, models are downloaded directly from original repositories like https://huggingface.co/bigscience/bloom. Servers download only shards with blocks to be hosted, and clients download only shards with input/output embeddings and layernorms.
bigscience/bloom, but we use the DHT prefixbigscience/bloom-petalsfor backward compatibility. Same with smaller BLOOMs and BLOOMZ.username/llama-65b-hf, but we use the DHT prefixllama-65b-hf(without the username) to accomodate blocks from different repos (there're a few of them with minor differences, such asLlamavs.LLaMAin the class name).Refactors the client to generalize it for multiple models. Now, we have
petals.modelspackages that contain model-specific code (e.g.petals.models.bloom,petals.models.llama). General code (e.g. CPU-efficient LM head, p-tuning) is kept inpetals.client.Introduces
WrappedLlamaBlock,DistributedLlamaConfig,DistributedLlamaForCausalLM,DistributedLlamaForSequenceClassification, andDistributedLlamaModelcompatible with Petals functionality (p-tuning, adapters, etc.).Introduces
AutoDistributedConfigthat automatically chooses the correct config class (DistributedLlamaConfigorDistributedBloomConfig). The refactored configs contain all model-specific info for both clients and servers.Upgrade instructions:
~/.cache/petals/model--bigscience--bloom-petalsand~/.cache/petals/model--bigscience--bloomz-petalsdirectories (if present).Tested:
Expected additions to this PR:
AutoDistributedModel,AutoDistributedModelForCausalLM,AutoDistributedModelForSequenceClassification(so that we have Colab notebooks where it's enough to replace only the model name)Future work for other PRs: