Skip to content

Conversation

@borzunov
Copy link
Collaborator

@borzunov borzunov commented Jun 8, 2023

This PR:

  1. Abolishes the model conversion procedure. Now, models are downloaded directly from original repositories like https://huggingface.co/bigscience/bloom. Servers download only shards with blocks to be hosted, and clients download only shards with input/output embeddings and layernorms.

    • BLOOM is loaded from bigscience/bloom, but we use the DHT prefix bigscience/bloom-petals for backward compatibility. Same with smaller BLOOMs and BLOOMZ.
    • LLaMA can be loaded from any repo like username/llama-65b-hf, but we use the DHT prefix llama-65b-hf (without the username) to accomodate blocks from different repos (there're a few of them with minor differences, such as Llama vs. LLaMA in the class name).
  2. Refactors the client to generalize it for multiple models. Now, we have petals.models packages that contain model-specific code (e.g. petals.models.bloom, petals.models.llama). General code (e.g. CPU-efficient LM head, p-tuning) is kept in petals.client.

  3. Introduces WrappedLlamaBlock, DistributedLlamaConfig, DistributedLlamaForCausalLM, DistributedLlamaForSequenceClassification, and DistributedLlamaModel compatible with Petals functionality (p-tuning, adapters, etc.).

  4. Introduces AutoDistributedConfig that automatically chooses the correct config class (DistributedLlamaConfig or DistributedBloomConfig). The refactored configs contain all model-specific info for both clients and servers.

Upgrade instructions:

  • Remove disk caches for blocks in old (converted) format to save disk space. That is, remove ~/.cache/petals/model--bigscience--bloom-petals and ~/.cache/petals/model--bigscience--bloomz-petals directories (if present).

Tested:

  • Servers hosting BLOOM and LLaMA
  • Clients running inference, p-tuning and adapter tuning for BLOOM and LLaMA

Expected additions to this PR:

  • Fix NaNs in prompt embeddings during LLaMA p-tuning
  • Add AutoDistributedModel, AutoDistributedModelForCausalLM, AutoDistributedModelForSequenceClassification (so that we have Colab notebooks where it's enough to replace only the model name)

Future work for other PRs:

  • Add log messages regarding model terms of use
  • Cover llama with tests
  • Decide on cache reordering code
  • Add guanaco
  • Add falcon-40b and falcon-40b-instruct
  • Update the "Host your own model" guide
  • Update http://health.petals.ml and http://chat.petals.ml
  • Upgrade example notebooks
  • Add speed measurements vs. llama.cpp

@borzunov borzunov changed the title Support loading LLaMA and BLOOM blocks from existing repos Add LLaMA support Jun 8, 2023
@borzunov borzunov force-pushed the llama branch 5 times, most recently from b094f17 to 89aba9d Compare June 8, 2023 16:38
@borzunov borzunov force-pushed the llama branch 9 times, most recently from 7a4e801 to 6367fb8 Compare June 10, 2023 02:47
run: |
export HF_TAG=${{ hashFiles('setup.cfg', 'src/petals/cli/convert_model.py') }}
export MODEL_NAME=bloom-testing/test-bloomd-560m-$HF_TAG
export MODEL_NAME=bigscience/bloom-560m
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading the entire 560m model takes only 5 sec, so I'd stick to using it.

However, it takes a little more RAM, so we run 4 servers instead of 5 below.

@borzunov borzunov requested a review from justheuristic June 23, 2023 00:48
@borzunov borzunov marked this pull request as ready for review June 23, 2023 00:48
- name: Delete any test models older than 1 week
if: steps.cache-model.outputs.cache-hit != 'true'
run: |
python tests/scripts/remove_old_models.py --author bloom-testing --use_auth_token $BLOOM_TESTING_WRITE_TOKEN
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not forget to manually delete them in a week or so

value_states = value_states.view(batch_size * self.self_attn.num_heads, seq_length, self.self_attn.head_dim)
key_states = key_states.view(*value_states.shape)
key_states = key_states.permute(0, 2, 1)
return (key_states, value_states)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to allow transformer blocks to define these methods than reordering from bloom.
If you agree, it's definitely okay to do it in a separate PR.
If you don't, please explain why.

initial_peers=initial_peers,
start=True,
num_workers=self.block_config.n_layer,
num_workers=self.block_config.num_hidden_layers,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sanity check: is this field guaranteed for all models or only for bloom and llama? couldn't find it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this field is standard because I saw models like BLOOM remapping model-specific config keys like n_layer to num_hidden_layers for compatibility (not vice versa): https://github.com/huggingface/transformers/blob/6ab045d6fe7a859ddc219cd144e638bb4d8ab2fe/src/transformers/models/bloom/configuration_bloom.py#L108

Copy link
Collaborator

@justheuristic justheuristic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that's probably the largest petals PR of all time)

The new structure appears sound. The only real concern is that we hard-code bloom caching order on the backend side -- instead of allowing each block to define their own cache order in as that block's methods or some other cunning plan. If you plan to do something similar, it's perfectly okay to do that later. If not, let's quickly discuss it, i might be missing something.

Another non_urgent point is covering LLaMA with tests. We can randomly initialize make a llama-pattern 0.2B model to use it for testing

@borzunov borzunov merged commit cb3f018 into main Jun 23, 2023
@borzunov borzunov deleted the llama branch June 23, 2023 11:46
borzunov added a commit that referenced this pull request Jul 5, 2023
…339)

Before this PR, `free_disk_space_for()` was able to remove **(a)** only entire cached revisions (= git commits/branches) and **(b)** only from the repository we're loading right now.

This PR allows this functions to remove arbitrary files separately from any repositories.

This is useful for transition to Petals 1.2.0+, since it now uses original repos instead of the ones with converted models (see #323). In particular, the cache for `bigscience/bloom-petals` is now deprecated and should be removed in favor of `bigscience/bloom`. This is also useful as a way to free space before loading LoRA adapters (#335).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants