Fix memory URL parsing #81

yanxi0830 · 2024-09-19T18:20:40Z

Why

https://github.com/meta-llama/llama-models/blob/43fcc08422a9a2105aeae7a5f09f3404dc478c5c/models/llama3/api/datatypes.py#L29
URL is defined as a string type in OpenAPI spec, client do not have a way to specify URL v.s. raw content.
This causes an issue where URL strings passed to server is indistinguishable from raw content or we should read from URL.

Fix

Check content against URL pattern match to determine whether URL is needed.

Test

test with memory SDK client

python sdk_examples/memory/client.py

Before Fix:

Score: 0.8463434066929127
Chunk:
========
Chunk(content='https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/llama3.rst', document_id='num-2', token_count=22)
========

Score: 0.6421850715500633
Chunk:
========
Chunk(content='https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/lora_finetune.rst', document_id='num-5', token_count=23)
========

Score: 0.5757617875215492
Chunk:
========
Chunk(content='XRvciBMaWNlbnNlIEFncmVlbWVudCAoIkNMQSIpCkluIG9yZGVyIHRvIGFjY2VwdCB5b3VyIHB1bGwgcmVxdWVzdCwgd2UgbmVlZCB5b3UgdG8gc3VibWl0IGEgQ0xBLiBZb3Ugb25seSBuZWVkCnRvIGRvIHRoaXMgb25jZSB0byB3b3JrIG9uIGFueSBvZiBNZXRhJ3Mgb3BlbiBzb3VyY2UgcHJvamVjdHMuCgpDb21wbGV0ZSB5b3VyIENMQSBoZXJlOiA8aHR0cHM6Ly9jb2RlLmZhY2Vib29rLmNvbS9jbGE+CgojIyBJc3N1ZXMKV2UgdXNlIEdpdEh1YiBpc3N1ZXMgdG8gdHJhY2sgcHVibGljIGJ1Z3MuIFBsZWFzZSBlbnN1cmUgeW91ciBkZXNjcmlwdGlvbiBpcwpjbGVhciBhbmQgaGFzIHN1ZmZpY2llbnQgaW5zdHJ1Y3Rpb25zIHRvIGJlIGFibGUgdG8gcmVwcm9kdWNlIHRoZSBpc3N1ZS4KCk1ldGEgaGFzIGEgW2JvdW50eSBwcm9ncmFtXShodHRwOi8vZmFjZWJvb2suY29tL3doaXRlaGF0L2luZm8pIGZvciB0aGUgc2FmZQpkaXNjbG9zdXJlIG9mIHNlY3VyaXR5IGJ1Z3MuIEluIHRob3NlIGNhc2VzLCBwbGVhc2UgZ28gdGhyb3VnaCB0aGU', document_id='num-0', token_count=512)

After Fix:

Score: 1.085271610677067
Chunk:
========
Chunk(content=' of :code:`lora_model` and\n:code:`base_model`, would show that they are both instances of the same :class:`~torchtune.modules.TransformerDecoder`.\n(Feel free to verify this for yourself.)\n\nWhy does this matter? torchtune makes it easy to load checkpoints for LoRA directly from our Llama2\nmodel without any wrappers or custom checkpoint conversion logic.\n\n.. code-block:: python\n\n  # Assuming that base_model already has the pretrained Llama2 weights,\n  # this will directly load them into your LoRA model without any conversion necessary.\n  lora_model.load_state_dict(base_model.state_dict(), strict=False)\n\n.. note::\n    Whenever loading weights with :code:`strict=False`, you should verify that any missing or extra keys in\n    the loaded :code:`state_dict` are as expected. torchtune\'s LoRA recipes do this by default via e.g.\n    :func:`validate_state_dict_for_lora() <torchtune.modules.peft.validate_state_dict_for_lora>` or\n    :func:`validate_missing_and_unexpected_for_lora() <torchtune.modules.peft.validate_missing_and_unexpected_for_lora>`.\n\nOnce we\'ve loaded the base model weights, we also want to set only LoRA parameters to trainable.\n\n.. _setting_trainable_params:\n\n.. code-block:: python\n\n  from torchtune.modules.peft.peft_utils import get_adapter_params, set_trainable_params\n\n  # Fetch all params from the model that are associated with LoRA.\n  lora_params = get_adapter_params(lora_model)\n\n  # Set requires_grad=True on lora_params, and requires_grad=False on all others.\n  set_trainable_params(lora_model, lora_params)\n\n  # Print the total number of parameters\n  total_params = sum([p.numel() for p in lora_model.parameters()])\n  trainable_params = sum([p.numel() for p in lora_model.parameters() if p.requires_grad])\n  print(\n    f"""\n    {total_params} total params,\n    {trainable_params}" trainable params,\n    {(100.0 * trainable_params / total_params):.2f}% of all params are trainable.\n    """\n  )\n\n  6742609920 total params,\n  4194304 trainable params,\n  0.06% of all params are trainable.\n\n.. note::\n    If you are directly using the LoRA recipe (as detailed :ref:`here<lora_recipe_label>`), you', document_id='num-5', token_count=512)
========

Score: 1.1161195066066212
Chunk:
========
Chunk(content=".. _lora_finetune_label:\n\n============================\nFine-Tuning Llama2 with LoRA\n============================\n\nThis guide will teach you about `LoRA <https://arxiv.org/abs/2106.09685>`_, a parameter-efficient finetuning technique,\nand show you how you can use torchtune to finetune a Llama2 model with LoRA.\nIf you already know what LoRA is and want to get straight to running\nyour own LoRA finetune in torchtune, you can jump to :ref:`LoRA finetuning recipe in torchtune<lora_recipe_label>`.\n\n.. grid:: 2\n\n    .. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn\n\n      * What LoRA is and how it saves memory during finetuning\n      * An overview of LoRA components in torchtune\n      * How to run a LoRA finetune using torchtune\n      * How to experiment with different LoRA configurations\n\n    .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites\n\n      * Be familiar with :ref:`torchtune<overview_label>`\n      * Make sure to :ref:`install torchtune<install_label>`\n      * Make sure you have downloaded the :ref:`Llama2-7B model weights<download_llama_label>`\n\nWhat is LoRA?\n-------------\n\n`LoRA <https://arxiv.org/abs/2106.09685>`_ is an adapter-based method for\nparameter-efficient finetuning that adds trainable low-rank decomposition matrices to different layers of a neural network,\nthen freezes the network's remaining parameters. LoRA is most commonly applied to\ntransformer models, in which case it is common to add the low-rank matrices\nto some of the linear projections in each transformer layer's self-attention.\n\n.. note::\n\n    If you're unfamiliar, check out these references for the `definition of rank <https://en.wikipedia.org/wiki/Rank_(linear_algebra)>`_\n    and discussion of `low-rank approximations <https://en.wikipedia.org/wiki/Low-rank_approximation>`_.\n\nBy finetuning with LoRA (as opposed to finetuning all model parameters),\nyou can expect to see memory savings due to a substantial reduction in the\nnumber of parameters with gradients. When using an optimizer with momentum,\nlike `AdamW <https://py", document_id='num-5', token_count=512)
========

Score: 1.0961540886504595
Chunk:
========
Chunk(content='.. _llama3_label:\n\n========================\nMeta Llama3 in torchtune\n========================\n\n.. grid:: 2\n\n    .. grid-item-card:: :octicon:`mortar-board;1em;` You will learn how to:\n\n      * Download the Llama3-8B-Instruct weights and tokenizer\n      * Fine-tune Llama3-8B-Instruct with LoRA and QLoRA\n      * Evaluate your fine-tuned Llama3-8B-Instruct model\n      * Generate text with your fine-tuned model\n      * Quantize your model to speed up generation\n\n    .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites\n\n      * Be familiar with :ref:`torchtune<overview_label>`\n      * Make sure to :ref:`install torchtune<install_label>`\n\n\nLlama3-8B\n---------\n\n`Meta Llama 3 <https://llama.meta.com/llama3>`_ is a new family of models released by Meta AI that improves upon the performance of the Llama2 family\nof models across a `range of different benchmarks <https://huggingface.co/meta-llama/Meta-Llama-3-8B#base-pretrained-models>`_.\nCurrently there are two different sizes of Meta Llama 3: 8B and 70B. In this tutorial we will focus on the 8B size model.\nThere are a few main changes between Llama2-7B and Llama3-8B models:\n\n- Llama3-8B uses `grouped-query attention <https://arxiv.org/abs/2305.13245>`_ instead of the standard multi-head attention from Llama2-7B\n- Llama3-8B has a larger vocab size (128,256 instead of 32,000 from Llama2 models)\n- Llama3-8B uses a different tokenizer than Llama2 models (`tiktoken <https://github.com/openai/tiktoken>`_ instead of `sentencepiece <https://github.com/google/sentencepiece>`_)\n- Llama3-8B uses a larger intermediate dimension in its MLP layers than Llama2-7B\n- Llama3-8B uses a higher base value to calculate theta in its `rotary positional embeddings <https://arxiv.org/abs/2104.09864>`_\n\n|\n\nGetting access to Llama3', document_id='num-2', token_count=512)
========

Score: 0.9733073447702968
Chunk:
========
Chunk(content="_7b <torchtune.models.llama2.lora_llama2_7b>` alone will not handle the definition of which parameters are trainable.\n    See :ref:`below<setting_trainable_params>` for how to do this.\n\nLet's inspect each of these models a bit more closely.\n\n.. code-block:: bash\n\n  # Print the first layer's self-attention in the usual Llama2 model\n  >>> print(base_model.layers[0].attn)\n  MultiHeadAttention(\n    (q_proj): Linear(in_features=4096, out_features=4096, bias=False)\n    (k_proj): Linear(in_features=4096, out_features=4096, bias=False)\n    (v_proj): Linear(in_features=4096, out_features=4096, bias=False)\n    (output_proj): Linear(in_features=4096, out_features=4096, bias=False)\n    (pos_embeddings): RotaryPositionalEmbeddings()\n  )\n\n  # Print the same for Llama2 with LoRA weights\n  >>> print(lora_model.layers[0].attn)\n  MultiHeadAttention(\n    (q_proj): LoRALinear(\n      (dropout): Dropout(p=0.0, inplace=False)\n      (lora_a): Linear(in_features=4096, out_features=8, bias=False)\n      (lora_b): Linear(in_features=8, out_features=4096, bias=False)\n    )\n    (k_proj): Linear(in_features=4096, out_features=4096, bias=False)\n    (v_proj): LoRALinear(\n      (dropout): Dropout(p=0.0, inplace=False)\n      (lora_a): Linear(in_features=4096, out_features=8, bias=False)\n      (lora_b): Linear(in_features=8, out_features=4096, bias=False)\n    )\n    (output_proj): Linear(in_features=4096, out_features=4096, bias=False)\n    (pos_embeddings): RotaryPositionalEmbeddings()\n  )\n\n\nNotice that our LoRA model's layer contains additional weights in the Q and V projections,\nas expected. Additionally, inspecting the type of :code:`lora_model` and\n:code:`base_model`, would show that they are both instances of the same :class:`~torchtune.modules.TransformerDecoder`.\n(Feel free to verify this for yourself.)\n\nWhy does this matter? torchtune makes it easy to load checkpoints for LoRA directly", document_id='num-5', token_count=512)
========

Score: 1.1386078582408372
Chunk:
========
Chunk(content="8B uses a larger intermediate dimension in its MLP layers than Llama2-7B\n- Llama3-8B uses a higher base value to calculate theta in its `rotary positional embeddings <https://arxiv.org/abs/2104.09864>`_\n\n|\n\nGetting access to Llama3-8B-Instruct\n------------------------------------\n\nFor this tutorial, we will be using the instruction-tuned version of Llama3-8B. First, let's download the model from Hugging Face. You will need to follow the instructions\non the `official Meta page <https://github.com/meta-llama/llama3/blob/main/README.md>`_ to gain access to the model.\nNext, make sure you grab your Hugging Face token from `here <https://huggingface.co/settings/tokens>`_.\n\n\n.. code-block:: bash\n\n    tune download meta-llama/Meta-Llama-3-8B-Instruct \\\n        --output-dir <checkpoint_dir> \\\n        --hf-token <ACCESS TOKEN>\n\n|\n\nFine-tuning Llama3-8B-Instruct in torchtune\n-------------------------------------------\n\ntorchtune provides `LoRA <https://arxiv.org/abs/2106.09685>`_, `QLoRA <https://arxiv.org/abs/2305.14314>`_, and full fine-tuning\nrecipes for fine-tuning Llama3-8B on one or more GPUs. For more on LoRA in torchtune, see our :ref:`LoRA Tutorial <lora_finetune_label>`.\nFor more on QLoRA in torchtune, see our :ref:`QLoRA Tutorial <qlora_finetune_label>`.\n\nLet's take a look at how we can fine-tune Llama3-8B-Instruct with LoRA on a single device using torchtune. In this example, we will fine-tune\nfor one epoch on a common instruct dataset for illustrative purposes. The basic command for a single-device LoRA fine-tune is\n\n.. code-block:: bash\n\n    tune run lora_finetune_single_device --config llama3/8B_lora_single_device\n\n.. note::\n    To see a full list of recipes and their corresponding configs, simply run ``tune ls`` from the command line.\n\nWe can also add :ref:`command-line overrides <cli_override>` as needed, e.g.\n\n.. code-block:: bash\n\n    tune run lora", document_id='num-2', token_count=512)
========

Score: 0.8992728724432936
Chunk:
========
Chunk(content="_7b <torchtune.models.llama2.lora_llama2_7b>` alone will not handle the definition of which parameters are trainable.\n    See :ref:`below<setting_trainable_params>` for how to do this.\n\nLet's inspect each of these models a bit more closely.\n\n.. code-block:: bash\n\n  # Print the first layer's self-attention in the usual Llama2 model\n  >>> print(base_model.layers[0].attn)\n  MultiHeadAttention(\n    (q_proj): Linear(in_features=4096, out_features=4096, bias=False)\n    (k_proj): Linear(in_features=4096, out_features=4096, bias=False)\n    (v_proj): Linear(in_features=4096, out_features=4096, bias=False)\n    (output_proj): Linear(in_features=4096, out_features=4096, bias=False)\n    (pos_embeddings): RotaryPositionalEmbeddings()\n  )\n\n  # Print the same for Llama2 with LoRA weights\n  >>> print(lora_model.layers[0].attn)\n  MultiHeadAttention(\n    (q_proj): LoRALinear(\n      (dropout): Dropout(p=0.0, inplace=False)\n      (lora_a): Linear(in_features=4096, out_features=8, bias=False)\n      (lora_b): Linear(in_features=8, out_features=4096, bias=False)\n    )\n    (k_proj): Linear(in_features=4096, out_features=4096, bias=False)\n    (v_proj): LoRALinear(\n      (dropout): Dropout(p=0.0, inplace=False)\n      (lora_a): Linear(in_features=4096, out_features=8, bias=False)\n      (lora_b): Linear(in_features=8, out_features=4096, bias=False)\n    )\n    (output_proj): Linear(in_features=4096, out_features=4096, bias=False)\n    (pos_embeddings): RotaryPositionalEmbeddings()\n  )\n\n\nNotice that our LoRA model's layer contains additional weights in the Q and V projections,\nas expected. Additionally, inspecting the type of :code:`lora_model` and\n:code:`base_model`, would show that they are both instances of the same :class:`~torchtune.modules.TransformerDecoder`.\n(Feel free to verify this for yourself.)\n\nWhy does this matter? torchtune makes it easy to load checkpoints for LoRA directly", document_id='num-5', token_count=512)
========

Score: 0.875921749979012
Chunk:
========
Chunk(content=' support for all our models, and also use the ``lora_`` prefix, e.g.\nthe :func:`torchtune.models.llama3.llama3` model has a corresponding :func:`torchtune.models.llama3.lora_llama3`.\nWe aim to provide a comprehensive set of configurations to allow you to get started with training with LoRA quickly,\njust specify any config with ``_lora`` in its name, e.g:\n\n.. code-block:: bash\n\n  tune run lora_finetune_single_device --config llama3/8B_lora_single_device\n\n\nThere are two sets of parameters to customize LoRA to suit your needs. Firstly, the parameters which control\nwhich linear layers LoRA should be applied to in the model:\n\n* ``lora_attn_modules: List[str]`` accepts a list of strings specifying which layers of the model to apply\n  LoRA to:\n\n  * ``q_proj`` applies LoRA to the query projection layer.\n  * ``k_proj`` applies LoRA to the key projection layer.\n  * ``v_proj`` applies LoRA to the value projection layer.\n  * ``output_proj`` applies LoRA to the attention output projection layer.\n\n  Whilst adding more layers to be fine-tuned may improve model accuracy,\n  this will come at the cost of increased memory usage and reduced training speed.\n\n* ``apply_lora_to_mlp: Bool`` applies LoRA to the MLP in each transformer layer.\n* ``apply_lora_to_output: Bool`` applies LoRA to the model\'s final output projection.\n  This is usually a projection to vocabulary space (e.g. in language models), but\n  other modelling tasks may have different projections - classifier models will project\n  to the number of classes, for example\n\n.. note::\n\n  Models which use tied embeddings (such as Gemma and Qwen2 1.5B and 0.5B) for the\n  final output projection do not support ``apply_lora_to_output``.\n\nThese are all specified under the ``model`` flag or config entry, i.e:\n\n.. code-block:: bash\n\n  tune run lora_finetune_single_device --config llama3/8B_lora_single_device  \\\n  model.apply_lora_to_mlp=True \\\n  model.lora_attn_modules=["q_proj","k_proj","v_proj"]\n\n.. code-block:: yaml\n\n  model:\n    apply_lora_to_mlp: True\n    model.lora_attn', document_id='num-0', token_count=512)
========

ashwinb · 2024-09-19T18:31:20Z

I think we need to take a closer look at our type system to get this fixed properly.

URL is defined as a string type in OpenAPI spec, client do not have a way to specify URL v.s. raw content.

this seems to be the problem? we have a URL(uri=...) way of specifying a URL maybe @json_schema_type annotation around that is causing problems.

yanxi0830 · 2024-09-19T18:56:04Z

this seems to be the problem? we have a URL(uri=...) way of specifying a URL maybe @json_schema_type annotation around that is causing problems.

Yes, this is the problem. The URL is of type "string". We could either change the json_schema_type annotation, or do this workaround. Depends on how we want client to handle the contents. This is a fastest solution to unblock @hardikjshah .

@json_schema_type(
    schema={"type": "string", "format": "uri", "pattern": "^(https?://|file://|data:)"}
)
class URL(BaseModel):
    uri: str

    def __str__(self) -> str:
        return self.uri

ashwinb · 2024-09-19T19:46:36Z

OK I will approve this, but we need to file a task and fix this in our type system because that's the root of the issue. Otherwise it will trip us up in more horrible ways later.

fix memory url parsing

59ff1cc

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 19, 2024

yanxi0830 marked this pull request as ready for review September 19, 2024 18:23

yanxi0830 requested review from ashwinb, hardikjshah, dltn and raghotham as code owners September 19, 2024 18:23

raghotham approved these changes Sep 19, 2024

View reviewed changes

yanxi0830 merged commit 59af1c8 into main Sep 19, 2024
3 checks passed

yanxi0830 deleted the fix_memory_url branch September 19, 2024 20:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory URL parsing #81

Fix memory URL parsing #81

yanxi0830 commented Sep 19, 2024 •

edited

Loading

ashwinb commented Sep 19, 2024

yanxi0830 commented Sep 19, 2024

ashwinb commented Sep 19, 2024

Fix memory URL parsing #81

Fix memory URL parsing #81

Conversation

yanxi0830 commented Sep 19, 2024 • edited Loading

Why

Fix

Test

ashwinb commented Sep 19, 2024

yanxi0830 commented Sep 19, 2024

ashwinb commented Sep 19, 2024

yanxi0830 commented Sep 19, 2024 •

edited

Loading