You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You can download model weights from [🤗HuggingFace](https://huggingface.co/hpcai-tech/Colossal-LLaMA-2-7b-base) or [👾Modelscope](https://modelscope.cn/models/colossalai/Colossal-LLaMA-2-7b-base/summary).
155
166
167
+
#### Quick Start
168
+
You can run [`inference_example.py`](inference_example.py) to quickly start the inference of our base model by loading model weights from HF.
* Model path: `--model_path`. HF repo name or local path of the model.
184
+
* Device: `--device`. Set the device.
185
+
* Max new tokens: `--max_new_tokens`. Set maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
186
+
* Do sample: `--do_sample`. Set whether or not to use sampling.
187
+
* Temperature: `--temperature`. Set temperature value.
188
+
* Top_k: `--top_k`. Set top_k value for top-k-filtering.
189
+
* Top_p: `--top_p`. Set top_p value for generation.
190
+
* Input_txt: `--input_txt`. The prompt string input to the model.
156
191
## Usage
157
192
### Install
158
193
@@ -218,6 +253,8 @@ Here is details about CLI arguments:
218
253
❗️**Important**: Once you initialize the new model checkpoint, copy your new tokenizer files (`special_tokens_map.json`, `tokenizer.model` and `tokenizer_config.json`) to your new model folder.
219
254
220
255
#### 3. Data Preparation
256
+
257
+
##### 3.1 Data for Pretraining
221
258
Raw data should be formatted as `jsonl` format. Each data point should have the following fields:
222
259
*`source` (str, compulsory): This part is ignored when calculating loss. Default can be empty.
223
260
*`target` (str, compulsory): Loss will be calculated.
@@ -250,7 +287,31 @@ Here is details about CLI arguments:
250
287
* Max length: `max_length`. Max length of spliced samples. Default value is 4096.
251
288
* Number of bins for each category: `num_spliced_dataset_bins`. Number of bins for each category, used for bucket-based training.
252
289
290
+
##### 3.2 Data for Supervised Fine-tuning
291
+
We prepare data for supervised fine-tuning in a similar way. The main difference lies in the data format. Each data point should have the following field:
292
+
*`messages` (list, compulsory): This part consists of a conversation between a human and assistant. The length of `messages` can vary and only content from `assistant` is used for calculating loss.
293
+
294
+
Examples:
295
+
```JSON
296
+
{"messages": [{"from": "human", "content": "What are the three primary colors?"}, {"from": "assistant", "content": "The three primary colors are red, blue, and yellow."}]}
Command to convert jsonl dataset to arrow format is similar to the command in [3.1 Data for Pretraining](#31-data-for-pretraining). In `prepare_sft_dataset.py`, we don't concatenate different data samples.
You can use `colossalai run` to launch multi-nodes training:
255
316
```bash
256
317
colossalai run --nproc_per_node YOUR_GPU_PER_NODE --hostfile YOUR_HOST_FILE \
@@ -288,7 +349,16 @@ Here is details about CLI arguments:
288
349
* Tensor parallelism size: `--tp`. TP size for 3d Parallelism. The default value is 1.
289
350
* Zero stage: `--zero`. Zero stage for 3d Parallelism. The default value is 1.
290
351
352
+
##### 4.2 Arguments for Supervised Fine-tuning
353
+
We add support for gradient accumulation and NEFTuning for supervised fine-tuning and thus there are two more arguments apart from the arguments listed in [4.1 Arguments for Pretraining](#41-arguments-for-pretraining).
354
+
355
+
Here is details about CLI arguments:
356
+
* Accumulation steps: `--accumulation_steps`. The default value is `8`.
357
+
* NEFTuning: `--use_neft`. The default value is `False`. It can help improve the performance of chat models.
358
+
291
359
#### 5. Running Command
360
+
361
+
##### 5.1 Command for Pretraining
292
362
An [example bash](train.example.sh) is also provided for the experiment. Here is the steps to run the experiment:
293
363
* Create your own hostfile: `cp hostfile.example hostfile`.
294
364
* Create your own bash: `cp train.example.sh train.sh`.
@@ -310,6 +380,10 @@ declare -a dataset=(
310
380
"<DIR_2>/part-00000"
311
381
)
312
382
```
383
+
384
+
##### 5.2 Command for Supervised Fine-tuning
385
+
An [example bash](train_sft.example.sh) is provided. The only difference with the command for pretraining is the two arguments (`--accumulation_steps` and `--use_neft`) in the script. You can refer to [4.2 Arguments for Supervised Fine-tuning](#42-arguments-for-supervised-fine-tuning) for more details.
386
+
313
387
## Technical Insights
314
388
In order to enhance LLaMA-2's capabilities for understanding and generating Chinese content, The [Colossal-AI](https://github.com/hpcaitech/ColossalAI) team proposes the continuation of pre-training the LLaMA-2 model using both Chinese and English corpora. The overall pipeline can be described as follows:
315
389
@@ -416,3 +490,11 @@ Applying the above process to perform knowledge transfer in any field allows for
author={Jain, Neel and Chiang, Ping-yeh and Wen, Yuxin and Kirchenbauer, John and Chu, Hong-Min and Somepalli, Gowthami and Bartoldson, Brian R and Kailkhura, Bhavya and Schwarzschild, Avi and Saha, Aniruddha and others},
0 commit comments