Skip to content

Commit 07e3ace

Browse files
authored
Merge pull request #285 from FlagAI-Open/revert-284-master
Revert "update master into gpm_dev"
2 parents ea9a2b8 + 905b1a0 commit 07e3ace

File tree

126 files changed

+1762315
-252
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

126 files changed

+1762315
-252
lines changed

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,3 +28,8 @@ qqp
2828
glm_large_qqp_pytorch
2929
wandb
3030
clip_benchmark_datasets
31+
examples/AltCLIP/clip_benchmark_datasets
32+
examples/glm_pretrain/data.lazy
33+
examples/glm_pretrain/examples/glm_pretrain/data.lazy
34+
examples/vit_cifar100/cifar100
35+
examples/vit_cifar100/data

README.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,6 @@ FlagAI (Fast LArge-scale General AI models) is a fast, easy-to-use and extensibl
2020
The code is partially based on [GLM](https://github.com/THUDM/GLM), [Transformers](https://github.com/huggingface/transformers)[timm](https://github.com/rwightman/pytorch-image-models) and [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM).
2121

2222
## News
23-
- [17 Mar 2023] release v1.6.2, Support application of new optimizers [#266](https://github.com/FlagAI-Open/FlagAI/pull/266);
2423
- [2 Mar 2023] release v1.6.1, Support Galactica model [#234](https://github.com/FlagAI-Open/FlagAI/pull/234); BMInf, a low-resource inference package [#238](https://github.com/FlagAI-Open/FlagAI/pull/238), and examples for p-tuning [#227](https://github.com/FlagAI-Open/FlagAI/pull/238)
2524
- [12 Jan 2023] release v1.6.0, support a new parallel lib called [**BMTrain**](https://github.com/OpenBMB/BMTrain) and integate [**Flash Attention**](https://github.com/HazyResearch/flash-attention) to speedup training of Bert and Vit models, examples in [FlashAttentionBERT](https://github.com/FlagAI-Open/FlagAI/blob/master/examples/bert_title_generation_english/train_flash_atten.py) and [FlashAttentionViT](https://github.com/FlagAI-Open/FlagAI/blob/master/examples/vit_cifar100/train_single_gpu_flash_atten.py). Also add the contrastive search based text generation method [**SimCTG**](https://github.com/yxuansu/SimCTG) and DreamBooth finetuning based on AltDiffusion, examples in [AltDiffusionNaruto](https://github.com/FlagAI-Open/FlagAI/blob/master/examples/AltDiffusion/dreambooth.py).
2625
- [28 Nov 2022] release v1.5.0, support 1.1B [**EVA-CLIP**](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/EVA_CLIP) and [ALM: A large Arabic Language Model based on GLM], examples in [**ALM**](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/ALM)

README_zh.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,6 @@
2121
本项目的部分代码基于[GLM](https://github.com/THUDM/GLM)[Transformers](https://github.com/huggingface/transformers)[timm](https://github.com/rwightman/pytorch-image-models)[DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM).
2222

2323
## 动态
24-
- [17 Mar 2023] 支持v1.6.2版本, 可以使用新的优化器 [#266](https://github.com/FlagAI-Open/FlagAI/pull/266);
2524
- [2 Mar 2023] 支持v1.6.1版本, 增加Galactica模型 [#234](https://github.com/FlagAI-Open/FlagAI/pull/234), 大模型推理的低资源工具包BMInf [#238](https://github.com/FlagAI-Open/FlagAI/pull/238), 以及P-tuning样例 [#227](https://github.com/FlagAI-Open/FlagAI/pull/238)
2625
- [12 Jan 2023] 发布v1.6.0版本, 新增支持并行训练库 [**BMTrain**](https://github.com/OpenBMB/BMTrain) 以及集成 [**Flash Attention**](https://github.com/HazyResearch/flash-attention) 到 Bert 和 Vit 模型提速端到端训练, 示例见 [FlashAttentionBERT](https://github.com/FlagAI-Open/FlagAI/blob/master/examples/bert_title_generation_english/train_flash_atten.py)[FlashAttentionViT](https://github.com/FlagAI-Open/FlagAI/blob/master/examples/vit_cifar100/train_single_gpu_flash_atten.py). 同时增加了基于对比搜索的文本生成方法 [**SimCTG**](https://github.com/yxuansu/SimCTG) 以及基于 AltDiffusion 进行 DreamBooth 个性化微调, 示例见 [AltDiffusionNaruto](https://github.com/FlagAI-Open/FlagAI/blob/master/examples/AltDiffusion/dreambooth.py).
2726
- [28 Nov 2022] 发布v1.5.0版本, 支持1.1B参数的 [**EVA-CLIP**](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/EVA_CLIP) 以及[ALM: 基于GLM的阿拉伯语大模型], 示例见[**ALM**](https://github.com/FlagAI-Open/FlagAI/tree/master/examples/ALM)

examples/clip/train_clip_deepspeed.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,8 @@
55

66
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
77
# cd examples/clip
8-
data_path = "./data/pairs.csv"
9-
img_dir = "./data/img"
8+
data_path = "./data/pairs.csv"#"/mnt/datasets/multimodal/ConceptualCaptions/Train_GCC-training_output.csv"
9+
img_dir = "./data/img"#"/mnt/datasets/multimodal/ConceptualCaptions"
1010

1111
trainer = Trainer(
1212
env_type="deepspeed",
Lines changed: 341 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,341 @@
1+
# coding=utf-8
2+
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
"""GPT style dataset."""
17+
18+
import os
19+
import time
20+
21+
import numpy as np
22+
import torch
23+
24+
from megatron import mpu, print_rank_0
25+
from megatron.data.blendable_dataset import BlendableDataset
26+
from megatron.data.dataset_utils import get_datasets_weights_and_num_samples
27+
from megatron.data.dataset_utils import get_train_valid_test_split_
28+
from megatron.data.indexed_dataset import make_dataset as make_indexed_dataset
29+
from megatron.data.gpt_dataset import _build_shuffle_idx, _build_doc_idx, _num_epochs, _num_tokens, get_indexed_dataset_, _build_sample_idx
30+
31+
class GPTDataset(torch.utils.data.Dataset):
32+
33+
def __init__(self, name, data_prefix, documents, indexed_dataset,
34+
num_samples, seq_length, seed):
35+
36+
self.name = name
37+
self.indexed_dataset = indexed_dataset
38+
39+
# Checks
40+
assert np.min(documents) >= 0
41+
assert np.max(documents) < indexed_dataset.sizes.shape[0]
42+
43+
# Build index mappings.
44+
self.doc_idx, self.sample_idx, self.shuffle_idx = _build_index_mappings(
45+
self.name, data_prefix, documents, self.indexed_dataset.sizes,
46+
num_samples, seq_length, seed)
47+
48+
def __len__(self):
49+
# -1 is due to data structure used to retieve the index:
50+
# sample i --> [sample_idx[i], sample_idx[i+1])
51+
return self.sample_idx.shape[0] - 1
52+
53+
def __getitem__(self, idx):
54+
# Get the shuffled index.
55+
idx = self.shuffle_idx[idx]
56+
# Start and end documents and offsets.
57+
doc_index_f = self.sample_idx[idx][0]
58+
doc_index_l = self.sample_idx[idx + 1][0]
59+
offset_f = self.sample_idx[idx][1]
60+
offset_l = self.sample_idx[idx + 1][1]
61+
# If we are within the same document, just extract the chunk.
62+
if doc_index_f == doc_index_l:
63+
sample = self.indexed_dataset.get(self.doc_idx[doc_index_f],
64+
offset=offset_f,
65+
length=offset_l - offset_f + 1)
66+
else:
67+
# Otherwise, get the rest of the initial document.
68+
sample_list = [self.indexed_dataset.get(self.doc_idx[doc_index_f],
69+
offset=offset_f)]
70+
# Loop over all in between documents and add the entire document.
71+
for i in range(doc_index_f + 1, doc_index_l):
72+
sample_list.append(self.indexed_dataset.get(self.doc_idx[i]))
73+
# And finally add the relevant portion of last document.
74+
sample_list.append(self.indexed_dataset.get(
75+
self.doc_idx[doc_index_l],
76+
length=offset_l + 1))
77+
sample = np.concatenate(sample_list)
78+
79+
return {'input_ids': np.array(sample, dtype=np.int64)}
80+
81+
82+
def _build_index_mappings(name, data_prefix, documents, sizes,
83+
num_samples, seq_length, seed):
84+
"""Build doc-idx, sample-idx, and shuffle-idx.
85+
doc-idx: is an array (ordered) of documents to be used in training.
86+
sample-idx: is the start document index and document offset for each
87+
training sample.
88+
shuffle-idx: maps the sample index into a random index into sample-idx.
89+
"""
90+
# Number of tokens in each epoch and number of required epochs.
91+
tokens_per_epoch = _num_tokens(documents, sizes)
92+
num_epochs = _num_epochs(tokens_per_epoch, seq_length, num_samples)
93+
# rng state
94+
np_rng = np.random.RandomState(seed=seed)
95+
96+
# Filename of the index mappings.
97+
_filename = data_prefix
98+
_filename += '_{}_indexmap'.format(name)
99+
_filename += '_{}ns'.format(num_samples)
100+
_filename += '_{}sl'.format(seq_length)
101+
_filename += '_{}s'.format(seed)
102+
doc_idx_filename = _filename + '_doc_idx.npy'
103+
sample_idx_filename = _filename + '_sample_idx.npy'
104+
shuffle_idx_filename = _filename + '_shuffle_idx.npy'
105+
106+
# Build the indexed mapping if not exist.
107+
if True:
108+
if (not os.path.isfile(doc_idx_filename)) or \
109+
(not os.path.isfile(sample_idx_filename)) or \
110+
(not os.path.isfile(shuffle_idx_filename)):
111+
112+
print_rank_0(' > WARNING: could not find index map files, building '
113+
'the indices on rank 0 ...')
114+
115+
# For the last epoch, decide whether include the entire epoch
116+
# in the global shuffle or not.
117+
118+
# If we need only one epoch, then separating last epoch does
119+
# not mean anything.
120+
if num_epochs == 1:
121+
separate_last_epoch = False
122+
print(' > only one epoch required, setting '
123+
'separate_last_epoch to False', flush=True)
124+
125+
else:
126+
# Get the number of samples for the last epoch
127+
num_samples_from_epochs_minus_one = (
128+
(num_epochs - 1) * tokens_per_epoch - 1) // seq_length
129+
last_epoch_num_samples = num_samples - \
130+
num_samples_from_epochs_minus_one
131+
assert last_epoch_num_samples >= 0, \
132+
'last epoch number of samples should be non-negative.'
133+
num_samples_per_epoch = (tokens_per_epoch - 1) // seq_length
134+
assert last_epoch_num_samples < (num_samples_per_epoch + 1), \
135+
'last epoch number of samples exceeded max value.'
136+
# If we have less than 80% of the samples for the last epoch,
137+
# seperate out the epoch and treat it differently.
138+
# Note: the 80% number is just based on common sense and can
139+
# be adjusted if needed.
140+
separate_last_epoch = (last_epoch_num_samples <
141+
int(0.80 * num_samples_per_epoch))
142+
if separate_last_epoch:
143+
string = ' > last epoch number of samples ({}) is smaller '\
144+
'than 80% of number of samples per epoch ({}), '\
145+
'setting separate_last_epoch to True'
146+
else:
147+
string = ' > last epoch number of samples ({}) is larger '\
148+
'than 80% of number of samples per epoch ({}), '\
149+
'setting separate_last_epoch to False'
150+
print(string.format(last_epoch_num_samples,
151+
num_samples_per_epoch), flush=True)
152+
153+
# doc-idx.
154+
start_time = time.time()
155+
doc_idx = _build_doc_idx(documents, num_epochs, np_rng,
156+
separate_last_epoch)
157+
np.save(doc_idx_filename, doc_idx, allow_pickle=True)
158+
print_rank_0(' > elasped time to build and save doc-idx mapping '
159+
'(seconds): {:4f}'.format(time.time() - start_time))
160+
# sample-idx.
161+
start_time = time.time()
162+
# Use C++ implementation for speed.
163+
# First compile and then import.
164+
# from megatron.data import helpers
165+
assert doc_idx.dtype == np.int32
166+
assert sizes.dtype == np.int32
167+
sample_idx = _build_sample_idx(sizes, doc_idx, seq_length,
168+
num_epochs, tokens_per_epoch)
169+
# sample_idx = _build_sample_idx(sizes, doc_idx, seq_length,
170+
# num_epochs, tokens_per_epoch)
171+
np.save(sample_idx_filename, sample_idx, allow_pickle=True)
172+
print_rank_0(' > elasped time to build and save sample-idx mapping '
173+
'(seconds): {:4f}'.format(time.time() - start_time))
174+
# shuffle-idx.
175+
start_time = time.time()
176+
# -1 is due to data structure used to retieve the index:
177+
# sample i --> [sample_idx[i], sample_idx[i+1])
178+
if separate_last_epoch:
179+
num_samples_ = num_samples_from_epochs_minus_one
180+
else:
181+
num_samples_ = sample_idx.shape[0] - 1
182+
shuffle_idx = _build_shuffle_idx(num_samples_,
183+
sample_idx.shape[0] - 1, np_rng)
184+
np.save(shuffle_idx_filename, shuffle_idx, allow_pickle=True)
185+
print_rank_0(' > elasped time to build and save shuffle-idx mapping'
186+
' (seconds): {:4f}'.format(time.time() - start_time))
187+
188+
# Load mappings.
189+
start_time = time.time()
190+
print_rank_0(' > loading doc-idx mapping from {}'.format(
191+
doc_idx_filename))
192+
doc_idx = np.load(doc_idx_filename, allow_pickle=True, mmap_mode='r')
193+
print_rank_0(' > loading sample-idx mapping from {}'.format(
194+
sample_idx_filename))
195+
sample_idx = np.load(sample_idx_filename, allow_pickle=True, mmap_mode='r')
196+
print_rank_0(' > loading shuffle-idx mapping from {}'.format(
197+
shuffle_idx_filename))
198+
shuffle_idx = np.load(shuffle_idx_filename, allow_pickle=True, mmap_mode='r')
199+
print_rank_0(' loaded indexed file in {:3.3f} seconds'.format(
200+
time.time() - start_time))
201+
print_rank_0(' total number of samples: {}'.format(
202+
sample_idx.shape[0]))
203+
print_rank_0(' total number of epochs: {}'.format(num_epochs))
204+
205+
return doc_idx, sample_idx, shuffle_idx
206+
def _build_train_valid_test_datasets(data_prefix, data_impl, splits_string,
207+
train_valid_test_num_samples,
208+
seq_length, seed, skip_warmup):
209+
"""Build train, valid, and test datasets."""
210+
211+
# Indexed dataset.
212+
indexed_dataset = get_indexed_dataset_(data_prefix,
213+
data_impl,
214+
skip_warmup)
215+
216+
total_num_of_documents = indexed_dataset.sizes.shape[0]
217+
splits = get_train_valid_test_split_(splits_string, total_num_of_documents)
218+
219+
# Print stats about the splits.
220+
print_rank_0(' > dataset split:')
221+
222+
def print_split_stats(name, index):
223+
print_rank_0(' {}:'.format(name))
224+
print_rank_0(' document indices in [{}, {}) total of {} '
225+
'documents'.format(splits[index], splits[index + 1],
226+
splits[index + 1] - splits[index]))
227+
print_split_stats('train', 0)
228+
print_split_stats('validation', 1)
229+
print_split_stats('test', 2)
230+
231+
def build_dataset(index, name):
232+
dataset = None
233+
if splits[index + 1] > splits[index]:
234+
documents = np.arange(start=splits[index], stop=splits[index + 1],
235+
step=1, dtype=np.int32)
236+
dataset = GPTDataset(name, data_prefix,
237+
documents, indexed_dataset,
238+
train_valid_test_num_samples[index],
239+
seq_length, seed)
240+
return dataset
241+
242+
train_dataset = build_dataset(0, 'train')
243+
valid_dataset = build_dataset(1, 'valid')
244+
test_dataset = build_dataset(2, 'test')
245+
246+
return (train_dataset, valid_dataset, test_dataset)
247+
248+
if __name__ == '__main__':
249+
### 需要根据数据集情况填写
250+
### documents_stat.py
251+
### 样本量和epochs提前考虑,这里统一做打散
252+
253+
### gpt2
254+
data_prefix = '/share/project/ldwang/data/indexed_dataset/gpt2/merged_text_document'
255+
data_impl = 'mmap'
256+
splits_string = '9999,1,0'
257+
train_valid_test_num_samples = [41313229, 4132, 0]
258+
seq_length = 1024
259+
seed = 2023
260+
skip_warmup = False
261+
262+
### debug
263+
data_prefix = '00_text_document'
264+
data_impl = 'mmap'
265+
splits_string = '9999,1,0'
266+
train_valid_test_num_samples = [1, 1, 0]
267+
seq_length = 1024
268+
seed = 2023
269+
skip_warmup = False
270+
271+
### gpm
272+
data_prefix = '/share/project/ldwang/data/indexed_dataset/gpm/merged_text_document'
273+
data_impl = 'mmap'
274+
splits_string = '9999,1,0'
275+
train_valid_test_num_samples = [343969381, 344314, 0]
276+
seq_length = 2048
277+
seed = 2023
278+
skip_warmup = False
279+
280+
### gpm part
281+
data_prefix = '/share/project/ldwang/data/indexed_dataset/gpm/part_merged_text_document'
282+
data_impl = 'mmap'
283+
splits_string = '9999,1,0'
284+
train_valid_test_num_samples = [99136540, 99236, 0]
285+
seq_length = 2048
286+
seed = 2023
287+
skip_warmup = False
288+
289+
### gpm 10
290+
data_prefix = '/share/project/ldwang/data/indexed_dataset/gpm/10_merged_text_document'
291+
data_impl = 'mmap'
292+
splits_string = '9999,1,0'
293+
train_valid_test_num_samples = [29375962, 29406, 0]
294+
seq_length = 2048
295+
seed = 2023
296+
skip_warmup = False
297+
298+
### gpm 20
299+
data_prefix = '/share/project/ldwang/data/indexed_dataset/gpm/20_merged_text_document'
300+
data_impl = 'mmap'
301+
splits_string = '9999,1,0'
302+
train_valid_test_num_samples = [70166341, 70237, 0]
303+
seq_length = 2048
304+
seed = 2023
305+
skip_warmup = False
306+
307+
### gpm 12
308+
data_prefix = '/share/project/ldwang/data/indexed_dataset/gpm/12_merged_text_document'
309+
data_impl = 'mmap'
310+
splits_string = '9999,1,0'
311+
train_valid_test_num_samples = [33605368, 33606, 0]
312+
seq_length = 2048
313+
seed = 2023
314+
skip_warmup = False
315+
316+
### gpm debug
317+
data_prefix = '/share/project/ldwang/data/indexed_dataset/gpm/debug_merged_text_document'
318+
data_impl = 'mmap'
319+
splits_string = '9999,1,0'
320+
train_valid_test_num_samples = [29375962, 29406, 0]
321+
seq_length = 2048
322+
seed = 2023
323+
skip_warmup = False
324+
325+
### gpm
326+
data_prefix = '/share/project/ldwang/data/indexed_dataset/gpm/merged_text_document'
327+
data_impl = 'mmap'
328+
splits_string = '9999,1,0'
329+
train_valid_test_num_samples = [344379254, 34441, 0]
330+
seq_length = 2048
331+
seed = 2023
332+
skip_warmup = True
333+
334+
train_dataset, valid_dataset, test_dataset = _build_train_valid_test_datasets(
335+
data_prefix, data_impl, splits_string,
336+
train_valid_test_num_samples,
337+
seq_length, seed, skip_warmup)
338+
print(len(train_dataset))
339+
print(type(train_dataset))
340+
print(train_dataset[0])
341+

0 commit comments

Comments
 (0)