-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Description
What is the problem?
I have a large corpus of data that I need to train an NER model with. The corpus is 300k PDF documents each around 6 pages. The size of the training data in the v2 format [(text, {'entities': []}), ...], before converting it to v3 format, is about 16Gib.
1- The first issue that I ran into was in trying the training data to v3 format. I tied using the "Migrating from v2" guidelines and used: python -m spacy convert ./training.json ./output
However, I got the following message:
UserWarning: [W027] Found a large training file of 5429543893 bytes. Note that it may be more efficient to split your training data into multiple smaller JSON files instead.
for json_doc in json_iterate(input_data):
✔ Generated output file (0 documents):
output3/dataset_multilabel1.spacy
When I looked out the output, it was 118 bytes and clearly the conversion had crashed.
2- So, I changed my approach and used the following code to convert my data:
def make_v3_dataset(data, db = []):
nlp = spacy.blank('en')
failed_record = []
if not db:
db = DocBin()
for text, annot, _, _ in tqdm(data):
doc = nlp.make_doc(text)
ents = []
for start, end, label in annot['entities']:
span = doc.char_span(start, end, label = label, alignment_mode = 'contract')
if span is None:
print(f'empty entity, {text}, {annot["entities"]}') #I expect this to never happen
else:
ents.append(span)
try:
doc.ents = ents
except:
failed_record.append((text, annot))
db.add(doc)
return db, failed_record
This approach worked fine and successfully converted the data. Though I need to mention that I had to execute it on an p2.8xlarge AWS EC2 with 488GiB RAM. So far so good.
3- Then, I tried to save the v3 data because I need a saved file to use spaCy v3 CLI training approach. Therefore, I used the following code to save the training data:
v3_data.to_disk("train.spacy")
4- However, I got the error "bytes object is too large", similar to issue #5219 in the closed issues here on github.
I looked further around it seems like the only solution is to break down the training data into smaller sections to save it. So, I tried a few attempts and I found that I could break it down into 30 pieces and successfully saved them.
5- Then, to use these files, I created 2 config files. One for a "cold" start which is using the very first batch of training data. One for a "hot" start using the following 29 batches of data. The primary difference between these two config files are in the following:
For cold start:
[components.ner]
factory = "ner"
moves = null
update_with_oracle_cut_size = 100
[components.tok2vec]
factory = "tok2vec"
For hot start:
[components.ner]
source = "./output2/model-best"
[components.tok2vec]
source = "./output2/model-best"
This approach seem to be working. For the first batch of saved v3 training data (aka cold start) the report says:
ℹ Using GPU: 0
=========================== Initializing pipeline ===========================
[2021-06-09 15:54:44,637] [INFO] Set up nlp object from config
**[2021-06-09 15:54:44,652] [INFO] Pipeline: ['tok2vec', 'ner']**
[2021-06-09 15:54:44,657] [INFO] Created vocabulary
[2021-06-09 15:55:24,815] [INFO] Added vectors: ./wordvec
[2021-06-09 15:55:24,815] [INFO] Finished initializing nlp object
[2021-06-09 15:58:31,397] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
✔ Initialized pipeline
============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE
--- ------ ------------ -------- ------ ------ ------ ------
0 0 0.00 864.15 0.00 0.00 0.00 0.00
0 200 7569.50 17518.94 24.94 32.41 20.26 0.25
0 400 15093.60 2880.46 36.62 37.27 36.00 0.37
0 600 8486.77 1953.45 42.19 48.01 37.63 0.42
0 800 2062.73 1591.75 39.89 43.21 37.05 0.40
...
0 3400 2483.47 911.21 47.15 62.86 37.72 0.47
0 3600 2587.18 782.31 47.80 57.47 40.91 0.48
0 3800 2160.07 791.27 47.31 59.70 39.18 0.47
✔ Saved pipeline to output directory
output2/model-last
As of the second batch of saved v3 training data (aka hot start), the report says that the training is being resumed:
ℹ Using GPU: 0
=========================== Initializing pipeline ===========================
[2021-06-09 17:13:14,429] [INFO] Set up nlp object from config
[2021-06-09 17:13:14,442] [INFO] Pipeline: ['tok2vec', 'ner']
**[2021-06-09 17:13:14,442] [INFO] Resuming training for: ['ner', 'tok2vec']**
[2021-06-09 17:13:14,448] [INFO] Created vocabulary
[2021-06-09 17:13:14,448] [INFO] Finished initializing nlp object
[2021-06-09 17:13:14,448] [INFO] Initialized pipeline components: []
✔ Initialized pipeline
============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE
--- ------ ------------ -------- ------ ------ ------ ------
0 0 4.14 4.08 50.89 59.96 44.21 0.51
0 200 1066.24 709.79 51.08 60.53 44.17 0.51
0 400 1880.10 942.89 47.21 58.50 39.57 0.47
...
The issues that I am having are:
-
Is this even a reasonable approach? I feel like it's too "hacky" and does not inspire confidence. It is super slow, even though I am using GPU.
-
What is the best approach for NER training on very large datasets like mine? I understand that spaCy's philosophy is industrial applications and scale-ablity is taken into account. So I thought there should be a better way to do this. I considered using the API and instead of saving the v3 training data on disk, just feed it to nlp.update() to train my model, but I am not sure if that is recommended in v3 as I used to go that way in v2 and I'm afraid I might lose computational efficiencies. Please advice on the best approach for large scale training. Thank you!
-
I tried using ray for multiprocessing but I never had success with it as it always crashed half way which is reported in the last comment in issue Training NER models on multiple GPUs (not just one) #8093.
Your Environment
All of the code above is in one conda_python3 notebook on AWS Sagemaker using ml.p2.8xlarge EC2 instance.
Python Version Used: 3
spaCy Version Used: 3.0.6