Training NER on Large Dataset

## What is the problem?
I have a large corpus of data that I need to train an NER model with. The corpus is 300k PDF documents each around 6 pages. The size of the training data in the v2 format [(text, {'entities': []}), ...], before converting it to v3 format, is about 16Gib.

1- The first issue that I ran into was in trying the training data to v3 format. I tied using the "Migrating from v2" guidelines and used: python -m spacy convert ./training.json ./output

However, I got the following message:

```
UserWarning: [W027] Found a large training file of 5429543893 bytes. Note that it may be more efficient to split your training data into multiple smaller JSON files instead.
  for json_doc in json_iterate(input_data):
✔ Generated output file (0 documents):
output3/dataset_multilabel1.spacy
```

When I looked out the output, it was 118 bytes and clearly the conversion had crashed.

2- So, I changed my approach and used the following code to convert my data:

    def make_v3_dataset(data, db = []):
        nlp = spacy.blank('en')
        failed_record = []
        if not db:
            db = DocBin()
        for text, annot, _, _ in tqdm(data):
            doc = nlp.make_doc(text)
            ents = []
            for start, end, label in annot['entities']:
                span = doc.char_span(start, end, label = label, alignment_mode = 'contract')
                if span is None:
                    print(f'empty entity, {text}, {annot["entities"]}') #I expect this to never happen
                else:
                    ents.append(span)
            try:
                doc.ents = ents
            except:
                failed_record.append((text, annot))
            db.add(doc)
        return db, failed_record

This approach worked fine and successfully converted the data. Though I need to mention that I had to execute it on an p2.8xlarge AWS EC2 with 488GiB RAM. So far so good. 

3- Then, I tried to save the v3 data because I need a saved file to use spaCy v3 CLI training approach. Therefore, I used the following code to save the training data:

    v3_data.to_disk("train.spacy")

4- However, I got the error "bytes object is too large", similar to issue #5219 in the closed issues here on github.

I looked further around it seems like the only solution is to break down the training data into smaller sections to save it. So, I tried a few attempts and I found that I could break it down into 30 pieces and successfully saved them.

5- Then, to use these files, I created 2 config files. One for a "cold" start which is using the very first batch of training data. One for a "hot" start using the following 29 batches of data. The primary difference between these two config files are in the following:

For cold start:

```
[components.ner]
factory = "ner"
moves = null
update_with_oracle_cut_size = 100

[components.tok2vec]
factory = "tok2vec"
```

For hot start:

```
[components.ner]
source = "./output2/model-best"

[components.tok2vec]
source = "./output2/model-best"
```

This approach seem to be working. For the first batch of saved v3 training data (aka cold start) the report says:

```
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
[2021-06-09 15:54:44,637] [INFO] Set up nlp object from config
**[2021-06-09 15:54:44,652] [INFO] Pipeline: ['tok2vec', 'ner']**
[2021-06-09 15:54:44,657] [INFO] Created vocabulary
[2021-06-09 15:55:24,815] [INFO] Added vectors: ./wordvec
[2021-06-09 15:55:24,815] [INFO] Finished initializing nlp object
[2021-06-09 15:58:31,397] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00    864.15    0.00    0.00    0.00    0.00
  0     200       7569.50  17518.94   24.94   32.41   20.26    0.25
  0     400      15093.60   2880.46   36.62   37.27   36.00    0.37
  0     600       8486.77   1953.45   42.19   48.01   37.63    0.42
  0     800       2062.73   1591.75   39.89   43.21   37.05    0.40
...
0    3400       2483.47    911.21   47.15   62.86   37.72    0.47
  0    3600       2587.18    782.31   47.80   57.47   40.91    0.48
  0    3800       2160.07    791.27   47.31   59.70   39.18    0.47
✔ Saved pipeline to output directory
output2/model-last
```

As of the second batch of saved v3 training data (aka hot start), the report says that the training is being resumed:

```
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
[2021-06-09 17:13:14,429] [INFO] Set up nlp object from config
[2021-06-09 17:13:14,442] [INFO] Pipeline: ['tok2vec', 'ner']
**[2021-06-09 17:13:14,442] [INFO] Resuming training for: ['ner', 'tok2vec']**
[2021-06-09 17:13:14,448] [INFO] Created vocabulary
[2021-06-09 17:13:14,448] [INFO] Finished initializing nlp object
[2021-06-09 17:13:14,448] [INFO] Initialized pipeline components: []
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          4.14      4.08   50.89   59.96   44.21    0.51
  0     200       1066.24    709.79   51.08   60.53   44.17    0.51
  0     400       1880.10    942.89   47.21   58.50   39.57    0.47
...
```

## The issues that I am having are:

1. Is this even a reasonable approach? I feel like it's too "hacky" and does not inspire confidence. It is super slow, even though I am using GPU.

2. What is the best approach for NER training on very large datasets like mine? I understand that spaCy's philosophy is industrial applications and scale-ablity is taken into account. So I thought there should be a better way to do this. I considered using the API and instead of saving the v3 training data on disk, just feed it to nlp.update() to train my model, but I am not sure if that is recommended in v3 as I used to go that way in v2 and I'm afraid I might lose computational efficiencies. Please advice on the best approach for large scale training. Thank you!

3. I tried using ray for multiprocessing but I never had success with it as it always crashed half way which is reported in the last comment in issue #8093.

## Your Environment

All of the code above is in one conda_python3 notebook on AWS Sagemaker using ml.p2.8xlarge EC2 instance.
Python Version Used: 3
spaCy Version Used: 3.0.6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Training NER on Large Dataset #8324

What is the problem?

The issues that I am having are:

Your Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Training NER on Large Dataset #8324

Description

What is the problem?

The issues that I am having are:

Your Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions