-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training NER models on multiple GPUs (not just one) #8093
Comments
The basic I think in theory you would want to configure Taking a quick look at https://github.com/explosion/spacy-ray/blob/master/spacy_ray/worker.py#L300-L308 The main thing I don't understand is why this bit calls I think it would be difficult to use it with a script based around Let us know how it works for you if you try it out! |
Thank you for your reply. I think it is time for me to move on to spaCy v3, so I converted my code and data to use spaCy ray train. I am providing some details below before showing you the error I get. Code to convert training data to v3 format - data is v2 format [(text, 'entities':{(start, stop, label), ...}), ...]
end of training data code. The output is saved to train.spacy and val.spacy for training. So far so good.I downloaded the base config file from spaCy guidelines and edited the first 2 lines (train address and dev address):
Then I used: !python -m spacy init fill-config base_config.cfg config.cfg To use ray, I executed the following line because I thought I'd try it without GPU first: This worked fine and training started. Afterwards, to push the envelope a little, I tried: This worked fine too using GPU - one core I presume. Eventually it failed due to memory: Then I tried the follow to use ray: However, this time I got: At this point, I have a few questions that I am researching to find answers for. 1- Does max_epochs belong to the config file? 2- I share the confusion with you about the spacy ray code as no matter what the gpu ID is, it uses 0. Along those lines of confusion, I wonder that once the max_epochs issue is resolved, should I use the command below for GPU: In the command above, the issue is that I am sure how to tell ray which GPU IDs to use. Does it take a list of IDs? Any ideas? 3- As an irrelevant note to GPUs, I already have a trained word2vec model. In v2, I used the following to load it:
Do you know how to include it in the config file in v3 so that the training does not spend time on training token2vector embedding matrix? |
If you have created a model with vectors using |
I created my word2vec model using gensim. Here is the code:
In v2, I used the code at the end of my last post to load the word2vec model. I am playing to see if I could do something similar in v3. |
Yes, just run Since spacy actually doesn't include any code for training static word vectors, the only way to include them is to use the |
Thanks for that! I really appreciate your support. I read the guideline and successfully converted the word2vec txt model to v3 format which is saved in a folder called "vocab" using: !python -m spacy init vectors en w2v_model.txt ./ I also found the include_static_vectors field in the base_config.cfg file and changed it to True: [components.tok2vec.model.embed] Interestingly, in the last line, "True" returned an error and "true" worked. How could I use initialize.vectors? Should I change the base_config.cfg file as follows? [initialize] It seems working and the NER loss is lower than it used to be. Is there any way/test to ensure that my word vectors are being used as initial value? Going forward, I shift my focus to ray. In the first phase, I will try multi CPU using the following command. Once I get this work, I will try multi GPU: !python -m spacy ray train config.cfg --n-workers 2 --output ./output At this time, the problem is getting the error: ℹ Using CPU I posted it as a separate issue. Once I get it resolved, I will proceed with a test for multiple GPUs. |
I'm having the same issue regarding the error I've looked at |
@thejamesmarq I am happy that it was not just me. Also, glad that someone else is working towards multiple GPU training. @adrianeboyd fixed the "max_epoch" issue about 10 hours ago (see #8137) and released a new version of spaCy ray. I reinstalled and the max_epoch problem is gone. If they had patreon, I would not hesitate contributing to it for a second! I have not tried the multiple GPU solution yet, but the multiple solution using ray seem to be working! I cannot say if it increased the training speed though - need to dig in further. @thejamesmarq Could I ask you to give it a shot and let me know if spacy ray v0.1.2 works for you - with and without using GPUs? I am a little surprised that all your 4 GPU cores were working as the workers are for CPU as far as I understand, not GPU. Did you use the following command? !python -m spacy ray train config.cfg --n-workers 2 --output ./output --gpu-id 0 |
Tried this out on a machine with 4 GPUs, using
I'm wondering if that is at all related how
I believe this will use at most 1 GPU, although I could be wrong. Also, noticing the |
Gotch ya! I think we could determine the gpu ID using the CLI option like you mentioned: !python -m spacy ray train config.cfg --n-workers 4 --output ./output --gpu-id 0 However, there are two issues:
Could you try one more time using the CLI command and see if changing the gpu-id makes a difference in terms of the gpu core used? I will do the same on my end. |
(No need for patreon: this is my job!) For local testing I only have one GPU, so I may not be much immediate help. The |
Thanks @adrianeboyd . I actually did not know it's your job. You're great at it!! :) I have two questions and 4 reports for you. I really hope those reports are helpful for further development of spaCy v3 given how amazing it is! The questions:Q1) What am I doing wrong in using -g 0. Below is my CLI: The error is: Also, Isn't -g 0 a spaCy v2 notion? The guidelines of spaCy 3 says to use --gpu-id. Q2) Isn't 0 in "-g 0" referring to the GPU ID? How is that different from using gpu-id as in: Now onto the reports:I tried 4 main experiments. All experiments are executed on an ml.p3.8xlarge AWS EC2 instance with: 1) In the first experiment, I did not use ray nor did I use GPU. The command was:!python -m spacy train config.cfg --output ./output Here is the output: Starting time: 18:41:46 =========================== Initializing pipeline =========================== ============================= Training pipeline ============================= 0 0 0.00 848.15 0.01 0.01 0.01 0.00 Also, here is a snapshot on the CPU usage and GPU usage. It is as expected - only CPU 100% and 0% GPUs on all 4 cores.CPU: top - 18:47:27 up 31 min, 0 users, load average: 1.00, 0.94, 0.90 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND GPU:
Sorry for the table format. I do not know how to neatly paste it like @thejamesmarq did. Execusion-wise, so far so good. 2) In the second experiment, I still did not use ray, but I did use gpu using the following command:!python -m spacy train config.cfg --output ./output --gpu-id Here is the output: Starting time: 20:18:14 =========================== Initializing pipeline =========================== ============================= Training pipeline ============================= ⚠ Aborting and saving the final best model. Encountered exception: 3) In the 3rd experiment, I did use ray with 8 CPUs, but I did not use gpu using the following command:!python -m spacy ray train config.cfg --n-workers 8 --output ./output Here is the output: PID MEM COMMAND In addition, up to 11.08 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the Tip: Use the 2021-05-20 21:30:22,277 ERROR worker.py:1074 -- Possible unhandled error from worker: ray::Worker.set_param() (pid=50800, ip=172.16.41.157) During handling of the above exception, another exception occurred: ray::Worker.set_param() (pid=50800, ip=172.16.41.157) During handling of the above exception, another exception occurred: ray::Worker.set_param() (pid=50800, ip=172.16.41.157) During handling of the above exception, another exception occurred: Also, here is a snapshot on the CPU usage and GPU usage before it crashed. It is as expected - multiple CPU >100% usage and 0% GPUs on all 4 cores.CPU: top - 20:26:13 up 24 min, 0 users, load average: 9.45, 5.35, 2.89 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND GPU: Thu May 20 20:27:06 2021 +-----------------------------------------------------------------------------+ 4) In the 4th experiment, I used ray, and also I did use gpu using the following command:!python -m spacy ray train config.cfg --n-workers 8 --output ./output --gpu-id 0 Here is the output: Starting time: 20:20:36 Follow up thoughts and questions:1- ray does not seem to be working for me. It started and successfully completed a few iterations as shown in experiment 3, but then it fails. When I use a high number of vCPUs with ray, similar to experiment 3, for example with 20 cores, it crashed right from the start with the following error. Starting time: 22:29:04 So, given the change of output with change of cpu cores used, what I did was repeating experiment 3 with only 4 cores/workers and this time it worked fine with no errors! Below is the results. Is there a limit on the number of vCPUs?Output of re-execution of experiment 3 (ray without GPU) with only 4 cores: Starting time: 21:01:57 Below is a snapshot of CPU and GPU usage - not sure why only 2 CPUs are heavily utilized as opposed to 4.CPU: top - 22:30:43 up 2:19, 0 users, load average: 2.62, 2.58, 2.74 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND GPU: +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ 2- It looks like there are issues in spaCy v3 regarding using gpu whether or not ray is used. I got 2 different kinds of error, one with ray, and one without ray, as described in experiments #3 and #4. I understand that the issue may be on my end, not spaCy. But I successfully ran the training on gpu using spaCy v2 as described at the beginning of this thread. Any ideas? At this point, I cannot even get one GPU working :) 3- I re-executed experiment 1 to see if the results are replicate-able. I am happy to say that I got the exact same result with a slight different execution time which is understandable. 4- When I use ray, it "appears" that the word2vec initiation is ignored. The reason I think that is by looking at the loss tok2vec at experiment 1 and experiment 3. It seems that when I use ray in experiment 3, the loss starts at a significantly higher value. Does ray have a different initiation mechanism? 5- When I used 4 cores in re-execusion of experiment 3, the model took alot longer to train compared to experiment one using one core - both experiments did not use GPU as I have not succeeded in using GPU in spaCy C3 yet. As shown in the results of experiment 1, with one core, the training took 1.84 hours. However, with 4 cores, it took 3.21 hours. This is a counter intuitive outcome, even if we only consider the same number of iterations as single core execution in experiment 1 - I checked on my end and it took twice as long for the same number of iterations! This gave me the idea that maybe the issue is with ray. So I re-executed experiment 3 (ray without GPU) with one core only, and it took similar time (under 2 hours) compared to experiment one which does not use ray - the accuracies were also different but I get that part. In conclusion, something is fishy when using ray as the more cores I use the slower the execution gets!! I checked ray using 2 cores and it took just over 2 hours! :) Are there any known issues with ray that is being worked on for v3.1 release? 6- If you teach me how to use -g 0, I am happy to re-run the 2 experiments with GPU and see if multiple cores will be used. Though at this point, even one GPU destroys the memory! Please let know if there are any experiments that you're interested in. I am happy to assist you. Thanks! |
Any updates on this? |
Hi @delucca, not yet, unfortunately. |
Hello,
I am training my NER model using the following code:
Start of Code
End of Code
The problem:
The issue is that each iteration take about 30 minutes - I have 8000 training records which include very long texts and also 6 labels.
So I was hoping to reduce it using more GPU cores, but it seems that only one core is being used - when I execute print(util.gpu) in the code above, only the first core returns a non zero value .
Question 1: Is there any way I could use more GPU cores in the training process to make it faster? I would appreciate any leads.
Edit: After some more research, it seems that spacy-ray is intended to enable parallel training. But I cannot find the documentation on using Ray in the nlp.update as all I find is about using "python -m spacy ray train config.cfg --n-workers 2."
Question 2: Does Ray enable parallel processing using GPUs, is it only for CPU cores?
Question 3: How could I integrate Ray in the python code I have using nlp.update as opposed to using "python -m spacy ray train config.cfg --n-workers 2." ?
Thank you!
Environment
All of the code above is in one conda_python3 notebook on AWS Sagemaker using ml.p3.2xlarge EC2 instance.
Python Version Used: 3
spaCy Version Used: 3.0.6
The text was updated successfully, but these errors were encountered: