-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
converted bart model is slower than the original one during inference time #14
Comments
Hi |
hi grimoire, you can test by running python issue.py directly, and you should see outputs as below, as you could see , if tensorrt model is run directly, it's faster than original model, if it follows some gpu to cpu operation, it becomes much slower than the original model, how should I solve this problem? |
Ok, I will let you know when I found something. I am not pro on NLP, this might take some time. |
ok, thanks for your suggestion, I will try and do more investigation in the mean time. |
I just updated the demo project repository, and added the method I used to convert to tensorrt model to it, it's in issue.py file with method name: decoderlayer_convertor_dynamic , I test with torch.cuda.sychronize() again, the result is as below, looks like the converted model is slower than the original one, I have also converted with another repo called torch2trt but with no dynamic shape supported, the converted tesnorrt model with fixed input shape is 6 times faster, but that does not work for me, bucause the decode process has to deal with dynamic shape. |
Are you using this repo to accelerate decoder? |
The different between output of fp16 and fp32 might be Inevitable. Significand precision of fp16 is 10 bit and exponent is 5 bit. That would limit the precision of model. self.q_proj.bias = None
query_states = self.q_proj(hidden_states) #* self.scaling
query_states = query_states * 10
return query_states
5 bit exponent can not given enough precision. Still working on it. |
hello grimoire, |
Sorry, I still do not have any solution. |
And, by the way, have you tried to increase |
I set the num_test = 100
# raw test
## warmup
if first_token:
y = decoder_layer(
decoder_layer_hidden_states,
encoder_hidden_states=de_encoder_hidden_states,
encoder_attention_mask=de_encoder_layer_attention_mask)
else:
y = decoder_layer(
decoder_layer_hidden_states,
encoder_hidden_states=de_encoder_hidden_states,
encoder_attention_mask=de_encoder_layer_attention_mask,
attention_mask=decoder_layer_attention_mask)
for _ in range(num_test):
start = time.time()
torch.cuda.synchronize(device)
with torch.no_grad():
if first_token:
y = decoder_layer(
decoder_layer_hidden_states,
encoder_hidden_states=de_encoder_hidden_states,
encoder_attention_mask=de_encoder_layer_attention_mask)
else:
y = decoder_layer(
decoder_layer_hidden_states,
encoder_hidden_states=de_encoder_hidden_states,
encoder_attention_mask=de_encoder_layer_attention_mask,
attention_mask=decoder_layer_attention_mask)
torch.cuda.synchronize(device)
end = time.time()
raw_time = end - start
raw_times.append(raw_time)
# trt_test
## warmup
if first_token:
y_trt = decoder_layer_tensorrt(decoder_layer_hidden_states,
de_encoder_hidden_states,
de_encoder_layer_attention_mask)
else:
y_trt = decoder_layer_tensorrt(decoder_layer_hidden_states,
de_encoder_hidden_states,
de_encoder_layer_attention_mask,
decoder_layer_attention_mask)
for _ in range(num_test):
start = time.time()
torch.cuda.synchronize(device)
with torch.no_grad():
if first_token:
y_trt = decoder_layer_tensorrt(
decoder_layer_hidden_states, de_encoder_hidden_states,
de_encoder_layer_attention_mask)
else:
y_trt = decoder_layer_tensorrt(
decoder_layer_hidden_states, de_encoder_hidden_states,
de_encoder_layer_attention_mask,
decoder_layer_attention_mask)
torch.cuda.synchronize(device)
end = time.time()
tensorrt_time = end - start
tensorrt_times.append(tensorrt_time)
times = [raw_time/tensorrt_time for raw_time,tensorrt_time in zip(raw_times, tensorrt_times)] This give me:
nearly 1:1, as expected. And this is the log of trt_exec:
|
hi there,
I have a project to use facebook bart for news summerization. In order to make the inference faster, we are trying to convert part of the model to tensorrt and then interegerated into the original model.
Via this repo, I have successfully converted facebook bart decoder layers to tensorrt model, and successfully integerated, however, the total inference time of generated tokens of the new bart model(i.e. the model integerated with converted tensorrt decoder layer) is 2 times slower than the original one, so, I tried to find why, and finally I found that the new bart model itself is faster than the original one, see code below, line1 is faster than before after changing with new bart model,
but is became much slower after line2,
line1: outputs = self(model_inputs, return_dict=True)
line2: next_token_logits = outputs.logits[:, -1, :]
line3: next_token_logits = self.adjust_logits_during_generation(
line4: next_token_logits, cur_len=cur_len, max_length=max_length)
below you can find the comparing speed of new bart model and original one (corresponding to comparing results of code line1 above),
below you can find the comparing speed of new bart model and original one(corresponding to comparing results of code after line2 above)
Does anyone knows why it became slow after line1 code above?
The text was updated successfully, but these errors were encountered: