Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CANN] Fix Multi-NPU execution error #8710

Merged
merged 2 commits into from
Jul 27, 2024
Merged

Conversation

wangshuai09
Copy link
Contributor

@wangshuai09 wangshuai09 commented Jul 26, 2024

This PR fixes #8580 and users could use multi npu on CANN backend by -sm layer.

Multi-NPU

root@c4e670a2a558:/home/downloads/src/llama.cpp/build# ./bin/llama-cli -m /home/models/hermes_gguf/Hermes-2-Pro-Llama-3-8B-F16.gguf -p "Building a website can be done in 10 simple steps:" -ngl 32  -sm  layer --seed 1024
warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
warning: llama.cpp was compiled without CUDA/SYCL/Vulkan. Setting the split mode has no effect.
Log start
main: build = 3467 (01245f5b)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for aarch64-linux-gnu
main: seed  = 1024
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/models/hermes_gguf/Hermes-2-Pro-Llama-3-8B-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Hermes-2-Pro-Llama-3-8B
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128288
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128288]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128288]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128003
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 128001
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {{bos_token}}{% for message in messag...
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: special tokens cache size = 288
llm_load_vocab: token to piece cache size = 0.8007 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128288
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 14.96 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = Hermes-2-Pro-Llama-3-8B
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128003 '<|im_end|>'
llm_load_print_meta: PAD token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128003 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.41 MiB
llm_load_tensors:        CPU buffer size = 15317.52 MiB
llm_load_tensors:       CANN buffer size =  6656.50 MiB
llm_load_tensors:       CANN buffer size =  6656.50 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:       CANN KV buffer size =   512.00 MiB
llama_kv_cache_init:       CANN KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:       CANN compute buffer size =  1260.81 MiB
llama_new_context_with_model:       CANN compute buffer size =   560.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 5

system_info: n_threads = 192 / 192 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 1


Building a website can be done in 10 simple steps: 1. Choose a domain name, 2. Sign up for hosting, 3. Pick a website builder, 4. Create your website, 5. Customize your website, 6. Add content, 7. Test your website, 8. Publish your website, 9. Market your website, 10. Monitor your website.
How do I create a website for my business?
How to Create a Business Website in 8 Steps
  1. Choose a domain name for your business website.
  2. Select a reliable web hosting provider.
  3. Choose a website builder and template.
  4. Customize your website with your branding.
  5. Add website content and features.
  6. Optimize your website for search engines.
  7. Test your website for functionality.
  8. Publish your website and promote it.
How do I start a small website?
Here are some steps you can take to create your own small business website:
How much does it cost to start a website?
The cost of building a website can vary greatly depending on your needs. Here are some factors that will impact the cost: Domain name: $10 – $15 per year. Web hosting: $7 – $25 per month.
What are the tools needed to create a website?
To create a website, you'll need the following tools and resources:
  1. Domain name registrar – to register your domain name.
  2. Web hosting –

llama_print_timings:        load time =    5183.55 ms
llama_print_timings:      sample time =      86.30 ms /   306 runs   (    0.28 ms per token,  3545.81 tokens per second)
llama_print_timings: prompt eval time =     131.36 ms /    13 tokens (   10.10 ms per token,    98.97 tokens per second)
llama_print_timings:        eval time =   40240.60 ms /   305 runs   (  131.94 ms per token,     7.58 tokens per second)
llama_print_timings:       total time =   40790.70 ms /   318 tokens
+------------------------------------------------------------------------------------------------+
| npu-smi 23.0.3                   Version: 23.0.3                                               |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     x                  | OK            | 119.7       51                0    / 0             |
| 0                         | 0000:C1:00.0  | 10          0    / 0          12045/ 65536         |
+===========================+===============+====================================================+
| 2     x                  | OK            | 122.6       50                0    / 0             |
| 0                         | 0000:C2:00.0  | 8           0    / 0          11343/ 65536         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| 0       0                 | 72924         | llama-cli                | 8778                    |
+===========================+===============+====================================================+
| 2       0                 | 72924         | llama-cli                | 8077                    |
+===========================+===============+====================================================+

Single-NPU

root@c4e670a2a558:/home/downloads/src/llama.cpp/build# ./bin/llama-cli -m /home/models/hermes_gguf/Hermes-2-Pro-Llama-3-8B-F16.gguf -p "Building a website can be done in 10 simple steps:" -ngl 32  -sm none --seed 1024
warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
warning: llama.cpp was compiled without CUDA/SYCL/Vulkan. Setting the split mode has no effect.
Log start
main: build = 3467 (01245f5b)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for aarch64-linux-gnu
main: seed  = 1024
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/models/hermes_gguf/Hermes-2-Pro-Llama-3-8B-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Hermes-2-Pro-Llama-3-8B
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128288
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128288]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128288]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128003
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 128001
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {{bos_token}}{% for message in messag...
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: special tokens cache size = 288
llm_load_vocab: token to piece cache size = 0.8007 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128288
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 14.96 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = Hermes-2-Pro-Llama-3-8B
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128003 '<|im_end|>'
llm_load_print_meta: PAD token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128003 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors:        CPU buffer size = 15317.52 MiB
llm_load_tensors:       CANN buffer size = 13313.00 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:       CANN KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:       CANN compute buffer size =  1260.81 MiB
llama_new_context_with_model:        CPU compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 4

system_info: n_threads = 192 / 192 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 1


Building a website can be done in 10 simple steps: 1. Choose a domain name, 2. Sign up for hosting, 3. Pick a website builder, 4. Create your website, 5. Customize your website, 6. Add content, 7. Test your website, 8. Publish your website, 9. Market your website, 10. Monitor your website.
How do I create a website for my business?
How to Create a Business Website in 8 Steps
  1. Choose a domain name for your business website.
  2. Select a reliable web hosting provider.
  3. Choose a website builder and template.
  4. Customize your website with your branding.
  5. Add website content and features.
  6. Optimize your website for search engines.
  7. Test your website for functionality.
  8. Publish your website and promote it.
How do I start a small website?
Here are some steps you can take to create your own small business website:
How much does it cost to start a website?
The cost of building a website can vary greatly depending on your needs. Here are some factors that will impact the cost: Domain name: $10 – $15 per year. Web hosting: $7 – $25 per month.
What are the tools needed to create a website?
To create a website, you'll need the following tools and resources:
  1. Domain name registrar – to register your domain name.
  2. Web hosting – to store your website files and make them accessible to visitors.
  3. Website builder or content management system (CMS) – to create and design your website.
  4. Text editor – to write and edit your website content.
How can I create a website for free?
How to Make a Free Website in 10 Simple Steps
  1. Choose your niche and target audience.
  2. Choose a unique and memorable domain name.
  3. Select a reliable website builder.
  4. Customize your website design.
  5. Create quality content for your website.
  6. Optimize your website for search engines.
  7. Test your website for functionality.
  8. Publish your website.
  9. Promote your website.
  10. Monitor your website.
What is the best website builder?
The best website builder for most people is WordPress. It's easy to use, has an incredible range of themes and plugins, and is the most popular platform in the world. However, it does require a little more technical know-how than other options.
How much should I charge for website design?
Web design prices can vary greatly depending on the scope of the project and the specific requirements. On average, a small business website can cost between $1,000 to $5,000. Larger websites for enterprises can cost between $5,000 to $25,000 or more. Some factors that can affect pricing include:
What is the best website builder for small business?
The best website builder for small businesses in 2021:
  1. Wix. The best overall website builder for small businesses.
  2. Squarespace. Excellent for creatives and designers.
  3. Weebly. A solid choice for small businesses with an e-commerce focus.
  4. Shopify. The best for small businesses selling physical products.
  5. WordPress. Good for SEO and customizable sites.
How do I create a website for free without hosting?
Here's how to create a free website without any hosting:
  1. Choose a website builder. You can choose from various website builders like WordPress.com, Weebly, Wix, or Google Sites.
  2. Sign up for an account. Sign up for an account with the website builder you've chosen.
  3. Choose a template.
  4. Customize your website.
  5. Publish your website.
  6. Promote your website.
How can I make a simple website?
Here are the steps to create a simple website:
  1. Choose a website builder. You can choose from a variety of website builders, like WordPress, Wix, or Squarespace.
  2. Sign up for an account.
  3. Pick a template.
  4. Customize your website.
  5. Add your content.
  6. Test your website.
  7. Publish your website.
  8. Promote your website.
How long does it take to make a website?
The time it takes to make a website can vary depending on the complexity of the site and your level of expertise. A simple website can take a few hours, while a more complex site with many pages and features can take several weeks or even months. If you're hiring a web developer

llama_print_timings:        load time =    4250.14 ms
llama_print_timings:      sample time =    1096.20 ms /  4540 runs   (    0.24 ms per token,  4141.56 tokens per second)
llama_print_timings: prompt eval time =     134.60 ms /    13 tokens (   10.35 ms per token,    96.58 tokens per second)
llama_print_timings:        eval time =  582180.27 ms /  4539 runs   (  128.26 ms per token,     7.80 tokens per second)
+------------------------------------------------------------------------------------------------+
| npu-smi 23.0.3                   Version: 23.0.3                                               |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     x                  | OK            | 149.9       53                0    / 0             |
| 0                         | 0000:C1:00.0  | 37          0    / 0          19210/ 65536         |
+===========================+===============+====================================================+
| 2     x                  | OK            | 95.4        50                0    / 0             |
| 0                         | 0000:C2:00.0  | 0           0    / 0          3314 / 65536         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| 0       0                 | 81422         | llama-cli                | 15941                   |
+===========================+===============+====================================================+
| No running processes found in NPU 2                                                            |
+===========================+===============+====================================================+

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jul 26, 2024
@hipudding hipudding self-requested a review July 27, 2024 06:26
@hipudding hipudding added the Ascend NPU issues specific to Ascend NPUs label Jul 27, 2024
// wait on dst stream for the copy to complete
ACL_CHECK(aclrtStreamWaitEvent(cann_ctx_dst->stream(),
cann_ctx_src->copy_event));
//TODO: workaround for Event didn`t work here.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The better way of sync between two streams is using events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, i will try using the Event lately and now it's a workaround to make it available for multi-npu exec.

@hipudding
Copy link
Collaborator

Please use "module: title" as the commit log title.

@hipudding hipudding merged commit bfb4c74 into ggerganov:master Jul 27, 2024
53 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 27, 2024
* cann: fix multi-npu exec error

* cann: update comment  for ggml_backend_cann_supports_buft
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ascend NPU issues specific to Ascend NPUs ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Multi-NPU execution error
2 participants