Skip to content

gguf : enforce that tensor names are unique #6905

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 28, 2024

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Apr 25, 2024

Resolve #6836

For now, only adding duplicated tensor is prevented:

  • ggml.c uses gguf_find_tensor to check for duplicated tensor. This is a linear search so may not be a suitable implementation when loading the gguf (but acceptable when writing gguf, since writing is not happen very often). What do you think about this point @slaren ?
  • GGUFWriter from gguf-py uses a set to keep track of tensor names when writing a gguf. We can use the same logic for GGUFReader

Copy link
Contributor

github-actions bot commented Apr 25, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 211 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=22836.94ms p(95)=42547.38ms fails=, finish reason: stop=99 truncated=112
  • Prompt processing (pp): avg=261.91tk/s p(95)=737.96tk/s
  • Token generation (tg): avg=19.17tk/s p(95)=27.0tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=xsn/duplicated_tensor_name commit=b1df2f212e4ab03b6cffbf66a1c26a718bc6677c

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 211 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1714296843 --> 1714297479
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 364.4, 364.4, 364.4, 364.4, 364.4, 424.47, 424.47, 424.47, 424.47, 424.47, 667.24, 667.24, 667.24, 667.24, 667.24, 557.81, 557.81, 557.81, 557.81, 557.81, 556.36, 556.36, 556.36, 556.36, 556.36, 553.92, 553.92, 553.92, 553.92, 553.92, 549.94, 549.94, 549.94, 549.94, 549.94, 546.06, 546.06, 546.06, 546.06, 546.06, 567.79, 567.79, 567.79, 567.79, 567.79, 566.64, 566.64, 566.64, 566.64, 566.64, 575.76, 575.76, 575.76, 575.76, 575.76, 588.64, 588.64, 588.64, 588.64, 588.64, 612.48, 612.48, 612.48, 612.48, 612.48, 616.62, 616.62, 616.62, 616.62, 616.62, 615.96, 615.96, 615.96, 615.96, 615.96, 614.95, 614.95, 614.95, 614.95, 614.95, 616.78, 616.78, 616.78, 616.78, 616.78, 615.96, 615.96, 615.96, 615.96, 615.96, 624.39, 624.39, 624.39, 624.39, 624.39, 627.91, 627.91, 627.91, 627.91, 627.91, 627.02, 627.02, 627.02, 627.02, 627.02, 634.76, 634.76, 634.76, 634.76, 634.76, 633.43, 633.43, 633.43, 633.43, 633.43, 634.43, 634.43, 634.43, 634.43, 634.43, 642.42, 642.42, 642.42, 642.42, 642.42, 643.49, 643.49, 643.49, 643.49, 643.49, 642.56, 642.56, 642.56, 642.56, 642.56, 641.56, 641.56, 641.56, 641.56, 641.56, 651.86, 651.86, 651.86, 651.86, 651.86, 650.35, 650.35, 650.35, 650.35, 650.35, 648.97, 648.97, 648.97, 648.97, 648.97, 660.84, 660.84, 660.84, 660.84, 660.84, 656.79, 656.79, 656.79, 656.79, 656.79, 655.96, 655.96, 655.96, 655.96, 655.96, 655.17, 655.17, 655.17, 655.17, 655.17, 639.67, 639.67, 639.67, 639.67, 639.67, 646.12, 646.12, 646.12, 646.12, 646.12, 647.77, 647.77, 647.77, 647.77, 647.77, 645.59, 645.59, 645.59, 645.59, 645.59, 644.81, 644.81, 644.81, 644.81, 644.81, 639.09, 639.09, 639.09, 639.09, 639.09, 638.39, 638.39, 638.39, 638.39, 638.39, 638.7, 638.7, 638.7, 638.7, 638.7, 642.12, 642.12, 642.12, 642.12, 642.12, 642.67, 642.67, 642.67, 642.67, 642.67, 643.88, 643.88, 643.88, 643.88, 643.88, 642.99, 642.99, 642.99, 642.99, 642.99, 641.2, 641.2, 641.2, 641.2, 641.2, 653.68, 653.68, 653.68, 653.68, 653.68, 655.38, 655.38, 655.38, 655.38, 655.38, 654.79, 654.79, 654.79, 654.79, 654.79, 652.86, 652.86, 652.86, 652.86, 652.86, 654.55, 654.55, 654.55, 654.55, 654.55, 653.82, 653.82, 653.82, 653.82, 653.82, 654.61, 654.61, 654.61, 654.61, 654.61, 654.16, 654.16, 654.16, 654.16, 654.16, 658.4, 658.4, 658.4, 658.4, 658.4, 658.6, 658.6, 658.6, 658.6, 658.6, 661.45, 661.45, 661.45, 661.45, 661.45, 660.71, 660.71, 660.71, 660.71, 660.71, 661.76, 661.76, 661.76, 661.76, 661.76, 661.76, 661.76, 661.76, 661.76]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 211 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1714296843 --> 1714297479
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 30.94, 30.94, 30.94, 30.94, 30.94, 28.33, 28.33, 28.33, 28.33, 28.33, 28.33, 28.33, 28.33, 28.33, 28.33, 25.56, 25.56, 25.56, 25.56, 25.56, 21.68, 21.68, 21.68, 21.68, 21.68, 19.48, 19.48, 19.48, 19.48, 19.48, 16.98, 16.98, 16.98, 16.98, 16.98, 17.14, 17.14, 17.14, 17.14, 17.14, 17.19, 17.19, 17.19, 17.19, 17.19, 17.93, 17.93, 17.93, 17.93, 17.93, 18.42, 18.42, 18.42, 18.42, 18.42, 18.44, 18.44, 18.44, 18.44, 18.44, 18.58, 18.58, 18.58, 18.58, 18.58, 18.58, 18.58, 18.58, 18.58, 18.58, 18.57, 18.57, 18.57, 18.57, 18.57, 18.7, 18.7, 18.7, 18.7, 18.7, 18.8, 18.8, 18.8, 18.8, 18.8, 19.14, 19.14, 19.14, 19.14, 19.14, 19.28, 19.28, 19.28, 19.28, 19.28, 19.4, 19.4, 19.4, 19.4, 19.4, 19.45, 19.45, 19.45, 19.45, 19.45, 19.46, 19.46, 19.46, 19.46, 19.46, 19.48, 19.48, 19.48, 19.48, 19.48, 19.53, 19.53, 19.53, 19.53, 19.53, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.54, 19.51, 19.51, 19.51, 19.51, 19.51, 19.54, 19.54, 19.54, 19.54, 19.54, 19.52, 19.52, 19.52, 19.52, 19.52, 19.51, 19.51, 19.51, 19.51, 19.51, 19.43, 19.43, 19.43, 19.43, 19.43, 19.3, 19.3, 19.3, 19.3, 19.3, 19.15, 19.15, 19.15, 19.15, 19.15, 19.05, 19.05, 19.05, 19.05, 19.05, 18.89, 18.89, 18.89, 18.89, 18.89, 18.81, 18.81, 18.81, 18.81, 18.81, 18.69, 18.69, 18.69, 18.69, 18.69, 18.52, 18.52, 18.52, 18.52, 18.52, 18.52, 18.52, 18.52, 18.52, 18.52, 18.26, 18.26, 18.26, 18.26, 18.26, 17.98, 17.98, 17.98, 17.98, 17.98, 17.66, 17.66, 17.66, 17.66, 17.66, 17.55, 17.55, 17.55, 17.55, 17.55, 17.53, 17.53, 17.53, 17.53, 17.53, 17.54, 17.54, 17.54, 17.54, 17.54, 17.58, 17.58, 17.58, 17.58, 17.58, 17.65, 17.65, 17.65, 17.65, 17.65, 17.69, 17.69, 17.69, 17.69, 17.69, 17.68, 17.68, 17.68, 17.68, 17.68, 17.67, 17.67, 17.67, 17.67, 17.67, 17.6, 17.6, 17.6, 17.6, 17.6, 17.49, 17.49, 17.49, 17.49, 17.49, 17.45, 17.45, 17.45, 17.45, 17.45, 17.46, 17.46, 17.46, 17.46, 17.46, 17.51, 17.51, 17.51, 17.51, 17.51, 17.53, 17.53, 17.53, 17.53, 17.53, 17.57, 17.57, 17.57, 17.57, 17.57, 17.62, 17.62, 17.62, 17.62, 17.62, 17.68, 17.68, 17.68, 17.68, 17.68, 17.74, 17.74, 17.74, 17.74, 17.74, 17.79, 17.79, 17.79, 17.79]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 211 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1714296843 --> 1714297479
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.08, 0.08, 0.08, 0.08, 0.08, 0.17, 0.17, 0.17, 0.17, 0.17, 0.34, 0.34, 0.34, 0.34, 0.34, 0.39, 0.39, 0.39, 0.39, 0.39, 0.45, 0.45, 0.45, 0.45, 0.45, 0.47, 0.47, 0.47, 0.47, 0.47, 0.46, 0.46, 0.46, 0.46, 0.46, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.29, 0.29, 0.29, 0.29, 0.29, 0.21, 0.21, 0.21, 0.21, 0.21, 0.25, 0.25, 0.25, 0.25, 0.25, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.22, 0.22, 0.22, 0.22, 0.22, 0.25, 0.25, 0.25, 0.25, 0.25, 0.2, 0.2, 0.2, 0.2, 0.2, 0.19, 0.19, 0.19, 0.19, 0.19, 0.25, 0.25, 0.25, 0.25, 0.25, 0.22, 0.22, 0.22, 0.22, 0.22, 0.34, 0.34, 0.34, 0.34, 0.34, 0.3, 0.3, 0.3, 0.3, 0.3, 0.33, 0.33, 0.33, 0.33, 0.33, 0.35, 0.35, 0.35, 0.35, 0.35, 0.24, 0.24, 0.24, 0.24, 0.24, 0.35, 0.35, 0.35, 0.35, 0.35, 0.43, 0.43, 0.43, 0.43, 0.43, 0.51, 0.51, 0.51, 0.51, 0.51, 0.52, 0.52, 0.52, 0.52, 0.52, 0.44, 0.44, 0.44, 0.44, 0.44, 0.35, 0.35, 0.35, 0.35, 0.35, 0.2, 0.2, 0.2, 0.2, 0.2, 0.23, 0.23, 0.23, 0.23, 0.23, 0.2, 0.2, 0.2, 0.2, 0.2, 0.27, 0.27, 0.27, 0.27, 0.27, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.26, 0.26, 0.26, 0.26, 0.26, 0.36, 0.36, 0.36, 0.36, 0.36, 0.4, 0.4, 0.4, 0.4, 0.4, 0.37, 0.37, 0.37, 0.37, 0.37, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.24, 0.24, 0.24, 0.24, 0.24, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.18, 0.18, 0.18, 0.18, 0.18, 0.26, 0.26, 0.26, 0.26]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 211 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1714296843 --> 1714297479
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0]
                    
Loading

@slaren
Copy link
Member

slaren commented Apr 25, 2024

A linear search is definitely not optimal, but we don't really have a good alternative at the moment, since we don't have implementations of the data structures that would be necessary to prevent this in ggml. Currently the models are small enough that it is not likely to cause a significant performance issues, and fixing the immediate issue is more important. So I think it is ok to use gguf_find_tensor.

@ngxson ngxson marked this pull request as ready for review April 27, 2024 13:49
@ngxson ngxson requested a review from slaren April 27, 2024 13:49
@ngxson
Copy link
Collaborator Author

ngxson commented Apr 27, 2024

@slaren FYI, here is the model that I used for testing: https://huggingface.co/ngxson/test_gguf_models/blob/main/stories260K_duplicated_tensors.gguf

I edited it with hex editor to have 2 times blk.0.attn_norm.weight

Co-authored-by: slaren <slarengh@gmail.com>
Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I verified that this prevents both loading existing models with duplicated tensors, and creating models with duplicated tensors. The GPT-2 arch conversion in convert-hf-to-gguf.py will need to be updated to avoid duplicating output.weight.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 28, 2024

@slaren Thanks for the confirmation. I'm not very familiar with convert-hf-to-gguf.py, so unfortunately I have no clue how to do that. Without that fix, do you think this PR is still ok to be merged?

@slaren
Copy link
Member

slaren commented Apr 28, 2024

Yes, that's a different issue, I was just pointing that's the reason why some models have duplicated tensors. Merge at will.

@ngxson ngxson merged commit 7bb36cc into ggml-org:master Apr 28, 2024
nopperl pushed a commit to nopperl/llama.cpp that referenced this pull request May 5, 2024
* not allow adding duplicated tensor name

* no duplicated tensor while reading gguf

* typo

* throw exception inside llama_model_loader

Co-authored-by: slaren <slarengh@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gguf : enforce that tensor names are unique
2 participants