Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : full ALiBi support #7192

Merged
merged 10 commits into from
May 11, 2024
Merged

ggml : full ALiBi support #7192

merged 10 commits into from
May 11, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented May 10, 2024

Implementing ALiBI as explained here: https://github.com/ofirpress/attention_with_linear_biases

If I understand correctly, the ALiBi bias can become part of the KQ_mask:

A - ALiBi integer matrix (0.0 when no ALiBi)
m - ALiBi head-specific slope parameter (1.0 when no ALiBi)

KQ_mask = causal_mask + A

soft_max(KQ*scale + KQ_mask*m)

Therefore there is no need to create the KQ_pos tensor as I initially thought. If this is correct, then we can simplify the ggml_soft_max_ext() operator and no longer pass the positions tensor. Extending Flash Attention support should also be possible and simple

This PR is needed to properly support Jina embedding models: #6826

Worflow

  • Remove ggml_alibi()
  • Update ggml_soft_max_ext() to no longer accept pos tensor:
    • CPU
    • Metal
    • CUDA
    • SYCL
    • Vulkan (requires change similar to d0592d4, cc @0cc4m)
  • Update ggml_flash_attn_ext() to support the new ALiBi KQ_mask:
    • CPU
    • Metal
    • CUDA

Tests

make -j && ./main -m ./models/refact-1b-base/ggml-model-f16.gguf -p "bool is_prime(" -e -n 256 -s 1 --temp 0.0 --verbose-prompt
make -j && ./infill -m models/refact-1b-fim/ggml-model-f16.gguf --in-prefix "def helloworld(): print(\"hel" --in-suffix " print(\"goodbye world\") " -ngl 99 --temp 0 --verbose-prompt

@ggerganov ggerganov force-pushed the gg/refactor-alibi-2 branch from 922a5b3 to d0592d4 Compare May 10, 2024 08:17
@ggerganov ggerganov force-pushed the gg/refactor-alibi-2 branch from a4c7cf7 to 166e60b Compare May 10, 2024 08:48
@mofosyne mofosyne added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs enhancement New feature or request model Model specific labels May 10, 2024
@ggerganov ggerganov force-pushed the gg/refactor-alibi-2 branch from ba4d12a to 97c27f5 Compare May 10, 2024 10:58
Comment on lines +2820 to +2821
// TODO: is there a better way to handle -INFINITY?
dst_data[i00] = src[0] == -INFINITY ? -MAXHALF : src[0];
Copy link
Owner Author

@ggerganov ggerganov May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I fully understand the problem, but when we cast the KQ_mask from F32 to F16 in this kernel, the F32 -INFINIFTY values are converted to some F16 value that when multiplied with the ALiBi slope results in garbage. Even if I force the slope to be 1.0h it still produces garbage. I expected that it would still be -INFINITY, but it's not the case. Since there is no way to print these values in Metal, this is the workaround that I found to work, but it feels a bit poor


for (int i1 = ir0; i1 < ir1; i1++) {
// ALiBi
const uint32_t h = (i1/ne01)%ne02; // head
const float slope = (max_bias > 0.0f) ? h < n_head_log2 ? powf(m0, h + 1) : powf(m1, 2*(h - n_head_log2) + 1) : 1.0f;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see how this directly work? how does it implement this logic?

[0, 1, 2, 3], [1, 0, 1, 2]

and the negative slope or negativity of this matrix?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We compute the integer matrix here:

llama.cpp/llama.cpp

Lines 10966 to 10987 in f7055d3

// For causal attention, use only the previous KV cells
// of the correct sequence for each token of the batch.
// It's assumed that if a token in the batch has multiple sequences, they are equivalent.
for (int h = 0; h < 1; ++h) {
for (int j = 0; j < n_tokens; ++j) {
const llama_pos pos = batch.pos[j];
const llama_seq_id seq_id = batch.seq_id[j][0];
for (int i = 0; i < n_kv; ++i) {
float f;
if (!lctx.kv_self.cells[i].has_seq_id(seq_id) || lctx.kv_self.cells[i].pos > pos) {
f = -INFINITY;
} else {
if (hparams.use_alibi) {
f = -fabs(lctx.kv_self.cells[i].pos - pos);
} else {
f = 0.0f;
}
}
data[h*(n_kv*n_tokens) + j*n_kv + i] = f;
}
}

We store it in KQ_mask. The slope is just the head-specific hyper parameter m from the link

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, good thanks!

@ggerganov
Copy link
Owner Author

I think this is ready. @JoanFM, I will now rebase the Jina branch on top of this branch and will adapt to the changes. Will ping you when ready so we can do some tests and verify that this works. If it is good, will mark this PR ready for review and proceed

@ggerganov
Copy link
Owner Author

ggerganov commented May 10, 2024

Updated the branch in #6826 and my embedding tests using Jina worked correctly, so it is ready for further tests. Let me know if you spot something that is not right

@ggerganov ggerganov marked this pull request as ready for review May 10, 2024 12:32
@ggerganov ggerganov requested a review from slaren May 10, 2024 12:33
@JoanFM
Copy link
Contributor

JoanFM commented May 10, 2024

Updated the branch in #6826 and my embedding tests using Jina worked correctly, so it is ready for further tests. Let me know if you spot something that is not right

I have done some tests on my end and seems to work fine

ggml.c Outdated Show resolved Hide resolved
float scale = 1.0f;
float max_bias = 0.0f;

memcpy(&scale, (float *) dst->op_params + 0, sizeof(float));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to confirm, this is needed because to set params we also do a memcpy instead of making op_params float[] for alignment reasons right?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, correct

Copy link
Collaborator

@NeoZhangJianyu NeoZhangJianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I test the CI of SYCL on Intel GPU.
The quality is OK!
soft_max is same as base.

@ggerganov ggerganov force-pushed the gg/refactor-alibi-2 branch from a616605 to 0faf92e Compare May 10, 2024 14:21
@mofosyne mofosyne added the refactoring Refactoring label May 10, 2024
@ggerganov ggerganov merged commit 9cb317f into master May 11, 2024
59 of 64 checks passed
@JohannesGaessler
Copy link
Collaborator

I think the total amount of work needed for conflict resolution would have been lower if #7188 had been merged first but what's done is done.

Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 548 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8511.54ms p(95)=20878.8ms fails=, finish reason: stop=484 truncated=64
  • Prompt processing (pp): avg=91.75tk/s p(95)=403.42tk/s
  • Token generation (tg): avg=34.45tk/s p(95)=45.69tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=gg/refactor-alibi-2 commit=03e940cdec1b91b848b9652e61cc3a2f4541d171

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 548 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715437926 --> 1715438554
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 268.87, 268.87, 268.87, 268.87, 268.87, 825.62, 825.62, 825.62, 825.62, 825.62, 764.56, 764.56, 764.56, 764.56, 764.56, 781.89, 781.89, 781.89, 781.89, 781.89, 784.49, 784.49, 784.49, 784.49, 784.49, 820.08, 820.08, 820.08, 820.08, 820.08, 815.28, 815.28, 815.28, 815.28, 815.28, 829.67, 829.67, 829.67, 829.67, 829.67, 841.41, 841.41, 841.41, 841.41, 841.41, 856.3, 856.3, 856.3, 856.3, 856.3, 881.99, 881.99, 881.99, 881.99, 881.99, 877.05, 877.05, 877.05, 877.05, 877.05, 880.19, 880.19, 880.19, 880.19, 880.19, 920.3, 920.3, 920.3, 920.3, 920.3, 909.46, 909.46, 909.46, 909.46, 909.46, 918.1, 918.1, 918.1, 918.1, 918.1, 914.92, 914.92, 914.92, 914.92, 914.92, 905.15, 905.15, 905.15, 905.15, 905.15, 900.11, 900.11, 900.11, 900.11, 900.11, 901.53, 901.53, 901.53, 901.53, 901.53, 905.63, 905.63, 905.63, 905.63, 905.63, 905.48, 905.48, 905.48, 905.48, 905.48, 916.78, 916.78, 916.78, 916.78, 916.78, 917.96, 917.96, 917.96, 917.96, 917.96, 919.22, 919.22, 919.22, 919.22, 919.22, 919.33, 919.33, 919.33, 919.33, 919.33, 925.43, 925.43, 925.43, 925.43, 925.43, 921.28, 921.28, 921.28, 921.28, 921.28, 917.13, 917.13, 917.13, 917.13, 917.13, 918.87, 918.87, 918.87, 918.87, 918.87, 921.24, 921.24, 921.24, 921.24, 921.24, 918.79, 918.79, 918.79, 918.79, 918.79, 920.91, 920.91, 920.91, 920.91, 920.91, 933.22, 933.22, 933.22, 933.22, 933.22, 936.03, 936.03, 936.03, 936.03, 936.03, 943.47, 943.47, 943.47, 943.47, 943.47, 940.63, 940.63, 940.63, 940.63, 940.63, 938.4, 938.4, 938.4, 938.4, 938.4, 938.21, 938.21, 938.21, 938.21, 938.21, 939.34, 939.34, 939.34, 939.34, 939.34, 937.96, 937.96, 937.96, 937.96, 937.96, 936.02, 936.02, 936.02, 936.02, 936.02, 892.84, 892.84, 892.84, 892.84, 892.84, 891.34, 891.34, 891.34, 891.34, 891.34, 889.19, 889.19, 889.19, 889.19, 889.19, 885.5, 885.5, 885.5, 885.5, 885.5, 885.79, 885.79, 885.79, 885.79, 885.79, 884.88, 884.88, 884.88, 884.88, 884.88, 887.41, 887.41, 887.41, 887.41, 887.41, 886.0, 886.0, 886.0, 886.0, 886.0, 888.07, 888.07, 888.07, 888.07, 888.07, 890.82, 890.82, 890.82, 890.82, 890.82, 889.83, 889.83, 889.83, 889.83, 889.83, 895.41, 895.41, 895.41, 895.41, 895.41, 895.98, 895.98, 895.98, 895.98, 895.98, 895.17, 895.17, 895.17, 895.17, 895.17, 895.56, 895.56, 895.56, 895.56, 895.56, 894.71, 894.71, 894.71, 894.71, 894.71, 896.62, 896.62, 896.62, 896.62, 896.62, 898.25, 898.25, 898.25, 898.25, 898.25, 897.75, 897.75, 897.75, 897.75, 897.75]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 548 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715437926 --> 1715438554
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.24, 41.24, 41.24, 41.24, 41.24, 37.47, 37.47, 37.47, 37.47, 37.47, 28.91, 28.91, 28.91, 28.91, 28.91, 28.17, 28.17, 28.17, 28.17, 28.17, 29.23, 29.23, 29.23, 29.23, 29.23, 30.2, 30.2, 30.2, 30.2, 30.2, 31.35, 31.35, 31.35, 31.35, 31.35, 32.65, 32.65, 32.65, 32.65, 32.65, 33.25, 33.25, 33.25, 33.25, 33.25, 33.56, 33.56, 33.56, 33.56, 33.56, 33.95, 33.95, 33.95, 33.95, 33.95, 33.8, 33.8, 33.8, 33.8, 33.8, 32.76, 32.76, 32.76, 32.76, 32.76, 32.36, 32.36, 32.36, 32.36, 32.36, 31.98, 31.98, 31.98, 31.98, 31.98, 31.74, 31.74, 31.74, 31.74, 31.74, 32.13, 32.13, 32.13, 32.13, 32.13, 32.2, 32.2, 32.2, 32.2, 32.2, 31.62, 31.62, 31.62, 31.62, 31.62, 31.46, 31.46, 31.46, 31.46, 31.46, 31.54, 31.54, 31.54, 31.54, 31.54, 31.63, 31.63, 31.63, 31.63, 31.63, 31.76, 31.76, 31.76, 31.76, 31.76, 31.57, 31.57, 31.57, 31.57, 31.57, 31.6, 31.6, 31.6, 31.6, 31.6, 31.68, 31.68, 31.68, 31.68, 31.68, 31.35, 31.35, 31.35, 31.35, 31.35, 31.11, 31.11, 31.11, 31.11, 31.11, 30.89, 30.89, 30.89, 30.89, 30.89, 31.2, 31.2, 31.2, 31.2, 31.2, 31.28, 31.28, 31.28, 31.28, 31.28, 31.39, 31.39, 31.39, 31.39, 31.39, 31.59, 31.59, 31.59, 31.59, 31.59, 31.57, 31.57, 31.57, 31.57, 31.57, 31.37, 31.37, 31.37, 31.37, 31.37, 31.23, 31.23, 31.23, 31.23, 31.23, 30.93, 30.93, 30.93, 30.93, 30.93, 30.95, 30.95, 30.95, 30.95, 30.95, 31.12, 31.12, 31.12, 31.12, 31.12, 31.21, 31.21, 31.21, 31.21, 31.21, 31.39, 31.39, 31.39, 31.39, 31.39, 31.4, 31.4, 31.4, 31.4, 31.4, 31.34, 31.34, 31.34, 31.34, 31.34, 30.58, 30.58, 30.58, 30.58, 30.58, 30.49, 30.49, 30.49, 30.49, 30.49, 29.56, 29.56, 29.56, 29.56, 29.56, 29.55, 29.55, 29.55, 29.55, 29.55, 29.6, 29.6, 29.6, 29.6, 29.6, 29.77, 29.77, 29.77, 29.77, 29.77, 29.93, 29.93, 29.93, 29.93, 29.93, 29.94, 29.94, 29.94, 29.94, 29.94, 29.91, 29.91, 29.91, 29.91, 29.91, 29.78, 29.78, 29.78, 29.78, 29.78, 29.7, 29.7, 29.7, 29.7, 29.7, 29.73, 29.73, 29.73, 29.73, 29.73, 29.82, 29.82, 29.82, 29.82, 29.82, 29.98, 29.98, 29.98, 29.98, 29.98, 30.08, 30.08, 30.08, 30.08, 30.08, 30.12, 30.12, 30.12, 30.12, 30.12, 30.14, 30.14, 30.14, 30.14, 30.14, 30.12, 30.12, 30.12, 30.12, 30.12]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 548 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715437926 --> 1715438554
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07, 0.07, 0.07, 0.07, 0.07, 0.38, 0.38, 0.38, 0.38, 0.38, 0.29, 0.29, 0.29, 0.29, 0.29, 0.29, 0.29, 0.29, 0.29, 0.29, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.27, 0.27, 0.27, 0.27, 0.27, 0.29, 0.29, 0.29, 0.29, 0.29, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.25, 0.25, 0.25, 0.25, 0.25, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.24, 0.24, 0.24, 0.24, 0.24, 0.09, 0.09, 0.09, 0.09, 0.09, 0.1, 0.1, 0.1, 0.1, 0.1, 0.18, 0.18, 0.18, 0.18, 0.18, 0.35, 0.35, 0.35, 0.35, 0.35, 0.26, 0.26, 0.26, 0.26, 0.26, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.27, 0.27, 0.27, 0.27, 0.27, 0.14, 0.14, 0.14, 0.14, 0.14, 0.25, 0.25, 0.25, 0.25, 0.25, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.34, 0.34, 0.34, 0.34, 0.34, 0.54, 0.54, 0.54, 0.54, 0.54, 0.51, 0.51, 0.51, 0.51, 0.51, 0.51, 0.51, 0.51, 0.51, 0.51, 0.22, 0.22, 0.22, 0.22, 0.22, 0.21, 0.21, 0.21, 0.21, 0.21, 0.11, 0.11, 0.11, 0.11, 0.11, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.11, 0.11, 0.11, 0.11, 0.11, 0.24, 0.24, 0.24, 0.24, 0.24, 0.2, 0.2, 0.2, 0.2, 0.2]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 548 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715437926 --> 1715438554
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0]
                    
Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request model Model specific refactoring Refactoring Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants