Skip to content

Vectorize TensorPrimitives.CosineSimilarity<Half> #116898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 23, 2025

Conversation

stephentoub
Copy link
Member

Vectorize for Half by processing it as shorts, using the existing widening routine to two vectors of floats, and operating on those floats. Even for non-vectorized, this improves throughput as each intermediate operation is operating on floats rather than constantly needing to convert back to Half.

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Numerics.Tensors;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    private Half[] _x, _y;

    [Params(1, 10, 100, 1000)]
    public int Length { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        _x = new Half[Length];
        _y = new Half[Length];
        var random = new Random(42);
        for (int i = 0; i < Length; i++)
        {
            _x[i] = (Half)random.NextSingle();
            _y[i] = (Half)random.NextSingle();
        }
    }

    [Benchmark]
    public Half CosineSimilarity() => TensorPrimitives.CosineSimilarity(_x, _y);
}

Before:

Method Length Mean
CosineSimilarity 1 64.24 ns
CosineSimilarity 10 241.02 ns
CosineSimilarity 100 2,077.22 ns
CosineSimilarity 1000 20,033.55 ns

After

Method Length Mean
CosineSimilarity 1 14.59 ns
CosineSimilarity 10 29.79 ns
CosineSimilarity 100 69.57 ns
CosineSimilarity 1000 465.07 ns

Vectorize for Half by processing it as shorts, using the existing widening routine to two vectors of floats, and operating on those floats. Even for non-vectorized, this improves throughput as each intermediate operation is operating on floats rather than constantly needing to convert back to Half.
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-numerics-tensors
See info in area-owners.md if you want to be subscribed.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds explicit vectorization support for Half inputs in TensorPrimitives.CosineSimilarity, refactors the core implementation to use common Update/Finalize helpers, and introduces a specialized CosineSimilarityHalfCore that processes Half as widened floats.

  • Adds a generic wrapper for CosineSimilarity<T> that dispatches to a new Half-specific path
  • Refactors existing vector‐and‐scalar loops into shared Update and Finalize methods
  • Implements CosineSimilarityHalfCore with 128/256/512-bit vector and scalar fallbacks for Half
Comments suppressed due to low confidence (2)

src/libraries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.CosineSimilarity.cs:184

  • A new specialized path for Half has been added but no tests for TensorPrimitives.CosineSimilarity on Half arrays appear in this PR. Please add unit tests covering both vectorized and scalar code paths to validate correctness.
        private static Half CosineSimilarityHalfCore(ReadOnlySpan<Half> x, ReadOnlySpan<Half> y)

src/libraries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.CosineSimilarity.cs:31

  • The XML doc for CosineSimilarity<T> does not mention the new Half-specialized path. Please update the summary to note that Half inputs are now vectorized via Halfshortfloat widening.
        public static T CosineSimilarity<T>(ReadOnlySpan<T> x, ReadOnlySpan<T> y)

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

It's a bit unfortunate we need to duplicate the CosineSimilarityCore function here. I expect we could have a general operate with m-to-n intermediate helper, but that would be a larger refactoring (and I don't think it's worth blocking this on making that happen).

@stephentoub
Copy link
Member Author

It's a bit unfortunate we need to duplicate the CosineSimilarityCore function here. I expect we could have a general operate with m-to-n intermediate helper, but that would be a larger refactoring (and I don't think it's worth blocking this on making that happen).

I have such a helper in another PR I'll put up for other methods, but applying it to CosineSimliarity (which doesn't use any of the shared helpers or operators) results in roundtripping between Half and float for each operation, which is measurably worse than staying with float as the accumulator. We can subsequently look at a larger refactoring around our aggregations to enable a) making the accumulation configurable and b) getting CosineSimilarity onto the same helpers (which is desirable, anyway, as it's not currently as robust in its optimizations as the shared helpers are).

@stephentoub stephentoub merged commit 594f85c into dotnet:main Jun 23, 2025
82 of 89 checks passed
@stephentoub stephentoub deleted the vectorizecshalf branch June 23, 2025 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants