Performance optimization opportunities in common pixel formats.

### Prerequisites

- [X] I have written a descriptive issue title
- [X] I have verified that I am running the latest version of ImageSharp
- [X] I have verified if the problem exist in both `DEBUG` and `RELEASE` mode
- [X] I have searched [open](https://github.com/SixLabors/ImageSharp/issues) and [closed](https://github.com/SixLabors/ImageSharp/issues?q=is%3Aissue+is%3Aclosed) issues to ensure it has not already been reported

### ImageSharp version

v3 alpha + 

### Other ImageSharp packages and versions

NA

### Environment (Operating system, version and so on)

NA

### .NET Framework version

NA

### Description

As described [here](https://github.com/SixLabors/ImageSharp/pull/2230#issuecomment-1248800429) there are several performance opportunities can be implemented in many of our pixel format types. This should be fairly low hanging fruit with good return.

Notably on .NET 6/7, you could make this even more efficient by doing something like:
```csharp
public void Pack(Vector4 vector)
{
    vector *= MaxBytes;
    vector += Half;
    vector = Vector4.Clamp(vector, Vector4.Zero, MaxBytes);

    Vector128<byte> result = Sse2.ConvertToVector128Int32WithTruncation(vector.AsVector128()).AsByte();
    // In .NET 7+ the above can be `result = Vector128.ConvertToInt32(vector.AsVector128()).AsByte()` so it works on Arm64 too

    R = result.GetElement(0);
    G = result.GetElement(4);
    B = result.GetElement(8);
    A = result.GetElement(12);
}
```

This converts all 4 elements at once and then extracts the truncated bytes directly:
```asm
vzeroupper
vmovupd xmm0, [0x7ffd160105c0]
vmovaps xmm1, xmm0
vmulps xmm1, xmm1, [rdx]
vmovupd [rdx], xmm1
vmovupd xmm1, [0x7ffd160105d0]
vaddps xmm1, xmm1, [rdx]
vmovupd [rdx], xmm1
vmovupd xmm1, [rdx]
vxorps xmm2, xmm2, xmm2
vmaxps xmm1, xmm1, xmm2
vminps xmm0, xmm1, xmm0
vmovupd [rdx], xmm0
vcvttps2dq xmm0, [rdx]
vpextrb eax, xmm0, 0
mov [rcx+2], al
vpextrb eax, xmm0, 4
mov [rcx+1], al
vpextrb eax, xmm0, 8
mov [rcx], al
vpextrb eax, xmm0, 0xc
mov [rcx+3], al
ret
```
* We'll also be improving the codegen around `vpextrb` more in the future so it can be just `vpextrb [rcx+2], xmm0, 0` instead of `vpextrb eax, xmm0, 0` followed by `mov [rcx+2], al`.

You can also optimize in .NET 6+ by directly using `Vector128.Create()`. This creates a method local constant and avoids the static initializer entirely:
```csharp
    private static Vector4 MaxBytes => Vector128.Create(255f).AsVector4();
    private static Vector4 Half => Vector128.Create(0.5f).AsVector4();
```

### Steps to Reproduce

NA

### Images

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Performance optimization opportunities in common pixel formats. #2232

Prerequisites

ImageSharp version

Other ImageSharp packages and versions

Environment (Operating system, version and so on)

.NET Framework version

Description

Steps to Reproduce

Images

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Performance optimization opportunities in common pixel formats. #2232

Description

Prerequisites

ImageSharp version

Other ImageSharp packages and versions

Environment (Operating system, version and so on)

.NET Framework version

Description

Steps to Reproduce

Images

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions