Add support for masked loads & stores #374

farnoy · 2023-11-16T15:38:07Z

Resolves the masked loads/stores checklist item from #16

Depends on rustc exposing new intrinsics - rust-lang/rust#117953

Added a basic test, do we need to auto-generate tests with macros to cover all types & lane sizes?

farnoy · 2023-11-16T15:40:11Z

crates/core_simd/src/vector.rs

+    #[inline]
+    pub fn masked_load_select(slice: &[T], enable: Mask<isize, N>, or: Self) -> Self {
+        let ptr = slice.as_ptr();
+        let idxs = Simd::<usize, N>::from_slice(&[


I don't like this. The rotate_lanes_left code does a clever thing with a constexpr and maybe that should be preferred?

But ideally, I'd like to have something like Simd<T, N>::lane_indices() -> Simd<T, N> (generating 0 through N elements) built-in as it's a useful function to have in user code.

crates/core_simd/src/vector.rs

farnoy · 2023-11-17T18:18:40Z

crates/core_simd/src/vector.rs

+        enable &= {
+            let mask = bzhi_u64(u64::MAX, core::cmp::min(N, slice.len()) as u32);
+            let mask_bytes: [u8; 8] = unsafe { core::mem::transmute(mask) };
+            let mut in_bounds_arr = Mask::splat(true).to_bitmask_array();
+            let len = in_bounds_arr.as_ref().len();
+            in_bounds_arr.as_mut().copy_from_slice(&mask_bytes[..len]);
+            Mask::from_bitmask_array(in_bounds_arr)
+        };


The previous version was generating suboptimal code. This implementation is still a bit gnarly because I wasn't able to easily cast a u64 into a Mask<_, N> easily. This is because ToBitmask uses u8/u16/u32/u64. I don't think we can pull in num_traits for an easy solution to this?

This version generates code that matches my handwritten AVX-512 intrinsics.

I made some (not yet merged) changes to bitmasks in #375 that may help you (with that change, it's possible to always return a u64), but I'm suspicious about this implementation, since it's not particularly portable. Bitmasks are really only efficient on x86, and most x86 for the time being will not be AVX-512. I'm confident this codegen will be horrendous on aarch64 :)

I think you should use the old version of the bounds checking, and conditionally check for x86 and N < 64 to use bitmasks. I would actually split off the x86-optimized enable-adjusting code to its own function.

I'll check it out. So far, the bitmasks haven't been a problem for me because of other codegen issues.

I hoped that LLVM implements the fallback by breaking the vector in half, log2 on every step. I was imagining something like this:

fn load_select<T, const N: usize> (slice: &[T], enable: Mask<T::Mask, N>, or: Simd<T, N>) -> Simd<T, N> where ... { if likely(enable.all() && in_bounds(N-1)) { Self::from_slice_unchecked(slice) } else { let low = Simd::<T, N/2>::load_select(slice, enable.low_half()); let high = Simd::<T, N/2>::load_select(&slice[N..], enable.high_half()); merge(low, high) } }

But instead it generates code that's more like this:

fn load_select<T, const N: usize> (slice: &[T], enable: Mask<T::Mask, N>, or: Simd<T, N>) -> Simd<T, N> where ... { let mut elems = [T::default(); N]; for lane_ix in 0..N { if in_bounds(lane_ix) && enable.test(lane_ix) { elems[lane_ix] = slice[lane_ix]; } } Self::from_array(elems) }

If I wanted to implement this for other architectures efficiently, could I do it with some form of recursive swizzle & macros? I think the potential for writing simpler code while retaining efficiency is there. The intended use case is to have the likely(enable.all() && in_bounds(N-1)) branch be always taken, except for the last iteration. I think that would lead to good codegen, since you don't need to duplicate loop bodies if you wanted an epilogue, and it should also be perfectly predictable until the last iteration?

This could still be inefficient in certain cases:

Even for AVX-512, not needing to do bitmask operations in all but one iteration could free up the core. In my experience, you can probably spare scalar resources for this so it doesn't seem like a big deal

The caller depends on the mask for their computation after load_select. This could be a problem if they compute the mask differently than this library, and the compiler doesn't notice. Basically computing the masks twice, in different ways

I don't think const generics are powerful enough to allow anything recursive, yet. I think you're being too tricky for the compiler--if you'd like to skip the checks, you need to add the explicit if enable.all() branch.

This was obviously easier for me to implement, since I didn't write it generically, but does this look right?
https://rust.godbolt.org/z/TvYqPh4js

First, I couldn't convince the compiler to emit bzhi so I had to resort to asm. But even then--the naive approach seems more efficient to me.

Check out my comment on StackOverflow here: https://stackoverflow.com/questions/75179720/how-to-get-rust-compiler-to-emit-bzhi-instruction-without-resorting-to-platform#comment136619035_75179720

I found two ways to emit bzhi but neither are production-ready IMO.

I will benchmark this some more. I like your constexpr array solution. When I did something similar before (but using a &'static [usize; 64]), it loaded the entire slice into multiple vector registers before shuffling.

Your solution is precise and indeed looks clean here, including for -C target-feature=+sse4.1,+avx2 version. mm_maskload comes from AVX/AVX2. Not sure about SSE & NEON but it's promising.

I don't think it's valid to do len as i32, you'd need something like i32::try_from(len).unwrap_or(i32::MAX) to do a saturating cast. Otherwise passing in a slice of length i32::MAX as usize + 1 could make the bounds check reject valid elements, right? This applies to smaller masks too, like i16 and i8.

As for const generics and N/2, I'll try adding just a single branch if likely(enable.all() && in_bounds(N-1)) to see if there's a perf hit on AVX-512, because it's definitely going to be a big win for other targets.

This is silly, but maybe you can rely on the optimizer here: https://rust.godbolt.org/z/TxPYhx6K4

You would could add that index function to one of the Sealed traits.

crates/core_simd/src/vector.rs

calebzulawski · 2023-11-18T17:31:09Z

crates/core_simd/src/vector.rs

+        Mask<<T as SimdElement>::Mask, N>: ToBitMask + ToBitMaskArray,
+    {
+        Self::masked_load_select(slice, Mask::splat(true), or)
+    }


I don't think this function does anything?

I suppose it's masking the slice length, but I'm not sure I would even call this a masked load. Also, I'm not sure this is the best way to load from slices on most architectures.

It's supposed to enable us to write extremely nice loops like this:

let mut accum = Simd::<u8, N>::default(); for i in (0..data.len()).step_by(N) { accum ^= Simd::masked_load_or(&data[i..], Simd::default()); }

No epilogues or scalar fallbacks needed. This could be even shorter since SimdElement implies Default?

But it's only a reality on AVX-512 for now on account of the codegen

I would at least rename it--it might be implemented via masked loads, but API-wise I would consider it something else entirely. This is perhaps Simd::from_slice.

After the rename -- I don't like that this function seems more discoverable than from_slice. Users should prefer from_slice if it fits their use-case. This isn't a problem with gather/scatter because that's a specific name with a more complex signature, it's hard to use it by accident.

from_slice already has a branch on the length anyway, what I'm suggesting that perhaps from_slice should instead do a masked load for len < N:

portable-simd/crates/core_simd/src/vector.rs

Lines 280 to 283 in 64ea088

assert!(

slice.len() >= Self::LEN,

"slice length must be at least the number of elements"

);

calebzulawski · 2023-11-18T17:33:17Z

crates/core_simd/src/vector.rs

+
+    #[must_use]
+    #[inline]
+    pub fn masked_load_select(


I would call this just masked_load, or load_select. IMO masked and select mean the same thing.

farnoy · 2023-11-18T16:39:02Z

crates/core_simd/tests/masked_load_store.rs

+    let mut arr = [u8::MAX; 7];
+
+    u8x4::splat(0).masked_store(&mut arr[5..], Mask::from_array([false, true, false, true]));
+    // write to index 8 is OOB and dropped


I'll make the stack array larger to prove that the store doesn't spill beyond a smaller slice

farnoy · 2023-11-18T18:20:13Z

crates/core_simd/src/vector.rs

+        Mask<<T as SimdElement>::Mask, N>: ToBitMask + ToBitMaskArray,
+    {
+        Self::masked_load_select(slice, Mask::splat(true), or)
+    }


It's supposed to enable us to write extremely nice loops like this:

let mut accum = Simd::<u8, N>::default(); for i in (0..data.len()).step_by(N) { accum ^= Simd::masked_load_or(&data[i..], Simd::default()); }

No epilogues or scalar fallbacks needed. This could be even shorter since SimdElement implies Default?

But it's only a reality on AVX-512 for now on account of the codegen

farnoy · 2023-11-18T18:31:43Z

crates/core_simd/src/vector.rs

+        enable &= {
+            let mask = bzhi_u64(u64::MAX, core::cmp::min(N, slice.len()) as u32);
+            let mask_bytes: [u8; 8] = unsafe { core::mem::transmute(mask) };
+            let mut in_bounds_arr = Mask::splat(true).to_bitmask_array();
+            let len = in_bounds_arr.as_ref().len();
+            in_bounds_arr.as_mut().copy_from_slice(&mask_bytes[..len]);
+            Mask::from_bitmask_array(in_bounds_arr)
+        };


I'll check it out. So far, the bitmasks haven't been a problem for me because of other codegen issues.

I hoped that LLVM implements the fallback by breaking the vector in half, log2 on every step. I was imagining something like this:

fn load_select<T, const N: usize> (slice: &[T], enable: Mask<T::Mask, N>, or: Simd<T, N>) -> Simd<T, N> where ... { if likely(enable.all() && in_bounds(N-1)) { Self::from_slice_unchecked(slice) } else { let low = Simd::<T, N/2>::load_select(slice, enable.low_half()); let high = Simd::<T, N/2>::load_select(&slice[N..], enable.high_half()); merge(low, high) } }

But instead it generates code that's more like this:

fn load_select<T, const N: usize> (slice: &[T], enable: Mask<T::Mask, N>, or: Simd<T, N>) -> Simd<T, N> where ... { let mut elems = [T::default(); N]; for lane_ix in 0..N { if in_bounds(lane_ix) && enable.test(lane_ix) { elems[lane_ix] = slice[lane_ix]; } } Self::from_array(elems) }

If I wanted to implement this for other architectures efficiently, could I do it with some form of recursive swizzle & macros? I think the potential for writing simpler code while retaining efficiency is there. The intended use case is to have the likely(enable.all() && in_bounds(N-1)) branch be always taken, except for the last iteration. I think that would lead to good codegen, since you don't need to duplicate loop bodies if you wanted an epilogue, and it should also be perfectly predictable until the last iteration?

This could still be inefficient in certain cases:

Even for AVX-512, not needing to do bitmask operations in all but one iteration could free up the core. In my experience, you can probably spare scalar resources for this so it doesn't seem like a big deal

The caller depends on the mask for their computation after load_select. This could be a problem if they compute the mask differently than this library, and the compiler doesn't notice. Basically computing the masks twice, in different ways

crates/core_simd/src/vector.rs

farnoy · 2023-11-19T13:58:50Z

@calebzulawski I pushed out a new version with the changes we discussed.

I've also added temporary toggles for easier benchmarking (USE_BRANCH and USE_BITMASK).

Here's some data on a simple benchmark reading from one u8 vector and zero-extending it to u16 to write back to another vector. The write buffer is small (1300B) and fully cached. The benchmark is DRAM-bound. This is on Zen 4 (7950X3D) with -C target-cpu=znver4

USE_BRANCH=true,USE_BITMASK=true at ~45GB/s here's the hot loop
USE_BRANCH=false,USE_BITMASK=true is at ~41GB/s - hot loop
USE_BRANCH=true,USE_BITMASK=false 32GB/s - hot loop
USE_BRANCH=false,USE_BITMASK=false at 2.4GB/s - hot loop

That last one is interesting as an outlier - uica disagrees that it should be >10x slower. I will investigate these on Sapphire Rapids and Graviton 3. I'm also curious if there's any way to do i8::try_from(slice.len()).unwrap_or(i8::MAX) more efficiently. And there also seem to be too many branches per iteration, perhaps it has something to do with how I defined my input & output vectors?

    let len = bytestring.len();
    output.clear();
    output.reserve(characters);

    unsafe {
        // contents are uninitialized until the loop below ends
        output.set_len(len);
    }

    for i in (0..len).step_by(N) {
        let data = Simd::<u8, N>::load_or_default(&bytestring[i..]);
        let data2: Simd<u16, N> = data.cast();
        data2.masked_store(&mut output[i..], Mask::splat(true));
    }

And I've set N=32 for this run, expecting a ymm load that expands into a zmm store

EDIT: It's also curious that the masks get calculated twice for USE_BITMASK=false, once to load, then separately to store. It seems to be because of varying lane size - u8 vs u16.

farnoy · 2023-11-19T14:08:28Z

crates/core_simd/src/vector.rs

+
+    #[must_use]
+    #[inline]
+    pub fn load_or(slice: &[T], or: Self) -> Self


TODO: I need to test if this works with T=*const u8 and other pointers. These trait bounds are involved, at least for now, but they shouldn't require T: Default for this and other functions that accept an or: Self

farnoy · 2023-11-19T14:39:46Z

I optimized USE_BITMASK=false to always do the comparison on i8 masks, which avoids the redundant computation. It's a substantial perf win, but it still misbehaves at specific vector sizes.

farnoy · 2023-11-21T20:25:12Z

crates/core_simd/src/vector.rs

+// TODO: optimize it further if possible
+// https://stackoverflow.com/questions/75179720/how-to-get-rust-compiler-to-emit-bzhi-instruction-without-resorting-to-platform
+#[inline(always)]
+fn bzhi_u64(a: u64, ix: u32) -> u64 {


If we made ix a u64, this would remove the need to do std::cmp::min(N, slice.len() as u32) in call-sites above.

Which is doubly important if we take it further:

extern "C" { #[link_name = "llvm.x86.bmi.bzhi.64"] fn x86_bmi2_bzhi_64(x: u64, y: u64) -> u64; } #[inline(always)] fn bzhi_u64(a: u64, ix: u64) -> u64 { #[cfg(target_feature = "bmi2")] unsafe { return x86_bmi2_bzhi_64(a, ix); } if ix > 63 { a } else { a & (1u64 << ix) - 1 } }

For whatever reason the intel intrinsic of _bzhi_u64 takes a ix: u32, also in the C/C++ versions. Even a 1u64 as u32 generates a redundant register mov to a 32-bit register, clearing the high bits. All of this is truly unnecessary because the instruction operates on 64b registers.

It's the difference between

cmp rsi, 64 mov r9d, 64 cmovb r9, rsi bzhi r9, r8, r9

In the current pure-Rust version,

mov r9d, esi bzhi r9, rdx, r9

When calling the core::arch::x86_64::_bzhi_u64 intrinsic with slice.len() as u32 (which is incorrect for slices longer than u32::MAX).

bzhi r9, rdx, rsi

When you use the above code with #![feature(llvm_link_intrinsics)]

The layers of mismatched semantics are working against us

Nevermind, the version without min(N, slice.len()) is wrong because the instruction only looks at the lowest 8 bits. So it effectively does a modulo 256 on the second operand.

So x86_bmi2_bzhi_64(u64::MAX, 525) is equal to 0x1FFF, 13 low bits set, because 525 % 256 = 13

farnoy · 2023-12-11T10:44:20Z

I've updated it for the final version of these intrinsics that shipped in nightly, but I have no idea why it fails on CI, something about the imports for ToBitmask that I've added

workingjubilee

Please add example doctests for these functions. The doctests are consistently exercised by Miri, so we want there to be an example for every function that exercises its usual behavior.

workingjubilee · 2023-12-14T01:17:46Z

I've also added temporary toggles for easier benchmarking (USE_BRANCH and USE_BITMASK).

These may be most interesting to keep around via cfg, so that people can rebuild the libcore with these using -Zbuild-std, but I have no strong opinion here.

I'll take a look at the import problem.

calebzulawski · 2023-12-14T01:49:23Z

I would like fuzz tests especially, I know it's not enabled but I'm particularly suspicious of the bitmask implementation

calebzulawski · 2023-12-14T01:51:06Z

crates/core_simd/src/vector.rs

+            + core::ops::Add<<T as SimdElement>::Mask, Output = <T as SimdElement>::Mask>,
+        Simd<<T as SimdElement>::Mask, N>: SimdPartialOrd,
+        Mask<<T as SimdElement>::Mask, N>: core::ops::BitAnd<Output = Mask<<T as SimdElement>::Mask, N>>
+            + core::convert::From<Mask<i8, N>>,


These bounds are extreme, and are inflexible enough to make it difficult to adjust the implementation in the future. If it's possible to implement this in a sealed trait or something and get some type erasure to hide the bounds, that would be preferred.

farnoy · 2023-12-18T23:23:12Z

I would like fuzz tests especially, I know it's not enabled but I'm particularly suspicious of the bitmask implementation

@calebzulawski What do you mean by that, exactly? Do we have facilities in this library already for fuzz testing?

These may be most interesting to keep around via cfg, so that people can rebuild the libcore with these using -Zbuild-std, but I have no strong opinion here.

@workingjubilee Is that the approach we want to take, given it will have an impact on the trait bounds we publish for these new functions? If we focused on USE_BRANCH=true and especially USE_BITMASK=true/false, we could simplify the trait bounds. Unless it's not a problem if we use a Sealed trait, but I don't usually write library code so I wouldn't know.

I'm not sure if these toggles are needed, or if the API interface is the final one we should use.

In my testing, even AVX-512 can benefit from USE_BRANCH=true, which you'd think is strange because it's generating more code for both cases, but the optimizer is able to unroll loops better when it can promote it to an unmask load. I don't have a good variety of benchmarks though, and I'm not sure what's the best way to explore our options here.

As for the interface potentially not being the final one -- I find that in real code using this, you're often recomputing your own effective Mask that was used for the load/store, as the functions only return the vector, if they return anything at all. This is probably inefficient because user code might compute them slightly differently and the final code will have two computations which are effectively the same? To improve on this, we might want to return the mask that was used by the library - basically with the effects of bounds-checking applied to the user-provided mask.

calebzulawski · 2023-12-19T01:41:34Z

Regarding fuzz testing we have them implemented via wrappers around proptest. You'll see the helpers in the test_helpers crate. This will be a little more difficult, since it involves pointers.

workingjubilee · 2024-01-30T16:48:37Z

These may be most interesting to keep around via cfg, so that people can rebuild the libcore with these using -Zbuild-std, but I have no strong opinion here.

@workingjubilee Is that the approach we want to take, given it will have an impact on the trait bounds we publish for these new functions? If we focused on USE_BRANCH=true and especially USE_BITMASK=true/false, we could simplify the trait bounds. Unless it's not a problem if we use a Sealed trait, but I don't usually write library code so I wouldn't know.

Oh, I hadn't thought about that. The Sealed trait pattern still leaves those bounds as visible and thus something a user can depend on, even if we can relax bounds later, so we should eschew anything that affects the bounds.

calebzulawski · 2024-02-17T15:38:10Z

I think the direction I would like to see these changes go (in this PR or in a new one, your preference) is for the initial implementation to have no x86 specific code and no bitmask tricks, and no extra bounds on the functions, and then develop it from there.

I don't disagree that there may be better codegen possibilities (though I'd like to see a little more evidence of a particular implementation being faster than another, since the supposedly slower implementation actually has a lower instruction count), but I am getting the feeling that this could be better implemented as improvements to the LLVM intrinsics, or at least in the rustc intrinsics, rather than in library code.

farnoy · 2024-02-17T16:36:33Z

That's fair. Should we worry about backwards compatibility if we introduce a function now but then add extra bounds later? I know that it's an unstable feature for now, just want to confirm that's fine.

Sorry about the delay too. My priorities at work changed and this is not blocking me for now. I still want to ship this as I'll have other uses for it.

calebzulawski · 2024-02-17T16:40:20Z

No worries--I've been too busy to focus much on std::simd lately as well. My goal is that there should be effectively no bounds on the functions (which is true of all of the other functions for the most part). This is one of the benefits of fixing the codegen on the LLVM or rust intrinsic side--there are no additional rust bounds to expose. However, if that's not possible for some reason, it may still be possible to avoid the bounds but still implement this within std::simd by using the intrinsics directly rather than Simd's functions.

jhorstmann · 2024-02-17T17:23:22Z

Of note is that llvm has separate intrinsics for masked loads with an additional len parameter: https://llvm.org/docs/LangRef.html#vector-predication-intrinsics

In my experiments these generate nice code for avx512, and with a mask of all ones and only a len, also good code for avx2 and 32/64 bit values: https://llvm.godbolt.org/z/Yx6fKM853

But I would agree to start with a simpler version, and maybe initially just offer functions taking a mask parameter.

A pattern I often use in my code is to have a helper function that processes a chunk of data, and which has a const MASKED: bool generic parameter. Based on that parameter I then either use normal or masked loads, and call the masked version only for the remainder chunk that is smaller than the vector size. Calculating the load mask is then relatively simple, for example for an 8bit bitmask u8::MAX >> (8 - len). Since this is the remainder chunk, the len is guaranteed to be less than 8.

workingjubilee · 2024-02-21T02:42:21Z

aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAA

hi just wanted to let you know I sympathize with

Sorry about the delay too.

No worries--I've been too busy to focus much on std::simd lately as well.

anyways, re:

My goal is that there should be effectively no bounds on the functions (which is true of all of the other functions for the most part).

yeah, ideally. I'd rather any bounds be the kind that type inference can easily take care of 99.9% of the time.

farnoy · 2024-02-29T16:36:50Z

I'll close this in favor of the simplified PR I just opened.

I agree with letting LLVM optimize it. If someone needs performant code today, they probably need to use the unsafe functions and calculate bitmasks themselves.

farnoy added 2 commits November 16, 2023 16:36

Add support for masked loads & stores

3492364

Add tests for masked loads&stores

f182fa7

farnoy commented Nov 16, 2023

View reviewed changes

crates/core_simd/src/vector.rs Outdated Show resolved Hide resolved

Optimize bounds checking

66a5748

farnoy commented Nov 17, 2023

View reviewed changes

calebzulawski reviewed Nov 18, 2023

View reviewed changes

crates/core_simd/src/vector.rs Show resolved Hide resolved

calebzulawski reviewed Nov 18, 2023

View reviewed changes

farnoy commented Nov 18, 2023

View reviewed changes

Rework the masking logic, rename the functions

34e54b4

farnoy commented Nov 19, 2023

View reviewed changes

Optimize masking by always calculating it in i8 space

f0b4f26

farnoy commented Nov 21, 2023

View reviewed changes

set USE_BRANCH=true

3ed9d15

farnoy force-pushed the masked-load-store branch from 9ddb6bb to 3ed9d15 Compare December 5, 2023 13:58

farnoy added 2 commits December 11, 2023 11:31

Update intrinsics to use the version shipped in nightly

470e711

Fix imports

a4ae456

workingjubilee requested changes Dec 13, 2023

View reviewed changes

calebzulawski reviewed Dec 14, 2023

View reviewed changes

farnoy mentioned this pull request Jan 4, 2024

Add compress store / expand load intrinsics & maybe register -> register variant too #241 #240

Open

farnoy mentioned this pull request Feb 29, 2024

Add support for masked loads & stores #399

Merged

farnoy closed this Feb 29, 2024

	assert!(
	slice.len() >= Self::LEN,
	"slice length must be at least the number of elements"
	);

Add support for masked loads & stores #374

Add support for masked loads & stores #374

Uh oh!

Conversation

farnoy commented Nov 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

farnoy commented Nov 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

farnoy commented Nov 19, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

farnoy commented Dec 11, 2023

Uh oh!

workingjubilee left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

workingjubilee commented Dec 14, 2023

Uh oh!

calebzulawski commented Dec 14, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

farnoy commented Dec 18, 2023

Uh oh!

calebzulawski commented Dec 19, 2023

Uh oh!

workingjubilee commented Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

calebzulawski commented Feb 17, 2024

Uh oh!

farnoy commented Feb 17, 2024

Uh oh!

farnoy commented Nov 16, 2023 •

edited

Loading

farnoy commented Nov 19, 2023 •

edited

Loading

workingjubilee left a comment •

edited

Loading

workingjubilee commented Jan 30, 2024 •

edited

Loading