Skip to content

Tags: ashvardanian/NumKong

Tags

v7.6.0

Toggle v7.6.0's commit message
Release: v7.6.0 [skip ci]

- Add: DLPack 1.3 interop bridge for numkong.Tensor (ea74fe1)
- Add: Back-port tensor API to C++20 for CUDA (ad93068)

- Improve: FP8 GEMM throughput on Skylake/Haswell + Granite Rapids E5M2 kernel (c19bec9)
- Improve: FP8 pairwise distance kernels via Giesen trick + F16 widen path (679f55f)
- Fix: Keep `*_serial` kernels scalar across LTO (455d535)
- Make: Enable symbol exports for `nk_shared` Emscripten builds (482e4fd)
- Improve: SSD trace-identity fold across all mesh backends + Genoa/NEONFHM kernels (e9d40e5)
- Make: Normalize base PowerPC & LoongArch cap for JS (ab81191)

v7.5.0

Toggle v7.5.0's commit message
Release: v7.5.0 [skip ci]

- Add: NEON popcount kernel for nk_reduce_moments_u1 (2181e0c)
- Add: Tensor constructors, sealed trait family, div_ceil cleanup (2792279)
- Add: Span-based matrix `_into` APIs, parallel Hammings/Jaccards, full-crate docs (99289df)
- Add: OpenMP for Python & JavaScript (499ecc9)
- Add: Granite Rapids AMX for F16 & F32 (28036ea)

- Fix: Native ISA probe on Apple Clang + compile/runtime glyph (bc13e02)
- Make: Detect illegal instructions in macOS CI (289cdaf)
- Fix: Drop `-march=` on macOS setup.py builds (28aac74)
- Fix: Exclude `std::signal` from WASM builds (14814c5)
- Improve: Drop GNU statement-expression macros in SVE reduce helpers (b8b4ca0)
- Make: Drop `+nosimd` from AArch64 baseline (23f5195)
- Make: Forbid auto-vectorization in portable baseline builds (43e8324)
- Make: Pin TU baseline to per-arch ABI floor across build systems (453ed5f)
- Fix: Mitigate GCC 13 wrong BF16 splat in Arm NEON (#346) (fc3d8ec)
- Improve: Log faulting capability detection (a401f8a)
- Improve: Log faulting kernel on fatal signals in `nk_test` (22c7c79)
- Make: Normalize Python test dependencies across CI and docs (8a0f3d4)
- Make: Baseline-only ISA for shared-library test, harden Windows CI (1907685)
- Fix: Wrong compiler probes for SMEBF16 & SMEBI32 (8b19ddb)
- Make: Log host CPU capabilities in macOS and Windows CI jobs (988eeb2)
- Fix: Pre-declare OpenMP loop counter, universal libomp for macOS (493a021)
- Fix: Use int for OpenMP loop counters, absolute libomp install name (ccc0118)
- Fix: GCC requires +sme prefix in target attribute for __arm_sc_* stubs (291dc0a)
- Fix: Signed OpenMP iterators, source-built libomp, JS KMP guard (dc1ae75)
- Fix: OpenMP wheel builds on macOS and Windows (f569121)
- Fix: Add target("sme") to __arm_sc_* stubs for GCC compatibility (ad2add0)
- Fix: Unpoison SVE scalar reductions for MemorySanitizer (#342) (b42eda7)
- Improve: Move SME runtime stubs to types.h as weak inline definitions (64ca934)
- Improve: Manual SME streaming control, single enter/exit per API call (6432837)
- Fix: Update `cdist` edge-case test for re-added `threads=` kwarg (50681af)
- Make: Allow force-enabling ISA targets via environment variables (0e58702)
- Improve: Abandon F32β†’F64 via Ozaki on Granite Rapids (94a5f19)
- Make: FreeBSD, PPC64le, LoongArch, RISC-V releases & compress Windows (a9a0d83)
- Make: Standardize CI compilers and add Windows test job (9a22ea4)
- Make: Shrink serial fallbacks with scoped size optimization (83154a8)
- Make: Compress Windows builds (e30ad3d)
- Fix: Streaming-compatible stubs for LLVM SME builds (0be7b2f)

v7.4.5

Toggle v7.4.5's commit message
Release: v7.4.5 [skip ci]

- Improve: Vectorize F32 SME MaxSim finalizer (0daacf3)
- Improve: Remove centering from RMSD kernels (1a83ab4)
- Fix: Emulated vs native test durations (4266451)

v7.4.4

Toggle v7.4.4's commit message
Release: v7.4.4 [skip ci]

- Fix: ARMv7 Rust cross-compilation with CC for versioned GCC (a5e67e6)
- Make: `check_source_runs`-probing like `march=native` on MSVC (7a152f3)
- Fix: Drop `_MM_FROUND_NO_EXC` from `_mm256_cvtps_ph` calls (8649b0c)
- Fix: Guard against old MSVC preprocessor (25d3304)
- Make: Enforce newer preprocessor in MSVC (be966af)
- Make: Cleaner CIBW artifact names & env forwarding (a6cf642)
- Make: Forward cross-compilation flags for macOS wheels (6ed3b8c)
- Make: Split ppc64le, s390x, i686 CIBW runs (c01795c)

v7.4.3

Toggle v7.4.3's commit message
Release: v7.4.3 [skip ci]

- Fix: Require AArch64 for NEON kernels (2ba1b34)
- Docs: Table order & formatting (8673a56)
- Make: Avoid `--all-features` in Rust cross-compilation CI (8be8bff)
- Improve: Arm32 compatibility (6404172)
- Make: `cancel-in-progress` CI to shift compute resources (dfc8fa0)
- Improve: Harden Swift SDK for 6.1+ toolkit (965cd52)
- Make: Strip `.unsafeFlags` & list platforms for SPM consumption (b061b78)
- Make: Expose `CNumKongDispatch` target to Swift users (6aa00a8)

v7.4.2

Toggle v7.4.2's commit message
Release: v7.4.2 [skip ci]

- Docs: Shrink tables in the main README (6d2ea34)
- Make: Inline Power Shell cross-compilation logic in CI (974c30c)
- Make: Define `_ARM64_` for Arm JS builds in MSVC (f303042)
- Make: Skip same-named artifacts on CI reruns (7c098e5)

v7.4.1

Toggle v7.4.1's commit message
Release: v7.4.1 [skip ci]

- Make: Set `repository.url` for NPM (385480d)
- Make: Pull MSVC ARM64 Cross-Compiler (e20c93e)
- Fix: Swap `f16x8` for `u16x8` in `cast_neon` (154ec5d)

v7.4.0

Toggle v7.4.0's commit message
Release: v7.4.0 [skip ci]

- Add: Was elementwise ops & spatial mini-float kernels (81b8c44)
- Add: WASM type-casting kernels (e09df31)
- Add: SVE+SDOT ops for 8-bit integers (913fc6b)

- Fix: Misplaced NEON loads/stores in Sierra (05e3045)
- Fix: Avoid unconsitional `np` symbols (9dffb68)
- Make: Resolve probe locations for NPM consumers (c602f45)
- Docs: Refined "What's Inside" (28f35cd)
- Docs: Mini-float kernel selection strategy (04e6598)
- Improve: Accelerate PyTests, reduce `Decimal` use (2417248)
- Make: Move `.pyi` for PyLance (688ec2d)
- Fix: Inconsistent SME function qualifiers (5b4148a)
- Improve: Smaller test inputs under QEMU (ee36bf2)
- Improve: Vectorize GEMM "packers" (86127a4)
- Make: Longer timeouts for QEMU in CI (a9cc732)
- Fix: `vec_t` store helper args order (eecbcac)
- Fix: Negative stride tensor reductions (3ea81be)
- Improve: Recursive stride collapsing and axis-lane fast paths for N-D reductions (cf8eaf6)
- Improve: Faster reductions in strided tensors (61651ed)
- Improve: Wider NEON curved, mesh, & probability F16 kernels (1c17678)
- Fix: Harden mini-float type-casting (1911b89)
- Make: Ship `win32-arm64` NPM builds (578b7ad)
- Make: Auto-bump JS platform-specific versions (5617f75)
- Fix: `vcombine` instead of initializer lists for NEON arrays in MSVC (906c178)
- Fix: Avoid flaky `vld1_f16` for MSVC (7a987d2)

v7.3.0

Toggle v7.3.0's commit message
Release: v7.3.0 [skip ci]

- Add: NEON & SDOT fallbacks for `i4` & `e3m2` (0c6afa5)

- Docs: M5 perf stats for Wasmtime v43 (43c2881)
- Fix: Alternative MSVC-friendly cast (4744b9b)
- Make: Disable LTCG due to MSVC issues (3d37684)
- Make: Try `PREBUILDS_ONLY=0` in CI (64c5f95)
- Improve: Lower NEONHALF β†’ NEON requirements (37f99ec)
- Fix: Wire `nk_cast_neon` benchmarks (3793af2)
- Docs: Apple M5 native stats for secondary workloads (d7c81c4)
- Improve: Faster in-vector 4-way finalizers in NEON (968dcd1)
- Improve: Drop `nk_f16x4_to_f32x4_neon` (84bb20a)
- Improve: `vcvt_high` for faster unpacking (a5f4a19)
- Docs: Refresh GEMM/SYRK measurements Apple M4 β†’ M5 (3e010de)
- Fix: Harden strided reductions in NEON & AVX2 (61ac67b)
- Fix: Double-counted tail in Skylake `f64` RMSD, Kabsch, and Umeyama (5391344)
- Improve: Share `decimal.Context.traps` rules (3c28ae9)
- Fix: Padding partial tail 32-bit words for `BMOPA` (2598487)
- Fix: Missing scale type definitions of mini-floats (91862da)
- Fix: Scalar buffer cast internal overwrites & aliasing (7b0e129)
- Fix: Top-bottom variable names (a014134)
- Improve: Giesen's E4M3 β†’ F16 in Streaming SVE (25322b5)
- Improve: Fewer branches in SME GEMMs (858263c)
- Fix: Up-round dimensions count in sub-byte C++ tests (87a72d0)
- Make: Focus on M4 CPUs for SME probing (5ff63eb)
- Improve: PyTesting across more shapes (4bc3e44)
- Improve: Cleaner type-casting & promotion rules (23c2474)
- Make: Hide formatting commits for v7-7.2 (f6ce2da)
- Make: Native addon resolution for Deno & Bun (0d502d5)
- Docs: Citations (6220137)
- Improve: Faster mini-float norms in Streaming SVE (088de57)
- Make: Integrate PyRight (0fe56c0)
- Fix: F16 norms in SSVE skipped odd entries (bf3bfee)
- Fix: Harden SVE MaxSim upcasting logic (803eb33)
- Fix: Disable `FPCR.AH` bit (7b2b850)
- Make: Node 24 for trusted publishing (9f1a4ef)
- Fix: `_m` to zero-out predicated SVE/SME ops (16c157b)
- Fix: `_m` to zero-out predicated SVE lanes in `spatial/` (ac27cde)
- Make: Replace stale `prebuildify` (74c5454)

v7.2.4

Toggle v7.2.4's commit message
Release: v7.2.4 [skip ci]

- Make: 2h timeout budget for JS & Py builds (2e8f081)