Tags: linkedin/isolation-forest
Tags
Bump pytest from 8.3.2 to 9.0.3 in /isolation-forest-onnx (#84) Bumps [pytest](https://github.com/pytest-dev/pytest) from 8.3.2 to 9.0.3. - [Release notes](https://github.com/pytest-dev/pytest/releases) - [Changelog](https://github.com/pytest-dev/pytest/blob/main/CHANGELOG.rst) - [Commits](pytest-dev/pytest@8.3.2...9.0.3) --- updated-dependencies: - dependency-name: pytest dependency-version: 9.0.3 dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Upgrade ONNX to 1.21.0 and align dependencies (onnxruntime, numpy, Py… …thon >=3.11) (#83) * Bump onnx from 1.17.0 to 1.21.0 in /isolation-forest-onnx Bumps [onnx](https://github.com/onnx/onnx) from 1.17.0 to 1.21.0. - [Release notes](https://github.com/onnx/onnx/releases) - [Changelog](https://github.com/onnx/onnx/blob/main/docs/Changelog-ml.md) - [Commits](onnx/onnx@v1.17.0...v1.21.0) --- updated-dependencies: - dependency-name: onnx dependency-version: 1.21.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Pin ONNX model IR version to 10 for onnxruntime compatibility onnx 1.21.0 defaults to IR version 13, which is unsupported by onnxruntime < 1.24.1. Since the model only uses opset 14, IR version 10 is sufficient and ensures broad onnxruntime compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Upgrade onnxruntime, numpy, and Python version for onnx 1.21.0 compatibility - Upgrade onnxruntime from 1.19.2/1.18.0 to 1.24.1 (supports IR version 13 that onnx 1.21.0 produces by default) - Upgrade numpy from 1.26.4 to 2.2.6 and fix np.trapz -> np.trapezoid (trapz was removed in numpy 2.0) - Update python_requires from >=3.9 to >=3.10 (required by onnx 1.21.0) - Remove the ir_version=10 pin since onnxruntime 1.24.1 natively supports IR version 13 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Align Python version in CI and Black config with python_requires >=3.10 - Update pypi-publish job from Python 3.9 to 3.10 - Update Black target-version from py39 to py310 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Pin ONNX model IR version to 10 for maximum runtime portability The model only uses opset 14, which is fully supported by IR version 10. Pinning avoids requiring onnxruntime >= 1.24.1 or other recent runtimes just to load the model, maximizing cross-platform compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add explicit Python setup to the build job The build job runs Gradle which creates a Python venv for the isolation-forest-onnx tests. Without setup-python, it relies on whatever python3 the runner provides, which is implicit and fragile. Pin to 3.10 to match python_requires. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Raise minimum Python to 3.12 onnxruntime 1.24.x only ships Linux wheels for Python 3.11+, and 3.12 is the current ubuntu-latest default. Since this is a converter tool (not a foundational library), targeting 3.12 is a practical baseline that ensures all dependencies have prebuilt wheels. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Set python_requires to >=3.11 based on actual dependency floor onnxruntime 1.24.1 only ships wheels for Python 3.11+, making that the true minimum. CI remains on 3.12 (runner default with full wheel coverage), but the package contract should reflect what users can actually install, not what CI happens to run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: James Verbus <james.verbus@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add eif heatmap plots to README.md (#82) * Add Standard IF vs Extended IF score heatmap plots to README Add synthetic 2D score heatmaps illustrating the axis-aligned bias of Standard Isolation Forest that Extended Isolation Forest eliminates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update heatmap plots with improved versions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update heatmap plots Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update heatmap plots Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add project description to isolation-forest-onnx PyPI package (#81) Add a README.md for the isolation-forest-onnx package and configure setup.cfg to use it as the long description on PyPI. Include the README in MANIFEST.in so it is packaged in the source distribution. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Upgrade GitHub Actions for Node 24 compatibility (#80) * Upgrade GitHub Actions for Node 24 compatibility Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com> * Add required distribution parameter to setup-java@v5 The upgrade to actions/setup-java@v5 requires the distribution input. Use Eclipse Temurin as the JDK distribution for all CI jobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix Java version format for setup-java@v5 setup-java@v5 uses semver and does not recognize '1.8'. Use '8' instead, which matches the available Temurin distribution versions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com> Co-authored-by: Salman Muin Kayser Chishti <13schishti@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add Extended Isolation Forest (EIF) support to the Scala/Spark isolat… …ion-forest library. (#79) * First working version of extended isolation forest training and scoring. Results look reasonable, but detailed correctness not yet verified. * Updated rough draft code for EIF. * Refactor Extended Isolation Forest for clearer logic more in line with Isolation Forest, improved docs, and parameter validation - Renamed local variables in `ExtendedIsolationForest.scala` for clarity (`dataset` → `data`). - Moved and refined parameter validation in `validateAndResolveParams`, logging chosen samples/features. - Updated Javadoc-style comments in `ExtendedIsolationForest`, `ExtendedIsolationForestModel`, and related classes. - Changed schema checks to use `VectorType` instead of `SQLDataTypes.VectorType`. - Renamed and documented internal methods (e.g., `pathLengthInternal`) in `ExtendedIsolationTree`. - Ensured consistent naming across `ExtendedIsolationForestModel` fields (e.g., `extendedIsolationTrees`). - Cleaned up imports, minor style fixes, and removed commented-out debug prints. There are still likely opprotunities to factor out more shared logic into `core`.. * Got standard isolation forest R/W working after major refactor. Still a work in progress. * Fixed package structure. * WORK IN PROGRESS - Have prototype extended isolation forest read / write working with tests. * Did linting for eif code. * fix(EIF): align hyperplane split + path test with paper; correct intercept sampling, ≤ semantics, and degeneracy handling - Sample normal in the selected subspace with up to (extensionLevel+1) non‑zero coords; normalize and guard zero‑norm. - Sample intercept as point p by drawing each active coordinate uniformly from that node’s data range; set offset = n·p. - Use inclusive left-branch test x·n ≤ n·p in both training and scoring so the split predicate matches the paper. - Treat minDot == maxDot or an empty partition as a leaf (stores numInstances); keeps trees well‑formed. - Compute dot against a full‑length normal (zeros for unused coords) to match the (x − p)·n test. - Minor: log message tweaks; one‑pass min/max scan instead of materializing arrays; consistent ≤ in train/score. - No change to model IO or public params. * fix(EIF): retry degenerate hyperplane splits instead of premature leafing Previously, a single failed split attempt (constant feature, all-same dot products, or empty partition) immediately produced a leaf node. This meant extensionLevel=0 was not equivalent to standard IF when the first randomly chosen feature happened to be constant. Now retries up to 50 times before falling back to a leaf. * chore(EIF): remove dead code, unused imports, and fix test description typos * fix(EIF): validate extensionLevel at fit time instead of silent clamping Remove Int.MaxValue-1 sentinel default. If the user sets extensionLevel above numFeatures-1, throw immediately. If unset, default to numFeatures-1 (fully extended). The resolved value is persisted in the model rather than the sentinel. * fix: fail fast on empty partition in shared tree training Guard against dataForTree.head crash when a partition receives zero sampled points. Throws a clear IllegalStateException instead of a confusing NoSuchElementException. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use actual tree count instead of numEstimators param in scoring Divide path length sum by the actual number of trees in the model rather than the $(numEstimators) parameter, preventing model/param drift from producing incorrect anomaly scores. * test(EIF): replace toString tree comparison with structural equality check Use recursive node-by-node comparison with epsilon tolerance for doubles instead of fragile toString matching. * docs: add Extended Isolation Forest documentation to README Add EIF section covering when to use it, the extensionLevel parameter and its interaction with maxFeatures, and a usage example. Call out that ONNX export is not supported for EIF. Add Hariri et al. 2018 to references. * Added citation info to readme. * fix(EIF): use strict < for hyperplane split and stop fit() from mutating estimator Change the split criterion in ExtendedIsolationTree from <= to strict <, matching both the reference implementation (sahandha/eif) and our own standard IsolationTree. Affects tree building (partition) and scoring (path traversal). Remove the set(extensionLevel, resolvedExtensionLevel) call in ExtendedIsolationForest.fit() that mutated the estimator. When extensionLevel was unset (defaulting to fully extended), the first fit() permanently set it, causing reuse on a dataset with fewer features to either fail validation or silently use the wrong level. * fix(EIF): match reference implementation split semantics instead of retry loop Remove bounded retry loop for degenerate splits. Instead, follow the EIF paper and reference implementation: allow empty partitions to become ExtendedExternalNode(0) leaves. Change split predicate from <= to strict < to match reference implementation's (x-p)·n < 0. Relax ExtendedExternalNode to accept numInstances >= 0. * docs: update benchmarks with StandardIF, ExtendedIF_0, and ExtendedIF_max results Replace the old IF-only benchmark table with comprehensive results across 13 datasets comparing all three model variants against Liu et al. and the reference Python EIF implementation. * docs: update benchmark table with reference Python results for all 13 datasets * fix(EIF): persist resolved extensionLevel on trained model Set the resolved extensionLevel on the estimator before copyValues so it flows into the model's param map. Without this, models trained without explicitly calling setExtensionLevel() would lose the effective value on save/load. Add test covering default resolution and round-trip persistence. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test(EIF): add pre-merge tests for zero-size leaves and ext=0 axis-aligned splits Exercise the numInstances >= 0 semantics that became first-class EIF behavior when degenerate hyperplane splits were allowed to produce empty children. New tests cover: - ExtendedExternalNode(0) construction and subtreeDepth - Path length through a zero-size leaf contributes avgPathLength(0) = 0 - Save/load round-trip preserves a tree containing a zero-size leaf - extensionLevel=0 produces strictly axis-aligned normals (1 non-zero coordinate) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: soften benchmark claims and clarify EIF_0 vs StandardIF wording - Use "closely matches" for ExtendedIF_max reference comparison - Note mulcross as an open outlier in ExtendedIF_0 parity (12 of 13) - Describe extensionLevel=0 as "uses axis-aligned splits" instead of "recovers standard axis-aligned splits" - Frame low-dimensional underperformance as our benchmark observation, not a broad established finding from the paper Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test(EIF): enable saved model tree structure regression test Uncomment savedExtendedIsolationForestModelTreeStructureTest and add the required resource files: a saved ExtendedIsolationForestModel and its expected first-tree toString golden file. This provides a regression guard against accidental changes to tree serialization or structure. * What was done: Extracted the duplicated validateAndResolveParams method into SharedTrainLogic (where the other shared training helpers already live). Both IsolationForest.scala and ExtendedIsolationForest.scala now call the single shared implementation, passing $(maxFeatures) and $(maxSamples) as arguments. Files changed: - core/SharedTrainLogic.scala — added validateAndResolveParams(dataset, maxFeatures, maxSamples) method and its ResolvedParams import - IsolationForest.scala — removed private method, updated import and call site - extended/ExtendedIsolationForest.scala — removed private method, updated import and call site * refactor: extract duplicated transformSchema into Utils.validateAndTransformSchema All four Estimator/Model classes had identical 15-line transformSchema overrides. Extract the shared logic into Utils and delegate with a one-liner in each class. * chore(EIF): remove unused import, fix docstring, and align threshold comparison style - Remove unused IsolationForestModel import from ExtendedIsolationForestModelReadWrite - Fix reader docstring that incorrectly said "standard" instead of "Extended" - Change `outlierScoreThreshold > 0.0` to `> 0` to match standard IF style * test(EIF): add tests for L2-normalized normals, invalid extensionLevel, and intermediate levels - Verify all hyperplane normals are L2-normalized across extension levels and seeds - Verify extensionLevel > numFeatures - 1 throws IllegalArgumentException at fit time - Verify intermediate extensionLevel values (1-4) train valid models with reasonable AUROC * chore(EIF): fix redundant import and stale docstrings in ExtendedIsolationForestModel - Remove unnecessary self-import of ExtendedIsolationForestModel in ReadWrite file - Fix companion object and threshold comments that said "IsolationForestModel" instead of "ExtendedIsolationForestModel" * fix: address EIF review findings and harden model edge cases Resolve the review issues uncovered while comparing the extended isolation forest branch against master and the EIF reference implementation. ExtendedIsolationForest - stop mutating the estimator with a resolved default extensionLevel during fit() - keep dataset-dependent extensionLevel resolution local to each fit and apply the resolved value only to the trained model - add a regression test that reuses the same estimator across datasets with different feature dimensions to ensure default extensionLevel does not leak across fits IsolationForestModel / ExtendedIsolationForestModel - fail fast when transform() is called on an empty ensemble instead of dividing by zero and producing invalid scores - keep scoring normalized by the actual loaded tree count, but guard the zero-tree case explicitly - add transform-throws coverage for manually constructed empty standard and extended models - preserve existing empty model write/read tests so persistence still round-trips this edge case correctly Tests and style cleanup - move ExtendedIsolationForestModelWriteReadTest into the com.linkedin.relevance.isolationforest.extended package so package names match file paths and the surrounding test suites - restore the Spark-derived attribution header on the moved/copied read-write helpers - align ExtendedIsolationForestModelReadWrite visibility with the rest of the package-private isolation forest internals Verification - ./gradlew -g /tmp/codex-gradle-home :isolation-forest:test - ./gradlew -g /tmp/codex-gradle-home :isolation-forest:test --tests com.linkedin.relevance.isolationforest.extended.ExtendedIsolationForestTest - ./gradlew -g /tmp/codex-gradle-home :isolation-forest:test --tests com.linkedin.relevance.isolationforest.IsolationForestModelWriteReadTest --tests com.linkedin.relevance.isolationforest.extended.ExtendedIsolationForestModelWriteReadTest - ./gradlew -g /tmp/codex-gradle-home :isolation-forest:compileScala :isolation-forest:compileTestScala * docs: refresh README for EIF and current build defaults Update the README so the documented examples and version references match the current repo state and are copy-paste runnable. README updates - change the documented default Spark version from 3.5.1 to 3.5.5 - update the example build command to use the current default Spark/Scala combination - replace stale hardcoded library and ONNX package versions with <latest-version> / <matching-version> placeholders - switch the Gradle dependency example from deprecated `compile` to `implementation` - add the missing `org.apache.spark.sql.functions.col` import to the Scala training example - fix the training example text to refer to the `label` column instead of `labels` - clarify the EIF `extensionLevel(5)` example comment so the dimensional assumption is explicit - define `dataset_name` and `num_examples_to_print` in the ONNX Python inference example so the snippet is runnable as written - remove the benchmark prose reference to a `LI IF` comparison column that is not present in the table This is a documentation-only change. * feat: add extended isolation forest with sparse hyperplane persistence Add Extended Isolation Forest (EIF) support alongside the existing standard Isolation Forest implementation, and harden the standard/extended model persistence and scoring paths. Extended Isolation Forest - add ExtendedIsolationForest estimator, ExtendedIsolationForestModel, ExtendedIsolationForestParams, ExtendedIsolationTree, ExtendedNodes, and ExtendedUtils - implement EIF training with extensionLevel-controlled random hyperplane splits based on the Hariri et al. algorithm - resolve extensionLevel per fit without mutating estimator state - support axis-aligned EIF (extensionLevel = 0) through fully extended EIF (extensionLevel = numFeatures - 1) Sparse EIF model representation - store hyperplanes sparsely as (indices, weights, offset) instead of dense per-node normal vectors - canonicalize stored sparse coordinates by sorting feature indices before constructing SplitHyperplane - use sparse dot products for tree traversal and add a direct Spark Vector scoring path so EIF scoring benefits from sparsity end to end - enforce sparse hyperplane invariants: non-empty, length-matched, non-negative, distinct, sorted indices Persistence and read/write refactor - move standard model read/write into a top-level IsolationForestModelReadWrite implementation - add shared metadata helpers in IsolationForestModelReadWriteUtils - add sparse EIF model read/write support and checked-in EIF persistence fixtures - preserve standard-model backward compatibility when loading older saved models that do not contain totalNumFeatures metadata, logging that dimension validation is unavailable for those legacy models Model/scoring hardening - reject numSamples values that resolve to fewer than 2 samples during training - fail fast when transform() is called on empty standard or extended models - store totalNumFeatures in newly saved models and validate scoring input dimension when that training dimension is known - keep standard IF backward compatibility by restoring the legacy public 4-arg IsolationForestModel constructor while making the richer internal constructor package-private - restrict the extended model constructor to package-private use so totalNumFeatures remains internal to fit/load/copy flows Tests - add comprehensive EIF estimator, tree, sparse-hyperplane, and write/read tests - add regression coverage for repeated EIF fits, empty-model scoring guards, numSamples >= 2 enforcement, scoring-time feature dimension validation, standard legacy metadata loading, and standard legacy constructor behavior - update saved model metadata/tree-structure fixtures for the new extended persistence format and formatting changes Documentation - refresh README dependency/version examples and fix copy-paste issues in the Scala and ONNX examples - add EIF usage and persistence examples - document benchmark results for standard IF vs EIF variants - fix benchmark/doc typos and soften the benchmark agreement statement to avoid overstating row-by-row verification Verification - ./gradlew -g /tmp/codex-gradle-home :isolation-forest:test * Updated readme. * docs: update README benchmark table and references - Apply rounding to all value ± error pairs (1 sig fig on error, 2 if leading digit is 1) - Move Ref Python results from StandardIF to ExtendedIF_0 rows since the reference Python EIF at ext=0 is not a true standard IF - Add DOI to EIF paper reference and add reference Python eif repo - Clarify column headers (Liu et al., Ref. Python with IF/EIF labels) - Simplify key observations and fix overstated dimensionality claim - Minor wording improvements throughout * Added scroll to results table. * updated readme 1. Non-breaking spaces around ± — replaced ± with ± in all value cells so values like 0.813 ± 0.004 won't wrap mid-value. 2. Dashes in empty cells — all empty reference cells now show - instead of blank: - StandardIF rows: - in both Ref. Python columns - ExtendedIF rows: - in the Liu et al. column * Updated readme. * fix(EIF): use float-precision hyperplane weights for Spark 4.x Avro compatibility Spark 4.x's Avro encoder silently demotes Array[Double] elements to float (32-bit) precision during serialization, while scalar Double fields survive intact. This caused all five EIF model write/read tests to fail on Spark 4.0.1 and 4.1.1, with weight mismatches at ~1e-8 (the exact double→float→double precision boundary). The fix changes SplitHyperplane weights from Array[Double] to Array[Float]. This is the correct design from first principles: - Features are already Array[Float] (DataPoint.features) - Weights define the hyperplane *direction* (analogous to the feature index in standard IF, which is just an Int) - The offset defines *where* to split and remains Double (analogous to splitValue in standard IF, which is Double for the same reason) - The dot product is accumulated in Double regardless of operand type - The split comparison (dot < offset) is always Double vs Double Weights are converted to float after normalization but before computing the offset, so training and scoring are consistent. Benchmarks confirm the change is invisible after rounding: only one value across all 13 datasets changed (breastw ExtendedIF_max AUPRC: 0.9568 → 0.9569, well within the ±0.0015 error bar). Production code: - ExtendedUtils.scala: SplitHyperplane.weights Array[Double] → Array[Float] - ExtendedIsolationTree.scala: normalize to float before offset computation - ExtendedIsolationForestModelReadWrite.scala: ExtendedNodeData.weights and NullWeights updated to float Test code: - ExtendedIsolationTreeTest.scala: float literals, L2 norm tolerance widened from 1e-10 to 1e-6 (appropriate for float precision) - ExtendedIsolationForestModelWriteReadTest.scala: float literals, added disabled regenerateGoldenExtendedModel helper - Regenerated golden model and expected tree structure README: - Updated breastw ExtendedIF_max AUPRC from 0.9568 to 0.9569 Verified all 67 tests pass on Spark 3.5.5, 4.0.1, and 4.1.1. * docs: address Copilot review feedback on PR #79 Copilot review responses: 1. Grammar fix (accepted): "some dataset" → "some datasets" in README benchmark observations. 2. .toFloat cast comment (accepted): Added clarifying comment explaining why features(indices(i)).toFloat is intentional — it matches the DataPoint (Array[Float]) precision used during training, ensuring scoring consistency between the DataPoint and Vector code paths. 3. Shuffle optimization (declined): Copilot suggested replacing Random.shuffle + take with reservoir sampling. The shuffle operates on a tiny array (≤ dim features, typically < 100 elements) once per tree node during training — not a hot path. Readability outweighs the micro-optimization. 4. outlierScoreThreshold > 0 sentinel (declined): Copilot noted that threshold=0.0 would be treated as "unset". Technically correct, but this mirrors the existing standard IF pattern identically. A threshold of 0.0 (label everything as outlier) is not a practical use case. Fixing it properly requires changing both IF and EIF together in a separate PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Bump protobuf from 5.29.5 to 5.29.6 in /isolation-forest-onnx (#78) Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 5.29.5 to 5.29.6. - [Release notes](https://github.com/protocolbuffers/protobuf/releases) - [Commits](https://github.com/protocolbuffers/protobuf/commits) --- updated-dependencies: - dependency-name: protobuf dependency-version: 5.29.6 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bump wheel from 0.38.1 to 0.46.2 in /isolation-forest-onnx (#75) Bumps [wheel](https://github.com/pypa/wheel) from 0.38.1 to 0.46.2. - [Release notes](https://github.com/pypa/wheel/releases) - [Changelog](https://github.com/pypa/wheel/blob/main/docs/news.rst) - [Commits](pypa/wheel@0.38.1...0.46.2) --- updated-dependencies: - dependency-name: wheel dependency-version: 0.46.2 dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
PreviousNext