Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
121 commits
Select commit Hold shift + click to select a range
c54e18e
feat: isolation forest based has_no_anomaly
vb-dbrks Dec 8, 2025
b41ead1
feat: changed to using sklearn
vb-dbrks Dec 8, 2025
0f7b1c4
tests: int tests
vb-dbrks Dec 8, 2025
472cc7e
demo: demo notebook updated
vb-dbrks Dec 8, 2025
1dc4c6b
bug fixes on Algorithm naming, feature importance, drift warning etc
vb-dbrks Dec 9, 2025
172fef4
feat: progressive api for setup complexity
vb-dbrks Dec 9, 2025
0214d88
fix: tests
vb-dbrks Dec 10, 2025
3fada9b
fix: explainability and tests
vb-dbrks Dec 10, 2025
aa79dda
bug: model drift detection and yaml check tests, model name and uri i…
vb-dbrks Dec 11, 2025
830c8b5
fix: integration tests with feature scaling
vb-dbrks Dec 12, 2025
30029cc
feat: feature scaling with robust scaler for long tails
vb-dbrks Dec 12, 2025
bce81fa
fix: integration tests
vb-dbrks Dec 12, 2025
352b198
feat: mlflow model registration updates
vb-dbrks Dec 12, 2025
a8d2b6f
fix: merg_cols added back
vb-dbrks Dec 12, 2025
e1e3bb3
feat: auto-discovery (heuristics and profile based)
vb-dbrks Dec 18, 2025
76db51c
fix: column selection and model baseline statistics
vb-dbrks Dec 18, 2025
dba8860
cleanup
vb-dbrks Dec 18, 2025
0bed8a0
fmt
vb-dbrks Dec 18, 2025
3aa27c5
fmt
vb-dbrks Dec 18, 2025
b2e3149
fmt
vb-dbrks Dec 18, 2025
04a272c
fmt
vb-dbrks Dec 19, 2025
9af5be0
refactor: helper functions, dead code, formatting
vb-dbrks Dec 22, 2025
d8096a9
fmt: fixed a ton of formatting issues
vb-dbrks Dec 22, 2025
788b676
fix: merge_columns are always provided and validated
vb-dbrks Dec 22, 2025
7eaf71e
bug fixes, optimisation, code formatting, documentation, tests
vb-dbrks Dec 23, 2025
55b712f
bug fix: model registry
vb-dbrks Dec 23, 2025
3646c1d
performance improvements for demp and documentation
vb-dbrks Dec 23, 2025
f91c7eb
demo: bug fixes
vb-dbrks Dec 23, 2025
f21e397
feat: improve pk and fk key identification
vb-dbrks Dec 23, 2025
d74ba0b
fix: collision on model registry table + test fixes
vb-dbrks Jan 2, 2026
2ad4354
fix+fmt: demo, tests and code fomatting
vb-dbrks Jan 2, 2026
22b0161
fix: integration tests, fixtures
vb-dbrks Jan 5, 2026
c1e01ba
fix: integration tests and demo
vb-dbrks Jan 6, 2026
90d9416
refactoring: code complexity.
vb-dbrks Jan 6, 2026
61facae
fix: mlflow artifact path to name change and demo updates, integratio…
vb-dbrks Jan 6, 2026
b528f21
fix: mlflow warnings and deprecation
vb-dbrks Jan 6, 2026
1db8c4f
fix: demo
vb-dbrks Jan 6, 2026
598b830
feat: improve message and feature contributions + _info column renami…
vb-dbrks Jan 6, 2026
e163607
feat: update anomaly detection examples and enhance documentation and…
vb-dbrks Jan 7, 2026
b0cba9a
feat: enhance anomaly detection configuration and auto-discovery logi…
vb-dbrks Jan 7, 2026
70cec2d
cleanup: removed nightly markers and makefile changes
vb-dbrks Jan 7, 2026
a2743ef
fix and test improvements: improvements to shared fixtures
vb-dbrks Jan 7, 2026
e898209
chore: enhance anomaly detection documentation with detailed drift de…
vb-dbrks Jan 7, 2026
2450aaa
feat: update anomaly detection demos and documentation, enhance model…
vb-dbrks Jan 8, 2026
3d757b7
feat: update anomaly detection demo and documentation, adjust default…
vb-dbrks Jan 8, 2026
4d4a743
fix: update anomaly detection demo with improved markdown formatting …
vb-dbrks Jan 8, 2026
2837586
fix and docs: ci and docs issue + unit tests
vb-dbrks Jan 8, 2026
d6cec1b
fmt: minor import for mlflow issue
vb-dbrks Jan 8, 2026
5b94b06
fmt: removing overrides
vb-dbrks Jan 8, 2026
91cab17
feat: add telemetry logging for anomaly detection checks and training…
vb-dbrks Jan 8, 2026
269abba
fix: remove dataset rule registration decorator from anomaly check fu…
vb-dbrks Jan 8, 2026
23574b7
fix: update exception handling to catch all exceptions for anomaly de…
vb-dbrks Jan 8, 2026
08cbefa
feat: integrate Azure authentication and environment variable setup f…
vb-dbrks Jan 8, 2026
5df0360
feat: remove outdated anomaly detection demo notebook and add corresp…
vb-dbrks Jan 8, 2026
37fcef5
fmt
vb-dbrks Jan 8, 2026
ce05754
test: fixes to int and e2e
vb-dbrks Jan 9, 2026
5ced589
test fix for int tests
vb-dbrks Jan 9, 2026
732cd24
Merge branch 'main' into 957-ml-has_no_anomaly
mwojtyczka Jan 9, 2026
71cfc9b
chore: update dependencies and enhance MLflow configuration for anoma…
vb-dbrks Jan 9, 2026
d3144bd
Merge branch 'main' into 957-ml-has_no_anomaly
mwojtyczka Jan 9, 2026
e147e84
fix: update MLFLOW_TRACKING_URI in CI workflows and enhance feature e…
vb-dbrks Jan 9, 2026
29fae5b
Merge branch '957-ml-has_no_anomaly' of https://github.com/databricks…
vb-dbrks Jan 9, 2026
4620ce6
fmt
vb-dbrks Jan 9, 2026
59303e7
refactor: Set MLFLOW_TRACKING_URI and MLFLOW_REGISTRY_URI directly in…
vb-dbrks Jan 14, 2026
ad0f9e9
chore: update anomaly dependency and add azure-cli auth type to CI wo…
vb-dbrks Jan 14, 2026
eac5842
chore: Azure authentication in CI workflows
vb-dbrks Jan 14, 2026
a67907d
test: checking a different approach
vb-dbrks Jan 14, 2026
bcc6f82
trial 2: minimal
vb-dbrks Jan 14, 2026
bf3da2b
test3
vb-dbrks Jan 14, 2026
af19080
test4: DATABRICKS_HOST env variable set
vb-dbrks Jan 14, 2026
361fa35
fix: integration test config
vb-dbrks Jan 15, 2026
211d2cf
fix: testing workaround for mlflow with dummy databrickscfg
vb-dbrks Jan 15, 2026
de9cd03
fix: mlflow experiment_id
vb-dbrks Jan 15, 2026
416a77f
fix: create schema fixture for anomaly tests
vb-dbrks Jan 16, 2026
5f661b0
fix: udf self contained for datbaricks-connect
vb-dbrks Jan 16, 2026
430084d
fmt: code formatting fixes and pytest ignore for files for udfs
vb-dbrks Jan 16, 2026
9407ae1
fix: Correct SHAP value calculation for single-feature models and ref…
vb-dbrks Jan 16, 2026
9caab66
fix+feat: anomaly test split and simplification
vb-dbrks Jan 16, 2026
54d70de
fix: dorny/pathfilter to git diff
vb-dbrks Jan 16, 2026
5f4e9e8
fix: pytest filters and mlflow fixture
vb-dbrks Jan 16, 2026
de596f0
refactor: delay TreeSHAP explainer creation until needed in anomaly c…
vb-dbrks Jan 16, 2026
c50175b
fix: glob pattern
vb-dbrks Jan 16, 2026
3bd00b0
refactor: enhance anomaly detection UDFs with improved SHAP contribut…
vb-dbrks Jan 16, 2026
334a511
refactor: streamline SHAP value computation in anomaly detection UDFs…
vb-dbrks Jan 16, 2026
b36031b
refactor: change spark fixture to session-scoped for improved paralle…
vb-dbrks Jan 16, 2026
caf46f4
refactor: enhance anomaly detection profiling and training functions …
vb-dbrks Jan 19, 2026
dd2dc4b
refactor: improve column shuffling logic in anomaly trainer by introd…
vb-dbrks Jan 19, 2026
04bdbbf
debug
vb-dbrks Jan 19, 2026
b0ed20b
debug
vb-dbrks Jan 19, 2026
1ee0796
mlflow debug
vb-dbrks Jan 19, 2026
fce341e
debug mlflow xdist pytest n 0
vb-dbrks Jan 19, 2026
0faa67f
refactor: Inject WorkspaceClient as a fixture argument into anomaly t…
vb-dbrks Jan 19, 2026
1ff4831
fix: databricks-sdk to install dependencies for anomal detection
vb-dbrks Jan 20, 2026
c49ac53
debug: pytest_collection_modifyitems for preinstallation step
vb-dbrks Jan 20, 2026
4bd6061
fmt
vb-dbrks Jan 20, 2026
cca504f
fmt
vb-dbrks Jan 20, 2026
1b578f7
xdist controller only install
vb-dbrks Jan 20, 2026
28f8f37
fix: re-enable all but integration_anomaly
vb-dbrks Jan 20, 2026
7dea5e7
fix: driver_only mode for databricks connect compatibility
vb-dbrks Jan 21, 2026
9785215
fix: flaky tests
vb-dbrks Jan 21, 2026
1f81a90
Merge branch 'main' into 957-ml-has_no_anomaly
mwojtyczka Jan 21, 2026
56920d8
Update demos/dqx_anomaly_detection_101_demo.py
vb-dbrks Jan 21, 2026
543bcb9
fix: improve handling of empty segments and is_not_null errors in ano…
vb-dbrks Jan 21, 2026
18d8231
fix:max_by instead of first to join back the _info
vb-dbrks Jan 21, 2026
605f08a
refactor: enhance anomaly testing framework and CI configuration
vb-dbrks Jan 21, 2026
43f013a
chore: update anomaly workflow and refactor test fixtures
vb-dbrks Jan 21, 2026
e0a03e4
refactor: remove HAS_ANOMALY checks from test fixtures
vb-dbrks Jan 21, 2026
2e9dfad
refactor: streamline anomaly engine fixture and clean up integration …
vb-dbrks Jan 21, 2026
b23d533
refactor: improve MLflow tracking configuration in integration tests
vb-dbrks Jan 21, 2026
3df1196
refactor: reorganize anomaly test structure and update configurations
vb-dbrks Jan 21, 2026
a546aa1
debug
vb-dbrks Jan 21, 2026
c47cc67
feat: add DATABRICKS_WAREHOUSE_ID to GitHub Actions workflow
vb-dbrks Jan 21, 2026
dbad305
refactor: enhance anomaly detection workflow and documentation
vb-dbrks Jan 21, 2026
b446f7a
fix: demo
vb-dbrks Jan 21, 2026
7686342
test fixes
vb-dbrks Jan 21, 2026
a6af046
refactor: enhance anomaly detection demo and improve documentation
vb-dbrks Jan 21, 2026
d798c64
refactor: enhance dataframe equality assertions in integration tests
vb-dbrks Jan 21, 2026
09c1fc5
refactor: update anomaly detection parameters and enhance demo
vb-dbrks Jan 21, 2026
27b7abe
refactor: pytest:numba compatilbity root conftest.py removed. adding …
vb-dbrks Jan 22, 2026
adef56f
fmt
vb-dbrks Jan 22, 2026
c3bbcbd
refactor: enhance GitHub Actions and local testing for DQX library re…
vb-dbrks Jan 22, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions .github/workflows/acceptance.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ jobs:
ARM_CLIENT_ID: ${{ secrets.ARM_CLIENT_ID }}
ARM_TENANT_ID: ${{ secrets.ARM_TENANT_ID }}
COVERAGE_FILE: ${{ github.workspace }}/.coverage # make sure the coverage report is preserved
PYTEST_ADDOPTS: "--ignore=tests/integration_anomaly/" # Exclude anomaly tests (run separately in anomaly workflow)

- name: Merge coverage reports and convert them to XML
run: |
Expand Down Expand Up @@ -129,6 +130,7 @@ jobs:
ARM_TENANT_ID: ${{ secrets.ARM_TENANT_ID }}
DATABRICKS_SERVERLESS_COMPUTE_ID: ${{ env.DATABRICKS_SERVERLESS_COMPUTE_ID }}
COVERAGE_FILE: ${{ github.workspace }}/.coverage # make sure the coverage report is preserved
PYTEST_ADDOPTS: "--ignore=tests/integration_anomaly/" # Exclude anomaly tests (run separately in anomaly workflow)

- name: Merge coverage reports and convert them to XML
run: |
Expand Down Expand Up @@ -189,7 +191,7 @@ jobs:
timeout: 2h
codegen_path: tests/e2e/.codegen.json
env:
REF_NAME: ${{ github.ref_name }} # NOTE: end-to-end tests use this to pip install from the current PR branch
REF_NAME: ${{ github.head_ref || github.ref_name }} # NOTE: end-to-end tests use this to pip install from the current PR branch (head_ref for PRs, ref_name for push events)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this change?

GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
ARM_CLIENT_ID: ${{ secrets.ARM_CLIENT_ID }}
ARM_TENANT_ID: ${{ secrets.ARM_TENANT_ID }}
Expand Down Expand Up @@ -245,7 +247,7 @@ jobs:
timeout: 2h
codegen_path: tests/e2e/.codegen.json
env:
REF_NAME: ${{ github.ref_name }} # NOTE: end-to-end tests use this to pip install from the current PR branch
REF_NAME: ${{ github.head_ref || github.ref_name }} # NOTE: end-to-end tests use this to pip install from the current PR branch (head_ref for PRs, ref_name for push events)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this change?

GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
ARM_CLIENT_ID: ${{ secrets.ARM_CLIENT_ID }}
ARM_TENANT_ID: ${{ secrets.ARM_TENANT_ID }}
Expand Down
209 changes: 209 additions & 0 deletions .github/workflows/anomaly.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
name: anomaly

on:
pull_request:
types: [ opened, synchronize, ready_for_review ]
merge_group:
types: [ checks_requested ]
push:
branches:
- main

permissions:
id-token: write
contents: read
pull-requests: write

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: false # don't cancel ongoing runs to ensure fixtures are completed and resources terminated

jobs:
anomaly-tests:
if: github.event_name == 'pull_request' && !github.event.pull_request.draft && !github.event.pull_request.head.repo.fork
environment: tool
runs-on: larger
steps:
- name: Checkout Code
uses: actions/checkout@v4
with:
fetch-depth: 0

- name: Install Python
uses: actions/setup-python@v5
with:
cache: 'pip'
cache-dependency-path: '**/pyproject.toml'
python-version: '3.12'

- name: Install hatch
run: pip install hatch==1.15.0

# Anomaly tests are run from within tests/integration_anomaly folder.
# Create .coveragerc with correct relative path to source code.
- name: Prepare code coverage configuration for anomaly tests
run: |
cat > tests/integration_anomaly/.coveragerc << EOF
[run]
source = ../../../src
parallel = true
relative_files = true
EOF

- name: Azure login (OIDC)
uses: azure/login@v2
with:
client-id: ${{ secrets.ARM_CLIENT_ID }}
tenant-id: ${{ secrets.ARM_TENANT_ID }}
allow-no-subscriptions: true

- name: Set env vars for Azure CLI auth + MLflow
shell: bash
run: |
val=$(az keyvault secret show --id "${{ secrets.VAULT_URI }}/secrets/DATABRICKS-HOST" --query value -o tsv)
# Ensure host has https:// prefix (SDK and MLflow expect full URL)
if [[ ! "$val" =~ ^https?:// ]]; then
val="https://$val"
fi
# Workaround for MLflow OIDC auth: MLflow requires a profile to exist even when it uses SDK auth.
dummy_profile="${RUNNER_TEMP}/databricks_profile"
cat > "$dummy_profile" << EOF
[DEFAULT]
host = $val
token = dummy
EOF
echo "DATABRICKS_HOST=$val" >> $GITHUB_ENV
# Set cluster ID without printing to logs
echo "DATABRICKS_CLUSTER_ID=$(az keyvault secret show --id "${{ secrets.VAULT_URI }}/secrets/DATABRICKS-CLUSTER-ID" --query value -o tsv)" >> $GITHUB_ENV
# Set warehouse ID without printing to logs
echo "DATABRICKS_WAREHOUSE_ID=$(az keyvault secret show --id "${{ secrets.VAULT_URI }}/secrets/DATABRICKS-WAREHOUSE-ID" --query value -o tsv)" >> $GITHUB_ENV
echo "DATABRICKS_AUTH_TYPE=azure-cli" >> $GITHUB_ENV
echo "DATABRICKS_CONFIG_FILE=$dummy_profile" >> $GITHUB_ENV

# MLflow: Use databricks scheme so MLflow uses SDK auth (with dummy profile present).
echo "MLFLOW_ENABLE_DB_SDK=true" >> $GITHUB_ENV
echo "MLFLOW_TRACKING_URI=databricks" >> $GITHUB_ENV
echo "MLFLOW_REGISTRY_URI=databricks-uc" >> $GITHUB_ENV

- name: Run anomaly integration tests and generate test coverage report
timeout-minutes: 120
env:
COVERAGE_FILE: ${{ github.workspace }}/.coverage
DATABRICKS_HOST: ${{ env.DATABRICKS_HOST }}
DATABRICKS_CLUSTER_ID: ${{ env.DATABRICKS_CLUSTER_ID }}
DATABRICKS_WAREHOUSE_ID: ${{ env.DATABRICKS_WAREHOUSE_ID }}
DATABRICKS_AUTH_TYPE: ${{ env.DATABRICKS_AUTH_TYPE }}
DATABRICKS_CONFIG_FILE: ${{ env.DATABRICKS_CONFIG_FILE }}
MLFLOW_ENABLE_DB_SDK: "true"
MLFLOW_TRACKING_URI: ${{ env.MLFLOW_TRACKING_URI }}
MLFLOW_REGISTRY_URI: "databricks-uc"
MLFLOW_HTTP_REQUEST_TIMEOUT: "600"
MLFLOW_HTTP_REQUEST_MAX_RETRIES: "10"
run: |
hatch run pytest tests/integration_anomaly/ -v -rs -n 10 --cov --cov-report=xml --timeout=1200 --reruns 2 --reruns-delay 5

- name: Merge coverage reports and convert them to XML
if: ${{ false }} # disabled temporarily
run: |
hatch run combine_coverage

# Recursively search the entire workspace directory for all coverage reports.
# All uploaded test coverage reports will be used even if publish is done multiple time.
- name: Publish test coverage
if: ${{ false }} # disabled temporarily
uses: codecov/codecov-action@v5
with:
use_oidc: true

anomaly-tests-serverless:
if: github.event_name == 'pull_request' && !github.event.pull_request.draft && !github.event.pull_request.head.repo.fork
environment: tool
runs-on: larger
steps:
- name: Checkout Code
uses: actions/checkout@v4
with:
fetch-depth: 0

- name: Install Python
uses: actions/setup-python@v5
with:
cache: 'pip'
cache-dependency-path: '**/pyproject.toml'
python-version: '3.12'

- name: Install hatch
run: pip install hatch==1.15.0

# Anomaly tests are run from within tests/integration_anomaly folder.
# Create .coveragerc with correct relative path to source code.
- name: Prepare code coverage configuration for anomaly tests
run: |
cat > tests/integration_anomaly/.coveragerc << EOF
[run]
source = ../../../src
parallel = true
relative_files = true
EOF

- name: Azure login (OIDC)
uses: azure/login@v2
with:
client-id: ${{ secrets.ARM_CLIENT_ID }}
tenant-id: ${{ secrets.ARM_TENANT_ID }}
allow-no-subscriptions: true

- name: Set env vars for Azure CLI auth + MLflow
shell: bash
run: |
val=$(az keyvault secret show --id "${{ secrets.VAULT_URI }}/secrets/DATABRICKS-HOST" --query value -o tsv)
# Ensure host has https:// prefix (SDK and MLflow expect full URL)
if [[ ! "$val" =~ ^https?:// ]]; then
val="https://$val"
fi
# Workaround for MLflow OIDC auth: MLflow requires a profile to exist even when it uses SDK auth.
dummy_profile="${RUNNER_TEMP}/databricks_profile"
cat > "$dummy_profile" << EOF
[DEFAULT]
host = $val
token = dummy
EOF
echo "DATABRICKS_HOST=$val" >> $GITHUB_ENV
echo "DATABRICKS_AUTH_TYPE=azure-cli" >> $GITHUB_ENV
echo "DATABRICKS_CONFIG_FILE=$dummy_profile" >> $GITHUB_ENV
# Set warehouse ID without printing to logs
echo "DATABRICKS_WAREHOUSE_ID=$(az keyvault secret show --id "${{ secrets.VAULT_URI }}/secrets/DATABRICKS-WAREHOUSE-ID" --query value -o tsv)" >> $GITHUB_ENV

# MLflow: Use databricks scheme so MLflow uses SDK auth (with dummy profile present).
echo "MLFLOW_ENABLE_DB_SDK=true" >> $GITHUB_ENV
echo "MLFLOW_TRACKING_URI=databricks" >> $GITHUB_ENV
echo "MLFLOW_REGISTRY_URI=databricks-uc" >> $GITHUB_ENV

- name: Run anomaly integration tests on serverless cluster
timeout-minutes: 120
env:
COVERAGE_FILE: ${{ github.workspace }}/.coverage
DATABRICKS_SERVERLESS_COMPUTE_ID: auto
DATABRICKS_HOST: ${{ env.DATABRICKS_HOST }}
DATABRICKS_WAREHOUSE_ID: ${{ env.DATABRICKS_WAREHOUSE_ID }}
DATABRICKS_AUTH_TYPE: ${{ env.DATABRICKS_AUTH_TYPE }}
DATABRICKS_CONFIG_FILE: ${{ env.DATABRICKS_CONFIG_FILE }}
MLFLOW_ENABLE_DB_SDK: "true"
MLFLOW_TRACKING_URI: ${{ env.MLFLOW_TRACKING_URI }}
MLFLOW_REGISTRY_URI: "databricks-uc"
MLFLOW_HTTP_REQUEST_TIMEOUT: "600"
MLFLOW_HTTP_REQUEST_MAX_RETRIES: "10"
run: |
hatch run pytest tests/integration_anomaly/ -v -rs -n 10 --cov --cov-report=xml --timeout=1200 --reruns 2 --reruns-delay 5

- name: Merge coverage reports and convert them to XML
if: ${{ false }} # disabled temporarily
run: |
hatch run combine_coverage

# collects all coverage reports
- name: Publish test coverage
if: ${{ false }} # disabled temporarily
uses: codecov/codecov-action@v5
with:
use_oidc: true
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,15 @@ coverage-integration.xml
.pytest_cache/
cover/

# MLflow
mlruns/
mlartifacts/
mlflow.db
mlflow.db-*

# Test output logs
test_*.log

# Translations
*.mo
*.pot
Expand Down
Loading
Loading