Skip to content

CRITICAL: Data leakage in scaler fitting #13

@iAmGiG

Description

@iAmGiG

Problem

The scaler is being fit on both training AND validation data, which is a form of data leakage.

Affected files:

  • anomaly-detection/train_og.py:29
  • anomaly-detection/test.py:40

Current code:

scaler.fit(x_train.append(x_opt))

Issue: The scaler learns mean/std statistics from validation data (x_opt) that it shouldn't have access to during training. This inflates accuracy metrics.

Correct approach:

scaler.fit(x_train)  # Only fit on training data

Impact

This is likely the cause of the suspected overtraining. The reported 99.98% accuracy may be artificially inflated.

Priority

CRITICAL - This affects the validity of published results.

References

  • Archive branch: Lines identified in code review
  • See: RETROSPECTIVE.md for context

Metadata

Metadata

Assignees

Labels

archiveRelated to archiving old research codetechnical-debtTechnical debt and code quality

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions