Your trained regression model has a heteroskedasticity problem? It can't predict extreme values well? These pre-processing techniques can help when the issues are caused by an imbalanced target distribution.
- Random Oversampling (RO)
- Gaussian Noise and Undersampling (GN)
- Weighted Relevance-based Combination Strategy (WERCS)
- Pass your data as a pandas DataFrame to any of the techniques.
- Define a relevance function that maps the target variable to [0, 1] (higher value = rarer samples).
- Set a threshold to flag rare vs normal samples.
- Set method-specific parameters (e.g. oversampling/undersampling ratios).
- Call
.get()to obtain the resampled dataset.
pip install PyImbalReguv pip install PyImbalRegpip install git+https://github.com/vd1371/PyImbalReg.gitThis project uses uv for fast, reliable dependency management.
-
Install uv (if needed):
curl -LsSf https://astral.sh/uv/install.sh | sh # or: pip install uv
-
Clone and sync:
git clone https://github.com/vd1371/PyImbalReg.git cd PyImbalReg uv syncThis creates a virtual environment, installs the package in editable mode, and installs dev dependencies (e.g. pytest, seaborn).
-
Run tests:
uv sync --extra dev uv run pytest tests/
-
Run examples (requires dev deps for seaborn/matplotlib):
uv run python examples/ro.py uv run python examples/gn.py uv run python examples/gnhf.py uv run python examples/wercs.py
-
Lock dependencies (optional):
uv lock
Commit
uv.lockfor reproducible installs.
import PyImbalReg as pir
from seaborn import load_dataset
data = load_dataset("dots")
ro = pir.RandomOversampling(
df=data,
rel_func="default",
threshold=0.7,
o_percentage=5, # (o_percentage - 1) × n_rare_samples will be added
)
new_data = ro.get()- Python ≥ 3.8
- NumPy
- Pandas
- SciPy
(All are declared in pyproject.toml and installed automatically with the package.)
Runnable scripts (use uv run python examples/<name>.py after uv sync --extra dev):
Issues, new techniques, and pull requests are welcome.
If you use this repository, please cite:
Branco, P., Torgo, L. and Ribeiro, R.P., 2019.
Pre-processing approaches for imbalanced distributions in regression.
Neurocomputing, 343, pp.76–99.
© Vahid Asghari, 2020. Licensed under the GNU General Public License v3.0 (GPLv3).
Some parts of the README and code were inspired by smogn.
