Skip to content

Pre-processing technics for imbalanced datasets in regression modelling

License

Notifications You must be signed in to change notification settings

vd1371/PyImbalReg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyImbalReg

PyImbalReg

Pre-processing techniques for imbalanced datasets in regression

PyPI version License: GPL v3 Codacy Badge GitHub last commit


Dealing with imbalanced datasets for regression

Your trained regression model has a heteroskedasticity problem? It can't predict extreme values well? These pre-processing techniques can help when the issues are caused by an imbalanced target distribution.

  • Random Oversampling (RO)
  • Gaussian Noise and Undersampling (GN)
  • Weighted Relevance-based Combination Strategy (WERCS)

How to use (2-minute read)

  1. Pass your data as a pandas DataFrame to any of the techniques.
  2. Define a relevance function that maps the target variable to [0, 1] (higher value = rarer samples).
  3. Set a threshold to flag rare vs normal samples.
  4. Set method-specific parameters (e.g. oversampling/undersampling ratios).
  5. Call .get() to obtain the resampled dataset.

Installation

From PyPI (pip)

pip install PyImbalReg

From PyPI with uv

uv pip install PyImbalReg

From GitHub

pip install git+https://github.com/vd1371/PyImbalReg.git

Development (uv)

This project uses uv for fast, reliable dependency management.

  1. Install uv (if needed):

    curl -LsSf https://astral.sh/uv/install.sh | sh
    # or: pip install uv
  2. Clone and sync:

    git clone https://github.com/vd1371/PyImbalReg.git
    cd PyImbalReg
    uv sync

    This creates a virtual environment, installs the package in editable mode, and installs dev dependencies (e.g. pytest, seaborn).

  3. Run tests:

    uv sync --extra dev
    uv run pytest tests/
  4. Run examples (requires dev deps for seaborn/matplotlib):

    uv run python examples/ro.py
    uv run python examples/gn.py
    uv run python examples/gnhf.py
    uv run python examples/wercs.py
  5. Lock dependencies (optional):

    uv lock

    Commit uv.lock for reproducible installs.


Example

import PyImbalReg as pir
from seaborn import load_dataset

data = load_dataset("dots")

ro = pir.RandomOversampling(
    df=data,
    rel_func="default",
    threshold=0.7,
    o_percentage=5,  # (o_percentage - 1) × n_rare_samples will be added
)
new_data = ro.get()

Requirements

  • Python ≥ 3.8
  • NumPy
  • Pandas
  • SciPy

(All are declared in pyproject.toml and installed automatically with the package.)


More examples

Runnable scripts (use uv run python examples/<name>.py after uv sync --extra dev):


Contributing

Issues, new techniques, and pull requests are welcome.


Citation

If you use this repository, please cite:

Branco, P., Torgo, L. and Ribeiro, R.P., 2019.
Pre-processing approaches for imbalanced distributions in regression.
Neurocomputing, 343, pp.76–99.


License

© Vahid Asghari, 2020. Licensed under the GNU General Public License v3.0 (GPLv3).

Some parts of the README and code were inspired by smogn.

About

Pre-processing technics for imbalanced datasets in regression modelling

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages