Skip to content

v2.1.18.0

Compare
Choose a tag to compare
@cezary986 cezary986 released this 09 Sep 10:12
· 53 commits to main since this release

What's new in RuleKit version 2.1.18.0?

This release mainly focuses on fixing various inconsistencies between this package and the original Java RuleKit v2 library.

1. Add utility function for reading .arff files.

The ARFF file format was originally created by the Machine Learning Project at the University of Waikato's Department of Computer Science for use with Weka machine learning software. This format, once popular, has now become rather niche. However, some older but popular public benchmark datasets are still available as arff files.

Modern Python hovewer lacks a good package for reading such files. Most exsiting examples on the internet are using scipy.io.arff package. However, this package has some drawbacks that can be problematic (they certainly were in our own experiments). First of all, it does not read the data as pandas DataFrames. Although the returned data can be easily converted into a DataFrame, it still fails to properly encode string columns, leaving them as bytes. We also encountered problems parsing empty values, especially in numeric columns.

After encountering all these problems and drinking considerable amounts of coffee ☕ to fix all sorts of strange bugs they caused, we decided to add a custom function for reading arff files to this package. It is not a completely new implementation and uses scipy.io.arff. It fixes the previously mentioned problems, and also returns a ready-to-use pandas DataFrame compatible with the models available in this package. Example below.

import pandas as pd
from rulekit.arff import read_arff

df: pd.DataFrame = read_arff('./tests/additional_resources/cholesterol.arff')

2. Add ability to write verbose rule induction process logs to the file.

The original RuleKit provides detailed logs of the entire rule induction process. Such logs may not be of interest to the average user, but may be of value to others. They can also be helpful in the debugging process (they certainly were for us).

To configure such logs you can use RuleKit class:

from rulekit import RuleKit

RuleKit.configure_java_logger(
    log_file_path='./java.logs',
    verbosity_level=1
)
# train your model later

3. Add validation of the models parameters configuration.

This package acts as a wrapper for the original RuleKit library written in Java, offering an analogous but more Python-like API. However, this architecture has led to many bugs in the past. Most of them were due to differences between the parameter values of models configured in Python and their values set in Java. In this version, we have added automatic validation, which compares the parameter values configured by the user with those configured in Java and reports the corresponding rulekit.exceptions.RuleKitMisconfigurationException exception. However, this exception should not occur during normal use of this package and was added mainly to make debugging easier and prevent such bugs in the future.

Fixed issues

  • Inconsistent results of induction for survival #22
  • Fixed numerous inconsistencies between this package and the original Java RuleKit v2 library.