-
Notifications
You must be signed in to change notification settings - Fork 10
7 Contrast set mining
RuleKit includes an algorithm for contrast set (CS) identification Gudyś et al, 2024. Currently, the mode is available only through XML batch interface.
Below we present an example parameter set suitable for contrast set mining:
<parameter_set name="Contrast set mining">
<param name="induction_measure">Correlation</param>
<param name="pruning_measure">Correlation</param>
<param name="voting_measure">Correlation</param>
<param name="minsupp_all">0.8 0.5 0.2 0.1</param>
<param name="minsupp_new">0.1</param>
<param name="max_neg2pos">0.5</param>
<param name="max_passes_count">5</param>
<param name="penalty_strength">0.5</param>
<param name="penalty_saturation">0.2</param>
</parameter_set>
Parameter meaning (symbols from the RuleKit-CS manuscript are given in parentheses):
-
induction_measure
/pruning_measure
/voting_measure
- name of the rule quality measure used during growing/pruning/voting (ignored in the regression/survival analysis where special measure is used); recommended: Correlation, -
minsupp_all
- a minimum positive support of a contrast set (p/P). When multiple values are specified, a metainduction is performed; recommended sequence: 0.8, 0.5, 0.2, 0.1, -
minsupp_new
- a minimum positive support of a contrast set calculated w.r.t. to previously uncovered examples (pnew/P); an alias ofmin_rule_covered
parameter in traditional rules induction; recommended: 0.1, -
max_neg2pos
- a maximum ratio of negative to positive supports (nP/pN); recommended: 0.5, - 'max_passes_count` (max-passes) - a maximum number of sequential covering passes for a single minsupp-all; recommended: 5,
-
penalty_strength
(s) - penalty strength; recommended: 0.5, -
penalty_saturation
- the value of pnew/P at which penalty reward saturates; recommended: 0.2.
In order to enable contrast set mining, one need to specify contrast_attribute
tag in the data set description which indicates the group attribute. Note, that prediction
section in the contrast set mining is ignored, thus it can be ommited.
In the case of traditional contrast sets, a group attribute must be the same as a discrete label:
<dataset>
<label>class</label>
<contrast_attribute>class</contrast_attribute>
<out_directory>./classification/final/anneal</out_directory>
<training>
<report_file>training.log</report_file>
<train>
<in_file>../data/classification/anneal.arff</in_file>
<model_file>anneal.mdl</model_file>
<model_csv>anneal.csv</model_csv>
</train>
</training>
</dataset>
For regression problems, one need to specify a discrete group attribute and a continous label:
<dataset>
<label>class</label>
<contrast_attribute>group</contrast_attribute>
<out_directory>./regression/final/plastic</out_directory>
<training>
<report_file>plastic.train.log</report_file>
<train>
<in_file>../data/regression/plastic.arff</in_file>
<model_file>plastic.mdl</model_file>
<model_csv>plastic.csv</model_csv>
</train>
</training>
</dataset>
When analyzing survival data, the following attributes are needed: a discrete group attribute, a binary survival status (label), and a continous survival time.
<dataset>
<survival_time>survival_time</survival_time>
<label>survival_status</label>
<contrast_attribute>group</contrast_attribute>
<out_directory>./survival/final/actg320</out_directory>
<training>
<report_file>actg320.train.log</report_file>
<train>
<in_file>../data/survival/actg320.arff</in_file>
<model_file>actg320.mdl</model_file>
<model_csv>actg320.csv</model_csv>
</train>
</training>
</dataset>
All the data sets investigated in the RuleKit-CS study can be downloaded from here. This location also contains an XLS spreadsheet with detailed information on the data sets. The XML files defining experiments from the paper are also available in examples/contrast-sets repository folder. In order to reproduce the analyses, please copy the RuleKit jar file into that folder and run
java -jar rulekit-<version>-all.jar <experiments.xml>
with <experiments-xml>
being an XML file name for a particular experiment.