Skip to content

yunsuxiaozi/Yunbase

Repository files navigation

🚀Yunbase,first submission of your algorithm competition

yunbase title image

In the competition of data mining,there are many operations that need to be done in every time.Many of these operations,from data preprocessing to k-fold cross validation,are repetitive.It's a bit troublesome to write repetitive code every time,so I extracted the common parts among these operations and wrote the Yunbase class here.('Yun' is my name yunsuxiaozi,'base' is the baseline of competition.)

Get Started Quickly

1.git clone

!git clone https://github.com/yunsuxiaozi/Yunbase.git

2.download wheel in requirements.txt

!pip download -r Yunbase/requirements.txt

3.install according to requirements.txt

!pip install -q --requirement yourpath/Yunbase/requirements.txt  \
--no-index --find-links file:yourpath

4.import Yunbase

from Yunbase.baseline import Yunbase

5.create Yunbase.

All the parameters are below, and you can flexibly choose parameters according to the task.

yunbase=Yunbase(  num_folds:int=5,
                  n_repeats:int=1,
                  models:list[tuple]=[],
                  FE=None,
                  CV_sample=None,
                  group_col=None,
                  target_col:str='target',
                  weight_col:str='weight',
                  drop_cols:list[str]=[],
                  seed:int=2024,
                  objective:str='regression',
                  metric:str='mse',
                  nan_margin:float=0.95,
                  num_classes=None,
                  infer_size:int=10000,
                  save_oof_preds:bool=True,
                  save_test_preds:bool=True,
                  device:str='cpu',
                  one_hot_max:int=50,
                  custom_metric=None,
                  use_optuna_find_params:int=0,
                  optuna_direction=None,
                  early_stop:int=100,
                  use_pseudo_label:bool=False,
                  use_high_corr_feat:bool=True,
                  cross_cols:list[str]=[],
                  labelencoder_cols:list[str]=[],
                  list_stat:list[tuple]=[],
                  word2vec_models:list[tuple]=[],
                  text_cols:list[str]=[],
                  plot_feature_importance:bool=False,
                  log:int=100,
                  exp_mode:bool=False,
                  use_reduce_memory:bool=False,
                  use_data_augmentation:bool=False,
                  use_oof_as_feature:bool=False,
                  use_CIR:bool=False,
                  use_median_as_pred:bool=False,
                  use_scaler:bool=False,
                  use_TTA:bool=False,
                  use_eval_metric:bool=True,
                  feats_stat:list[tuple]=[],
                  target_stat:list[tuple]=[],
                  use_spellchecker:bool=False,
                  AGGREGATIONS:list=['nunique','count','min','max','first',
                                     'last','mean','median','sum','std','skew',kurtosis],
    )
  • num_folds:int.the number of folds for k-fold cross validation.

  • n_repeats:int,Replace different seeds for multiple kfold cross validation.This parameter is generally used for small datasets to ensure the stability of the model.

  • models:list[tuple].Built in 3 GBDTs as baseline, you can also use custom models,such as

    models=[(LGBMRegressor(**lgb_params),'lgb')]
  • FE:function.In addition to the built-in feature engineer, you can also customize feature engineer.For example:

    def FE(df):
        return df.drop(['id'],axis=1)
  • CV_sample:function.You can customize your downsampling and oversampling operations inside.In order to ensure the accuracy of CV, operations on the validation set should not be performed in principle. However, to meet personalized needs, operations on the validation set are still allowed here In addition to sampling operations, related feature engineering can also be customized here.

    For example:

    def CV_sample(X_train,y_train,X_valid,y_valid,
                  sample_weight_train,sample_weight_valid):
        less_idx=list(np.where(y_train==1)[0])
        more_idx=list(np.where(y_train==0)[0])
        np.random.shuffle(more_idx)
        #undersample
        more_idx=more_idx[:int(len(more_idx)*0.9)]
        #Adversarial learning
        X_train_copy=X_train.iloc[less_idx].copy()
        y_train_copy=y_train.iloc[less_idx].copy()
        y_train_copy[:]=0
        sample_weight_train_copy=sample_weight_train.iloc[less_idx].copy()
        
        X_train=pd.concat((X_train.iloc[more_idx+less_idx],X_train_copy)).reset_index(drop=True)
        y_train=pd.concat((y_train.iloc[more_idx+less_idx],y_train_copy)).reset_index(drop=True)
        sample_weight_train=pd.concat((sample_weight_train.iloc[more_idx+less_idx],sample_weight_train_copy)).reset_index(drop=True)
        return X_train,y_train,X_valid,y_valid,sample_weight_train,sample_weight_valid

​ In purgedCV(time series CV), in order to make the training set and test set closer, without a validation set, this function will become as follows:

def CV_sample(X_train,y_train,sample_weight_train):
    #your code
    return X_train,y_train,sample_weight_train
  • group_col:str.if you want to use groupkfold,then define this group_col.

  • target_col:str.the column that you want to predict.

  • weight_col:str.You can set the weight of each sample during model training. If not defined by the user, 1 will be used by default to train each sample.

  • drop_cols:list.The column to be deleted after all feature engineering is completed.

  • seed:int.random seed.

  • objective:str.what task do you want to do?regression,binary or multi_class?

  • metric:str.metric to evaluate your model.

  • nan_margin:float.when the proportion of missing values in a column is greater than, we delete this column.

  • num_classes:int.if objectibe is multi_class or binary,you should define this parameter.

  • infer_size:int.the test data might be large,we can predict in batches to deal with memory issues.

  • save_oof_preds:bool.you can save OOF for your own offline study.

  • save_test_preds:bool.you can save test_preds for your own offline study.

  • device:str.GBDT can training on GPU,you can set this parameter 'gpu' when you want to training on GPU.

  • one_hot_max:int.If the nunique of a column is less than a certain value, perform one hot encoder.

  • custom_metric:function.you can define your own custom_metric.

    def weighted_MAE(y_true,y_pred,
                     weight=train['weight'].values):
        return np.sum(weight*np.abs(y_true-y_pred))/np.sum(weight)

    1.custom_metric can only pass in the parameters y_true and y_pred. If it is a regular cross validation, it needs to be assigned a value in advance like the weight parameter above. If it is a time series CV, the use_weighted_metric parameter can be used without defining the weight parameter.

    2.when objective is multi_class,y_pred in custom_metric(y_true,y_pred) is probability(shape:(len(y_true),num_classes)).

  • use_optuna_find_params:int.count of use optuna find best params,0 is not use optuna to find params.Currently only LGBM is supported.

  • optuna_direction:str.'minimize' or 'maximize',when you use custom metric,you must define the direction of optimization.

  • early_stop:int.Common parameters of GBDT.

  • use_pseudo_label:bool.Whether to use pseudo labels.When it is true,adding the test data to the training data and training again after obtaining the predicted results of the test data.To obtain a reliable CV, the test set and cross validation training set are concatenated and validated using the validation set.

  • use_high_corr_feat:bool.whether to use high correlation features or not.

  • cross_cols:list[str].Construct features using addition, subtraction, multiplication, and division brute force for these columns of features.

  • labelencoder_cols:list.Convert categorical string variables into [1,2,……,n].

  • list_stat:list[tuple]=[].example:[('step_list',list_gap=[1,2,4])].step_list:If the data in a column is a list or str(list), such as [] or '[]', this can be used to extract diff and shift features for list_cols.

  • word2vec_models:list[tuple].Use models such as tfidf to extract features of string columns.For example:

    word2vec_models=[(TfidfVectorizer(),col,model_name='tfidf',use_svd=False)]
    
  • text_cols:list[str].extract features of words, sentences, and paragraphs from text here.

  • plot_feature_importance:bool.after model training,whether print feature importance or not.

  • log:int.How many iterators output scores on the validation set once.

  • exp_mode:bool.In regression tasks, the distribution of target_col is a long tail distribution, and this parameter can be used to perform log transform on the target_col.

  • use_reduce_memory:bool.When facing large datasets, this method can be used to reduce memory.

  • use_data_augmentation:bool.if use data augmentation,During cross validation, the training data will undergo PCA transformation followed by inverse transformation.You can see function pca_augmentation for more details.

  • use_oof_as_feature:bool.For training data, use the oof_preds of the previous model as the feature, and for testing data, use the predicted results of the previous model as the feature for next model.

  • use_CIR:bool. use CenteredIsotonicRegression to fit(oof_preds,target) in the final.

  • use_median_as_pred:bool.The general model ensemble uses the mean as the prediction result, and this parameter uses median as the prediction result, which sometimes achieves better results, but only slightly.

  • use_scaler:bool.Although the usual scaling operation is not useful for GBDT models, after scaling the data, the clip operation can be used to remove outliers.We are using RobustScaler here.

  • use_TTA:bool.It is to apply the previous data augmentation operation to the test set and then take the average of the predicted results.

  • use_eval_metric:bool.Use metric to evaluate models during training with lightgbm and xgboost.

  • feats_stat:list[tuple]=[].Construct groupby features.for example: training data has some patients, testing data has other patients, each patient has multiple samples, this function can be used.

    feats_stat=
    [('patient_id','year',['max','min','median','mean','std','skew',kurtosis,'(x-mean)/std','max-min','mean/std'])]
  • target_stat:list[tuple]=[].It also performs a groupby operation to target encode categorical variables, such as calculating the mean of the target for each category in the training set.

  • use_spellchecker:bool.This is an immature feature that checks for word errors in text and then makes corrections. The main issue is that it takes too long time.

  • AGGREGATIONS=['nunique','count','min','max','first', 'last', 'mean','median','sum','std','skew',kurtosis]

6.yunbase training

At present, it supports read csv, parquet files according to path, or csv files that have already been read.

yunbase.fit(train_path_or_file:str|pd.DataFrame|pl.DataFrame='train.csv',
            category_cols:list[str]=[],date_cols:list[str]=[],
            target2idx:dict|None=None,
           )
            
  • train_path_or_file:You can use the file path or pass in the already loaded file.
  • category_cols:You can specify which columns to convert to 'category' in the training data.
  • date_cols:If a column of features are all of time type, for example :"2024-04-23",this can be used to construct features.
  • target2idx:The dictionary mapped in the classification task, if you want to predict a person's gender, you can specify {'Male ': 0,' Female ': 1}.If you do not specify it yourself, it will be mapped to 0, 1,... n in order of the number of times each target appears.

7.yunbase inference

test_preds=yunbase.predict(test_path_or_file:str|pd.DataFrame|pl.DataFrame='test.csv',weights=np.zeros(0))
  • weights:This is setting the weights for model ensemble. For example, if you specify lgb, xgb, and cat, you can set weights to [3,4,3].

8.save test_preds to submission.csv

yunbase.submit(submission_path_or_file='submission.csv',test_preds=np.ones(3),save_name='yunbase')
  • save_name can be changed.if you set 'submission',it will give you a csv file named 'submission.csv'.

9.ensemble

yunbase.ensemble(solution_paths_or_files:list[str]=[],weights=None)
  • For example:

    solution_paths_or_files=[
    'submission1.csv',
    'submission2.csv',
    'submission3.csv'
    ]
    weights=[3,3,4]

10.If train and inference need to be separated.

#model save
yunbase.pickle_dump(yunbase,'yunbase.model')

import dill#serialize and deserialize objects (such as saving and loading tree models)
def pickle_load(path):
    #open path,binary read
    with open(path, mode="rb") as f:
        data = dill.load(f)
        return data
yunbase=Yunbase()
yunbase=pickle_load("yunbase.model")
yunbase.model_save_path=your_model_save_path

11.train data and test data can be seen as below.

yunbase.train.head(),yunbase.test.head()
Here is a static version that can be used to play Kaggle competition.You can refer to this notebook to learn usage of Yunbase.

TimeSeries Purged CV

yunbase.purged_cross_validation(self,train_path_or_file:str|pd.DataFrame|pl.DataFrame='train.csv',
                                test_path_or_file:str|pd.DataFrame|pl.DataFrame='test.csv',
                                date_col:str='date',train_gap_each_fold:int=31,#one month
                                train_test_gap:int=7,#a week
                                train_date_range:int=0,test_date_range:int=0,
                                category_cols:list[str]=[],
                                use_seasonal_features:bool=True,
                                use_weighted_metric:bool=False,
                                only_inference:bool=False,
                                timestep:str='day',
                                target2idx:dict|None=None
                               )
  • only_inference:If you don't want to see the offline scores of the time-series CV or want to save time, you can directly train the final submitted model.

Demo notebook:Rohlik Yunbase

Adversarial Validation

demo notebook

follow-up work

The code has now completed a rough framework and will continue to be improved by adding new functions based on bug fixes.

In principle, fix as many bugs as I discover and add as many new features as I think of.

1.fit function to np.array.(such as model.fit(train_X,train_y),model.predict(test_X))

2.add more common metric.

3.In addition to kfold, single model training and inference are also implemented.

4.hill climbing to find blending weight.

5.Optimize memory and time to cope with larger datasets.

6.Make the code more beautiful, concise, and easy to understand.

Waiting for updates.

Kaggle:https://www.kaggle.com/yunsuxiaozi

yunbase title image

update time:2025/01/06(baseline.py and README may not synchronize updates)

About

Yunbase,first submission of your algorithm competition

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages