Skip to content

HaoAreYuDong/MachineLearningLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning LLM

​📄 Paper​​: https://arxiv.org/abs/2509.06806

🤗 huggingface: https://huggingface.co/MachineLearningLM

Model

The model is been available on Hugging Face.

Pretraining Dataset

All Datasets have been open-sourced on Hugging Face. Due to the large file size, the dataset has been split into multiple parts. At the same time, complete datasets are also hosted on Google Drive:

🔹 Warmup Dataset

🔹 Full Dataset

Evaluation Framework

A comprehensive framework for evaluating Large Language Models on machine learning tasks, supporting both traditional machine learning models and deep learning approaches with automated pipeline processing.

Overview

This framework provides end-to-end evaluation capabilities for LLMs on machine learning tasks, featuring automated data preprocessing, prompt generation, model inference, and comprehensive evaluation metrics.

Important Notes

  1. Special Character Handling: Due to shell reserved characters, CSV filenames in TALENT datasets may contain special characters like ( which are shell reserved characters. We recommend preprocessing these filenames to use only numbers, letters, and underscores before processing.

  2. Text Data Processing: While we support text data processing, since we use commas (,) as feature separators, please replace any commas in your dataset text to avoid model confusion. In our evaluation, we use spaces as replacements for commas.

Installation

# Install Python dependencies
pip install -r requirements.txt

Batch Processing Usage

For batch processing, you need to provide input path and output path parameters. The framework supports three execution modes:

Step 1: Activate Parameters

source ./scripts/evaluate_parameters.sh

Step 2: Choose Execution Mode

See the "Execution Options" section below for detailed commands based on your preferred processing mode (sequential, parallel, or end-to-end pipeline).

Execution Options

Option 1: Sequential Processing

Use scripts in single_process/ directory to run steps sequentially:

./scripts/single_process/data_prep.sh
./scripts/single_process/prompt_gen.sh  # For deep learning only
./scripts/single_process/model_pred.sh
./scripts/single_process/evaluation.sh
./scripts/single_process/report.sh

Option 2: Parallel Processing

Use scripts in multi_process/ directory for accelerated parallel execution:

./scripts/multi_process/data_prep_mp.sh
./scripts/multi_process/prompt_gen_mp.sh  # For deep learning only
./scripts/multi_process/model_pred.sh
./scripts/multi_process/evaluation_mp.sh
./scripts/multi_process/report_mp.sh

Option 3: End-to-End Pipeline

Run the complete pipeline with parallelization optimizations:

./scripts/evaluate_pipeline.sh

Single File Processing

For direct inference on individual JSONL files, we support single-file processing mode.

Important: The input file must have a .jsonl extension - the code logic uses this suffix for file type identification.

The file should contain prompts in LLaMA Factory's Alpaca format with the following structure:

  • instruction: The task instruction
  • input: The input data
  • output: The expected output

Local Model Usage Example

python ./src/evaluation/model_pred/dl_model_pred.py \
  --input_dir ./demo_input.jsonl \
  --output_dir ./demo_output.jsonl \
  --model_name MachineLearningLM/MachineLearningLM-7B-v1

Cloud Model Usage

For cloud model calls, model path must start with openai:: for proper parsing and OpenAI SDK execution:

python3 ./src/evaluation/model_pred/dl_model_pred.py \
  --input_dir ./input_demo.jsonl \
  --output_dir ./output_demo.jsonl \
  --model_name openai::gpt-4o-mini \
  --api_key your_own_api_key \
  --base_url your_own_base_url \
  --max_samples 5

Single File Evaluation

You can also perform evaluation on individual files directly:

python ./src/evaluation/result_proc/evaluator.py \
  --input_dir ./demo_response.jsonl \
  --output_dir ./output_demo.txt   # Can also be .jsonl

Note: Our evaluation framework is specifically designed for results generated by our dl_model_pred inference pipeline. Please use outputs from our inference module as input for evaluation to ensure compatibility.

Configuration

All parameters are managed through ./scripts/evaluate_parameters.sh. Modify this file to customize:

  • Input/output paths
  • Model configurations
  • Processing parameters
  • Evaluation settings

Features

  • Dual Model Support: Traditional ML and Deep Learning models.
  • Flexible Processing: Single-process or multi-process execution
  • Automated Pipeline: End-to-end workflow automation
  • Single File Support: Direct inference on individual JSONL files
  • Comprehensive Evaluation: Multi-metric evaluation framework
  • Parallel Optimization: Built-in parallelization for performance

TabICL Evaluation

This part of the code needs to run in an environment with the tabicl and openpyxl libraries installed.

The evaluation code for tabicl is placed separately in the ./src/evaluation/tabicl_evaluate.py file. Use ./scripts/tabicl_evaluate.sh to obtain the evaluation results for tabicl.

Use --datasets to specify the datasets to be evaluated, and --sample_sizes to indicate the number of shots.

If multiple datasets need to be evaluated, separate them with spaces. To evaluate all CSV files in the input folder, use all.

Prior_data

MachineLearningLM uses the code from tabicl to generate prior data.

Use ./scripts/generate_data.sh to generate the prior data. It generates the corresponding .pt and .csv files, and normalizes the feature values in the CSV files to the range of 0–999, as we did in the paper.

Parameter Introduction(refer to the comments in the file tabicl\src\tabicl\prior\dataset.py

Data Scale & Structure

Parameter Type Description
min_features int Minimum number of features per dataset
max_features int Maximum number of features per dataset
max_classes int Maximum number of target classes
min_seq_len int Minimum samples per dataset. Uses max_seq_len if None
max_seq_len int Maximum samples per dataset (Not Include)

Batch Configuration

Parameter Type Description
batch_size int Total number of datasets to generate per batch
batch_size_per_gp int Number of datasets per group (shared characteristics)
batch_size_per_subgp int Number of datasets per subgroup (similar causal structures). Defaults to batch_size_per_gp if None

Sequence Length Control

Parameter Type Description
log_seq_len bool Sample sequence length from log-uniform distribution if True
seq_len_per_gp bool Sample sequence length per group (enables variable-sized datasets)
replay_small bool Occasionally sample smaller sequences for model robustness

Train-Test Split

Parameter Type Description
min_train_size int/float Start position/ratio for train split (int: absolute, float: fractional)
max_train_size int/float End position/ratio for train split (int: absolute, float: fractional)

Generation Method

Parameter Type Description
prior_type str Prior type: 'mlp_scm', 'tree_scm', or 'mix_scm' (random selection)
fixed_hp dict Fixed structural configuration parameters
sampled_hp dict Parameters sampled during generation

Computation Settings

Parameter Type Description
n_jobs int Number of parallel jobs (-1 = use all processors)
num_threads_per_generate int Number of threads per generation job
device str Computation device ('cpu' or 'cuda')

Train

MachineLearningLM uses the LLaMA-Factory framework for training.

Training Environment Configuration

cd ./third_party/LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation
pip install wandb

Use ./scripts/train.sh for training.

Project Structure

MachineLearningLM/
├──src/
|   ├──evaluation/
│   │   ├── data_prep/          # Data preprocessing and chunking utilities
│   │   ├── prompt_gen/         # Prompt generation for deep learning models
│   │   ├── model_pred/         # Model inference (ML and DL prediction engines)
│   │   ├── result_proc/        # 5-layer evaluation architecture and metrics processing
│   │   ├── zero_summary/       # Result summarization and report generation
│   │   └── tabicl_evaluate.py
│   └──prior_data
│       └── pt_to_csv.py     
├── scripts/
│   ├── single_process/         # Sequential execution shell scripts
│   ├── multi_process/          # Parallel execution shell scripts (with _mp suffix)
│   ├── evaluate_parameters.sh  # Global parameter configuration
|   ├── evaluate_pipeline.sh    # automated pipeline
|   ├── generate_data.sh
|   ├── tabicl_evaluate.sh
|   └── train.sh
├── datahub_inputs/
│   ├── data_demo/          # Demo datasets for testing
│   └── data_raw/           # Raw input datasets
├── third_party/
│   ├── tabicl/          
│   └── LLaMA-Factory/   
├── requirements.txt        # Python dependencies for Evaluation Framework
├── README.md
├── README_zh.md
├── THIRD_PARTY_NOTICES.md
└── LICENSE

Acknowledgement

We thank LLaMA-Factory and TabICL for the open source code.

Reference

@article{dong2025machinelearninglm,
  title={MachineLearningLM: Scaling Many-shot In-Context Learning via Continued Pretraining},
  author={Dong, Haoyu and Zhang, Pengkun and Lu, Mingzhe and Shen, Yanzhen and Ke, Guolin},
  journal={arXiv preprint arXiv:2509.06806},
  year={2025}
}

About

Scaling In-context Learning from Few-shot to 1,024-shot on Tabular ML

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •