📄 Paper: https://arxiv.org/abs/2509.06806
🤗 huggingface: https://huggingface.co/MachineLearningLM
Model
The model is been available on Hugging Face.
Pretraining Dataset
All Datasets have been open-sourced on Hugging Face. Due to the large file size, the dataset has been split into multiple parts. At the same time, complete datasets are also hosted on Google Drive:
A comprehensive framework for evaluating Large Language Models on machine learning tasks, supporting both traditional machine learning models and deep learning approaches with automated pipeline processing.
This framework provides end-to-end evaluation capabilities for LLMs on machine learning tasks, featuring automated data preprocessing, prompt generation, model inference, and comprehensive evaluation metrics.
-
Special Character Handling: Due to shell reserved characters, CSV filenames in TALENT datasets may contain special characters like
(
which are shell reserved characters. We recommend preprocessing these filenames to use only numbers, letters, and underscores before processing. -
Text Data Processing: While we support text data processing, since we use commas (
,
) as feature separators, please replace any commas in your dataset text to avoid model confusion. In our evaluation, we use spaces as replacements for commas.
# Install Python dependencies
pip install -r requirements.txt
For batch processing, you need to provide input path and output path parameters. The framework supports three execution modes:
source ./scripts/evaluate_parameters.sh
See the "Execution Options" section below for detailed commands based on your preferred processing mode (sequential, parallel, or end-to-end pipeline).
Use scripts in single_process/
directory to run steps sequentially:
./scripts/single_process/data_prep.sh
./scripts/single_process/prompt_gen.sh # For deep learning only
./scripts/single_process/model_pred.sh
./scripts/single_process/evaluation.sh
./scripts/single_process/report.sh
Use scripts in multi_process/
directory for accelerated parallel execution:
./scripts/multi_process/data_prep_mp.sh
./scripts/multi_process/prompt_gen_mp.sh # For deep learning only
./scripts/multi_process/model_pred.sh
./scripts/multi_process/evaluation_mp.sh
./scripts/multi_process/report_mp.sh
Run the complete pipeline with parallelization optimizations:
./scripts/evaluate_pipeline.sh
For direct inference on individual JSONL files, we support single-file processing mode.
Important: The input file must have a .jsonl
extension - the code logic uses this suffix for file type identification.
The file should contain prompts in LLaMA Factory's Alpaca format with the following structure:
instruction
: The task instructioninput
: The input dataoutput
: The expected output
python ./src/evaluation/model_pred/dl_model_pred.py \
--input_dir ./demo_input.jsonl \
--output_dir ./demo_output.jsonl \
--model_name MachineLearningLM/MachineLearningLM-7B-v1
For cloud model calls, model path must start with openai::
for proper parsing and OpenAI SDK execution:
python3 ./src/evaluation/model_pred/dl_model_pred.py \
--input_dir ./input_demo.jsonl \
--output_dir ./output_demo.jsonl \
--model_name openai::gpt-4o-mini \
--api_key your_own_api_key \
--base_url your_own_base_url \
--max_samples 5
You can also perform evaluation on individual files directly:
python ./src/evaluation/result_proc/evaluator.py \
--input_dir ./demo_response.jsonl \
--output_dir ./output_demo.txt # Can also be .jsonl
Note: Our evaluation framework is specifically designed for results generated by our dl_model_pred
inference pipeline. Please use outputs from our inference module as input for evaluation to ensure compatibility.
All parameters are managed through ./scripts/evaluate_parameters.sh
. Modify this file to customize:
- Input/output paths
- Model configurations
- Processing parameters
- Evaluation settings
- Dual Model Support: Traditional ML and Deep Learning models.
- Flexible Processing: Single-process or multi-process execution
- Automated Pipeline: End-to-end workflow automation
- Single File Support: Direct inference on individual JSONL files
- Comprehensive Evaluation: Multi-metric evaluation framework
- Parallel Optimization: Built-in parallelization for performance
This part of the code needs to run in an environment with the tabicl and openpyxl libraries installed.
The evaluation code for tabicl is placed separately in the ./src/evaluation/tabicl_evaluate.py
file. Use ./scripts/tabicl_evaluate.sh
to obtain the evaluation results for tabicl.
Use --datasets to specify the datasets to be evaluated, and --sample_sizes to indicate the number of shots.
If multiple datasets need to be evaluated, separate them with spaces. To evaluate all CSV files in the input folder, use all.
MachineLearningLM uses the code from tabicl to generate prior data.
Use ./scripts/generate_data.sh
to generate the prior data. It generates the corresponding .pt and .csv files, and normalizes the feature values in the CSV files to the range of 0–999, as we did in the paper.
Data Scale & Structure
Parameter | Type | Description |
---|---|---|
min_features |
int | Minimum number of features per dataset |
max_features |
int | Maximum number of features per dataset |
max_classes |
int | Maximum number of target classes |
min_seq_len |
int | Minimum samples per dataset. Uses max_seq_len if None |
max_seq_len |
int | Maximum samples per dataset (Not Include) |
Batch Configuration
Parameter | Type | Description |
---|---|---|
batch_size |
int | Total number of datasets to generate per batch |
batch_size_per_gp |
int | Number of datasets per group (shared characteristics) |
batch_size_per_subgp |
int | Number of datasets per subgroup (similar causal structures). Defaults to batch_size_per_gp if None |
Sequence Length Control
Parameter | Type | Description |
---|---|---|
log_seq_len |
bool | Sample sequence length from log-uniform distribution if True |
seq_len_per_gp |
bool | Sample sequence length per group (enables variable-sized datasets) |
replay_small |
bool | Occasionally sample smaller sequences for model robustness |
Train-Test Split
Parameter | Type | Description |
---|---|---|
min_train_size |
int/float | Start position/ratio for train split (int: absolute, float: fractional) |
max_train_size |
int/float | End position/ratio for train split (int: absolute, float: fractional) |
Generation Method
Parameter | Type | Description |
---|---|---|
prior_type |
str | Prior type: 'mlp_scm', 'tree_scm', or 'mix_scm' (random selection) |
fixed_hp |
dict | Fixed structural configuration parameters |
sampled_hp |
dict | Parameters sampled during generation |
Computation Settings
Parameter | Type | Description |
---|---|---|
n_jobs |
int | Number of parallel jobs (-1 = use all processors) |
num_threads_per_generate |
int | Number of threads per generation job |
device |
str | Computation device ('cpu' or 'cuda') |
MachineLearningLM uses the LLaMA-Factory framework for training.
cd ./third_party/LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation
pip install wandb
Use ./scripts/train.sh
for training.
MachineLearningLM/
├──src/
| ├──evaluation/
│ │ ├── data_prep/ # Data preprocessing and chunking utilities
│ │ ├── prompt_gen/ # Prompt generation for deep learning models
│ │ ├── model_pred/ # Model inference (ML and DL prediction engines)
│ │ ├── result_proc/ # 5-layer evaluation architecture and metrics processing
│ │ ├── zero_summary/ # Result summarization and report generation
│ │ └── tabicl_evaluate.py
│ └──prior_data
│ └── pt_to_csv.py
├── scripts/
│ ├── single_process/ # Sequential execution shell scripts
│ ├── multi_process/ # Parallel execution shell scripts (with _mp suffix)
│ ├── evaluate_parameters.sh # Global parameter configuration
| ├── evaluate_pipeline.sh # automated pipeline
| ├── generate_data.sh
| ├── tabicl_evaluate.sh
| └── train.sh
├── datahub_inputs/
│ ├── data_demo/ # Demo datasets for testing
│ └── data_raw/ # Raw input datasets
├── third_party/
│ ├── tabicl/
│ └── LLaMA-Factory/
├── requirements.txt # Python dependencies for Evaluation Framework
├── README.md
├── README_zh.md
├── THIRD_PARTY_NOTICES.md
└── LICENSE
We thank LLaMA-Factory and TabICL for the open source code.
@article{dong2025machinelearninglm,
title={MachineLearningLM: Scaling Many-shot In-Context Learning via Continued Pretraining},
author={Dong, Haoyu and Zhang, Pengkun and Lu, Mingzhe and Shen, Yanzhen and Ke, Guolin},
journal={arXiv preprint arXiv:2509.06806},
year={2025}
}