This repository contains the code used in this paper. The overall aim is to construct datasets of members and non-members from Gutenberg minimizing bias.
Code contained here can be used to produce datasets, analyse ngram distribution overlap, train classifier and evaluate MIAs. If you find it usefull, please cite as follows:
Cédric Eichler, Nathan Champeil, Nicolas Anciaux, Alexandra Bensamoun, Héber H. Arcolezi, et al.. Nob-MIAs: Non-biased Membership Inference Attacks Assessment on Large Language Models with Ex-Post Dataset Construction. WISE 2024 - 25th International Web Information Systems Engineering conference, Dec. 2024, Doha, Qatar. pp.441-456
Every pipeline needs the initial dataset. Go to get_data to execute the pipeline fetching and filtering the Gutenberg dataset to build the initial dataset
- Compile bff (see ngram_analysis/bff)
- Follow the pipeline in ngram_analysis/altered_distribution
Use the two sampling scripts in data_models
Use the scripts in data_model/data_m3
Datasets can be found here
Contains scripts to construct the three dataset used in the paper (random, minimizing ngram bias, minimizing classifiability) from the initial dataset, the related models, the scripts to train them and the result of their evaluation.
Related to the initial dataset, extracted from Gutenberg following the method of [Meeus et al.](https://arxiv.org/abs/2310.15007)
Contains scripts and a pipeline to replicate the data collection
Contains the code necessary to evaluate MIAs on Pythia and OpenLlama with our datasets
Contains everything needed to compute n-gram overlap distribution and sample a set of non-members minimizing the related bias