You will need python3.9 or a later version, pip
and optionnaly virtualenv
git clone https://github.com/SupervisedStylometry/SuperStyl.git
cd SuperStyl
virtualenv -p python3.9 env #or later
source env/bin/activate
pip install -r requirements.txt
To use Superstyl, you have two options:
- Use the provided command-line interface from your OS terminal (tested on Linux)
- Import Superstyl in a Python script or notebook, and use the API commands
You also need a collection of files containing the text that you wish to analyse. The naming conventions of source files in Superstyl are as such:
Class_anythingthatyouwant
For instance:
Moliere_Amphitryon.txt
The text before the first underscore will be used as the class for training models.
A very simple usage, for building a corpus of text character 3-grams frequencies, training a SVM model with leave-one-out cross-validation, and predicting the class of unknown texts, would be:
# Creating the corpus and extracting characters 3-grams from text files
python load_corpus.py -s data/train/*.txt -t chars -n 3 -o train
python load_corpus.py -s data/test/*.txt -t chars -n 3 -o unknown -f train_feats.json
# Training a SVM, with cross-validation, and using it to predict the class of unknown sample
python train_svm.py train.csv --test_path unknown.csv --cross_validate leave-one-out --final
The two first commands will write to the disk the files train.csv
and unknown.csv
containing the metadata and features frequencies for both sets of files,
and a file train_feats.json
containing a list of used features.
The last one will print the scores of the cross-validation, and then write
to disk a file FINAL_PREDICTIONS.csv
, containing the class predictions
for the unknown texts.
This is just a small sample of all available corpus and training options.
To know more, do:
python load_corpus.py --help
python train_svm.py --help
A very simple usage, for building a corpus, training a SVM model with cross-validation, and predicting the class of an unknown text, would be:
import superstyl as sty
import glob
# Creating the corpus and extracting characters 3-grams from text files
train, train_feats = sty.load_corpus(glob.glob("data/train/*.txt"),
feats="chars", n=3)
unknown, unknown_feats = sty.load_corpus(glob.glob("data/test/*.txt"),
feat_list=train_feats,
feats="chars", n=3)
# Training a SVM, with cross-validation, and using it
# to predict the class of unknown sample
sty.train_svm(train, unknown, cross_validate="leave-one-out",
final_pred=True)
This is just a small sample of all available corpus and training options.
To know more, do:
help(sty.load_corpus)
help(sty.train_svm)
FIXME: look inside the scripts, or do
python load_corpus.py --help
python train_svm.py --help
for full documentation on the main functionnalities of the CLI, regarding data generation (main.py
) and SVM training (train_svm.py
).
For more particular data processing usages (splitting and merging datasets), see also:
python split.py --help
python merge_datasets.csv.py --help
With or without preexisting feature list:
python load_corpus.py -s path/to/docs/* -t chars -n 3
# with it
python load_corpus.py -s path/to/docs/* -f feature_list.json -t chars -n 3
# There are several other available options
# See --help
Alternatively, you can build samples out of the data, for a given number of verses or words:
# words from txt
python load_corpus.py -s data/psyche/train/* -t chars -n 3 -x txt --sampling --sample_units words --sample_size 1000
# verses from TEI encoded docs
python load_corpus.py -s data/psyche/train/* -t chars -n 3 -x tei --sampling --sample_units verses --sample_size 200
You have a lot of options for feats extraction, inclusion or not of punctuation and symbols, sampling, source file formats, …, that can be accessed through the help.
You can merge several sets of features, extracted in csv with the previous commands, by doing:
python merge_datasets.csv.py -o merged.csv char3grams.csv words.csv affixes.csv
You can choose either choose to perform k-fold cross-validation (including leave-one-out), in which case this step is unnecessary. Or you can do a classical train/test random split.
If you want to do initial random split,
python split.py feats_tests.csv
If you want to split according to existing json file,
python split.py feats_tests.csv -s split.json
There are other available options, see --help
, e.g.
python split.py feats_tests.csv -m langcert_revised.csv -e wilhelmus_train.csv
It's quite simple really,
python train_svm.py path-to-train-data.csv [--test_path TEST_PATH] [--cross_validate {leave-one-out,k-fold}] [--k K] [--dim_reduc {pca}] [--norms] [--balance {class_weight,downsampling,Tomek,upsampling,SMOTE,SMOTETomek}] [--class_weights] [--kernel {LinearSVC,linear,polynomial,rbf,sigmoid}] [--final] [--get_coefs]
For instance, using leave-one-out or 10-fold cross-validation
# e.g.
python train_svm.py data/feats_tests_train.csv --norms --cross_validate leave-one-out
python train_svm.py data/feats_tests_train.csv --norms --cross_validate k-fold --k 10
Or a train/test split
# e.g.
python train_svm.py data/feats_tests_train.csv --test_path test_feats.csv --norms
And for a final analysis, applied on unseen data:
# e.g.
python train_svm.py data/feats_tests_train.csv --test_path unseen.csv --norms --final
With a little more options,
# e.g.
python train_svm.py data/feats_tests_train.csv --test_path unseen.csv --norms --class_weights --final --get_coefs
If you've created samples using --sampling
to segment your text into consecutive slices (e.g., every 1000 words):
python load_corpus.py -s data/text/*.txt -t chars -n 3 -o rolling_train --sampling --units words --sample_size 1000
python load_corpus.py -s data/text_to_predict/*.txt -t chars -n 3 -o rolling_unknown -f rolling_train_feats.json --sampling --units words --sample_size 1000
You can then train and produce final predictions, and directly visualize how the decision function changes across these segments:
python train_svm.py rolling_train.csv --test_path rolling_unknown.csv --final --plot_rolling --plot_smoothing 5
This will produce FINAL_PREDICTIONS.csv and a plot showing how the classifier's authorial signals vary segment by segment through the text. The --plot_smoothing option applies a simple moving average smoothing to make trends clearer. If the smoothing is not defined, default value is 3. Smoothing can be set to None.
You can cite it using the CITATION.cff file (and Github cite functionnalities), following:
BIBTEX:
@software{camps_cafiero_2024,
author = {Jean-Baptiste Camps and Florian Cafiero},
title = {{SUPERvised STYLometry (SuperStyl)}},
month = {11},
year = {2024},
version = {v1.0},
doi = {10.5281/zenodo.14069799},
url = {https://doi.org/10.5281/zenodo.14069799}
}
MLA:
Camps, Jean-Baptiste, and Florian Cafiero. *SUPERvised STYLometry (SuperStyl)*. Version 1.0, 11 Nov. 2024, doi:10.5281/zenodo.14069799.
APA:
Camps, J.-B., & Cafiero, F. (2024). SUPERvised STYLometry (SuperStyl) (Version v1.0) [Computer software]. https://doi.org/10.5281/zenodo.14069799