Behavior-based machine learning for classifying Linux malware types from system call (syscall) sequences.
This project trains a model on syscall sequences and predicts a malware type for unseen sequences. It uses a TF‑IDF + handcrafted features pipeline with a LightGBM classifier, and provides both CLI and importable Python usage for training and inference.
- Train:
detector.py - Predict:
predict.py - Features:
features.py - Artifacts:
artifacts/model.joblib,artifacts/metrics.json
Repository: https://github.com/Zierax/Malware-LinuxTypes-Detector-ML/
- TF‑IDF on syscall n‑grams (1–5)
- Lightweight handcrafted features (
HandcraftedFeatures) for basic sequence statistics - LightGBM classifier with balanced class weights
- Label simplification and rare-class filtering
- Cross‑validation and metrics export to JSON
- Flexible input for prediction: single text, CSV, or newline‑delimited file
- Optional probability output and CSV export of predictions
.
├─ detector.py # Train and evaluate; saves model + metrics to artifacts/
├─ predict.py # CLI + importable predictions from syscall sequences
├─ features.py # HandcraftedFeatures transformer (simple stats)
├─ features_shim.py # (present if used elsewhere)
├─ _dataset.csv # Training dataset (label,syscalls)
├─ sample.csv # Example input for predict CLI
├─ artifacts/
│ ├─ model.joblib # Saved pipeline (TF-IDF + features + LGBM)
│ └─ metrics.json # Evaluation metrics and label info
├─ requirements.txt
├─ test_samples.csv # Additional sample inputs (for testing "opt")
└─ README.md
See requirements.txt. Core dependencies:
- scikit-learn, numpy, scipy, pandas, joblib
- lightgbm
- (Optional for future API use) fastapi, uvicorn, pydantic, python-multipart
- psutil (optional utilities)
Python 3.10+ recommended.
Install:
pip install -r requirements.txtTraining data file: _dataset.csv (note the leading underscore). Each line is:
<label>,<space-separated-syscall-sequence>
- Column 1: label (e.g., Trojan, Ransomware)
- Column 2: syscall sequence as a single string (space‑separated tokens)
- Lines with bad formatting are skipped with a warning.
Example:
Trojan,execve brk mmap openat fstat close read write ...
Ransomware,execve openat read write fsync ...
Prediction input examples:
--textfor a single sequence--csvfor a CSV file with either:- a single column of sequences, or
- a header containing one of: text, sequence, api, calls, syscalls
- if multiple columns and no header, rows are concatenated into a single sequence
--filefor a newline‑delimited text file of sequences
See sample.csv for a minimal CSV example.
Run:
python detector.pyWhat it does (detector.py):
- Loads
_dataset.csvwith robust parsing (load_dataset()). - Simplifies labels (
simplify_labels()), e.g. mapping variants to broader families (Trojan, Ransomware, Miner, etc.). If all labels are “Trojan”, it simplifies to a single class accordingly. - Filters out classes with fewer than 5 samples.
- Chooses evaluation strategy:
- If >= 100 samples: train/validation split + k‑fold CV on train, final eval on holdout.
- Else: k‑fold CV on all data.
- Builds pipeline (
build_pipeline()):TfidfVectorizeron 1–5‑grams with tokenization over space, up to 20k featuresHandcraftedFeatures(stats like token count, unique ratio, etc.)LGBMClassifierwith balanced class weights
- Saves:
artifacts/model.joblibartifacts/metrics.json(includes CV report, class distributions)
Important notes:
- If only one class remains after filtering, training is aborted with an error message.
- If accuracy is suspiciously high on sizable data, a warning suggests reviewing for leakage.
predict.py supports multiple input modes.
Examples:
# Single sequence, print top predictions with probabilities
python predict.py --text "execve brk mmap openat fstat" --proba
# CSV input; auto-detects the column or uses the first if single-column
python predict.py --csv sample.csv --proba --output results.csv
# Newline-delimited text file; limit output to first 5 predictions
python predict.py --file sequences.txt --limit 5Flags:
--proba: include max probability and top-N breakdown per sample--limit N: show only the first N results--top-n K: number of top classes shown in the probability string (default: 3)--output path.csv: write results to CSV--json: print JSON to stdout instead of a tabular view
Output columns:
Sample_IDPredicted_Malware_TypeText(truncated to 80 chars in CLI output)- Optional:
Probability,Top_Predictions(when--proba)
Model file requirement:
- Expects
artifacts/model.joblibto exist. Train first if missing.
from predict import run_prediction
results = run_prediction(
text="execve brk mmap openat fstat",
proba=True,
limit=10,
top_n=3,
output=None,
)
print(results) # list of dictsOr from CSV:
results = run_prediction(csv_path="sample.csv", proba=True)artifacts/model.joblib: joblib‑saved scikit‑learn pipelineartifacts/metrics.json: includes:- classification report (averaged across folds when CV)
- labels used in training (after filtering)
- original, simplified, and final class distributions
You can parse and display metrics in notebooks or dashboards as needed.
features.HandcraftedFeatures: computes simple stats per sequence:- token count, unique ratio, average token length, uppercase token ratio, digit‑containing token ratio
TfidfVectorizerusestoken_pattern=r"[^ ]+"to treat space‑separated tokens as words.- The vector union is built with
FeatureUnion([("tfidf", ...), ("handcrafted", ...)]).
random_state=42in LightGBM for reproducibility.- Ensure consistent environments (Python version and
requirements.txt) across runs.
- Model not found:
- Run
python detector.pyto train and generateartifacts/model.joblib.
- Run
- CSV column detection:
- If the wrong column is selected, ensure a single column or rename the desired column to one of:
text,sequence,api,calls,syscalls.
- If the wrong column is selected, ensure a single column or rename the desired column to one of:
- Small dataset:
- You may see warnings; consider collecting more samples or reducing label granularity.
- Class imbalance:
- The model uses
class_weight="balanced", but more data may still be needed for rare classes.
- The model uses
Contributions are welcome. Please open an issue or PR in the repository: https://github.com/Zierax/Malware-LinuxTypes-Detector-ML/