Malware Linux Types Detector (ML)

Behavior-based machine learning for classifying Linux malware types from system call (syscall) sequences.

This project trains a model on syscall sequences and predicts a malware type for unseen sequences. It uses a TF‑IDF + handcrafted features pipeline with a LightGBM classifier, and provides both CLI and importable Python usage for training and inference.

Train: detector.py
Predict: predict.py
Features: features.py
Artifacts: artifacts/model.joblib, artifacts/metrics.json

Repository: https://github.com/Zierax/Malware-LinuxTypes-Detector-ML/

Features

TF‑IDF on syscall n‑grams (1–5)
Lightweight handcrafted features (HandcraftedFeatures) for basic sequence statistics
LightGBM classifier with balanced class weights
Label simplification and rare-class filtering
Cross‑validation and metrics export to JSON
Flexible input for prediction: single text, CSV, or newline‑delimited file
Optional probability output and CSV export of predictions

Project Structure

.
├─ detector.py              # Train and evaluate; saves model + metrics to artifacts/
├─ predict.py               # CLI + importable predictions from syscall sequences
├─ features.py              # HandcraftedFeatures transformer (simple stats)
├─ features_shim.py         # (present if used elsewhere)
├─ _dataset.csv             # Training dataset (label,syscalls)
├─ sample.csv               # Example input for predict CLI
├─ artifacts/
│  ├─ model.joblib          # Saved pipeline (TF-IDF + features + LGBM)
│  └─ metrics.json          # Evaluation metrics and label info
├─ requirements.txt
├─ test_samples.csv         # Additional sample inputs (for testing "opt")
└─ README.md

Requirements

See requirements.txt. Core dependencies:

scikit-learn, numpy, scipy, pandas, joblib
lightgbm
(Optional for future API use) fastapi, uvicorn, pydantic, python-multipart
psutil (optional utilities)

Python 3.10+ recommended.

Install:

pip install -r requirements.txt

Data Format

Training data file: _dataset.csv (note the leading underscore). Each line is:

<label>,<space-separated-syscall-sequence>

Column 1: label (e.g., Trojan, Ransomware)
Column 2: syscall sequence as a single string (space‑separated tokens)
Lines with bad formatting are skipped with a warning.

Example:

Trojan,execve brk mmap openat fstat close read write ...
Ransomware,execve openat read write fsync ...

Prediction input examples:

--text for a single sequence
--csv for a CSV file with either:
- a single column of sequences, or
- a header containing one of: text, sequence, api, calls, syscalls
- if multiple columns and no header, rows are concatenated into a single sequence
--file for a newline‑delimited text file of sequences

See sample.csv for a minimal CSV example.

Training

Run:

python detector.py

What it does (detector.py):

Loads _dataset.csv with robust parsing (load_dataset()).
Simplifies labels (simplify_labels()), e.g. mapping variants to broader families (Trojan, Ransomware, Miner, etc.). If all labels are “Trojan”, it simplifies to a single class accordingly.
Filters out classes with fewer than 5 samples.
Chooses evaluation strategy:
- If >= 100 samples: train/validation split + k‑fold CV on train, final eval on holdout.
- Else: k‑fold CV on all data.
Builds pipeline (build_pipeline()):
- TfidfVectorizer on 1–5‑grams with tokenization over space, up to 20k features
- HandcraftedFeatures (stats like token count, unique ratio, etc.)
- LGBMClassifier with balanced class weights
Saves:
- artifacts/model.joblib
- artifacts/metrics.json (includes CV report, class distributions)

Important notes:

If only one class remains after filtering, training is aborted with an error message.
If accuracy is suspiciously high on sizable data, a warning suggests reviewing for leakage.

Prediction (CLI)

predict.py supports multiple input modes.

Examples:

# Single sequence, print top predictions with probabilities
python predict.py --text "execve brk mmap openat fstat" --proba

# CSV input; auto-detects the column or uses the first if single-column
python predict.py --csv sample.csv --proba --output results.csv

# Newline-delimited text file; limit output to first 5 predictions
python predict.py --file sequences.txt --limit 5

Flags:

--proba: include max probability and top-N breakdown per sample
--limit N: show only the first N results
--top-n K: number of top classes shown in the probability string (default: 3)
--output path.csv: write results to CSV
--json: print JSON to stdout instead of a tabular view

Output columns:

Sample_ID
Predicted_Malware_Type
Text (truncated to 80 chars in CLI output)
Optional: Probability, Top_Predictions (when --proba)

Model file requirement:

Expects artifacts/model.joblib to exist. Train first if missing.

Prediction (Python API)

from predict import run_prediction

results = run_prediction(
    text="execve brk mmap openat fstat",
    proba=True,
    limit=10,
    top_n=3,
    output=None,
)
print(results)  # list of dicts

Or from CSV:

results = run_prediction(csv_path="sample.csv", proba=True)

Artifacts and Metrics

artifacts/model.joblib: joblib‑saved scikit‑learn pipeline
artifacts/metrics.json: includes:
- classification report (averaged across folds when CV)
- labels used in training (after filtering)
- original, simplified, and final class distributions

You can parse and display metrics in notebooks or dashboards as needed.

Implementation Details

features.HandcraftedFeatures: computes simple stats per sequence:
- token count, unique ratio, average token length, uppercase token ratio, digit‑containing token ratio
TfidfVectorizer uses token_pattern=r"[^ ]+" to treat space‑separated tokens as words.
The vector union is built with FeatureUnion([("tfidf", ...), ("handcrafted", ...)]).

Reproducibility

random_state=42 in LightGBM for reproducibility.
Ensure consistent environments (Python version and requirements.txt) across runs.

Troubleshooting

Model not found:
- Run python detector.py to train and generate artifacts/model.joblib.
CSV column detection:
- If the wrong column is selected, ensure a single column or rename the desired column to one of: text, sequence, api, calls, syscalls.
Small dataset:
- You may see warnings; consider collecting more samples or reducing label granularity.
Class imbalance:
- The model uses class_weight="balanced", but more data may still be needed for rare classes.

Contributing

Contributions are welcome. Please open an issue or PR in the repository: https://github.com/Zierax/Malware-LinuxTypes-Detector-ML/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Malware Linux Types Detector (ML)

Features

Project Structure

Requirements

Data Format

Training

Prediction (CLI)

Prediction (Python API)

Artifacts and Metrics

Implementation Details

Reproducibility

Troubleshooting

Contributing

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
artifacts		artifacts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
_dataset.csv		_dataset.csv
detector.py		detector.py
features.py		features.py
features_shim.py		features_shim.py
predict.py		predict.py
requirements.txt		requirements.txt
sample.csv		sample.csv
test_samples.csv		test_samples.csv

License

Zierax/Malware-LinuxTypes-Detector-ML

Folders and files

Latest commit

History

Repository files navigation

Malware Linux Types Detector (ML)

Features

Project Structure

Requirements

Data Format

Training

Prediction (CLI)

Prediction (Python API)

Artifacts and Metrics

Implementation Details

Reproducibility

Troubleshooting

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages