Skip to content

Zierax/Malware-LinuxTypes-Detector-ML

Repository files navigation

Malware Linux Types Detector (ML)


Behavior-based machine learning for classifying Linux malware types from system call (syscall) sequences.

This project trains a model on syscall sequences and predicts a malware type for unseen sequences. It uses a TF‑IDF + handcrafted features pipeline with a LightGBM classifier, and provides both CLI and importable Python usage for training and inference.

  • Train: detector.py
  • Predict: predict.py
  • Features: features.py
  • Artifacts: artifacts/model.joblib, artifacts/metrics.json

Repository: https://github.com/Zierax/Malware-LinuxTypes-Detector-ML/

Features

  • TF‑IDF on syscall n‑grams (1–5)
  • Lightweight handcrafted features (HandcraftedFeatures) for basic sequence statistics
  • LightGBM classifier with balanced class weights
  • Label simplification and rare-class filtering
  • Cross‑validation and metrics export to JSON
  • Flexible input for prediction: single text, CSV, or newline‑delimited file
  • Optional probability output and CSV export of predictions

Project Structure

.
├─ detector.py              # Train and evaluate; saves model + metrics to artifacts/
├─ predict.py               # CLI + importable predictions from syscall sequences
├─ features.py              # HandcraftedFeatures transformer (simple stats)
├─ features_shim.py         # (present if used elsewhere)
├─ _dataset.csv             # Training dataset (label,syscalls)
├─ sample.csv               # Example input for predict CLI
├─ artifacts/
│  ├─ model.joblib          # Saved pipeline (TF-IDF + features + LGBM)
│  └─ metrics.json          # Evaluation metrics and label info
├─ requirements.txt
├─ test_samples.csv         # Additional sample inputs (for testing "opt")
└─ README.md

Requirements

See requirements.txt. Core dependencies:

  • scikit-learn, numpy, scipy, pandas, joblib
  • lightgbm
  • (Optional for future API use) fastapi, uvicorn, pydantic, python-multipart
  • psutil (optional utilities)

Python 3.10+ recommended.

Install:

pip install -r requirements.txt

Data Format

Training data file: _dataset.csv (note the leading underscore). Each line is:

<label>,<space-separated-syscall-sequence>
  • Column 1: label (e.g., Trojan, Ransomware)
  • Column 2: syscall sequence as a single string (space‑separated tokens)
  • Lines with bad formatting are skipped with a warning.

Example:

Trojan,execve brk mmap openat fstat close read write ...
Ransomware,execve openat read write fsync ...

Prediction input examples:

  • --text for a single sequence
  • --csv for a CSV file with either:
    • a single column of sequences, or
    • a header containing one of: text, sequence, api, calls, syscalls
    • if multiple columns and no header, rows are concatenated into a single sequence
  • --file for a newline‑delimited text file of sequences

See sample.csv for a minimal CSV example.

Training

Run:

python detector.py

What it does (detector.py):

  • Loads _dataset.csv with robust parsing (load_dataset()).
  • Simplifies labels (simplify_labels()), e.g. mapping variants to broader families (Trojan, Ransomware, Miner, etc.). If all labels are “Trojan”, it simplifies to a single class accordingly.
  • Filters out classes with fewer than 5 samples.
  • Chooses evaluation strategy:
    • If >= 100 samples: train/validation split + k‑fold CV on train, final eval on holdout.
    • Else: k‑fold CV on all data.
  • Builds pipeline (build_pipeline()):
    • TfidfVectorizer on 1–5‑grams with tokenization over space, up to 20k features
    • HandcraftedFeatures (stats like token count, unique ratio, etc.)
    • LGBMClassifier with balanced class weights
  • Saves:
    • artifacts/model.joblib
    • artifacts/metrics.json (includes CV report, class distributions)

Important notes:

  • If only one class remains after filtering, training is aborted with an error message.
  • If accuracy is suspiciously high on sizable data, a warning suggests reviewing for leakage.

Prediction (CLI)

predict.py supports multiple input modes.

Examples:

# Single sequence, print top predictions with probabilities
python predict.py --text "execve brk mmap openat fstat" --proba

# CSV input; auto-detects the column or uses the first if single-column
python predict.py --csv sample.csv --proba --output results.csv

# Newline-delimited text file; limit output to first 5 predictions
python predict.py --file sequences.txt --limit 5

Flags:

  • --proba: include max probability and top-N breakdown per sample
  • --limit N: show only the first N results
  • --top-n K: number of top classes shown in the probability string (default: 3)
  • --output path.csv: write results to CSV
  • --json: print JSON to stdout instead of a tabular view

Output columns:

  • Sample_ID
  • Predicted_Malware_Type
  • Text (truncated to 80 chars in CLI output)
  • Optional: Probability, Top_Predictions (when --proba)

Model file requirement:

  • Expects artifacts/model.joblib to exist. Train first if missing.

Prediction (Python API)

from predict import run_prediction

results = run_prediction(
    text="execve brk mmap openat fstat",
    proba=True,
    limit=10,
    top_n=3,
    output=None,
)
print(results)  # list of dicts

Or from CSV:

results = run_prediction(csv_path="sample.csv", proba=True)

Artifacts and Metrics

  • artifacts/model.joblib: joblib‑saved scikit‑learn pipeline
  • artifacts/metrics.json: includes:
    • classification report (averaged across folds when CV)
    • labels used in training (after filtering)
    • original, simplified, and final class distributions

You can parse and display metrics in notebooks or dashboards as needed.

Implementation Details

  • features.HandcraftedFeatures: computes simple stats per sequence:
    • token count, unique ratio, average token length, uppercase token ratio, digit‑containing token ratio
  • TfidfVectorizer uses token_pattern=r"[^ ]+" to treat space‑separated tokens as words.
  • The vector union is built with FeatureUnion([("tfidf", ...), ("handcrafted", ...)]).

Reproducibility

  • random_state=42 in LightGBM for reproducibility.
  • Ensure consistent environments (Python version and requirements.txt) across runs.

Troubleshooting

  • Model not found:
    • Run python detector.py to train and generate artifacts/model.joblib.
  • CSV column detection:
    • If the wrong column is selected, ensure a single column or rename the desired column to one of: text, sequence, api, calls, syscalls.
  • Small dataset:
    • You may see warnings; consider collecting more samples or reducing label granularity.
  • Class imbalance:
    • The model uses class_weight="balanced", but more data may still be needed for rare classes.

Contributing

Contributions are welcome. Please open an issue or PR in the repository: https://github.com/Zierax/Malware-LinuxTypes-Detector-ML/

About

detect malware type by input syscalls of the malware in ML model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages