On the Feasibility of Cross-Language Detection of Malicious Packages in npm and PyPI - Paper Artifacts
This supplementary material constitutes the artifact for our paper titled "On the Feasibility of Cross-Language Detection of Malicious Packages in npm and PyPI". Specifically:
Labelled_Dataset.csv
(1.8MB in size):- This file contains the labelled dataset used in our evaluation process. It houses the data employed for evaluating the features, assessing the models, and training the classifiers. Such dataset can be used to gain further insights into how features were selected, evaluate the models' performance, and comprehend the training process.
.pkl
Files (cross-language (125.9 kB), JavaScript mono-language (73.7 kB), and Python mono-language (110.1 kB)):- These files, saved in pickle format, encapsulate the best-performing classifiers that emerged from our models' evaluation. They can be imported into Python using scikit-learn library. This feature enables you to use these classifiers for classification tasks.
Malicious_Packages_Discovered.csv
:- This file comprises a comprehensive list of malicious packages that were successfully detected during our real-world experiment. It details their characteristics, behavior types, and obfuscation techniques.
scripts/*
:- This folder contains various scripts including those to reproduce the evaluation and training of the models presented in the paper
(1) Install the dependencies via:
$ pip install -r ./scripts/requirements.txt
(2) Run the script according to the model you want to reproduce (i.e., cross-language or mono-language). For example, in the case of the cross-language XGBoost model you can run from the main folder:
$ python ./scripts/Crosslanguage_XGBoost_train_test.py
The produced classifier can be then imported in Python using the joblib function load()
, for example:
classifier_XGBoost = joblib.load(model_path)
Labelled_Dataset.csv
is a labeled dataset containing packages from two distinct public repositories: NPM and PyPI. It addresses the issue of classifying malicious packages, taking into account the imbalance problem.
In this scenario, we assume that the estimated percentage of malicious packages among all packages is approximately 10%.
It follows the explaination of the columns in the CSV file.
Malicious
: encodes the target value, i.e., wether a package is benign (value of 0) or malicious (value of 1)Package Repository
: specifies from which package repository the package comes from (i.e., NPM or PyPI)Package Name
: specifies the package name
Number of Words in source code
: count of words in the source code filesNumber of lines in source code
: count of lines in the source code filesNumber of sospicious token in source code
: count of suspicious words (e.g., bash commands, path to sensitive files) in source code filesNumber of URLs in source code
Number of base64 chunks in source code
: number of detected valid base64 strings in source codeNumber of IP adress in source code
bracker ratio mean
: Mean of the ratio (no. of square brackets/byte size) among source code filesbracker ratio std
: Standard Deviation of the ratio (no. of square brackets/byte size) among source code filesbracker ratio max
: Maximum of the ratio (no. of square brackets/byte size) among source code filesbracker ratio q3
: Third quartile of the ratio (no. of square brackets/byte size) among source code fileseq ratio mean
: Mean of the ratio (equal signs/byte size) among source code fileseq ratio std
: Standard Deviation of the ratio (equal signs/byte size) among source code fileseq ratio max
: Maximum of the ratio (equal signs/byte size) among source code fileseq ratio q3
: Third quartile of the ratio (equal signs/byte size) among source code filesplus ratio mean
: Mean of the ratio (plus signs/byte size) among source code filesplus ratio std
: Standard Deviation of the ratio (plus signs/byte size) among source code filesplus ratio max
: Maximum of the ratio (plus signs/byte size) among source code filesplus ratio q3
: Third quartile of the ratio (plus signs/byte size) among source code filesshannon mean ID source code
: Shannon Entropy's mean computed on identifiers, after applying the generalization languageshannon std ID source code
: Shannon Entropy's standard deviation computed on identifiers, after applying the generalization languageshannon max ID source code
: Shannon Entropy's maximum value computed on identifiers, after applying the generalization languageshannon q3 ID source code
: Shannon Entropy's third quartile computed on identifiers, after applying the generalization languageshannon mean string source code
: Shannon Entropy's mean computed on strings, after applying the generalization languageshannon std string source code
: Shannon Entropy's standard deviation computed on strings, after applying the generalization languageshannon max string source code
: Shannon Entropy's maximum valuecomputed on, after applying the generalization languageshannon q3 string source code
: Shannon Entropy's third quartile of tokenized strings, after applying the generalization languagehomogeneous identifiers in source code
: Number of homogeneous identifiers, i.e., identifiers having all characters equal after transforming them through the generalization languagehomogeneous strings in source code
: Number of homogeneous strings, i.e. strings having all characters equal after transforming them through the generalization languageheterogeneous identifiers in source code
: Number of heterogeneous identifiers, i.e., identifiers with more than one symbol after transforming them using the generalization language equalheterogeneous strings in source code
: Number of heterogeneous strings, i.e., strings with more than one symbol after transforming them using the generalization language equal
The following features are extracted from the 'metadata' file in the packages, such as the package.json
file for NPM and the setup.py
file for PyPI:
Number of Words in metadata
: count of words in the metadata files (i.e.,package.json
file for NPM and thesetup.py
file for PyPI)Number of lines in metadata
: count of lines in the metadata files (i.e.,package.json
file for NPM and thesetup.py
file for PyPI)Number of sospicious token in metadata
: count of suspicious words (e.g., bash commands, path to sensitive files) in the metadata files (i.e.,package.json
file for NPM and thesetup.py
file for PyPI)Number of URLs in metadata
Number of base64 chunks in metadata
: number of detected valid base64 strings the metadata files (i.e.,package.json
file for NPM and thesetup.py
file for PyPI)Number of IP adress in metadata
presence of installation script
: boolean for the presence of installation script (Pypi: install script - NPM: presence of keyspostinstall
,preinstall
,install
)shannon mean ID metadata
: Shannon Entropy's mean computed on identifiers, after applying the generalization languageshannon std ID metadata
: Shannon Entropy's standard deviation computed on identifiers, after applying the generalization languageshannon max ID metadata
: Shannon Entropy's maximum value computed on identifiers, after applying the generalization languageshannon q3 ID metadata
: Shannon Entropy's third quartile computed on identifiers, after applying the generalization languageshannon mean metadata
: Shannon Entropy's mean computed on strings, after applying the generalization languageshannon std string metadata
: Shannon Entropy's standard deviation computed on strings, after applying the generalization languageshannon max string metadata
: Shannon Entropy's maximum valuecomputed on, after applying the generalization languageshannon q3 string metadata
: Shannon Entropy's third quartile of tokenized strings, after applying the generalization languagehomogeneous identifiers in metadata
: Number of homogeneous identifiers, i.e., identifiers having all characters equal after transforming them through the generalization languagehomogeneous strings in metadata
: Number of homogeneous strings, i.e. strings having all characters equal after transforming them through the generalization languageheterogeneous identifiers in metadata
: Number of heterogeneous identifiers, i.e., identifiers with more than one symbol after transforming them using the generalization language equalheterogeneous strings in metadata
: Number of heterogeneous strings, i.e., strings with more than one symbol after transforming them using the generalization language equal
The following features count the number of files per selected extensions:
'bat', 'bz2', 'c', 'cert', 'conf' ,'cpp' ,'crt', 'css', 'csv', 'deb' ,'erb', 'gemspec', 'gif', 'gz', 'h', 'html', 'ico' ,'ini' ,'jar', 'java', 'jpg', 'js', 'json', 'key' ,'m4v' ,'markdown' ,'md' ,'pdf', 'pem', 'png', 'ps', 'py', 'rb', 'rpm', 'rst','sh' ,'svg', 'toml', 'ttf', 'txt','xml', 'yaml', 'yml', 'eot', 'exe', 'jpeg', 'properties', 'sql', 'swf', 'tar', 'woff', 'woff2', 'aac','bmp', 'cfg' ,'dcm', 'dll', 'doc', 'flac','flv', 'ipynb', 'm4a', 'mid', 'mkv', 'mp3', 'mp4', 'mpg', 'ogg','otf', 'pickle', 'pkl' ,'psd', 'pxd' ,'pxi', 'pyc', 'pyx', 'r', 'rtf', 'so', 'sqlite' ,'tif', 'tp', 'wav', 'webp' ,'whl', 'xcf', 'xz', 'zip' ,'mov' ,'wasm', 'webm'.
The current cost values to train our models have been computed in the following configuration:
MacOS 13.5.1; CPU: 2 GHz Quad-Core Intel Core i5; RAM: 16 GB 3733 MHz LPDDR4X
In addition, such costs are computed for the whole process of evaluating the best hyperparameters and training the models.
For XGBoost we have the following estimations:
- Cross-language model:
- Estimated train time:
271.04s
- Estimated space cost:
64.08 MB
- Estimated train time:
- Mono-language model (JavaScript):
- Estimated train time:
198.24s
- Estimated space cost:
57.06 MB
- Estimated train time:
- Mono-language model (Python):
- Estimated train time:
195.50s
- Estimated space cost:
60.98 MB
- Estimated train time:
For Decision Tree (DT) we have the following estimations:
- Cross-language model:
- Estimated train time:
86.76s
- Estimated space cost:
118.24 MB
- Estimated train time:
- Mono-language model (JavaScript):
- Estimated train time:
51.01s
- Estimated space cost:
108.03 MB
- Estimated train time:
- Mono-language model (Python):
- Estimated train time:
46.86s
- Estimated space cost:
125.35 MB
- Estimated train time:
For Random Forest (RF) we have the following estimations:
- Cross-language model:
- Estimated train time:
696.67s
- Estimated space cost:
193.31 MB
- Estimated train time:
- Mono-language model (JavaScript):
- Estimated train time:
544.89s
- Estimated space cost:
125.66 MB
- Estimated train time:
- Mono-language model (Python):
- Estimated train time:
550.65s
- Estimated space cost:
122.81 MB
- Estimated train time:
Create an issue in this repository if you find a bug or have questions about the content.
For additional support, ask a question in SAP Community.
If you wish to contribute code, offer fixes or improvements, please send a pull request. Due to legal reasons, contributors will be asked to accept a DCO when they create the first pull request to this project. This happens in an automated fashion during the submission process. SAP uses the standard DCO text of the Linux Foundation.
Copyright (c) 2023 SAP SE or an SAP affiliate company. All rights reserved. This project is licensed under the Apache Software License, version 2.0 except as noted otherwise in the LICENSE file.