On the Feasibility of Cross-Language Detection of Malicious Packages in npm and PyPI - Paper Artifacts

Description

This supplementary material constitutes the artifact for our paper titled "On the Feasibility of Cross-Language Detection of Malicious Packages in npm and PyPI". Specifically:

Labelled_Dataset.csv (1.8MB in size):
- This file contains the labelled dataset used in our evaluation process. It houses the data employed for evaluating the features, assessing the models, and training the classifiers. Such dataset can be used to gain further insights into how features were selected, evaluate the models' performance, and comprehend the training process.
.pkl Files (cross-language (125.9 kB), JavaScript mono-language (73.7 kB), and Python mono-language (110.1 kB)):
- These files, saved in pickle format, encapsulate the best-performing classifiers that emerged from our models' evaluation. They can be imported into Python using scikit-learn library. This feature enables you to use these classifiers for classification tasks.
Malicious_Packages_Discovered.csv:
- This file comprises a comprehensive list of malicious packages that were successfully detected during our real-world experiment. It details their characteristics, behavior types, and obfuscation techniques.
scripts/*:
- This folder contains various scripts including those to reproduce the evaluation and training of the models presented in the paper

Requirements and Run

(1) Install the dependencies via:

$ pip install -r ./scripts/requirements.txt

(2) Run the script according to the model you want to reproduce (i.e., cross-language or mono-language). For example, in the case of the cross-language XGBoost model you can run from the main folder:

$ python ./scripts/Crosslanguage_XGBoost_train_test.py

The produced classifier can be then imported in Python using the joblib function load(), for example:

classifier_XGBoost = joblib.load(model_path)

Features Explaination

Labelled_Dataset.csv is a labeled dataset containing packages from two distinct public repositories: NPM and PyPI. It addresses the issue of classifying malicious packages, taking into account the imbalance problem.

In this scenario, we assume that the estimated percentage of malicious packages among all packages is approximately 10%.

It follows the explaination of the columns in the CSV file.

Package information

Malicious: encodes the target value, i.e., wether a package is benign (value of 0) or malicious (value of 1)
Package Repository: specifies from which package repository the package comes from (i.e., NPM or PyPI)
Package Name: specifies the package name

Features extracted from the source code files (.js, .py )

Number of Words in source code : count of words in the source code files
Number of lines in source code : count of lines in the source code files
Number of sospicious token in source code : count of suspicious words (e.g., bash commands, path to sensitive files) in source code files
Number of URLs in source code
Number of base64 chunks in source code : number of detected valid base64 strings in source code
Number of IP adress in source code
bracker ratio mean : Mean of the ratio (no. of square brackets/byte size) among source code files
bracker ratio std: Standard Deviation of the ratio (no. of square brackets/byte size) among source code files
bracker ratio max : Maximum of the ratio (no. of square brackets/byte size) among source code files
bracker ratio q3: Third quartile of the ratio (no. of square brackets/byte size) among source code files
eq ratio mean : Mean of the ratio (equal signs/byte size) among source code files
eq ratio std : Standard Deviation of the ratio (equal signs/byte size) among source code files
eq ratio max : Maximum of the ratio (equal signs/byte size) among source code files
eq ratio q3 : Third quartile of the ratio (equal signs/byte size) among source code files
plus ratio mean : Mean of the ratio (plus signs/byte size) among source code files
plus ratio std : Standard Deviation of the ratio (plus signs/byte size) among source code files
plus ratio max : Maximum of the ratio (plus signs/byte size) among source code files
plus ratio q3 : Third quartile of the ratio (plus signs/byte size) among source code files
shannon mean ID source code : Shannon Entropy's mean computed on identifiers, after applying the generalization language
shannon std ID source code : Shannon Entropy's standard deviation computed on identifiers, after applying the generalization language
shannon max ID source code : Shannon Entropy's maximum value computed on identifiers, after applying the generalization language
shannon q3 ID source code : Shannon Entropy's third quartile computed on identifiers, after applying the generalization language
shannon mean string source code : Shannon Entropy's mean computed on strings, after applying the generalization language
shannon std string source code : Shannon Entropy's standard deviation computed on strings, after applying the generalization language
shannon max string source code : Shannon Entropy's maximum valuecomputed on, after applying the generalization language
shannon q3 string source code : Shannon Entropy's third quartile of tokenized strings, after applying the generalization language
homogeneous identifiers in source code : Number of homogeneous identifiers, i.e., identifiers having all characters equal after transforming them through the generalization language
homogeneous strings in source code : Number of homogeneous strings, i.e. strings having all characters equal after transforming them through the generalization language
heterogeneous identifiers in source code : Number of heterogeneous identifiers, i.e., identifiers with more than one symbol after transforming them using the generalization language equal
heterogeneous strings in source code : Number of heterogeneous strings, i.e., strings with more than one symbol after transforming them using the generalization language equal

Metadata file (NPM:package.json, Pypi: setup.py )

The following features are extracted from the 'metadata' file in the packages, such as the package.json file for NPM and the setup.py file for PyPI:

Number of Words in metadata : count of words in the metadata files (i.e., package.json file for NPM and the setup.py file for PyPI)
Number of lines in metadata : count of lines in the metadata files (i.e., package.json file for NPM and the setup.py file for PyPI)
Number of sospicious token in metadata : count of suspicious words (e.g., bash commands, path to sensitive files) in the metadata files (i.e., package.json file for NPM and the setup.py file for PyPI)
Number of URLs in metadata
Number of base64 chunks in metadata : number of detected valid base64 strings the metadata files (i.e., package.json file for NPM and the setup.py file for PyPI)
Number of IP adress in metadata
presence of installation script: boolean for the presence of installation script (Pypi: install script - NPM: presence of keys postinstall, preinstall, install)
shannon mean ID metadata : Shannon Entropy's mean computed on identifiers, after applying the generalization language
shannon std ID metadata : Shannon Entropy's standard deviation computed on identifiers, after applying the generalization language
shannon max ID metadata : Shannon Entropy's maximum value computed on identifiers, after applying the generalization language
shannon q3 ID metadata : Shannon Entropy's third quartile computed on identifiers, after applying the generalization language
shannon mean metadata : Shannon Entropy's mean computed on strings, after applying the generalization language
shannon std string metadata : Shannon Entropy's standard deviation computed on strings, after applying the generalization language
shannon max string metadata : Shannon Entropy's maximum valuecomputed on, after applying the generalization language
shannon q3 string metadata : Shannon Entropy's third quartile of tokenized strings, after applying the generalization language
homogeneous identifiers in metadata : Number of homogeneous identifiers, i.e., identifiers having all characters equal after transforming them through the generalization language
homogeneous strings in metadata : Number of homogeneous strings, i.e. strings having all characters equal after transforming them through the generalization language
heterogeneous identifiers in metadata : Number of heterogeneous identifiers, i.e., identifiers with more than one symbol after transforming them using the generalization language equal
heterogeneous strings in metadata : Number of heterogeneous strings, i.e., strings with more than one symbol after transforming them using the generalization language equal

Structural features of the package

The following features count the number of files per selected extensions:

'bat', 'bz2', 'c', 'cert', 'conf' ,'cpp' ,'crt', 'css', 'csv', 'deb' ,'erb', 'gemspec', 'gif', 'gz', 'h', 'html', 'ico' ,'ini' ,'jar', 'java', 'jpg', 'js', 'json', 'key' ,'m4v' ,'markdown' ,'md' ,'pdf', 'pem', 'png', 'ps', 'py', 'rb', 'rpm', 'rst','sh' ,'svg', 'toml', 'ttf', 'txt','xml', 'yaml', 'yml', 'eot', 'exe', 'jpeg', 'properties', 'sql', 'swf', 'tar', 'woff', 'woff2', 'aac','bmp', 'cfg' ,'dcm', 'dll', 'doc', 'flac','flv', 'ipynb', 'm4a', 'mid', 'mkv', 'mp3', 'mp4', 'mpg', 'ogg','otf', 'pickle', 'pkl' ,'psd', 'pxd' ,'pxi', 'pyc', 'pyx', 'r', 'rtf', 'so', 'sqlite' ,'tif', 'tp', 'wav', 'webp' ,'whl', 'xcf', 'xz', 'zip' ,'mov' ,'wasm', 'webm'.

Time and Space Cost

The current cost values to train our models have been computed in the following configuration:

MacOS 13.5.1; CPU: 2 GHz Quad-Core Intel Core i5; RAM: 16 GB 3733 MHz LPDDR4X

In addition, such costs are computed for the whole process of evaluating the best hyperparameters and training the models.

For XGBoost we have the following estimations:

Cross-language model:
- Estimated train time: 271.04s
- Estimated space cost: 64.08 MB
Mono-language model (JavaScript):
- Estimated train time: 198.24s
- Estimated space cost: 57.06 MB
Mono-language model (Python):
- Estimated train time: 195.50s
- Estimated space cost: 60.98 MB

For Decision Tree (DT) we have the following estimations:

Cross-language model:
- Estimated train time: 86.76s
- Estimated space cost: 118.24 MB
Mono-language model (JavaScript):
- Estimated train time: 51.01s
- Estimated space cost: 108.03 MB
Mono-language model (Python):
- Estimated train time: 46.86s
- Estimated space cost: 125.35 MB

For Random Forest (RF) we have the following estimations:

Cross-language model:
- Estimated train time: 696.67s
- Estimated space cost: 193.31 MB
Mono-language model (JavaScript):
- Estimated train time: 544.89s
- Estimated space cost: 125.66 MB
Mono-language model (Python):
- Estimated train time: 550.65s
- Estimated space cost: 122.81 MB

How to obtain support

Create an issue in this repository if you find a bug or have questions about the content.

For additional support, ask a question in SAP Community.

Contributing

If you wish to contribute code, offer fixes or improvements, please send a pull request. Due to legal reasons, contributors will be asked to accept a DCO when they create the first pull request to this project. This happens in an automated fashion during the submission process. SAP uses the standard DCO text of the Linux Foundation.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.reuse		.reuse
LICENSES		LICENSES
scripts		scripts
Crosslanguage_model.pkl		Crosslanguage_model.pkl
JS_monolanguage_model.pkl		JS_monolanguage_model.pkl
LICENSE		LICENSE
Labelled_Dataset.csv		Labelled_Dataset.csv
Malicious_Packages_Discovered.csv		Malicious_Packages_Discovered.csv
Py_monolanguage_model.pkl		Py_monolanguage_model.pkl
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

On the Feasibility of Cross-Language Detection of Malicious Packages in npm and PyPI - Paper Artifacts

Description

Requirements and Run

Features Explaination

Package information

Features extracted from the source code files (.js, .py )

Metadata file (NPM:package.json, Pypi: setup.py )

Structural features of the package

Time and Space Cost

How to obtain support

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

SAP-samples/cross-language-detection-artifacts

Folders and files

Latest commit

History

Repository files navigation

On the Feasibility of Cross-Language Detection of Malicious Packages in npm and PyPI - Paper Artifacts

Description

Requirements and Run

Features Explaination

Package information

Features extracted from the source code files (.js, .py )

Metadata file (NPM:package.json, Pypi: setup.py )

Structural features of the package

Time and Space Cost

How to obtain support

Contributing

License

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages