Advanced X-ray Vision for Windows Executables
- Detect malicious
.exe
files using machine learning. Extracts static features (entropy, imports, metadata) and combines ML with heuristic rules for fast, automated classification.
- 50+ New Detection Features (VM detection, anti-debugging, API call chains)
- Enhanced Prediction Engine with detailed suspicious behavior reports if malware found
- Recall-Optimized Training: Custom scorer prioritizing malware detection
- Streamlined 3-Script Architecture (faster workflow)
- Improved Accuracy (F1-score up to 0.99 in testing)
- Dataset Provided !!
- Source & Composition:
Dataset | From | Examples | Total |
---|---|---|---|
Malicious Dataset | DikeDataset, theZoo, MalwareBazaar | WannaCry.exe, njRAT.exe | 10,925 |
Benign Dataset | Windows Files, Ninite.com, PortableApps.com | Putty.exe, notepad.exe, ida.exe | 3,590 |
Total | 14,515 |
- Dataset Processing: From 10,925 malware samples, we processed 4,200 for feature extraction, then applied Undersampling to balance with 3,500 benign samples (7,000 total). Used RandomUnderSampler (random_state=42) to prevent malware bias while preserving key patterns.
- Hybrid AI detection (XGBoost + Random Forest)
- Detailed Malware Fingerprinting:
- VM/Sandbox detection markers
- Anti-debugging technique identification
- Suspicious API call patterns
- Confidence Scoring with threat level classification
- Advanced PE Analysis: Full directory parsing (TLS, Debug, Resources)
- String Analysis: Unicode/ASCII pattern detection
- Behavioral Indicators: 15+ new malware behavior signatures
# PE File Structure
'num_sections',
'num_unique_sections',
'section_names_entropy',
'avg_section_size',
'min_section_size',
'max_section_size',
'total_section_size',
'avg_entropy',
'min_entropy',
'max_entropy',
'has_packed_sections',
'has_executable_sections',
'writable_executable_sections',
'is_dll',
'is_executable',
'is_system_file',
'has_aslr',
'has_dep',
'is_signed',
'has_rich_header',
'rich_header_entries',
'has_resources',
'num_resources',
'has_embedded_exe',
'has_debug',
'has_tls',
'has_relocations',
'ep_in_first_section',
'ep_in_last_section',
'ep_section_entropy',
'has_suspicious_sections'
# API/Import Analysis
'num_imports',
'num_unique_dlls',
'num_unique_imports',
'imports_to_dlls_ratio',
'has_import_name_mismatches',
'suspicious_imports_count',
'num_exports',
'suspicious_exports',
'suspicious_api_chains',
'has_delayed_imports',
'has_vm_detection_imports',
'has_anti_debug_imports',
'has_process_creation_imports',
'has_createprocess',
'has_setwindowshookex',
# String Patterns
'num_strings',
'avg_string_length',
'has_suspicious_strings',
'has_anti_debug',
'has_vm_detection_strings',
'has_vm_mac_addresses',
'has_anti_debug_strings',
'has_nop_sleds',
'has_anti_debug_strings'
vm_detection_strings = {
b'vbox', b'vmware', b'virtualbox', b'qemu', b'xen', b'hypervisor',
b'virtual machine', b'vmcheck', b'vboxguest', b'vboxsf', b'vboxvideo'
}
vm_mac_prefixes = {
b'00:0C:29', b'00:1C:14', b'00:05:69', b'00:50:56', # VMware
b'08:00:27', # VirtualBox
b'00:16:3E', # Xen
b'00:1C:42', # Parallels
b'00:15:5D' # Hyper-V
}
anti_debug_strings = {
b'IsDebuggerPresent', b'CheckRemoteDebuggerPresent', b'OutputDebugString',
b'NtQueryInformationProcess', b'NtSetInformationThread', b'ZwSetInformationThread'
}
suspicious_patterns = {
b'payload', b'malware', b'inject', b'virus', b'trojan',
b'backdoor', b'rat', b'worm', b'spyware', b'keylog',
b'xored', b'encrypted', b'packed', b'obfus'
}
# API Groups
vm_detection_apis = {
'cpuid', 'hypervisor', 'vmcheck', 'vbox', 'vmware', 'virtualbox',
'wine_get_unix_file_name', 'wine_get_dos_file_name'
}
anti_debug_apis = {
'IsDebuggerPresent', 'CheckRemoteDebuggerPresent', 'OutputDebugStringA',
'NtQueryInformationProcess', 'NtSetInformationThread', 'NtQuerySystemInformation',
'GetTickCount', 'QueryPerformanceCounter', 'RDTSC', 'GetProcessHeap',
'ZwSetInformationThread', 'DbgBreakPoint', 'DbgUiRemoteBreakin'
}
process_creation_apis = {
'CreateProcessA', 'CreateProcessW', 'CreateProcessAsUserA', 'CreateProcessAsUserW',
'SetWindowsHookExA', 'SetWindowsHookExW', 'ShellExecuteA', 'ShellExecuteW',
'WinExec', 'System'
}
# Suspicious API Chains
api_sequences = {
('VirtualAlloc', 'WriteProcessMemory', 'CreateRemoteThread'): 'Process Injection',
('RegCreateKey', 'RegSetValue', 'RegCloseKey'): 'Registry Persistence',
('LoadLibraryA', 'GetProcAddress', 'VirtualProtect'): 'Dynamic API Resolution',
('OpenProcess', 'ReadProcessMemory', 'WriteProcessMemory'): 'Process Hollowing',
('NtUnmapViewOfSection', 'MapViewOfFile', 'ResumeThread'): 'RunPE Technique',
('CreateProcessA', 'WriteProcessMemory', 'ResumeThread'): 'Process Injection',
('SetWindowsHookExA', 'GetMessage', 'DispatchMessage'): 'Hook Injection'
}
ExeShield_AI/
βββ assets/ # Repo Images
βββ data/ # Raw Samples
β βββ malware/ # Malicious Executables
β βββ benign/ # Clean Executables
βββ dependencies/ # Installation Dependencies
βββ models/ # Saved Models/Thresholds
β βββ malware_detector.joblib
β βββ optimal_threshold.npy
βββ output/ # Processed Data (CSV/features)
β βββ processed_features_dataset.csv
βββ scripts/ # Core Scripts
β βββ extract_features.py
β βββ train_model.py
β βββ predict.py
βββ README.md
git clone https://github.com/MohamedMostafa010/ExeRay.git
cd ExeRay
pip install -r dependencies/requirements.txt
> python extract_features.py
[*] Processing benign samples from ../data\benign...
[!] Not a valid PE file: adaminstall.exe
[!] Not a valid PE file: adamsync.exe
[!] Not a valid PE file: AddSuggestedFoldersToLibraryDialog.exe
[!] Not a valid PE file: AgentService.exe
[!] Not a valid PE file: AggregatorHost.exe
[!] Not a valid PE file: appcmd.exe
[!] Not a valid PE file: AppHostRegistrationVerifier.exe
[!] Not a valid PE file: ApplySettingsTemplateCatalog.exe
[!] Not a valid PE file: ApplyTrustOffline.exe
[!] Not a valid PE file: ApproveChildRequest.exe
[!] Not a valid PE file: AppVClient.exe
[!] Not a valid PE file: ARPPRODUCTICON.exe
[!] Not a valid PE file: audit.exe
[!] Not a valid PE file: AuditShD.exe
[!] Not a valid PE file: autofstx.exe
...
[*] Processing malware samples (limited to 3500) from ../data\malware...
[+] Processed Features Dataset saved to ../output/processed_features_dataset.csv
[+] Total samples: 6857
[+] Malware samples: 3500
[+] Benign samples: 3357
> python train_model.py
Training models: 0%| | 0/2 [00:00<?, ?it/s]
New best model: XGBoost (Recall=0.990)
Training models: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:07<00:00, 3.90s/it]
=== Evaluation ===
precision recall f1-score support
0 0.99 0.99 0.99 672
1 0.99 0.99 0.99 700
accuracy 0.99 1372
macro avg 0.99 0.99 0.99 1372
weighted avg 0.99 0.99 0.99 1372
ROC AUC: 1.000
Model saved to ../models/malware_detector.joblib
> python predict.py "path/to/[benign_file]"
Malware Detection Results:
========================================
File: CapCut.exe
Prediction: BENIGN
Malware Probability: 0.41%
Confidence Level: VERY_LOW
Decision Threshold: 35.93%
> python predict.py "path/to/[suspicious_file]"
Malware Detection Results:
========================================
Malware Detection Results:
========================================
File: Mh1.exe
Prediction: MALWARE
Malware Probability: 98.39%
Confidence Level: VERY_HIGH
Decision Threshold: 35.93%
Top Suspicious Features:
- High maximum section entropy: 7.887
- Suspicious API imports: 6
- High average section entropy: 5.743
- High section name entropy: 2.807
- Writable and executable sections: 2
- Suspicious API call chains: 1
- Anti-debugging API imports: 1
- Anti-debugging strings in code: 1
- While ExeShield AI achieves high accuracy, occasional false positives (legitimate files flagged as malware) may occur. Common causes:
- Legitimate tools with behaviors resembling malware (e.g., putty.exe).
- Packed/obfuscated benign files (high entropy).
- Example False Positive Output (Below is an example of a malicious file (1.exe) predicted as BENIGN:
> python predict.py python predict.py "C:\Users\[USERNAME]\Documents\Maicious_TEST\1.exe"
Malware Detection Results:
========================================
File: 1.exe
Prediction: BENIGN
Malware Probability: 1.56%
Confidence Level: VERY_LOW
Decision Threshold: 35.93%
- Adjust Threshold:
- Lower the decision threshold in predict.py for stricter filtering
- Whitelist Trusted Files:
- Manually verify and exclude known-safe executables.
- Retrain the Model:
- Add misclassified samples to your dataset and rerun train_model.py.
Model Performance Metrics:
- π¦ Our Model (AUC = 1.00): Perfect classification capability
- π₯ Random Guess (AUC = 0.5): Baseline for comparison
- π Optimal Threshold: 36% (0.36 in probability units)
Key Interpretation:
- X-axis (False Positive Rate): Lower values = fewer false alarms
- Y-axis (True Positive Rate): Higher values = more malware caught
- Perfect Score: Curve touching top-left corner (achieved in our case)
- Pull requests are welcome! If you have ideas for new user profiles, simulation modes, or forensic artifacts, feel free to contribute.
- This project is released under the MIT License.