Binary Similarity Model for Malware Classification
Developing a binary similarity model using Random Forest machine learning to classify malware into families. The project achieved 98% accuracy classifying SmokeLoader, ZeusBot, and benign samples by analyzing imported functions and modules.
Project Overview
This project involved developing a binary similarity model to classify malware into various families, enhancing our capability to detect and mitigate cyber threats. The challenges encompassed acquiring and preprocessing binary data (both benign and malicious), extracting meaningful features from binary code, training and evaluating a machine learning model, and interpreting the results.
The project focused on classifying sub-variants of trojan malware, specifically SmokeLoader and ZeusBot families, using a Random Forest machine learning model. By analyzing imported operating system functions and modules, the team developed a feature model that achieved high accuracy in malware classification.

What Was Built
Binary Similarity Classification Model
The team built a Random Forest machine learning model capable of classifying malware samples into their respective families. The model analyzes binary samples by extracting features from reverse-engineered malware, focusing on imported operating system functions and modules that reveal malware behavior.
Feature Extraction Pipeline
The project involved reverse engineering 150 malware samples (SmokeLoader, ZeusBot, and benign samples) to extract key features including:
- Imported functions and modules (e.g., KERNEL32.dll, GetProcAddress, VirtualAlloc)
- Number of imports per sample
- Image size of binaries
- MD5 and SHA256 hashes
These features were one-hot encoded to create a feature set suitable for machine learning classification.
Random Forest Model
The team selected Random Forest as the machine learning algorithm due to its flexibility, ability to learn non-linear decision boundaries, and robustness against overfitting. The model was trained on a 60-40 train-test split with hyperparameter tuning, achieving optimal performance with min_samples_split set to 10.
Results
Model Performance
The team tested two different feature sets to optimize model performance:
All Features (Including Hashes)
- Accuracy: 95%
- Precision: ~95.41%
- Recall: 95%
- F1 Score: ~94.94%
Without Hash Features
- Accuracy: ~98.33%
- Precision: ~98.43%
- Recall: ~98.33%
- F1 Score: ~98.34%
Removing hash features improved accuracy by approximately 3.3%, demonstrating that imported functions and modules provide more meaningful classification signals than hash values.
Key Findings
- Imported operating system functions and modules serve as highly effective features for malware classification
- Hash values (MD5, SHA256) are semantically null for classification purposes since each sample has a unique hash
- The Random Forest model successfully learned to distinguish between malware families based on function usage patterns
- The model achieved recall scores above the 90% benchmark, critical for malware detection where false negatives are particularly dangerous
Team Members
Karanpreet Singh
Rishik Kolli
Riley Hall
Krystian Bista
Darci Vincent
Gursharan Singh