Binary Similarity Model for Malware Classification

Developing a binary similarity model using Random Forest machine learning to classify malware into families. The project achieved 98% accuracy classifying SmokeLoader, ZeusBot, and benign samples by analyzing imported functions and modules.

Project Overview

This project involved developing a binary similarity model to classify malware into various families, enhancing our capability to detect and mitigate cyber threats. The challenges encompassed acquiring and preprocessing binary data (both benign and malicious), extracting meaningful features from binary code, training and evaluating a machine learning model, and interpreting the results.

The project focused on classifying sub-variants of trojan malware, specifically SmokeLoader and ZeusBot families, using a Random Forest machine learning model. By analyzing imported operating system functions and modules, the team developed a feature model that achieved high accuracy in malware classification.

Binary Similarity Model team presenting their project

What Was Built

Binary Similarity Classification Model

The team built a Random Forest machine learning model capable of classifying malware samples into their respective families. The model analyzes binary samples by extracting features from reverse-engineered malware, focusing on imported operating system functions and modules that reveal malware behavior.

Feature Extraction Pipeline

The project involved reverse engineering 150 malware samples (SmokeLoader, ZeusBot, and benign samples) to extract key features including:

Imported functions and modules (e.g., KERNEL32.dll, GetProcAddress, VirtualAlloc)
Number of imports per sample
Image size of binaries
MD5 and SHA256 hashes

These features were one-hot encoded to create a feature set suitable for machine learning classification.

Random Forest Model

The team selected Random Forest as the machine learning algorithm due to its flexibility, ability to learn non-linear decision boundaries, and robustness against overfitting. The model was trained on a 60-40 train-test split with hyperparameter tuning, achieving optimal performance with min_samples_split set to 10.

Results

Model Performance

The team tested two different feature sets to optimize model performance:

All Features (Including Hashes)

Accuracy: 95%
Precision: ~95.41%
Recall: 95%
F1 Score: ~94.94%

Without Hash Features

Accuracy: ~98.33%
Precision: ~98.43%
Recall: ~98.33%
F1 Score: ~98.34%

Removing hash features improved accuracy by approximately 3.3%, demonstrating that imported functions and modules provide more meaningful classification signals than hash values.

Key Findings

Imported operating system functions and modules serve as highly effective features for malware classification
Hash values (MD5, SHA256) are semantically null for classification purposes since each sample has a unique hash
The Random Forest model successfully learned to distinguish between malware families based on function usage patterns
The model achieved recall scores above the 90% benchmark, critical for malware detection where false negatives are particularly dangerous

Team Members

Karanpreet Singh

Rishik Kolli

Riley Hall

Krystian Bista

Darci Vincent

Gursharan Singh

Documentation

Research Whitepaper

Complete research paper detailing methodology, results, and findings

Project Presentation

Presentation slides from the project demonstration

Back to all capstone projects