Malicious Software Prediction

Traditional antivirus systems rely on comparing md5 hashes of malicious binaries to a database of manually registered known-bad hashes. With the ever-evolving threat landscape, including polymorphic samples which can evade hash-based detection entirely, this approach leaves systems increasingly vulnerable to exploitation. This raises the need for dynamic analysis of suspicious code, and heuristics for identifying malicious behavior at runtime.

We use a dataset of malicious samples collected from malware repositories VxHeaven and VirusTotal. The dataset is available via the UCI ML Repository. Our features include static analysis in the form of features extracted from a hex dump, as well as dynamic features analyzed in a Cuckoo sandbox runtime.

Using a simple neural network, we achieve 95% accuracy in classifying samples as malicious or benign. A random forest model across the same dataset yielded 97.4% accuracy. Our deep-learning based approach demonstrated considerably lower precision than our random forest model (91 false positives v.s. 8, respectively). Overall, we find that, for malware samples with tabular behavioral data generated from a sandbox runtime, using a random forest model outperforms simple neural networks.

You can see the source code for this model, including a more extensive writeup here.

This project was completed as a part of CS-434 Applied Machine Learning @ OSU-Cascades.