NeuroYara: Learning to Rank for Yara Rules Generation through Deep Language Modeling & Discriminative N-gram Encoding

Loading...
Thumbnail Image

Authors

Mansour, Ziad

Date

Type

thesis

Language

eng

Keyword

Deep Learning , Malware Analysis , Malware Detection , Yara Rules , Discriminative Encoding , Language Modeling , Automatic Signatures Generation , Adversarial Malware

Research Projects

Organizational Units

Journal Issue

Alternative Title

Abstract

Signature-based malware detection methods are simple, explainable, and efficient. One of the most ubiquitous tools is Yara. It is a widely-used syntax for writing malware signatures. Compared to machine learning models, Yara rules have a lower false-positive rate and better maintainability of the rules to incorporate new variants of malware. To produce a high-quality rule, one will need extensive experience in reverse engineering and malware analysis. However, it is resource- and time-consuming to train an experienced malware analyst. Only a few works have been conducted to automate the generation of high-quality signatures and generally perform worse than manually-generated ones. Moreover, they rely on a huge static and non-inclusive database of hard-coded byte n-grams. This database is used as the reference set for the automated Yara rules generator which aids in reducing the number of false-positive predictions. Hence, instead of storing a huge non-inclusive database to score byte n-grams, we propose a novel architecture utilizing two learning-to-rank neural networks to understand the underlying effectiveness of and correlations among n-grams extracted for rule construction. This approach provides better flexibility and coverage of possible n-grams while reducing the required storage size for this task from several GBs to only 10MBs. Combining these two models with a hierarchical density-based clustering method allows us to group multiple n-grams into logical conditions as Yara rules of a higher quality. Compared to state-of-the-art tools for automatic Yara rules generation, our experimental results show that our framework, NeuroYara, reduces the resources invested by the human analyst while generating rules with a low false-positive rate outperforming existing tools and rules manually generated by expert malware analysts.

Description

Citation

Publisher

License

Queen's University's Thesis/Dissertation Non-Exclusive License for Deposit to QSpace and Library and Archives Canada
ProQuest PhD and Master's Theses International Dissemination Agreement
Intellectual Property Guidelines at Queen's University
Copying and Preserving Your Thesis
This publication is made available by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.

Journal

Volume

Issue

PubMed ID

External DOI

ISSN

EISSN