NeuroYara: Learning to Rank for Yara Rules Generation through Deep Language Modeling & Discriminative N-gram Encoding
Deep Learning , Malware Analysis , Malware Detection , Yara Rules , Discriminative Encoding , Language Modeling , Automatic Signatures Generation , Adversarial Malware
Signature-based malware detection methods are simple, explainable, and efficient. One of the most ubiquitous tools is Yara. It is a widely-used syntax for writing malware signatures. Compared to machine learning models, Yara rules have a lower false-positive rate and better maintainability of the rules to incorporate new variants of malware. To produce a high-quality rule, one will need extensive experience in reverse engineering and malware analysis. However, it is resource- and time-consuming to train an experienced malware analyst. Only a few works have been conducted to automate the generation of high-quality signatures and generally perform worse than manually-generated ones. Moreover, they rely on a huge static and non-inclusive database of hard-coded byte n-grams. This database is used as the reference set for the automated Yara rules generator which aids in reducing the number of false-positive predictions. Hence, instead of storing a huge non-inclusive database to score byte n-grams, we propose a novel architecture utilizing two learning-to-rank neural networks to understand the underlying effectiveness of and correlations among n-grams extracted for rule construction. This approach provides better flexibility and coverage of possible n-grams while reducing the required storage size for this task from several GBs to only 10MBs. Combining these two models with a hierarchical density-based clustering method allows us to group multiple n-grams into logical conditions as Yara rules of a higher quality. Compared to state-of-the-art tools for automatic Yara rules generation, our experimental results show that our framework, NeuroYara, reduces the resources invested by the human analyst while generating rules with a low false-positive rate outperforming existing tools and rules manually generated by expert malware analysts.