NeuroYara: Learning to Rank for Yara Rules Generation through Deep Language Modeling & Discriminative N-gram Encoding

dc.contributor.authorMansour, Ziaden
dc.contributor.departmentComputingen
dc.contributor.supervisorDing, Steven
dc.date.accessioned2022-01-28T16:58:20Z
dc.date.available2022-01-28T16:58:20Z
dc.degree.grantorQueen's University at Kingstonen
dc.description.abstractSignature-based malware detection methods are simple, explainable, and efficient. One of the most ubiquitous tools is Yara. It is a widely-used syntax for writing malware signatures. Compared to machine learning models, Yara rules have a lower false-positive rate and better maintainability of the rules to incorporate new variants of malware. To produce a high-quality rule, one will need extensive experience in reverse engineering and malware analysis. However, it is resource- and time-consuming to train an experienced malware analyst. Only a few works have been conducted to automate the generation of high-quality signatures and generally perform worse than manually-generated ones. Moreover, they rely on a huge static and non-inclusive database of hard-coded byte n-grams. This database is used as the reference set for the automated Yara rules generator which aids in reducing the number of false-positive predictions. Hence, instead of storing a huge non-inclusive database to score byte n-grams, we propose a novel architecture utilizing two learning-to-rank neural networks to understand the underlying effectiveness of and correlations among n-grams extracted for rule construction. This approach provides better flexibility and coverage of possible n-grams while reducing the required storage size for this task from several GBs to only 10MBs. Combining these two models with a hierarchical density-based clustering method allows us to group multiple n-grams into logical conditions as Yara rules of a higher quality. Compared to state-of-the-art tools for automatic Yara rules generation, our experimental results show that our framework, NeuroYara, reduces the resources invested by the human analyst while generating rules with a low false-positive rate outperforming existing tools and rules manually generated by expert malware analysts.en
dc.description.degreeM.Sc.en
dc.identifier.urihttp://hdl.handle.net/1974/29913
dc.language.isoengen
dc.relation.ispartofseriesCanadian thesesen
dc.rightsQueen's University's Thesis/Dissertation Non-Exclusive License for Deposit to QSpace and Library and Archives Canadaen
dc.rightsProQuest PhD and Master's Theses International Dissemination Agreementen
dc.rightsIntellectual Property Guidelines at Queen's Universityen
dc.rightsCopying and Preserving Your Thesisen
dc.rightsThis publication is made available by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.en
dc.subjectDeep Learningen
dc.subjectMalware Analysisen
dc.subjectMalware Detectionen
dc.subjectYara Rulesen
dc.subjectDiscriminative Encodingen
dc.subjectLanguage Modelingen
dc.subjectAutomatic Signatures Generationen
dc.subjectAdversarial Malwareen
dc.titleNeuroYara: Learning to Rank for Yara Rules Generation through Deep Language Modeling & Discriminative N-gram Encodingen
dc.typethesisen
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Mansour_Ziad_K_202201_MSc.pdf
Size:
1.87 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.25 KB
Format:
Item-specific license agreed upon to submission
Description: