Semantic-Aware Fuzzing: A Reasoning-Driven Framework for LLM-Guided Input Mutation
Loading...
Authors
Lu, Mengdi
Date
2025-09-11
Type
thesis
Language
eng
Keyword
Cybersecurity , Vulnerability detection , Fuzzing , Machine learning , Large language models (LLMs) , Prompt engineering , Reasoning models
Alternative Title
Abstract
Security vulnerabilities in Internet-of-Things (IoT) devices, mobile platforms, and autonomous systems continue to pose significant risks, yet traditional mutation-based fuzzers—while effective at exploring code paths—primarily perform byte- or bit-level edits without true semantic reasoning. Even coverage-guided tools such as AFL++ rely on dictionaries, grammars, and splicing heuristics to impose only shallow structural constraints, leaving deeper protocol logic, inter-field dependencies, and domain-specific semantics unaddressed. In contrast, reasoning-capable large language models (LLMs) have the potentials to leverage human knowledge embedded during pretraining to understand input formats, respect complex constraints, and propose targeted mutations much like an experienced reverse engineer or testing expert would. However, because there is no ground truth for “correct” reasoning in mutation generation, supervised fine-tuning methods fall short, motivating an investigation of whether off-the-shelf LLMs—using prompt-based few-shot learning alone—can meaningfully enhance fuzzing efficiency and coverage.
To bridge this gap and to address the asynchronous pacing and divergent hardware demands (GPU- vs. CPU-intensive) of LLMs and fuzzers, we present an open-source framework built on microservices that integrates reasoning LLMs with AFL++ on Google’s FuzzBench. We evaluate it to answer four core research questions: (RQ1) How can reasoning-based LLMs be integrated into the fuzzing mutation loop? (RQ2) Do few-shot prompts yield higher-quality mutations than zero-shot? (RQ3) Are off-the-shelf reasoning models capable of improving fuzzing directly via prompt engineering? and (RQ4) Which open-source reasoning LLMs perform best under prompt-only conditions? Our experiments with Llama3.3, Deepseek-r1-Distill-Llama-70B, QwQ-32B, and Gemma3 reveal that Deepseek-r1-Distill-Llama-70B shows the most promise. Mutation effectiveness depends on balancing prompt complexity with model choice rather than shot count alone. Response latency and throughput bottlenecks remain key obstacles. We release our framework as open-source to promote reproducibility and community extension, and we outline future directions for dynamic scheduling, lightweight feedback integration, and scalable deployment in LLM-guided fuzzing.
Description
Citation
Publisher
License
Queen's University's Thesis/Dissertation Non-Exclusive License for Deposit to QSpace and Library and Archives Canada
Intellectual Property Guidelines at Queen's University
Copying and Preserving Your Thesis
This publication is made available by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.
Attribution-NonCommercial-NoDerivatives 4.0 International
Intellectual Property Guidelines at Queen's University
Copying and Preserving Your Thesis
This publication is made available by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.
Attribution-NonCommercial-NoDerivatives 4.0 International
