Semantic-Aware Fuzzing: A Reasoning-Driven Framework for LLM-Guided Input Mutation

Loading...
Thumbnail Image

Authors

Lu, Mengdi

Date

2025-09-11

Type

thesis

Language

eng

Keyword

Cybersecurity , Vulnerability detection , Fuzzing , Machine learning , Large language models (LLMs) , Prompt engineering , Reasoning models

Research Projects

Organizational Units

Journal Issue

Alternative Title

Abstract

Security vulnerabilities in Internet-of-Things (IoT) devices, mobile platforms, and autonomous systems continue to pose significant risks, yet traditional mutation-based fuzzers—while effective at exploring code paths—primarily perform byte- or bit-level edits without true semantic reasoning. Even coverage-guided tools such as AFL++ rely on dictionaries, grammars, and splicing heuristics to impose only shallow structural constraints, leaving deeper protocol logic, inter-field dependencies, and domain-specific semantics unaddressed. In contrast, reasoning-capable large language models (LLMs) have the potentials to leverage human knowledge embedded during pretraining to understand input formats, respect complex constraints, and propose targeted mutations much like an experienced reverse engineer or testing expert would. However, because there is no ground truth for “correct” reasoning in mutation generation, supervised fine-tuning methods fall short, motivating an investigation of whether off-the-shelf LLMs—using prompt-based few-shot learning alone—can meaningfully enhance fuzzing efficiency and coverage. To bridge this gap and to address the asynchronous pacing and divergent hardware demands (GPU- vs. CPU-intensive) of LLMs and fuzzers, we present an open-source framework built on microservices that integrates reasoning LLMs with AFL++ on Google’s FuzzBench. We evaluate it to answer four core research questions: (RQ1) How can reasoning-based LLMs be integrated into the fuzzing mutation loop? (RQ2) Do few-shot prompts yield higher-quality mutations than zero-shot? (RQ3) Are off-the-shelf reasoning models capable of improving fuzzing directly via prompt engineering? and (RQ4) Which open-source reasoning LLMs perform best under prompt-only conditions? Our experiments with Llama3.3, Deepseek-r1-Distill-Llama-70B, QwQ-32B, and Gemma3 reveal that Deepseek-r1-Distill-Llama-70B shows the most promise. Mutation effectiveness depends on balancing prompt complexity with model choice rather than shot count alone. Response latency and throughput bottlenecks remain key obstacles. We release our framework as open-source to promote reproducibility and community extension, and we outline future directions for dynamic scheduling, lightweight feedback integration, and scalable deployment in LLM-guided fuzzing.

Description

Citation

Publisher

License

Queen's University's Thesis/Dissertation Non-Exclusive License for Deposit to QSpace and Library and Archives Canada
Intellectual Property Guidelines at Queen's University
Copying and Preserving Your Thesis
This publication is made available by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.
Attribution-NonCommercial-NoDerivatives 4.0 International

Journal

Volume

Issue

PubMed ID

External DOI

ISSN

EISSN