Empirical Study of Fuzzing
Title: Empirical Study of Fuzzing
SNIC Project: LiU-compute-2022-23
Project Type: LiU Compute
Principal Investigator: Ulf Kargén <ulf.kargen@liu.se>
Affiliation: Linköpings universitet
Duration: 2022-09-01 – 2022-11-01
Classification: 10201


In recent years, fuzzing has become widely adopted in industry as the go-to method for automatic security and robustness testing of software. Fuzzing has, for example, exposed tens of thousands of bugs in popular open-source software projects, many of which are critical for the modern digital infrastructure. At its core, fuzzing is a random testing technique, which generates semi-valid inputs to the program under test (PUT) by applying mutations to a set of valid inputs. This approach has proven remarkably effective at uncovering critical security bugs, such as memory corruption flaws. Due to its popularity in industry, fuzzing has also garnered significant interest in academia, with fuzzing-related papers regularly appearing at every major academic security or software engineering conference in recent years. However, the challenges with measuring and comparing the performance of fuzzing tools (known as fuzzers) remain an impediment to fuzzing research. Most importantly, there is currently a lack of generally-adopted methodologies and metrics for fuzzer evaluation. The number of real bugs found in real-world software is generally considered as the "golden standard" for comparing fuzzers. However, since the relative performance of different fuzzers tend to vary significantly across different PUTs, comparisons need to be made on a per-PUT-basis. Therefore, many researchers and practitioners are directed to use less-accurate proxy metrics, such as code coverage (i.e., the amount of PUT code that is exercised by the fuzzer). For example, when a developer needs to find the best-performing fuzzer for an in-development piece of software, there is typically no corpus of ground-truth bugs available for that software, leaving proxy metrics as the only viable alternative. While code coverage and other proxy metrics are recognized to be moderately correlated with bug discovery likelihood, there is an ongoing debate in the fuzzing community over how well they reflect actual fuzzer bug-finding ability. Therefore, there is a pressing need both to better understand the limitations of current metrics, and to develop new, more accurate metrics. The aim of this project is to perform a large-scale empirical study of fuzzing, using the recently published Magma fuzzing benchmarking suite [https://doi.org/10.1145/3428334]. The gathered data will be used in two different research studies: Study 1 aims to investigate the degree of correlation with bug-discovery rate of several proxy metrics, using 10 modern fuzzers. The study will encompass both existing metrics and a set of new metrics proposed by us. The goal of the study is threefold: (1) to better understand limitations of current metrics, (2) to devise new metrics for more accurate fuzzer performance prediction, and (3) to gain better insights into fuzzer performance characteristics on different kinds of PUTs. In Study 2, we revisit a method for improving fuzzer efficiency that we proposed in an earlier work [https://doi.org/10.1145/3230833.3230867], with the goal of performing a substantially expanded evaluation of an enhanced version of the method, adapted for modern fuzzing settings.