Software Security - Fuzzing Project
Fuzzing group project
Goal of this project is to become familiar with
fuzzing in combination with code
instrumentation as way to do security testing.
Trying out different fuzzers, with and without code
instrumentation, should also provide some comparison of the pros
and cons and effectiveness of various approaches.
Step 1: Choose a target
For the assignment can target any software you want as the
System-Under-Test (SUT). Only constraint is that it should be
C(++) and that it is open source. You can choose some old
outdated software, where it should be easier to find bugs, or
some newer up-to-date software, where finding bugs may be harder
but also more interesting. To keep things simple, choose some
software that can take a single file as input on the command line.
To make things rewarding, choose some software that
takes a complex input format as input: the more complex the input
format, the more likely that there are bugs in input handling
and the more likely that fuzzing will detect them. One obvious
choice here is software that handles graphical data.
Step 2: Get fuzzing
Try the fuzzers below, initially without but then with the
instrumentation tools, to fuzz your chosen application.
Each group should at least try out afl, as the leading tool
around, and one of the dumber fuzzers, Radamsa or zzuf - ideally
both, and try to get ASan (or valgrind) in combination with these
tools. You are more than welcome to try out other fuzzers that
you happen to know and are curious about.
This project assignment is fairly open-ended. Make sure you all
spend at least one full afternoon per week on it for the next 2
months. We will discuss progress and experiences in class, so
that we can still shift or focus efforts depending on the results
so far, or maybe cut things short if spending time does not seem
that interesting. Final reports with a summary of your
experiments and reflection on the tools will be due at the end of
November/early December, so that we can still discuss this at one
of the last lectures in December.
NB Document your experiments carefully
When you run your experiments, document the set-up
and input files precisely. Ideally you should be able to re-run
experiment and reproduce the exact same results.
NB Warning
Do not open input files created by the fuzzers
with any 'production' software on a system that you
care about. For instance, if you are fuzzing image files then
opening the fuzzed image files on your mobile phone or with the
app that you use to manage all your holiday photos could have
disastrous effects.
Fuzzers
More information and pointers about these fuzzers on
tools page.
In Brightspace there are discussion forums to exchange
good and bad experiences on installing and using the
tools.
You can also have a look at this very gentle introduction to zuff, ASan and afl.
Instrumentation tools
Below some instrumentation tools to try out. These should help
to spot more errors when fuzzing. At least try out ASan, which
will probably give you the best improvements for the effort and
overhead. If you have trouble with ASan, you could use valgrind's
Memcheck as a fallback.
- AddressSanitizer (ASan)
[ASan github]
ASan adds memory safety checks to C(++) code, both for spatial and temporal memory flaws.
ASan is part of the LLVM clang compiler starting with version 3.1 and a part of GCC starting with version 4.8.
- MemorySanitizer (MSan)
[MSan github]
MSan adds memory safety checks to C(++) code to detect
reading of uninitialised memory.
MSan is available as a compile-time option in clang since version 4.0.
- valgrind
[http://valgrind.org]
Valgrind is the classic
instrumentation framework for building tools that do dynamic
analysis. It provides
several detectors to find different classes of bugs.
Valgrind's
Memcheck
detector for memory errors would be interesting to
try for these fuzzing experiments.
(Valgrind provides many more detectors and tools, for instance
detectors to spot thread errors, tools to
profile heap usage, and many more.)
Valgrind vs ASan/MSan
Valgrind is much slower than ASan and MSan, and has a
much larger memory overhead. ASan does set up an insane amount of
virtual memory (20 TB), but does not use all of it.
ASan can spot some flaws that valgrind's Memcheck cannot.
Conversely, Memcheck can spot some flaws that ASan can't. For
instance, ASan does not try to spot access to uninitialised
memory, which Memcheck, like MSan, does. So using ASan together
with MSan comes closer to what Memcheck does.
Memcheck can detect more problems than MSan. For instance,
Memcheck tries to detect memory leaks, which MSan does not. ASan has
some support to detect
memory leaks, with LeakSanitizer, which is not enabled by default
on all systems. For this project detecting memory leaks is
probably not interesting: memory leaks typically occur when an
application runs for a longer time, not when it is used one-off
to process a single input file.
afl-cmin and afl-tmin
afl comes with two helper programs that may be interesting to
play with:
-
For some initial set of input files (aka the initial test corpus)
that you use to get the
fuzzers started, afl-cmin tries to determine the smallest
subset that still has the same code coverage as the full set.
Code coverage here is measured using the instrumentation
that afl adds to a program to observe its control flow and
distinguish different execution paths. So, for example, if there are two
input files that trigger exactly the same execution path, then
afl-cmin will remove one of these files from this test corpus.
This will speed up fuzzing, not just for afl but also for other
mutation-based fuzzers, as mutating one of these files
is likely to generate similar inputs as mutating the other,
so having both in the test set is likely to double the amount work
for the same outcome.
-
Given a single input file that triggers a crash, afl-tmin tries
to generate a smaller input file that still triggers this crash.
Note that afl-cmin and afl-tmin try very different forms
of minimisation: afl-cmin tries to minimise a set of inputs and afl-tmin tries to minimise a single input.
Report
You final report should
- summarise the outcomes of your fuzzing experiments
- reflect on these outcomeas and the tools
More details below
Outcomes of the fuzzing experiments
For your fuzzing experiments, describe for each tool
or each tool combination of fuzzer and sanitiser:
- how many initial files you used;
- how many mutations were generated;
- how much time this took;
- how many -- if any -- flaws were found.
You could present this in one big table or several smaller tables.
For any flaws found, possible issues to discuss would be:
- if different tools found the same flaws;
- if the same flaws are found if differents initial files are
used or if a different random number seed is used;
- if any of the flaws are known CVEs, or other known bugs not reported as CVE but say reported on the application's website, github repo, or in release notes. Figuring out if some
crash responds to some CVE might be really tricky to figure out,
so don't feel obliged to dig into this for too long!
- if not, if they should be reported as new bugs.
Reflection
Some research questions to reflect on are given below. (You do
NOT have to stick to following these precise questions, feel free
to reflect on other interesting issues that came up, or questions
that were unresolved.)
- How good are the tools (or tool combinations) at finding
flaws in these 'typical' applications? What does this tell us
about the quality of the tools and/or about the quality of the
applications tested?
- Is afl as modern evolutionary fuzzer much better than more
basic mutation-based fuzzers? Or are the latter still good
enough to find flaws in (at least some) applications?
- What are the overheads of instrumentation approaches?
Does the improved detection rate with instrumentation
outweigh the slower speed?
- How difficult are the fuzzing and instrumentation tools to set up for the application tested?
- How many test cases can be generated is a given time frame?
Are there big differences here?
- How sensitive are tools to the initial set of valid
inputs, or the initial random number seed? Does a given tool find the same bugs when given different initial
inputs? Can different tools find the same bugs when given identical initial inputs? If you have more time to fuzz, it it more effective to simply let the tool run longer or to start with a larger initial sample size?
- Esp. in case that you do not find bugs, it would be
interesting to check if the project is already using a
fuzzer (that may be mentioned on the github pages or the
discussion forum for the software project.) That might
explain why you did not find anything.
Of course, the hope is that the tools do find some flaws.
But even if they don't, this result - albeit a negative result -
is still interesting.
If you find that the tools do not find flaws in the application you're
testing, you may want to switch to an earlier release of
the software, or to another application that takes the same
input format. If the interfaces and input languages the same,
swapping one application for another should be easy to do
without having to change the configuration of tools and test set-ups.