Dailydave mailing list archives

Mini Fuzzer Shootout


From: Ben Nagy <ben () iagu net>
Date: Mon, 19 Jul 2010 12:58:09 +0545

Hi all,

So, I've been presenting recently on 'industrial' fuzzing. This
basically means going really fast, then scrambling around, trying to
catch up by writing surrounding tools that can scale well enough to
make use of the speed. I was talking to Charlie Miller about his CSW
slides, which contain some awesome metrics for people in the game -
crash percentages, what percentage of those bugs are any good etc -
and decided that one of the things I should do with my speed is
compare some case generation approaches. I promised Charlie that I
would include the results in my next talk, but given all the new
material I am adding I realised I just won't have time. So instead, I
figured that most of the people who would be interested read
dailydave, so I'll just stick 'em up here.

The target is Word 2007, fully patched as of last month or so. Yes, I
have an unhealthy obsession with .doc. I'll switch soon, I swear. I
run with page heap full, and I stop on first chance exceptions for
access violation, integer overflow (Word uses SafeInt, though, which
rains on my parade), stack buffer overflow (had some stack bugs, but
never one that is detected as sbo by the debugger) and illegal
instruction. Wherever you read 'crash' remember that it might not
actually crash, I'm just too lazy to write 'exception event'.

Fuzzer 1: "millerfuzz"

Charlie's '5 lines of python' from the CSW 2010 talk. The algorithm is
to pick a number between 1 and (file length / 10) and randomly corrupt
that many bytes in random locations. It's a bit like fuzzing with a
shotgun. As he said in his slides, it's so retarded it shouldn't find
any bugs at all. As a note, in the CSW slides, the length divisor is a
variable called 'fuzzfactor' - I just guessed at 10. I have no idea
what Charlie used, so it's not exactly apples to apples.

Fuzzer 2: "benfuzz"

I have a fuzzer class which, at it's most basic, will take any string
and mess with it. It's still less than 5 lines, but the included code
is a bit smarter - let's call it 'top tier retarded fuzzing'. It runs
a 'rolling corrupt' where it takes successive chunks and mutates them
(increments a bit, decrements a bit, tries some binary corner cases)
also allowing for endian. Every so often (in this case, every 128th of
the file) it also inserts a bunch of junk. The junk is a mix of raw
random, ascii, some wacked unicode, some terminators and special chars
and so on - the usual stuff. It is not at all structure aware.

Fuzzer 3: "Tablestream Dumbfuzz"

This fuzzer unpacks the OLE structure, parses the Word FIB (file
information block) and gets a list of offset/length pairs that
describe internal Word structures. The offsets point into a separate
stream where these structures are packed up called the Table Stream.
Then, for each structure, it does the same thing as benfuzz, above,
but if any lengths have changed it will patch up the offsets in the
FIB for structures after the mutated one. You could call it smart
fuzzing, but I'd prefer something like 'structure aware dumb fuzzing'.

Fuzzer 4: "Deep Fuzz"

One of the structures I happen to know is disproportionally bug prone,
so I wrote a fuzzer to mess with it. We unpack the OLE, parse the FIB,
find this structure, and "semi" parse it. It's basically a type,
length, value nested container format which has jillions of subtypes.
I treat the contents (subtypes) as blobs, which is why it's "semi"
parsed. Then, the fuzzer recursively goes through the nesting, messing
with combinations of TLV / contents, fixing up some stuff which has
constrained values. It sounds a lot more complicated than it is,
because a lot of this I get for free from the structure definition
code. This fuzzer produces hundreds of millions of cases - I've never
run it until completion - because it's testing the full set product of
4 sets for every atom.

So here are the results. I didn't run anywhere near as many iterations
as I wanted - one of the fuzzers was generating so many crashes and
hangs that it reduced my speed to a crawl - barely 50 tests / second
for all 4 fuzzers combined.

Millerfuzz

1107728 tests fail: 692792, success: 413312, crash: 1614 (101) PE:
654, PNE: 350, U: 597, E: 12, <: 1

OK, so let me decode that output for you. 'fail' means the file was
detected by Word as corrupt. The other two are pretty obvious. The
number in brackets is !exploitable 'buckets' based on the major and
minor hash it produces. The breakdown is by !exploitable
classification but by number only, no secondary breakdown by bucket
(sorry). Not Exploitable (NE) Probably Not Exploitable (PNE) Unknown
(U), E for exploitable and < means that full crash details weren't
retrieved and it screwed my parser up. This happened when the reaper
killed the process when it wasn't done getting crash info, or when
!exploitable itself faulted in some way. There were actually a few
good bugs in '<', by the way.

Analysis:

Millerfuzz hits the broadest set of buckets, by a long way.
Unfortunately, I happen to know that at least half of those PE crashes
are what I call 'chaff' - 3-4 crashes that come up WAY too often and
are not useful. So 0.14% of tests caused a crash (higher than the
numbers Charlie saw against PDF parsers, I think). Sadly, at the
moment I can't connect exact crashes as I see them in my analysis tool
with the fuzzer that produced them, so I can't tell you if those 12 Es
look genuinely exploitable, but let's assume that some of them are.
What I get from the numbers is that millerfuzz is broad and actually
works - a fact that annoys me intensely. As a side issue - when you
come to analyse these crashes, you will see a lot of differences all
over the file, so how are you going to work out which change to even
try and track?

Benfuzz

2008423 tests success: 1891838, fail: 112817, crash: 3709 (59) E: 48,
PNE: 23, U: 3150, <: 4, PE: 484

Because this is a rolling corrupt, and it was incomplete, we don't
know if it would have changed the general footprint if I had left it
to run the whole way through. 0.18% crashes seems broadly in line.
Less buckets, more Es as a percentage of crashes. Masses of U, but a
lot of those I think are chaff as well (just from watching it run).
Overall, this looks like a slightly higher quality fuzzer, but
considering the extra code it's leveraging, this is not surprising.
The crashes will be easier to analyse, because the changes are more or
less in one place at a time. Some of those Es are the real deal, just
based on watching a few as they came up.

Tablestream Dumbfuzz

932858 tests fail: 183628, success: 724257, crash: 24953 (36) <: 41,
PE: 94, E: 14, U: 6817, PNE: 17987

2.67% crashes, but the huge bulk of them are divided between 3 of the
chaff crashes - I swear they do that on purpose. Even if we strip the
PNEs that's still 0.74% though. Very few Es and PEs by ratio. Where
this fuzzer has been awesome for me, though, is indicating structures
that might have deeper issues. I found that areas that throw a lot of
chaff crashes when fuzzed with the dumb algorithms were good targets
for deeper fuzzing. Again, this is an incomplete run, so the stats
might be different if I let it run all the way through - it's just
that I end up with tens of thousands of useless crashes in my sorting
bin. This is only a single data point, but it doesn't look like the
extra smarts did me much good here - I probably could have identified
the pain points from the dumb rolling corruptor as well, by parsing
out the mutated file, it would just be a bit more work.

Deep Fuzz

1057066 tests success: 22033, fail: 1034652, crash: 372 (30) E: 47,
PE: 316, PNE: 3, U: 6

The first thing to notice is the miniscule number of successfully
opened files. Basically either the file was detected as corrupt or it
crashed. The number of crashes is tiny - 0.03%, BUT they're almost all
gold - I'm in deep now, so I avoid a lot of the chaff. This is where
the scale thing comes in - if you ran this fuzzer for just a couple of
hundred thousand iterations you'd probably not see any crashes at all;
you'd throw it away. The other nice thing is that files that are
detected as corrupt error out faster, so this is one of the fastest
case generators in my stable - I can crank a million tests an hour if
I run only this fuzzer (and the results above are based on a million
tests...). The thing to remember, though, is that I wouldn't be
running it if I hadn't done the dumb coverage first, to find 'sore
spots' in the parser.

My general interpretation of the numbers is that it makes sense to run
lots of different case generators. This is in line with 'accepted
wisdom' based on the MS Fuzzing Olympics and their results with their
DFF. It looks like the really tasty bugs are still easier to mine with
the deep fuzzers though, so some broad coverage with millerfuzz or
benfuzz to identify pain points followed by some actual work. If you
run a millerfuzz style then I guess you can revert bytes one by one
until you find the minimum set of changes that still hits the crash -
I've got alpha code for this, but it's not one of the things I've
fully integrated into Bugmine (yet). Some of you might find other
ideas in the raw numbers - let me know any insights I have missed. :)

What I don't have, yet, is a feel for how using different template
files will affect these numbers - all these crashes are from a single
template file which contains a lot of different Word features. I know
automatic template selection has been explored by MS and also by Dr
Miller, but there's no valgrind for Windows, which means we can't just
steal his code. That research is in progress, I'll have some stuff to
share 'soonish'.

Anyway, hopefully some of you will find this useful / interesting.

Cheers,

ben
_______________________________________________
Dailydave mailing list
Dailydave () lists immunitysec com
http://lists.immunitysec.com/mailman/listinfo/dailydave


Current thread: