
Dailydave mailing list archives
Mini Fuzzer Shootout
From: Ben Nagy <ben () iagu net>
Date: Mon, 19 Jul 2010 12:58:09 +0545
Hi all, So, I've been presenting recently on 'industrial' fuzzing. This basically means going really fast, then scrambling around, trying to catch up by writing surrounding tools that can scale well enough to make use of the speed. I was talking to Charlie Miller about his CSW slides, which contain some awesome metrics for people in the game - crash percentages, what percentage of those bugs are any good etc - and decided that one of the things I should do with my speed is compare some case generation approaches. I promised Charlie that I would include the results in my next talk, but given all the new material I am adding I realised I just won't have time. So instead, I figured that most of the people who would be interested read dailydave, so I'll just stick 'em up here. The target is Word 2007, fully patched as of last month or so. Yes, I have an unhealthy obsession with .doc. I'll switch soon, I swear. I run with page heap full, and I stop on first chance exceptions for access violation, integer overflow (Word uses SafeInt, though, which rains on my parade), stack buffer overflow (had some stack bugs, but never one that is detected as sbo by the debugger) and illegal instruction. Wherever you read 'crash' remember that it might not actually crash, I'm just too lazy to write 'exception event'. Fuzzer 1: "millerfuzz" Charlie's '5 lines of python' from the CSW 2010 talk. The algorithm is to pick a number between 1 and (file length / 10) and randomly corrupt that many bytes in random locations. It's a bit like fuzzing with a shotgun. As he said in his slides, it's so retarded it shouldn't find any bugs at all. As a note, in the CSW slides, the length divisor is a variable called 'fuzzfactor' - I just guessed at 10. I have no idea what Charlie used, so it's not exactly apples to apples. Fuzzer 2: "benfuzz" I have a fuzzer class which, at it's most basic, will take any string and mess with it. It's still less than 5 lines, but the included code is a bit smarter - let's call it 'top tier retarded fuzzing'. It runs a 'rolling corrupt' where it takes successive chunks and mutates them (increments a bit, decrements a bit, tries some binary corner cases) also allowing for endian. Every so often (in this case, every 128th of the file) it also inserts a bunch of junk. The junk is a mix of raw random, ascii, some wacked unicode, some terminators and special chars and so on - the usual stuff. It is not at all structure aware. Fuzzer 3: "Tablestream Dumbfuzz" This fuzzer unpacks the OLE structure, parses the Word FIB (file information block) and gets a list of offset/length pairs that describe internal Word structures. The offsets point into a separate stream where these structures are packed up called the Table Stream. Then, for each structure, it does the same thing as benfuzz, above, but if any lengths have changed it will patch up the offsets in the FIB for structures after the mutated one. You could call it smart fuzzing, but I'd prefer something like 'structure aware dumb fuzzing'. Fuzzer 4: "Deep Fuzz" One of the structures I happen to know is disproportionally bug prone, so I wrote a fuzzer to mess with it. We unpack the OLE, parse the FIB, find this structure, and "semi" parse it. It's basically a type, length, value nested container format which has jillions of subtypes. I treat the contents (subtypes) as blobs, which is why it's "semi" parsed. Then, the fuzzer recursively goes through the nesting, messing with combinations of TLV / contents, fixing up some stuff which has constrained values. It sounds a lot more complicated than it is, because a lot of this I get for free from the structure definition code. This fuzzer produces hundreds of millions of cases - I've never run it until completion - because it's testing the full set product of 4 sets for every atom. So here are the results. I didn't run anywhere near as many iterations as I wanted - one of the fuzzers was generating so many crashes and hangs that it reduced my speed to a crawl - barely 50 tests / second for all 4 fuzzers combined. Millerfuzz 1107728 tests fail: 692792, success: 413312, crash: 1614 (101) PE: 654, PNE: 350, U: 597, E: 12, <: 1 OK, so let me decode that output for you. 'fail' means the file was detected by Word as corrupt. The other two are pretty obvious. The number in brackets is !exploitable 'buckets' based on the major and minor hash it produces. The breakdown is by !exploitable classification but by number only, no secondary breakdown by bucket (sorry). Not Exploitable (NE) Probably Not Exploitable (PNE) Unknown (U), E for exploitable and < means that full crash details weren't retrieved and it screwed my parser up. This happened when the reaper killed the process when it wasn't done getting crash info, or when !exploitable itself faulted in some way. There were actually a few good bugs in '<', by the way. Analysis: Millerfuzz hits the broadest set of buckets, by a long way. Unfortunately, I happen to know that at least half of those PE crashes are what I call 'chaff' - 3-4 crashes that come up WAY too often and are not useful. So 0.14% of tests caused a crash (higher than the numbers Charlie saw against PDF parsers, I think). Sadly, at the moment I can't connect exact crashes as I see them in my analysis tool with the fuzzer that produced them, so I can't tell you if those 12 Es look genuinely exploitable, but let's assume that some of them are. What I get from the numbers is that millerfuzz is broad and actually works - a fact that annoys me intensely. As a side issue - when you come to analyse these crashes, you will see a lot of differences all over the file, so how are you going to work out which change to even try and track? Benfuzz 2008423 tests success: 1891838, fail: 112817, crash: 3709 (59) E: 48, PNE: 23, U: 3150, <: 4, PE: 484 Because this is a rolling corrupt, and it was incomplete, we don't know if it would have changed the general footprint if I had left it to run the whole way through. 0.18% crashes seems broadly in line. Less buckets, more Es as a percentage of crashes. Masses of U, but a lot of those I think are chaff as well (just from watching it run). Overall, this looks like a slightly higher quality fuzzer, but considering the extra code it's leveraging, this is not surprising. The crashes will be easier to analyse, because the changes are more or less in one place at a time. Some of those Es are the real deal, just based on watching a few as they came up. Tablestream Dumbfuzz 932858 tests fail: 183628, success: 724257, crash: 24953 (36) <: 41, PE: 94, E: 14, U: 6817, PNE: 17987 2.67% crashes, but the huge bulk of them are divided between 3 of the chaff crashes - I swear they do that on purpose. Even if we strip the PNEs that's still 0.74% though. Very few Es and PEs by ratio. Where this fuzzer has been awesome for me, though, is indicating structures that might have deeper issues. I found that areas that throw a lot of chaff crashes when fuzzed with the dumb algorithms were good targets for deeper fuzzing. Again, this is an incomplete run, so the stats might be different if I let it run all the way through - it's just that I end up with tens of thousands of useless crashes in my sorting bin. This is only a single data point, but it doesn't look like the extra smarts did me much good here - I probably could have identified the pain points from the dumb rolling corruptor as well, by parsing out the mutated file, it would just be a bit more work. Deep Fuzz 1057066 tests success: 22033, fail: 1034652, crash: 372 (30) E: 47, PE: 316, PNE: 3, U: 6 The first thing to notice is the miniscule number of successfully opened files. Basically either the file was detected as corrupt or it crashed. The number of crashes is tiny - 0.03%, BUT they're almost all gold - I'm in deep now, so I avoid a lot of the chaff. This is where the scale thing comes in - if you ran this fuzzer for just a couple of hundred thousand iterations you'd probably not see any crashes at all; you'd throw it away. The other nice thing is that files that are detected as corrupt error out faster, so this is one of the fastest case generators in my stable - I can crank a million tests an hour if I run only this fuzzer (and the results above are based on a million tests...). The thing to remember, though, is that I wouldn't be running it if I hadn't done the dumb coverage first, to find 'sore spots' in the parser. My general interpretation of the numbers is that it makes sense to run lots of different case generators. This is in line with 'accepted wisdom' based on the MS Fuzzing Olympics and their results with their DFF. It looks like the really tasty bugs are still easier to mine with the deep fuzzers though, so some broad coverage with millerfuzz or benfuzz to identify pain points followed by some actual work. If you run a millerfuzz style then I guess you can revert bytes one by one until you find the minimum set of changes that still hits the crash - I've got alpha code for this, but it's not one of the things I've fully integrated into Bugmine (yet). Some of you might find other ideas in the raw numbers - let me know any insights I have missed. :) What I don't have, yet, is a feel for how using different template files will affect these numbers - all these crashes are from a single template file which contains a lot of different Word features. I know automatic template selection has been explored by MS and also by Dr Miller, but there's no valgrind for Windows, which means we can't just steal his code. That research is in progress, I'll have some stuff to share 'soonish'. Anyway, hopefully some of you will find this useful / interesting. Cheers, ben _______________________________________________ Dailydave mailing list Dailydave () lists immunitysec com http://lists.immunitysec.com/mailman/listinfo/dailydave
Current thread:
- Mini Fuzzer Shootout Ben Nagy (Jul 19)