Intrusion Detection Systems mailing list archives
RE: Determining when something is NOT random
From: "Bill Royds" <broyds () home com>
Date: Tue, 25 Jul 2000 20:00:29 -0400
Archive: http://msgs.securepoint.com/ids FAQ: http://www.ticm.com/kb/faq/idsfaq.html IDS: http://www-rnks.informatik.tu-cottbus.de/~sobirey/ids.html HELP: Having problems... email questions to ids-owner () uow edu au NOTE: Remove this section from reply msgs otherwise the msg will bounce. SPAM: DO NOT send unsolicted mail to this list. UNSUBSCRIBE: email "unsubscribe ids" to majordomo () uow edu au ----------------------------------------------------------------------------- The chi-squared is the most commonly applied, even if it is not robust. It is based on the assumption that deviations from expected bucket count are generated by a (0,1) normal distribution. It is rigorous based on assumptions. It is not very good if the assumptions are violated (unlike the F test). I assumed Lance was trying to test whether the IP's came from a decoy generator as in nmap or actually had some pattern. The simple chi-squared test can give an idea if they were evenly distributed (but not really prove it). It can't show much more than that but it is a fairly quick way to give a handle on the situation Common advice is to ensure at least 5 observations per bucket. Dividing 47 observations into ranges of 28 class A blocks each gives 8 buckets (224 class A,B and C blocks). On null hypothesis of equally likely, one would have ~6 IP's per range. Real IP addresses are not equally distributed so a chi square would help separate a naive IP address generator from more systematic approaches. It wouldn't help in determining whether the IP hosts are penetrated or not. -----Original Message----- From: Stephen P. Berry [mailto:spb () meshuggeneh net] Sent: Tuesday, July 25, 2000 15:22 To: broyds () home com Cc: lance () spitzner net; ids () uow edu au; spb () meshuggeneh net Subject: Re: Determining when something is NOT random *** PGP Signature Status: good *** Signer: Stephen P. Berry <spb () meshuggeneh net> *** Signed: 07/25/2000 3:18:07 PM *** Verified: 07/25/2000 7:33:32 PM *** BEGIN PGP VERIFIED MESSAGE *** Bill Royds writes:
The statistical method of determining whether something is random on not is the Chi-squared test. You calculate the sum of squares of (expected-observed)/ expected for classes of something.
Chi-square is -a- test which can be used to measure how close to a
random distribution a given distribution is. It isn't -the-
method, and it isn't even a particularly rigorous method.
There are a couple problems with attempting to determine if bits
of network data were randomly generated. Like:
-The data tends to be sparse
-A `normal' distribution is difficult to determine
-The `normal' distribution can change rapidly or be
highly time dependant (i.e., vary with time of day, day
of week, u.s.w.)
The most promising stuff I've personally worked on[0] has been
with birthday spacings tests of various sorts and OPSO (overlapping
pairs, sparse occupancy) tests---which seem to work comparatively
well against with the sort of data one gets from actual packet
captures.
I'm pretty sure, however, that you're pretty much S.O.L. if all
you've got is 47 datapoints (the data Mr Spitzner was intrested
in testing). It's not clear that you're any less S.O.L. in the
general case---one of the problems with PRNGs is that they tend
to be more pseudo than random. So you're not getting a perfectly
even distribution from their output. But that actually makes the
data look -more- like normal network traffic, which you wouldn't
expect to be randomly distributed.
About the best case you could hope for is a very naive PRNG with
a very short period---i.e., short enough that you'd see it repeat
a couple times in the data you're looking at. And hope that you're
actually seeing contiguous output bits from it.
Here the expected distribution of IP's can be determined by allocation of IP blocks. You would expect more IP's hitting you from the densely populated 24.x.x.x class A space. You would not expect something from NortelNetworks internal class A space. Classify the source IP's by class A prefix say, then see if they are fairly evenly distributed over class A space. If they are, then it is most likely random. If not, then examine to see which groups are more clumped. If they are the groups with more active IP addresses, it would give evidence of actual hacked machines.
I don't think this is a particularly good approach to the given
dataset. It would make more sense if you were attempting to
determine if all source addresses seen by one of your sensors (presumably
barring the local ones) were randomly distributed or something like that.
Given that you've got a collection of crafted packets (Mr Spitzner says
he's determined that they're all made by the same tool), I'd concentrate
on the distribution of the bits which change within your test data,
independent of what those bits represent. I.e., treat the IP
addresses as 32 bit numbers instead of dotted quads, and see
whether the distribution looks random. I suppose that you could
look at the distribution for the least 8 bits, then the least 16,
u.s.w., but I don't think that'll buy you much with this data.
If you were actually looking at a very large dataset (say, all the public
IP traffic generated in the world in a given 24 hour period), breaking
it down by subnets and pulling out private, reserved, and other
special-purpose subnets would make more sense. You'd also want
to know more about normal traffic distribution, how it varies with
time of day u.s.w., and you'd want a hell of a lot more crunch
power at your disposal than most of us have.
When I looked at this sort of problem a while back, I came to the
conclusion that it was essentially intractable unless you had
fairly detailed information about expected traffic distribution---and
that even given that, all it accomplished was upgrading simple guesswork
about the distribution of the data to educated guesswork.
That all being said, testing the output of RNGs and PRNGs is reasonably
well-explored territory in crypto. Interested folks should look
to proceedings papers[1] for more information.
-Steve
-----
0 Disclaimer: I'm not a professional statistician or mathematician.
I'm sure there's been quite a lot of work done on this subject
by folks that are much better at it than I am. My guess is
that most of this work has been done by TLAs and so isn't
publically available. If anyone has references or pointers to
material on the subject, by all means offer 'em.
1 The Menezes _Handbook of Applied Cryptography_, otherwise a good
starter reference for all things cryptographic, doesn't delve into
randomness testing past the venerable old chi square goodness test.
I think there's a paper by Marsaglia which gives a quick overview
of the methods used by DIEHARD, and I think it's included in
the distribution. If you're completely unfamiliar with randomness
testing, you could do worse than grab a copy of DIEHARD.
*** END PGP VERIFIED MESSAGE ***
Current thread:
- Determining when something is NOT random Lance Spitzner (Jul 24)
- Re: Determining when something is NOT random Joshua Stein (Jul 24)
- RE: Determining when something is NOT random Bill Royds (Jul 25)
- <Possible follow-ups>
- RE: Determining when something is NOT random Martins, Fernando (Lisbon) (Jul 24)
- Re: Determining when something is NOT random Robert Graham (Jul 25)
- RE: Determining when something is NOT random Meritt, Jim (Jul 25)
- RE: RE: Determining when something is NOT random Max Kilger (Jul 25)
- RE: Determining when something is NOT random Bill Royds (Jul 26)
