Intrusion Detection Systems mailing list archives

RE: Determining when something is NOT random


From: "Bill Royds" <broyds () home com>
Date: Tue, 25 Jul 2000 20:00:29 -0400

Archive: http://msgs.securepoint.com/ids
FAQ: http://www.ticm.com/kb/faq/idsfaq.html
IDS: http://www-rnks.informatik.tu-cottbus.de/~sobirey/ids.html
HELP: Having problems... email questions to ids-owner () uow edu au
NOTE: Remove this section from reply msgs otherwise the msg will bounce.
SPAM: DO NOT send unsolicted mail to this list.
UNSUBSCRIBE: email "unsubscribe ids" to majordomo () uow edu au
-----------------------------------------------------------------------------
The chi-squared is the most commonly applied, even if it is not robust. It is based on the assumption that deviations 
from expected bucket count are generated by a (0,1) normal distribution. It is rigorous based on assumptions. It is not 
very good if the assumptions are violated (unlike the F test).


I assumed Lance was trying to test whether the IP's came from a decoy generator as in nmap or actually had some 
pattern. The simple chi-squared test can give an idea if they were evenly distributed (but not really prove it). It 
can't show much more than that but it is a fairly quick way to give a handle on the situation

Common advice is to ensure at least 5 observations per bucket. Dividing 47 observations into ranges of 28 class A 
blocks each gives 8 buckets (224 class A,B and C blocks). On null hypothesis of equally likely, one would have ~6 IP's 
per range. Real IP addresses are not equally distributed so a chi square would help separate a naive IP address 
generator from more systematic approaches.
It wouldn't help in determining whether the IP hosts are penetrated or not.

-----Original Message-----
From: Stephen P. Berry [mailto:spb () meshuggeneh net]
Sent: Tuesday, July 25, 2000 15:22
To: broyds () home com
Cc: lance () spitzner net; ids () uow edu au; spb () meshuggeneh net
Subject: Re: Determining when something is NOT random



*** PGP Signature Status: good
*** Signer: Stephen P. Berry <spb () meshuggeneh net>
*** Signed: 07/25/2000 3:18:07 PM
*** Verified: 07/25/2000 7:33:32 PM
*** BEGIN PGP VERIFIED MESSAGE ***


Bill Royds writes:

The statistical method of determining whether something is random 
on not is the Chi-squared test.
You calculate the sum of squares of (expected-observed)/
expected for classes of something.

Chi-square is -a- test which can be used to measure how close to a
random distribution a given distribution is.  It isn't -the-
method, and it isn't even a particularly rigorous method.

There are a couple problems with attempting to determine if bits
of network data were randomly generated.  Like:

        -The data tends to be sparse
        -A `normal' distribution is difficult to determine
        -The `normal' distribution can change rapidly or be
         highly time dependant (i.e., vary with time of day, day
         of week, u.s.w.)

The most promising stuff I've personally worked on[0] has been
with birthday spacings tests of various sorts and OPSO (overlapping
pairs, sparse occupancy) tests---which seem to work comparatively
well against with the sort of data one gets from actual packet
captures.

I'm pretty sure, however, that you're pretty much S.O.L. if all
you've got is 47 datapoints (the data Mr Spitzner was intrested
in testing).  It's not clear that you're any less S.O.L. in the
general case---one of the problems with PRNGs is that they tend
to be more pseudo than random.  So you're not getting a perfectly
even distribution from their output.  But that actually makes the
data look -more- like normal network traffic, which you wouldn't
expect to be randomly distributed.

About the best case you could hope for is a very naive PRNG with
a very short period---i.e., short enough that you'd see it repeat
a couple times in the data you're looking at.  And hope that you're
actually seeing contiguous output bits from it.


Here the expected distribution of IP's can be determined by
allocation of IP blocks. You would expect more IP's hitting
you from the densely populated 24.x.x.x class A space. You 
would not expect something from NortelNetworks internal class A space. 
Classify the source IP's by class A prefix say, then see if they
are fairly evenly distributed over class A space. If they are,
then it is most likely random. If not, then examine to see which
groups are more clumped. If they are the groups with more active
IP addresses, it would give evidence of actual hacked machines.

I don't think this is a particularly good approach to the given
dataset.  It would make more sense if you were attempting to
determine if all source addresses seen by one of your sensors (presumably
barring the local ones) were randomly distributed or something like that.
Given that you've got a collection of crafted packets (Mr Spitzner says
he's determined that they're all made by the same tool), I'd concentrate
on the distribution of the bits which change within your test data,
independent of what those bits represent.  I.e., treat the IP
addresses as 32 bit numbers instead of dotted quads, and see
whether the distribution looks random.  I suppose that you could
look at the distribution for the least 8 bits, then the least 16,
u.s.w., but I don't think that'll buy you much with this data.

If you were actually looking at a very large dataset (say, all the public
IP traffic generated in the world in a given 24 hour period), breaking
it down by subnets and pulling out private, reserved, and other
special-purpose subnets would make more sense.  You'd also want
to know more about normal traffic distribution, how it varies with
time of day u.s.w., and you'd want a hell of a lot more crunch
power at your disposal than most of us have.

When I looked at this sort of problem a while back, I came to the
conclusion that it was essentially intractable unless you had
fairly detailed information about expected traffic distribution---and
that even given that, all it accomplished was upgrading simple guesswork
about the distribution of the data to educated guesswork.

That all being said, testing the output of RNGs and PRNGs is reasonably
well-explored territory in crypto.  Interested folks should look
to proceedings papers[1] for more information.





-Steve

-----
0     Disclaimer:  I'm not a professional statistician or mathematician.
      I'm sure there's been quite a lot of work done on this subject
      by folks that are much better at it than I am.  My guess is
      that most of this work has been done by TLAs and so isn't
      publically available.  If anyone has references or pointers to
      material on the subject, by all means offer 'em.
1     The Menezes _Handbook of Applied Cryptography_, otherwise a good
      starter reference for all things cryptographic, doesn't delve into
      randomness testing past the venerable old chi square goodness test.
      I think there's a paper by Marsaglia which gives a quick overview
      of the methods used by DIEHARD, and I think it's included in
      the distribution.  If you're completely unfamiliar with randomness
      testing, you could do worse than grab a copy of DIEHARD.



*** END PGP VERIFIED MESSAGE ***


Current thread: