Nmap Development mailing list archives

Request for Comments: New IPv6 OS detection machine learning engine


From: Prabhjyot Singh Sodhi <prabhjyotsingh95 () gmail com>
Date: Thu, 21 Jul 2016 12:17:50 +0530

Hi devs,

This document is to explain the solution that we experimented with and are
proposing over the current implementation.

*Aim* : The target of this model is to be able to guess target operating
system correctly based on network probes and the difference in reaction to
these probes by different operating systems.

*Data* : As of now we have a total of 301 prints. We use these prints to
generate the feature for our models (695 as of now).

*Model being used* : The current model uses a logistic regression learner
to predict operating systems.

*Change in data representation* : So the current database representation
which is used by the logistic regression model is based on the fact that
all prints which are members of a group are very similar to each other
(value wise). This is in contrast to how classes are in normal learning
systems (A learning system is one wherein you are trying to teach a system
to do something (prediction of os in our case)). Usually, we'd have a
target variable (operating system in our case) and have one group for some
operating system (or a set of operating  systems) and all prints
corresponding to the operating system would go into the group. And this is
exactly what we have attempted with the new representation.

Now, to achieve this, one simple solution could have been to just have one
group for each operating system (each version, so one per Linux kernel).
Given the low number of prints this would have resulted in a very high
number of groups with very less number of prints in each group which would
have made prediction more difficult. That is why we tried to keep similar
operating systems in the same group.

We were able to do this for Windows, IBM, Macintosh, FreeBSD type systems.
For Linux, we decided to stick with the existing representation (with small
changes) due to complexity in the way the groups were made.

*Models experimented with* :
i) Random Forest (RF):
A random forest is an example of what we call an ensemble model. Ensemble
models are models which combine two or more predictive models (decision
trees in our case)

Decision tree, just like Logistic regression is an algorithm to create a
predictive model. It gets its name from the tree structure wherein each
node represents a decision which is dependent on a feature of the dataset.

So, we try to ensemble / combine around 400 decision trees to our random
forest.
This explains the forest bit, but where is the randomness coming from?

All the trees in the forest are trained on the same parameters but with
different training sets. These different training sets are generated from
the original training prints using random selection (with replacement).
Apart from this, at each node of a trained tree, not all the variables are
used to find the best split, but a random subset of them (a new subset is
generated for each subset)

These measures help avoid the problem of overfitting [1] the dataset.

ii) Multi Stage Random Forest (MSRF):
This classifier aims to use multiple layers (with classifiers at each
level) for OS classification instead of previous single-stage attempts.
The model uses 2 layers of classifiers as of now. The first layer is
responsible for understanding and differentiating between broader sets of
operating systems, namely, Linux, BSDs, Windows, Macintosh and Others.

Once this classification is successful, the print is sent to the second
layer for a more specific classification. The second layer has a different
classifier for each of the OS classes (Linux, BSDs, Windows, Macintosh and
Others). The classifications so produced by the second layers is the output
for the given print.

Each of models that we spoke about are random forest models with different
training prints to change what they understand.

*Experimentation Results* : Both RF and MSRF perform better than the
logistic regression model in per group 80:20 testing (with 80% prints from
a group in the training set and rest in the testing set) and randomized
testing (with 20 random prints as the test set and rest as the training
set).

Between RF and MSRF, they perform equally well for most cases but
outperform each other in some cases. So between them, we don't have a clear
winner (in terms of accuracy). Having said that we think that MSRF is the
way to go about it.

Advantages of MSRF:
i) It represents the actual hierarchy of operating systems more closely.
What I mean is that, two linux kernel are more similar than a linux kernel
and a windows system.
ii) The combined size of all models in msrf is 3.6 MBs which is half of
RF's (8 MBs).
iii) There is a lot of scope of plugging in more features when using MSRF.
For example, we may choose to send a different set of probes if the first
stage tells us that it seems like a Linux device.

It'd be great if you could review this and help us with some valuable
feedback.

Cheers,
Prabhjyot

[1]: Overfitting is when your model performs beautifully on your tested
data but falls flat on unseen data.
http://blog.fliptop.com/wp-content/uploads/2015/03/goodfit_overfitting.png
_______________________________________________
Sent through the dev mailing list
https://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/

Current thread: