Nmap Development mailing list archives
Request for Comments: New IPv6 OS detection machine learning engine
From: Prabhjyot Singh Sodhi <prabhjyotsingh95 () gmail com>
Date: Thu, 21 Jul 2016 12:17:50 +0530
Hi devs, This document is to explain the solution that we experimented with and are proposing over the current implementation. *Aim* : The target of this model is to be able to guess target operating system correctly based on network probes and the difference in reaction to these probes by different operating systems. *Data* : As of now we have a total of 301 prints. We use these prints to generate the feature for our models (695 as of now). *Model being used* : The current model uses a logistic regression learner to predict operating systems. *Change in data representation* : So the current database representation which is used by the logistic regression model is based on the fact that all prints which are members of a group are very similar to each other (value wise). This is in contrast to how classes are in normal learning systems (A learning system is one wherein you are trying to teach a system to do something (prediction of os in our case)). Usually, we'd have a target variable (operating system in our case) and have one group for some operating system (or a set of operating systems) and all prints corresponding to the operating system would go into the group. And this is exactly what we have attempted with the new representation. Now, to achieve this, one simple solution could have been to just have one group for each operating system (each version, so one per Linux kernel). Given the low number of prints this would have resulted in a very high number of groups with very less number of prints in each group which would have made prediction more difficult. That is why we tried to keep similar operating systems in the same group. We were able to do this for Windows, IBM, Macintosh, FreeBSD type systems. For Linux, we decided to stick with the existing representation (with small changes) due to complexity in the way the groups were made. *Models experimented with* : i) Random Forest (RF): A random forest is an example of what we call an ensemble model. Ensemble models are models which combine two or more predictive models (decision trees in our case) Decision tree, just like Logistic regression is an algorithm to create a predictive model. It gets its name from the tree structure wherein each node represents a decision which is dependent on a feature of the dataset. So, we try to ensemble / combine around 400 decision trees to our random forest. This explains the forest bit, but where is the randomness coming from? All the trees in the forest are trained on the same parameters but with different training sets. These different training sets are generated from the original training prints using random selection (with replacement). Apart from this, at each node of a trained tree, not all the variables are used to find the best split, but a random subset of them (a new subset is generated for each subset) These measures help avoid the problem of overfitting [1] the dataset. ii) Multi Stage Random Forest (MSRF): This classifier aims to use multiple layers (with classifiers at each level) for OS classification instead of previous single-stage attempts. The model uses 2 layers of classifiers as of now. The first layer is responsible for understanding and differentiating between broader sets of operating systems, namely, Linux, BSDs, Windows, Macintosh and Others. Once this classification is successful, the print is sent to the second layer for a more specific classification. The second layer has a different classifier for each of the OS classes (Linux, BSDs, Windows, Macintosh and Others). The classifications so produced by the second layers is the output for the given print. Each of models that we spoke about are random forest models with different training prints to change what they understand. *Experimentation Results* : Both RF and MSRF perform better than the logistic regression model in per group 80:20 testing (with 80% prints from a group in the training set and rest in the testing set) and randomized testing (with 20 random prints as the test set and rest as the training set). Between RF and MSRF, they perform equally well for most cases but outperform each other in some cases. So between them, we don't have a clear winner (in terms of accuracy). Having said that we think that MSRF is the way to go about it. Advantages of MSRF: i) It represents the actual hierarchy of operating systems more closely. What I mean is that, two linux kernel are more similar than a linux kernel and a windows system. ii) The combined size of all models in msrf is 3.6 MBs which is half of RF's (8 MBs). iii) There is a lot of scope of plugging in more features when using MSRF. For example, we may choose to send a different set of probes if the first stage tells us that it seems like a Linux device. It'd be great if you could review this and help us with some valuable feedback. Cheers, Prabhjyot [1]: Overfitting is when your model performs beautifully on your tested data but falls flat on unseen data. http://blog.fliptop.com/wp-content/uploads/2015/03/goodfit_overfitting.png
_______________________________________________ Sent through the dev mailing list https://nmap.org/mailman/listinfo/dev Archived at http://seclists.org/nmap-dev/
Current thread:
- Request for Comments: New IPv6 OS detection machine learning engine Prabhjyot Singh Sodhi (Jul 20)
- Re: Request for Comments: New IPv6 OS detection machine learning engine David Fifield (Aug 07)
- Re: Request for Comments: New IPv6 OS detection machine learning engine Prabhjyot Singh Sodhi (Aug 09)
- Re: Request for Comments: New IPv6 OS detection machine learning engine David Fifield (Aug 07)
