Nmap Development mailing list archives
Re: Request for Comments: New IPv6 OS detection machine learning engine
From: Prabhjyot Singh Sodhi <prabhjyotsingh95 () gmail com>
Date: Wed, 10 Aug 2016 01:04:51 +0530
Hey David, Thanks a lot for your suggestions.
A decision tree–based classifier sounds like a good idea. Of course it all comes down to the performance of the implementation. Do you have some code or evaluation results?
We conducted accuracy based testing for 80:20 class wise training and testing split. The random forest model performed the best with 76.4% accuracy, Multi-stage random forest achieved around 76% accuracy and liblinear's accuracy was around 60%. We also tried to randomized testing (with 20 test prints and 40 test prints randomly chosen), random forest got 75.5% accuracy, multi-stage random forest got 75% and liblinear finished with 69% accuracy.
One problem you might have with making classes based strictly on OS versions is that some fingerprints may belong to different versions but be indistinguishable. For example, you might have training samples for Linux 2.6.22 and Linux 2.6.23 as leaves in one of your decision trees, even though they have the same network behavior. You might have to have some kind of cutoff where you decide that all leaves below certain nodes belong to the same class. (That kind of cutoff is basically what we are trying to simulate in the current system of manually curated classes. We rely on the human integrator having some expert knowledge of what version ranges should be distinguishable from what others.)
Every decision tree in the random forest has a max depth parameter which decides when to stop trying to differenciate between the prints at that level in the tree. At this point of time it is difficult to predict the optimal value of this parameter when the dataset will have, say 5K prints. But the hope is that tweaking these parameters will ensure good results from the model. Ideally though, if we could identify that Linux 2.6.22 and Linux 2.6.23 are indisguishable and group them together, It'd be amazing for the learning. Also, since it is difficult to do so for the existing classes, we should probably try to keep this in mind for the classes we add in the future.
How do you plan to handle novel fingerprints? One way to evaluate this would be to hold out an entire class during training, and test whether the fingerprints of the held-out class match some other existing class or are properly detected as novel.
The final system will return an answer only if the operating system is predicted with confidence above some threshold. Cheers, Prabhjyot
_______________________________________________ Sent through the dev mailing list https://nmap.org/mailman/listinfo/dev Archived at http://seclists.org/nmap-dev/
Current thread:
- Request for Comments: New IPv6 OS detection machine learning engine Prabhjyot Singh Sodhi (Jul 20)
- Re: Request for Comments: New IPv6 OS detection machine learning engine David Fifield (Aug 07)
- Re: Request for Comments: New IPv6 OS detection machine learning engine Prabhjyot Singh Sodhi (Aug 09)
- Re: Request for Comments: New IPv6 OS detection machine learning engine David Fifield (Aug 07)
