Nmap Development mailing list archives

Re: Request for Comments: New IPv6 OS detection machine learning engine


From: Prabhjyot Singh Sodhi <prabhjyotsingh95 () gmail com>
Date: Wed, 10 Aug 2016 01:04:51 +0530

Hey David,

Thanks a lot for your suggestions.


A decision tree–based classifier sounds like a good idea. Of course it
all comes down to the performance of the implementation. Do you have
some code or evaluation results?


We conducted accuracy based testing for 80:20 class wise training and
testing split. The random forest model performed the best with 76.4%
accuracy, Multi-stage random forest achieved around 76% accuracy and
liblinear's accuracy was around 60%.

We also tried to randomized testing (with 20 test prints and 40 test prints
randomly chosen), random forest got 75.5% accuracy, multi-stage random
forest got 75% and liblinear finished with 69% accuracy.​



One problem you might have with making classes based strictly on OS
versions is that some fingerprints may belong to different versions but
be indistinguishable. For example, you might have training samples for
Linux 2.6.22 and Linux 2.6.23 as leaves in one of your decision trees,
even though they have the same network behavior. You might have to have
some kind of cutoff where you decide that all leaves below certain nodes
belong to the same class. (That kind of cutoff is basically what we are
trying to simulate in the current system of manually curated classes. We
rely on the human integrator having some expert knowledge of what
version ranges should be distinguishable from what others.)​


Every decision tree in the random forest has a max depth parameter which
decides when to stop trying to differenciate between the prints at that
level in the tree.
At this point of time it is difficult to predict the optimal value of this
parameter when the dataset will have, say 5K prints. But the hope is that
tweaking these parameters will ensure good results from the model.​


Ideally though, if we could identify that Linux 2.6.22 and Linux 2.6.23​
are indisguishable and group them together, It'd be amazing for the
learning. Also, since it is difficult to do so for the existing classes, we
should probably try to keep this in mind for the classes we add in the
future.



How do you plan to handle novel fingerprints? One way to evaluate this
would be to hold out an entire class during training, and test whether
the fingerprints of the held-out class match some other existing class
or are properly detected as novel.


​The final system will return an answer only if the operating system is
predicted with confidence above some threshold.​

​Cheers,
Prabhjyot​
_______________________________________________
Sent through the dev mailing list
https://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/

Current thread: