mailing list archives
Re: Options to replace select in Nsock
From: doug () hcsw org
Date: Sat, 20 Jun 2009 22:09:35 +0000
On Fri, Jun 19, 2009 at 11:48:44PM -0600 or thereabouts, David Fifield wrote:
poll is a function that is similar to select but it doesn't have a limit
on the number of descriptors it can follow. poll is pretty portable,
except it isn't available on Windows before Windows Vista.
poll is also not available on old BSD derived systems (all modern BSDs
support it now) and was not available on Mac OS X until 10.3 or so.
poll has the
same scalability problem as select, namely that you have to iterate over
the entire descriptor list to find out which ones are ready, but that
hasn't been a problem for us.
The problem with select and poll is that the user-space code must inform
the kernel of all the descriptors it is interested in EVERY SINGLE TIME
it blocks to wait for the next event.
Assuming that event frequency is roughly proportional to the number of
descriptors the program is interested in, this means that the total
CPU time cost of monitoring events with select and poll is roughly
proportional to THE SQUARE of the number of descriptors being monitored.
The goal of modern event APIs is to make the CPU time cost of monitoring
events LINEAR to the number of events and UNCORRELATED to the number of
descriptors being monitored.
So the reason why we haven't noticed a scalability problem with select
is because the time spent handling events grows roughly quadratically
with the number of descriptors and until the number of descriptors gets
significantly above 1000, the CPU time is not substantial relative
to the cost of context switching to the kernel when calling select().
As you mention, the one nice thing about poll is that there is no fixed
limit on the number of descriptors that can be monitored. Unfortunately,
poll has worse scalability properties than select because each event
you are interested in uses more memory than it would with select, causing
more copying and higher cache pressure.
See, the way select works is that the fd_set contains a bit array of
1024 bits. The first bit refers to descriptor 0, the second to descriptor
1, and so on. When you pass an fd_set to select, the kernel will loop
through this bit array up to the value of the nfds parameter. This is
why nfds must be the highest numbered descriptor in the set plus 1,
NOT the total number of descriptors. So a read fd_set and a write fd_set
can be stored in (1024/8)*2 = 256 bytes. Since each pollfd struct
consumes 8 bytes, monitoring 1024 descriptors would consume 8K.
Furthermore, a smart application can build up and scan bit arrays
word-by-word -- not bit-by-bit -- for improved performance.
The idea behind event APIs like kqueue and epoll is that you have
the kernel remember which descriptors you are interested in between
calls to fetch the next event. For this reason, these are called
STATEFUL event APIs.
I said above that modern event APIs try to make the time handling
events linear with the number of events. This means that the
CPU time should be uncorrelated to the number of descriptors the
program is interested in. In the default modes of kqueue and
epoll this is not entirely the case because they use "level
triggering". This sounds complicated but is actually simple.
All it means is that if the kernel tells you there is some new
event on a socket and you don't process that event (ie you don't
read all the data available) it's no big deal because the kernel
will tell you again later that there is still data available on
The problem with level triggering is that the kernel has to
monitor the descriptor set continually because you may not have
processed all events even though the kernel told you about them.
This means that there is CPU activity proportional to the number
of descriptors -- exactly what we want to avoid.
The opposite of level triggering is called "edge triggering".
This mode is supported by both kqueue and epoll. In this mode,
the kernel will only notify you when the state of a descriptor
changes from unready to ready. So when a packet comes in, the
kernel will add it to the descriptor's input mbuf and queue a
ready event for notification to user-space. The kernel's event
API never has to re-visit that descriptor until the next event
happens on it.
The consequence is that processes must ensure that they read all
the data from a descriptor because if they don't, the descriptor
will appear unready even though there may still be data available
to read. Typically this is done by using non-blocking sockets
and always calling read() and write() until they return EAGAIN.
So in summary, using poll is obviously preferable to having nmap
crash when it reaches 1024 descriptors but if the goal is to have
nmap perform well when it's handling in excess of 1024 descriptors,
a stateful event API should be used.
Hope this helps,
Sent through the nmap-dev mailing list
Archived at http://SecLists.Org