Home page logo

nmap-dev logo Nmap Development mailing list archives

Re: Web crawling library proposal
From: David Fifield <david () bamsoftware com>
Date: Sat, 5 Nov 2011 10:08:57 -0700

On Wed, Oct 19, 2011 at 12:25:19AM -0700, Paulino Calderon wrote:
Hi list,

I'm attaching my working copies of the web crawling library and  a
few scripts that use it. It would be great if I can get some

All the documentation is here:

I'm including 3 scripts using the library:
* http-sitemap - Returns a list of URIs found. (Useful for target enum)
* http-phpselfxss-scan - Returns a list of PHP files vulnerable to
Cross Site Scripting via infecting the variable
* http-email-harvest - Returns a list of the email accounts found in
the web server.

NSE scripts would start a crawling process and then get a list of
URIs to be processed as the programmer wishes. For example if we
wanted to write a script to look for backup files we could simply

  httpspider.crawl(host, port)
  local uris = httpspider.get_sitemap()
  for _, uri in pairs(uris) do
    local obj = http.get(uri .. ".bak")
    if page_exists(obj and other params...) then
        results[#results+1] = uri

I'll repeat others' sentiments that the crawler needs to work
incrementally, rather than building a whole site map in advance. In
other words, the code you wrote above should look like this instead:

for obj in httpspider.crawl(host, port) do
  if page_exists(obj) then
    results[#results+1] = obj.uri

I'm attaching two scripts that show what I think spidering scripts
should look like. The first is a simple port of http-email-harvest. The
second, http-find, shows one reason why the spidering has to be
incremental: the script needs to stop after finding the first page that
matches a pattern.

In http-email-harvest, I included a proposed object-oriented API sketch.
You would use the crawler like this:

crawler = httpspider.Crawler:new()
for response in crawler:crawl(host, port) do

http-find uses a convenience function httpspider.crawl, which will
create a Crawler with default options and run it.

The way I think this should work is, the crawler should only keep a
short queue of pages fetched in advance (maybe 10 of them). When you
call the crawler iterator, it will just return the first element of the
queue (and set a condition variable to allow another page to be
fetched), or else block until a page is available. We will rely on the
HTTP cache to prevent downloading a file twice, when there is more than
one crawler operating. Multiple crawlers shouldn't need to interact

The default crawler would use a blacklist or whitelist of extensions,
and a default time limit. But you should be able to set your own options
before beginning the crawl. You should be able to control which URLs get
downloaded by setting a predicate function. And ideally, you should be
able to modify what pages get crawled during the crawl itself. What I
mean is, if you change the predicate function with set_url_filter, the
crawler will remove anything from the queue of downloaded pages that
doesn't match, and continue with the new options. And really great would
be a way to call crawler:stop_recursion_this_page() and have it stop
following links from the page most recently returned.

In summary, iterative crawling is an essential feature. Interactive
control of a running crawl would be nice, and probably wouldn't be too
hard to add on top of iterativity.

David Fifield

Attachment: http-email-harvest.nse

Attachment: http-find.nse

Sent through the nmap-dev mailing list
Archived at http://seclists.org/nmap-dev/

  By Date           By Thread  

Current thread:
[ Nmap | Sec Tools | Mailing Lists | Site News | About/Contact | Advertising | Privacy ]