Home page logo

nmap-dev logo Nmap Development mailing list archives

Re: Web crawling library proposal
From: Patrik Karlsson <patrik () cqure net>
Date: Wed, 30 Nov 2011 21:23:01 +0100

On Sat, Nov 5, 2011 at 6:08 PM, David Fifield <david () bamsoftware com> wrote:

On Wed, Oct 19, 2011 at 12:25:19AM -0700, Paulino Calderon wrote:
Hi list,

I'm attaching my working copies of the web crawling library and  a
few scripts that use it. It would be great if I can get some

All the documentation is here:

I'm including 3 scripts using the library:
* http-sitemap - Returns a list of URIs found. (Useful for target enum)
* http-phpselfxss-scan - Returns a list of PHP files vulnerable to
Cross Site Scripting via infecting the variable
* http-email-harvest - Returns a list of the email accounts found in
the web server.

NSE scripts would start a crawling process and then get a list of
URIs to be processed as the programmer wishes. For example if we
wanted to write a script to look for backup files we could simply

  httpspider.crawl(host, port)
  local uris = httpspider.get_sitemap()
  for _, uri in pairs(uris) do
    local obj = http.get(uri .. ".bak")
    if page_exists(obj and other params...) then
        results[#results+1] = uri

I'll repeat others' sentiments that the crawler needs to work
incrementally, rather than building a whole site map in advance. In
other words, the code you wrote above should look like this instead:

for obj in httpspider.crawl(host, port) do
 if page_exists(obj) then
   results[#results+1] = obj.uri

I'm attaching two scripts that show what I think spidering scripts
should look like. The first is a simple port of http-email-harvest. The
second, http-find, shows one reason why the spidering has to be
incremental: the script needs to stop after finding the first page that
matches a pattern.

In http-email-harvest, I included a proposed object-oriented API sketch.
You would use the crawler like this:

crawler = httpspider.Crawler:new()
for response in crawler:crawl(host, port) do

http-find uses a convenience function httpspider.crawl, which will
create a Crawler with default options and run it.

The way I think this should work is, the crawler should only keep a
short queue of pages fetched in advance (maybe 10 of them). When you
call the crawler iterator, it will just return the first element of the
queue (and set a condition variable to allow another page to be
fetched), or else block until a page is available. We will rely on the
HTTP cache to prevent downloading a file twice, when there is more than
one crawler operating. Multiple crawlers shouldn't need to interact

The default crawler would use a blacklist or whitelist of extensions,
and a default time limit. But you should be able to set your own options
before beginning the crawl. You should be able to control which URLs get
downloaded by setting a predicate function. And ideally, you should be
able to modify what pages get crawled during the crawl itself. What I
mean is, if you change the predicate function with set_url_filter, the
crawler will remove anything from the queue of downloaded pages that
doesn't match, and continue with the new options. And really great would
be a way to call crawler:stop_recursion_this_page() and have it stop
following links from the page most recently returned.

In summary, iterative crawling is an essential feature. Interactive
control of a running crawl would be nice, and probably wouldn't be too
hard to add on top of iterativity.

David Fifield

Sent through the nmap-dev mailing list
Archived at http://seclists.org/nmap-dev/

I was experimenting a little with this today but ran into some problems.
If I run http.get from within a closure I get the following error message:
"attempt to yield across metamethod/C-call boundary"
The same happens if I simply try to connect a socket? Have you hade any
succes with this Paulino? Anyone else have any ideas on how to get around
this? The attached script can be used to trigger the error.

Patrik Karlsson

Attachment: test-iter-sock.nse

Sent through the nmap-dev mailing list
Archived at http://seclists.org/nmap-dev/

  By Date           By Thread  

Current thread:
[ Nmap | Sec Tools | Mailing Lists | Site News | About/Contact | Advertising | Privacy ]