mailing list archives
Re: Proposed improvements on httpspider.lua
From: George Chatzisofroniou <sophron () latthi com>
Date: Sat, 13 Jul 2013 23:41:05 +0300
On Mon, Jun 24, 2013 at 02:00:18PM +0300, George Chatzisofroniou wrote:
There are two different operations that need approval:
spidering - If this is true, the crawler will return the resource.
scraping - If this is true, the crawler will scrape the resource for any links.
Progress has been slow on this issue, because me and Patrick were trying to
design a callback API. Eventualy, things went too complicated so we took a step
back and i implemented a sigle callback mechanism.
To achieve backwards compatibility, the default callbacks are the current
behavior. The user may use, as always, the boolean options (withinhost and
withindomain) to adjust crawler's behaviour for spidering.
But for any advanced case, the user can choose to replace the default behaviour
of withinhost and withindomain callbacks. He can easily define his new callbacks
using some ultity functions defined by the httpspider library.
For example, consider the following sample code:
crawler.options.withinhost = function(url)
and not crawler:isresource(url, "js")
and not crawler:isresource(url, "css") then
In this example, we override the default withinhost method and we allow spidering
only on resources within the host that they are not "js" or "css". We make use of
two ultity functions (iswithinhost and isresource).
There is also doscraping callback that can be replaced to adjust crawler's
A full working example is my http-referer-checker script .
Apart from this feature, i made some more changes, like fixing a syntax mistake,
replacing annoying tabs with spaces and rewriting some parts.
I've attached the new upgraded httpspider. If you have developed a script that
makes use of this library, please use the upgraded version and let me know if it
still works as expected. I've done some tests, but i may missed something.
Sent through the dev mailing list
Archived at http://seclists.org/nmap-dev/