mailing list archives
Re: Web crawling library proposal
From: Paulino Calderon <paulino () calderonpale com>
Date: Wed, 19 Oct 2011 16:15:09 -0700
On 10/19/2011 01:38 PM, Patrick Donnelly wrote:
On Wed, Oct 19, 2011 at 6:17 PM, Paulino Calderon
<paulino () calderonpale com> wrote:
On 10/19/2011 12:45 PM, Patrick Donnelly wrote:
On Wed, Oct 19, 2011 at 3:25 AM, Paulino Calderon
<paulino () calderonpale com> wrote:
I'm attaching my working copies of the web crawling library and a few
scripts that use it. It would be great if I can get some feedback.
For the library itself:
o I'm not convinced a Queue implementation is necessary. I'd prefer
just using table.insert/table.remove until evidence is presented it is
a performance block.
I thought table.insert adds the element at the end and table.remove removes
the last element. The purpose of this implementation was to use a FIFO
mechanism (Oldest item inserted gets removed first). I'll look into it to
see if I can use these standard table methods.
table.insert(t, 1, value) --> inserts at "front" of array, pushing
other elements back
table.remove(t, 1) --> removes first element of array, moving other
You want the last one for a Queue. See the Lua Manual.
Thank you! I'll get rid of the queue after making sure insertion time is
in constant time using this method.
o Libraries should not use the registry. Provide an interface to
access private data instead.
The registry was used to keep a record of states between multiple instances
of the library. For example, If I run script A and B, script B checks if
there is a crawler already running by checking the registry entry created
first by script A. I did not find another way of knowing that the library
was called already to avoid running multiple crawlers. How can I use a
private data interface to know when there is a copy of the library running
There are many solutions but it depends on what you want to do. If you
want completely separate crawlers with no sharing, the most common
solution is to wrap everything in a table that the script holds onto.
If you want to share some data between crawlers, you can use a table
that is local to the library and thus accessible by all the crawlers
that are spawned.
I need to play around with these solutions a little bit more.
You are correct. I will fix this limitation/bug and allow multiple
crawlers if multiple hosts are being scanned. There is definitely an
o is_url_absolute should anchor the pattern search to the beginning of the
o Make get_sitemap return an iterator instead of a table of results.
o Does get_sitemap return the URI for every site that's been crawled?
Shouldn't it return what we requested it to crawl? It would appear if
two scripts try to crawl at the same time, bad things happen with the
global queue structures (among other things)
It does return the list of URIs we requested. When two scripts are running
at the same time, only one of them runs the crawler and the other one waits
for the results. That means that even when running multiple scripts the web
crawler only runs once. The disadvantage of this is that we can't have
different "crawling profiles" for different scripts right now. It's a simple
thing to add but during my tests I found no use for it.
Still, artificially limiting the parallelism by design should be
avoided. If the user wants to crawl multiple sites simultaneously then
we shouldn't stop them. For example, consider an enterprise
environment where most of the internal network is a collection of web
portals accessible by the employees. A network administrator will want
to crawl all of these web servers simultaneously.
The reason I'm not using pipelining is because of its lack of cache.
http.pipeline does not have a cache system right now and I thought the
library will benefit of using the cache.
Also, you will want to consider adding HTTP pipelining to further
improve the efficiency of the library.
Did you have a chance to try out the scripts and library? Thank you for the
I haven't yet, sorry. I'm merely commenting on some design problems I see.
Thanks again for your valuable feedback!
Paulino Calderón Pale
Sent through the nmap-dev mailing list
Archived at http://seclists.org/nmap-dev/