mailing list archives
Re: Looking for a good web spider
From: Robin Wood <robin () digininja org>
Date: Sat, 25 Sep 2010 14:46:23 +0200
On 25 September 2010 02:46, Adrian Crenshaw <irongeek () irongeek com> wrote:
I'm looking at some of the tools in BT4R1, and will be looking at what
Samurai WTF has to offer once I finish downloading the latest version. I'm
looking for some sort of spider that lets me do the following:
1. Follow every link on a page, even onto other domains, as long as the top
level domain name is the same (edu, com, cn, whatever)
2. For every page it visits, it collect the file names of all resources.
3. The headers so I can see the server version.
4. Grab the robots .txt if possible.
Any ideas on the best tool for the job, or do I need to roll my own?
If you want to roll your own you can take my CeWL code and check the
spider, I do a full spider and check whether you are on the same site
or off and grab all the documents, you should easily be able to modify
this to do what you want.
Pauldotcom mailing list
Pauldotcom () mail pauldotcom com
Main Web Site: http://pauldotcom.com