mailing list archives
RE: Where to get spam?
From: "Weir, Jason" <jason.weir () nhrs org>
Date: Tue, 20 Feb 2007 15:30:31 -0500
I agree 100% - I've been using ASSP for 4 years and continually watch
SPAM change on a monthly basis. My corpus has a large diverse SPAM
collection, some of the messages going back to the day I put the server
Look to ASSP users for your spam needs, everyone of us has lots of SPAM
to donate and if we are doing our jobs it will be void of false
From: listbounce () securityfocus com [mailto:listbounce () securityfocus com]
On Behalf Of Micheal Espinola Jr
Sent: Monday, February 19, 2007 10:33 PM
To: security-basics () lists securityfocus com
Subject: Re: Where to get spam?
That's where signatures and heuristics fail, and a properly tuned
(truly randomized) Bayesian database succeeds.
For instance: The anti-spam product ASSP keeps two directories of
saved emails: ham and spam. These collections are referred to as the
corpuses. All messages ham and spam are saved into the corpuses. On
a scheduled basis, the corpuses are trimmed to a set amount of maximum
messages in which to use to [re]build the Bayesian database with.
BUT to prevent intentional spammer abuse, the cleanup of the corpus is
randomized. Deletion of corpus messages are not based on date, thus
leaving in the corpus messages that can be years old. This also helps
prevent newer waves of spam from intentionally skewing the Bayesian
ham/spam word tables.
You might thing that this poses a problem of the database becoming
"stale" to newer forms of spam, but this simply is not the case as
I would never claim its perfect; because I don't believe any anti-spam
product is, but anything that does make it through is quickly
corrected and compensated for once the user that received the
false-negative spam properly reports it back to ASSP.
ASSP is able to manage this appropriately by maintaining ~18,000
messages in each corpus.
I can't speak for other products, but this illustrates where the
Bayesian aspects of ASSP prevail over other products that rely more
heavily on signatures and heuristics.
On 2/19/07, Mark Teicher <mht3 () earthlink net> wrote:
This is a very interesting question. Why do you need spam from
2006/2007, SPAM TTL is <24, most SPAM engines will not detect SPAM > 30
days old. I have researched this problem for over a long period of
time, most anti-spam products out there will have issues detecting any
type of spam over 2 weeks old, since keeping signature/heuristic bases
that huge will slow down the performance of the product, which is an
interesting question in of itself. Why..
You are better off working with a university or local school that
retains their mail for some period of time
At 04:16 PM 2/17/2007, secbasics () dusty ece cmu edu wrote:
That was almost perfect. Unfortunately since I am correlating spam
data against other traffic types, I need the spam to be from 2006/2007,
and the most recent
one there is 2005.
Thanks anyway though.
On Sat, Feb 17, 2007 at 01:23:39PM +1100, David West wrote:
Try the SpamAssassin public mail corpus..
On 2/16/07, secbasics () dusty ece cmu edu
<secbasics () dusty ece cmu edu> wrote:
Does anyone know organizations which give away spam captures? I
obviously I will get lots of spam just from posting on this list
I would like to
get more to analyze. It seems like every couple months some student
project which requires spam but they always have to start from
there anywhere which gives spam to security researchers?