Home page logo
/

firewall-wizards logo Firewall Wizards mailing list archives

parsing logs ultra-fast inline
From: "Marcus J. Ranum" <mjr () ranum com>
Date: Wed, 01 Feb 2006 16:03:38 -0500

OK, based on some offline discussion with a few people, about
doing "large amounts" of system log processing inline at high
speeds, I thought I'd post a few code fragments and some ideas
that have worked for me in the past.

First off, if you want to handle truly ginormous amounts of log
data quickly, you need to build a structure wherein you're making
decisions quickly at a broad level, then "drilling down" based on
the results of the decision. This allows you to parallelize infinitely
because all you do is make the first branch in your decision tree
"stripe" across all your analysis engines. So, hypothetically,
let's say we were handling typical UNIX syslogs at a ginormous
volume, we might have one engine (CPU/process or even a
separate box/bus/backplane/CPU/drive array) responsible for
(sendmail | named) and another one responsible for (apache | imapd)
etc. If you put some simple counters in your analysis routines
(hits versus misses) you can "load balance" your first tree-branch
appropriately using a flat percentage. Also, remember, if you
standardize your processing, it doesn't matter where it happens;
it can happen at the edge/source or back in a central location
or any combination of the two. Simply design your analysis as
an always 3-stage process consisting of:
- weeding out and counting instances of uninteresting events
- selecting, parsing sub-fields of, and processing interesting events
- retaining events that fell through the first two steps as "unusual"
The results of these 3 stages are
- a set of event-IDs and counts
- a set of event-IDs and interesting fields and counts
- residual data in raw form
back-haul the event-IDS and counts and fields and graph them or stuff
them into a database, and bring the residual data to the attention of a human
being.

I suppose if you needed to you could implement a "log load
balancer" in the form of a box that had N interfaces that
collected a fat stream of log data, ran a simple program
that sorted the stream into 1/N sub-streams and forwarded
them to backend engines for more involved processing. You
could scale your logging architecture to very very large
loads this way. It works for Google and it'd work for you, too.

The first phase of processing is to stripe across engines if
necessary, then inside each engine you stripe the processing
into functional sub-parsers that deal with a given message
format. The implementation is language-irrelevant though your
language choice will affect performance. Typically you write a
main loop that looks like:
while ( get a message ) {
        if(message is a sendmail message)
                parse sendmail message
        if(message is an imap message)
                parse imap message
        ...
}

Once your system has run on a sample dataset you will be able
to determine which messages come most frequently and you can
put that test at the top of the loop. This can result in an enormous
performance boost.

Each sub-parse routine follows the same structure as the
main loop, performing a sorted series of checks to sub-parse
the fields of the message-specific formats. I.e.:

parse sendmail message( )  {
        if(message is a stat=sent message) {
                pull out recipient;
                pull out sender;
                increment message sent count;
                add message size to sender score;
                done
        }
        if(message is a stat=retry message) {
                ignore; //done
        }
        if(message is a whatever) {
                whatever;
                done
        }
        
        // if we fell through to here we have a new message structure
        // we have never seen before!!
        vector message to interesting folder;
}

Variant messages are a massive pain in the butt; you need to decide
whether to deal with variants as separate cases or to make the
sub-parser smarter in order to deal with them. This is one of the
reasons I keep saying that system log analysis is highly site
specific! If your site doesn't get system logs from a DECStation
5000 running ULTRIX 3.1D then you don't need to parse that data.
Indeed, if you build your parse tree around that notion then, if you
suddenly start getting ULTRIX format log records in your data
stream, that'd be - shall we say - interesting and you want to know
about it. I remember when I was looking at some log data at one
site (Hi Abe!) we found one log message that was about 400K long
on a single line. It appeared that a fairly crucial piece of software
decided to spew a chunk of its memory in a log message, for no
apparent reason. This is what I mean by "interesting."

A philosophical digression: Why not use regexps? Other than the
fact that they look like modem line noise (kids these days probably
don't even know what that looks like, I bet... arrgh!) it's hard to look
quickly at a regexp and a string and tell if it'll match. So most of
the log processing approaches that I've seen resort to having a
loop like:
while( get a message) {
        if(regexp #1 matches) {
                sub_parse format #1;
        } else
        if(regexp #2 matches) {
                ...
}

What you wind up doing is parsing the string twice or, worse,
having loads of regexps in the main loop, which makes your
parse tree too broad to execute quickly. The worst case of this
is really really ugly because you wind up executing every
regexp in your pattern-set against every line of the input,
which can suck quite badly if your pattern-set is large.

If you have to use regexps for this, use a pre-parse switch, i.e.:

while( get a message) {
        if(message begins with "sendmail")
                if(regexp #1 matches) {
                        sub_parse format #1;
                } else
                ...
        }
        ...
}

I think that's pretty pointless, though, since by the time you've
done the first match you can just jump straight into the sub-parse
and forget the regexp entirely.

So here's a code-fragment of evil that demonstrates this kind of
technique.
------------------------
#include        <stdio.h>

#define B_DATEOFF       16              /* the end of the date/time stamp */

main()
{
        char    in[BUFSIZ * 8];

        /* hypothetical fields we want to skin out of the messages */
        char    b_ho[256];              /* hostname */
        char    b_prog[256];            /* program name */
        int     b_nstore1;              /* stash a number */
        char    b_cstore1[256];         /* stash a char */

        int     p;                      /* offset */
        int     r;

        while(fgets(in,sizeof(in),stdin)!= NULL) {
                /* pluck the program name from the message */
                /* magic here is that p stores the match offset into the string */
                if((r = sscanf(in + B_DATEOFF,"%s %s%n",b_ho,b_prog,&p)) != 2)
                        continue;

                /* branch on the program name */
                if(strcmp(b_prog,"/bsd:"))
                        continue;

                /* -------------------------------------------------------
                from here on demux on next term(s) with individual matches

                you should match with the case that is most likely to
                occur first, then the second most likely, etc.
                */

                /* RESIDUAL:  dkcsum: wd0 matched BIOS disk 80 */
                r = sscanf(in + B_DATEOFF + p," dkcsum: %s matched BIOS disk %d",b_cstore1,&b_nstore1);
                if(r == 2) {
                        printf("found BIOS disk b_cstore1=%s b_nstore1=%d\n",b_cstore1,b_nstore1);
                        continue;
                }


                /* RESIDUAL:  root on wd0a */
                r = sscanf(in + B_DATEOFF + p," root on %s",b_cstore1);
                if(r == 1) {
                        printf("found root on b_cstore1=%s\n",b_cstore1);
                        continue;
                }
        }
}

You can see how this structure could be nested a bit deeper - in fact,
the more you nest it, in terms of logical tokens left-to-right the faster
it will run because you have to perform fewer tests. So in the example
above you could branch on ("root" | "dkcsum") if there were more than
a few cases.

This approach runs faster than hell on even low-end hardware and
can crunch through a lot of logs extremely rapidly. It has the disadvantage
that it's hard-coded, but the fact that it's a hand-constructed hard-coded
parse tree is what makes it so fast. So suppose you were getting
8 squadjillion firewall permit/deny log messages and wanted to do
something with them - filtering in this manner you could pretty
easily rip statistics into memory, and periodically checkpoint
counts to a BSD DB b-tree or even (if you like slow!) an SQL
database. Just don't do the updates in-line; collect statistics in-line
and checkpoint them to the SQL every minute or something
like that. If your logging system crashes the fact that you lost
a minute's worth of your log data is kind of irrelevant because you'll
lose several minutes' worth by the time it's down and back up.
With something like a firewall that's spewing out scads of
permit/deny your parse routine is very small, right? It's about
the size of that little program I posted up above. And if you
get something other than permit/deny - well, that's interesting,
so bounce it out to a human's attention.

The "gotcha" with this approach is that it is data-centric and
the level of work expands linearly with the number of different
types of data that you want to deal with. So if you want to
rapidly parse 400 different kinds of log messages, you're going
to write some code. Maybe a thousands lines, by the time
you're done. That's not scary!!

What _is_ scary is if you wanted to handle _all_ the variant
forms of log messages. My summer project 2 summers ago
("what I did on my summer vacation, by Marcus Ranum)
was going out to San Diego Supercomputer Center and
plowing around in SDSC's syslogs with Abe and a couple
bottles of tequila and a ton of gummi bears. The goal of the
project was to attempt to automatically generate a parse
tree engine that could recognize all the variants of syslog
messages seen. So first I wrote something that attempted
to enumerate the total # of variants seen. After it ran for a
day (much data, but the computer I was using was fast!)
my analysis program concluded there were 50,000+ different
forms of syslog messages in 10 years of syslog data.
Arrrgh. That's a lot. What the hell impels programmers to
keep changing the structure of their log messages? There
were lots of variants of sendmail messages (dang it!) - the
biggest culprit. Anyhow, you might only want a few of them
so it's not like you'd need parse rules for all 50,000+
forms of message. The important part of the process is
the sitting down and seeing what you've got and deciding
what you want to do with it.

It comes back to what I was saying in my last few postings on this
topic:
        - look at your data
        - then decide what you want to collect from it
        - then collect it
        - count the stuff you throw away
        - vector the rest to a human's attention

mjr.

_______________________________________________
firewall-wizards mailing list
firewall-wizards () honor icsalabs com
http://honor.icsalabs.com/mailman/listinfo/firewall-wizards


  By Date           By Thread  

Current thread:
[ Nmap | Sec Tools | Mailing Lists | Site News | About/Contact | Advertising | Privacy ]
AlienVault