Sawmill-Tutorial
Using Sawmill To Report On Custom Log Data
Sawmill is a universal log analyzer, that can parse and report on any
type of textual log data. As of this writing, Sawmill supports
850 different common log formats, from the full range of popular
devices: web servers, media servers, mail servers, firewalls, gateways,
etc. But Sawmill's log analytic capabilities don't stop at the 850
included formats; it can analyze
any log file. The 850 formats
are implemented using log format plug-ins, small text files which
describe the layout of the log data, the fields, the filtering it
should applies to the data, and the reports to be generated. These log
format plug-ins are user-editable, and user-
creatable, so to
support a custom log format, all you need to do is create your own
plug-in. This newsletter gives an example of creating a plug-in too.
The
specific log data is an internal format generated by a perl script to
analyze disk usage, compression, and number of lines in a log dataset,
but again, this same approach can be used to analyze
any
textual log data.
The Log Generator
The script which generates the logs is shown below. Our Professional
Services team created this script on a customer site, when we needed to
know how much disk space was used, and how many lines of log data there
were, in a very large multi-file compressed dataset. This was on a UNIX
system, so we could have done it with "du", or with "gunzip -c"
combined with "find" and/or "wc", but that would have given us
information for only one directory, without any ability to zoom in to
see information by directory, or by date; and we would not have been
able to filter. Importing the data into Sawmill allows very high
granularity in examining the data, filtering by date or other criteria.
So, we wrote this script to compute, for each file in a directory
(recursively), the file size, uncompressed file size (for gzipped
logs), and number of lines in the file:
#!/usr/bin/perl
use strict;
my $usage = "compute_size_data.pl <pathname>";
my $pathname = $ARGV[0];
if ($pathname eq "") {
print "Usage: $usage\n";
exit(-1);
}
my $findcmd = "find $pathname -type f";
open(FIND, "$findcmd|") || die("Can't run $findcmd: $!");
while(<FIND>) {
my $foundpathname = $_;
chomp($foundpathname);
my $filesize = -s $foundpathname;
my $uncompressedsize = $filesize;
my $lines = 0;
if ($foundpathname =~ /[.]gz$/) {
$uncompressedsize = `gunzip -l $foundpathname
| fgrep % | sed -e
's/^ *[0-9][0-9]* * \\([0-9][0-9]*\\) .*\\/\\1/'`;
chomp($uncompressedsize);
$lines = `gunzip -c $foundpathname | wc -l`;
chomp($lines);
}
else {
$lines = `wc -l $foundpathname`;
chomp($lines);
}
print
"pathname=$foundpathname|size=$filesize|uncompressedsize=$uncompressedsize|lines=$lines\n";
}
|
The Log Generator Script: compute_size_data.cfg
The details of the script are beyond the scope of this article, but the
output is, this script generates log data like this:
pathname=/logs/12345/log_12345.200806292100-2200-0.log.gz|size=542192|uncompressedsize=4046692|lines=7883
pathname=/logs/12345/log_12345.200808172000-2100-0.log.gz|size=667984|uncompressedsize=5331102|lines=11740
pathname=/logs/12345/log_12345.200806131300-1400-0.log.gz|size=380606|uncompressedsize=2970825|lines=5608
pathname=/logs/12345/log_12345.200805222000-2100-0.log.gz|size=589198|uncompressedsize=4567431|lines=8284
pathname=/logs/12345/log_12345.200803252100-2200-0.log.gz|size=691357|uncompressedsize=6072894|lines=12695
pathname=/logs/12346/log_12346.200803012200-2300-0.log.gz|size=513444|uncompressedsize=3881224|lines=7514
pathname=/logs/12346/log_12346.200805101400-1500-0.log.gz|size=322774|uncompressedsize=2501874|lines=4937
pathname=/logs/12346/log_12346.200712311800-1900-0.log.gz|size=461202|uncompressedsize=3422076|lines=6165
pathname=/logs/12346/log_12346.200806270700-0800-0.log.gz|size=105324|uncompressedsize=813253|lines=1807
pathname=/logs/12346/log_12346.200803172000-2100-0.log.gz|size=751699|uncompressedsize=5731115|lines=10523
|
The first line, for instance, means that there is a file
/logs/12345/log_12345.200806292100-2200-0.log.gz, which is 542,192
bytes in size, or 4,046,692 compressed, and is 7,883 lines in length.
In this case, this is the log file for customer 12345, generated on
June 29, 2008, which is for the period 21:00 - 22:00.
The Log Format Plug-in
So now we have a chunk of log data, and we want to analyze it with
Sawmill to answer questions like:
- What is the total compressed size of the log data, in bytes?
- What is the total uncompressed size of the log data, in bytes?
- How many lines are there in the log data?
- How many lines are there for May 22, 2008?
- What is the total uncompressed size of log data for the customer
12346?
- How much compressed log data has been generated between 3PM and
4PM, for all customers combined?
- How much compressed log data has been generated between 3PM and
4PM, for customer 12347?
- And so on--once the data is in Sawmill, any type of reporting and
filtering is possible.
In order to import this custom data into Sawmill, we need to create a
log format plug-in. The plug-in below recognizes and parses this format
of log data. Each section of the plug-in will be described separately
below.
compute_size_data = {
plugin_version = "1.0"
# The name of the log format
log.format.format_label = "compute_size_data.pl Log
Format"
log.miscellaneous.log_data_type = "other"
log.miscellaneous.log_format_type = "other"
# The log is in this format if any of the first ten lines
match
this regular expression
log.format.autodetect_regular_expression =
"^pathname=.*uncompressedsize="
# Log fields
log.fields = {
date = ""
time = ""
pathname = {
type = "page"
hierarchy_dividers = "/"
left_to_right = true
leading_divider = "true"
} # pathname
size = ""
uncompressed_size = ""
lines = ""
files = ""
} # log.fields
# Database fields
database.fields = {
date_time = ""
day_of_week = ""
hour_of_day = ""
pathname = {
suppress_bottom = 99999
}
} # database.fields
database.numerical_fields = {
files = {
default = true
}
lines = {
default = true
}
size = {
type = "float"
default = true
display_format_type = "bandwidth"
} # size
uncompressed_size = {
label = "uncompressed size"
type = "float"
default = true
display_format_type = "bandwidth"
} # uncompressed_size
} # database.numerical_fields
log.parsing_filters.parse = `
if (matches_regular_expression(current_log_line(),
'^pathname=([^|]+)[|]size=([0-9]+)[|]uncompressedsize=([0-9]+)[|]lines=([0-9]+)'))
then (
# Add an entry which reports total usage by all files
pathname = $1;
size = $2;
uncompressed_size = $3;
lines = $4;
files = 1;
if (matches_regular_expression(pathname,
'[.]([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])([0-9][0-9])([0-9][0-9])-'))
then (
date = $1 . '-' . $2 . '-' . $3;
time = $4 . ':' . $5 . ':00';
);
); # if matches line
`
create_profile_wizard_options = {
# The reports menu
report_groups = {
date_time_group = ""
} # report_groups
} # create_profile_wizard_options
} # compute_size_data
|
The Log Format plug-in: compute_size_data.cfg
If you put this plug-in in the LogAnalysisInfo/log_formats folder of
your Sawmill installation, then Create Profile using the log data
above, Sawmill will recognize the data as "compute_size_data.pl
Log
Format". Sawmill will then generate fully filterable and zoomable
reports, with
numerical fields size, uncompressed size, and lines, and reports for
date, time, and pathnames and directories.
The Log Format Plug-in, Dissected
In this section we will take the log format plug-in, one part at a
time, describing what each part does.
The Log Format Plug-in: The Header
A log format plug-in starts with a header like this:
compute_size_data = {
plugin_version = "1.0"
# The name of the log format
log.format.format_label = "compute_size_data.pl Log
Format"
log.miscellaneous.log_data_type = "other"
log.miscellaneous.log_format_type = "other"
|
The first line has the internal name of the plug-in, in this case,
"compute_size_data". The name of the file must be the same as this,
with a .cfg extension, so this file
must be called
compute_size_data.cfg. Plug-ins are "nodes" in Sawmill terminology
(like most of Sawmill's configuration text files), so they use curly
brackets ( { } ) for grouping. The first line shows that this plug-in
is called compute_size_data, and the "= {" indicates that it is
a group node, with multiple parameters below it. The entire remainder
of the file is information within this node, describing the plug-in;
and the final line of the file contains "}" to close the
"compute_size_data" node.
plugin_version is an optional parameter which is useful for tracking
multiple versions of a plug-in.
The line beginning with # is a comment. Everything after the #, until
the end of the line, is ignored by Sawmill, and has no effect on the
functionality of the plug-in.
The next three lines give the label of the plug-in as it will appear in
the Create Profile Wizard, and categories for the plug-in which
determine where they are listed in the documentation. The category
options do not affect the functionality of the plug-in, and can safely
be left as "other."
The Log Format Plug-in: The Autodetection Regular Expression
The next line is the autodetection regular expression:
# The log is in this format if any of the
first ten
lines match this regular expression
log.format.autodetect_regular_expression =
"^pathname=.*uncompressedsize="
|
This is a regular expression (See
Regular
Expressions in the Sawmill documentation, or look it up in a search
engine, to learn regular expression syntax) which describes what the
log data looks like, for the purposes of
detecting it. When a
profile is first created, Sawmill will detect the format by comparing
the first few lines of the log data with this expression. If any line
matches, the format will be listed as a matching format in the Create
Profile wizard. In this case, the regular expression means that any
line starting with "pathname=", and then containing "uncompressedsize="
later in the line, is considered to match this format. The autodetect
regular expression should be as tightly focused as possible, so it
detects
every line of the format, but is very unlikely to match
a line of any other format; this ensures that the Create Profile Wizard
shows the format, and no other formats, when it autodetects.
The Log Format Plug-in: The Log Fields
The log format continues with the log fields:
# Log fields
log.fields = {
date = ""
time = ""
pathname = {
type = "page"
hierarchy_dividers = "/"
left_to_right = true
leading_divider = "true"
} # pathname
size = ""
uncompressed_size = ""
lines = ""
files = ""
} # log.fields
|
Like the plug-in itself, this section is a node, using { } syntax for
grouping. The node is the "fields" node within the "log" node of the
plug-in, which describes the fields in the log data. This is a list of
fields which we will extract from the log data: date, time, pathname,
size, uncompressed_size, lines, and files.
All fields are simple, default fields, except for "pathname," so all
other fields are given an empty value, which simply defines the
existence of the field, and lets the Create Profile Wizard decide what
the field parameters are. But the pathname field is complicated in this
case, because we're doing something fancy: we want this to be a
hierarchically drillable field, so you can click "/logs/" in the
Pathname report, and zoom in to see a report showing "/logs/12345/" and
"/logs/12346/", and then click one of those to see another report with
just the files in that subdirectory. This gives a "file browser"
feel to the field, allowing you to zoom into directories by clicking
them. But the default behavior of the Create Profile Wizard is to make
reports list full field values, with no internal hierarchical
structure, so we need to override those for this field. The options
mean that this is a "page" field, a field (a pathname), with "/"
between
directories, with the containing items to the left (parent directories
appear to the left of their children in a pathname), and with a leading
divider (the pathname starts with a "/"). More information about these
options is available in the
Log
Fields chapter of the Sawmill documentation.
The Log Format Plug-in: The Database Fields (Non-Aggregating)
The next section lists the non-aggregating (non-numerical) database
fields:
# Database fields
database.fields = {
date_time = ""
day_of_week = ""
hour_of_day = ""
pathname = {
suppress_bottom = 99999
}
} # database.fields
|
This section describes the database fields, i.e., the fields as they
will appear in Sawmill's database. There is generally one report per
database field.
Database fields roughly correspond to log fields, but there are often
"derived" database fields, which are computed from specific log fields.
The full list of derived fields is included in the
Creating
Log Format Plug-ins chapter of Sawmill online documentation. In
this case, we are using the date and time log fields to derive three
other fields to include in the database date_time, day_of_week, and
hour_of_day. This allows us to see an integrated date/time report, as
well as separate "Day of Week" and "Hour of Day" reports. The pathname
field is also tracked, which will give us both a "Pathnames" report and
a hierarchical "Pathnames/directories" report--the wizard automatically
creates these two reports for a "page" field like "pathname" (where
type="page" in the log field).
Most fields are listed just as fieldname="", which lets the wizard pick
the values of their parameters, but we do need to override one
parameter: we set suppress_bottom to a large number in the pathname
field, to allow any number of levels of hierarchy in the pathname
field. Otherwise, zooming would only go two levels deep in the
Pathnames/directories report (the default suppress_bottom value is 2).
We didn't include the numerical fields here, because those don't become
reports: they become columns in reports. They belong in the aggregating
(numerical) fields, which is next.
The Log Format Plug-in: The Database Fields (Aggregating)
The next section lists the aggregating (numerical) database fields:
database.numerical_fields = {
files = {
default = true
}
lines = {
default = true
}
size = {
type = "float"
default = true
display_format_type = "bandwidth"
} # size
uncompressed_size = {
label = "uncompressed size"
type = "float"
default = true
display_format_type = "bandwidth"
} # uncompressed_size
} # database.numerical_fields
|
Aggregating fields automatically combine their values, usually by
summing them, into reports. So for instance, the "size" field
automatically sums the number of bytes; if the log data contains two
lines with 100 and 200 bytes listed, the Overview will show size=300.
All four fields here are summing fields (the default), so the Overview
will contain four entries: files, lines, size, and uncompressed size;
and all reports will contain those four columns (with a non-aggregating
field as the leftmost column).
The value default=true is specified for all fields, which causes them
to be checked by default, in the Create Profile Wizard.
The two "size" fields are listed as type=float, which allows them to
represent large numbers on 32-bit systems. The other two fields are
left as default type integers, which is faster and smaller, because
they are unlikely to exceed the maximum size of a 32-bit integer (about
2 billion).
The two bandwidth fields specify display_format_type=bandwidth, which
tells Sawmill to display them using bandwidth formatting. So where a
value of 1024 would be displayed as "1024" if it is the number of
lines, it will be displayed as "1 K" if it is the uncompressed size, or
the size.
The uncompressed_size field specifies label="uncompressed size".
Without this, it would appear with its default label, which is the
name, uncompressed_size. It looks better to have a space instead of an
underbar, so we have overridden the label in this case. The Create
Profile wizard also looks in the field_labels node of the file
LogAnalysisInfo\language\english\lang_stats.cfg to try to find a
matching label, and uses the one there if there is one (replace
"english" with the name of your language if you're using a non-English
translation), so you can also modify that file to change labels.
The Log Format Plug-in: The Parsing Filters
The next section lists the parsing filters:
log.parsing_filters.parse = `
if (matches_regular_expression(current_log_line(),
'^pathname=([^|]+)[|]size=([0-9]+)[|]uncompressedsize=([0-9]+)[|]lines=([0-9]+)'))
then (
# Add an entry which reports total usage by all files
pathname = $1;
size = $2;
uncompressed_size = $3;
lines = $4;
files = 1;
if (matches_regular_expression(pathname,
'[.]([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])([0-9][0-9])([0-9][0-9])-'))
then (
date = $1 . '-' . $2 . '-' . $3;
time = $4 . ':' . $5 . ':00';
);
); # if matches line
`
|
The parsing filters contain a description of how Sawmill parses the
line of log data. There are many several ways to parse the data which
do not involve writing a parsing filter, including using delimited
parsing (index/subindex), and using a single parsing regular
expression. But parsing filters provide the most functionality, and in
this case, they were useful because we wanted to reformat the date and
time values from the log pathname (otherwise, we could have used a
parsing regular expression). This parsing filter is an expression
written in the Salang language (Sawmill's language; see
The
Configuration Language in the documentation for a reference). The
entire expression is contained in backtick quotes (`) in this case,
which are convenient because they allow you to use single (') or double
(") quotes in the expression.
This expression calls the current_log_line() function of Salang to get
the value of the current line, then uses matches_regular_expression()
to match that line with a regular expression which has parenthesized
subexpressions to extract all the field values into the variables $1,
$2, $3, etc. Then, it assigns those variables to pathname, size, etc.
It sets the "files" field to 1, so that the sum of the "files" field
will be the total number of fields. Finally, it uses
matches_regular_expression() again to extract the date and time values
from the pathname, and rebuilds them in YYYY-MM-DD and HH:MM:SS format,
putting them into the date and time fields for Sawmill to use.
Writing a parsing filter is similar to writing a script or computer
program, and requires some experience with scripting. This is usually
the most difficult part of the plug-in. Fortunately, many log formats
do not require this step; they can be parsed using index/subindex or a
"parsing regular expression." See
Creating
Log Format Plug-ins in the Sawmill documentation for examples of
these simpler approaches to parsing. Parsing filters are required if a
single log entry spans multiple lines, or if field values need to be
converted before being put into the log fields (as in this case), or in
some other cases where advanced calculations are required.
The Log Format Plug-in: The Wizard Options
The next section lists the Create Profile Wizard options:
create_profile_wizard_options = {
# The reports menu
report_groups = {
date_time_group = ""
} # report_groups
} # create_profile_wizard_options
|
There are many options which can be included here (see
Creating
Log Format Plug-ins), which specify report groupings, report
details, field associations by report, final_step cleanup/reworking,
and more. But in this example, we're sticking with the basics: all we
want are a date/time group in the reports menu, and all default reports
(Overview, one report per database field, Single-page Summary, and Log
Detail). This is done by having a date_time_group specified in the
report_groups section of the create_profile_wizard_options section.
The Log Format Plug-in: The Closing Bracket
The plug-in is a single CFG node, which starts with "compute_size_data
= ". Therefore, it must have a closing bracket at the end:
The comment is optional but it is useful to put a the node name as a
comment on every closing bracket to improve legibility.
Conclusion
This newsletter describes the process of creating a plug-in for parsing
and reporting on a custom log format. This type of plug-in can be
created to parse and report on
any textual log data.
The process of creating a plug-in is somewhat complex and detailed,
especially if it involves parsing filters. Our experts have created
hundreds of plug-ins, and can do it quickly and accurately. If you
would like assistance in creating a plug-in for a custom log format,
you can also use Sawmill Professional Services.
[Article revision v1.0]
Professionelle Dienstleistungen
Sollten Sie die Anpassung von Sawmill Analytics nicht selbst vornehmen wollen, können wir Ihnen dies als Dienstleisung anbieten. Unsere Experten setzen sich gerne mit Ihnen in Verbindung, um die Reports oder sonstige Aspekte von Sawmill Analytics an Ihre Gegebenheiten und Wünsche anzupassen.
Kontakt
Zur Tutorial-Übersicht