HAAGE&PARTNER Computer GmbH  HAAGE&PARTNER

Sawmill Analytics

Analyse und Reporting
für Web | Netzwerk | Sicherheit

Zugriffs- und Datenanalyse von Server-Logs (Proxy, Mailserver, Firewall, Webserver) und Überwachung der Sicherheit & Performance, Schwachstellenanalyse.

Sawmill Analytics 8 | Loganalyse

Sawmill-Tutorial

Using Sawmill To Report On Custom Log Data


Sawmill is a universal log analyzer, that can parse and report on any type of textual log data. As of this writing, Sawmill supports 850 different common log formats, from the full range of popular devices: web servers, media servers, mail servers, firewalls, gateways, etc. But Sawmill's log analytic capabilities don't stop at the 850 included formats; it can analyze any log file. The 850 formats are implemented using log format plug-ins, small text files which describe the layout of the log data, the fields, the filtering it should applies to the data, and the reports to be generated. These log format plug-ins are user-editable, and user-creatable, so to support a custom log format, all you need to do is create your own plug-in. This newsletter gives an example of creating a plug-in too. The specific log data is an internal format generated by a perl script to analyze disk usage, compression, and number of lines in a log dataset, but again, this same approach can be used to analyze any textual log data.


The Log Generator

The script which generates the logs is shown below. Our Professional Services team created this script on a customer site, when we needed to know how much disk space was used, and how many lines of log data there were, in a very large multi-file compressed dataset. This was on a UNIX system, so we could have done it with "du", or with "gunzip -c" combined with "find" and/or "wc", but that would have given us information for only one directory, without any ability to zoom in to see information by directory, or by date; and we would not have been able to filter. Importing the data into Sawmill allows very high granularity in examining the data, filtering by date or other criteria. So, we wrote this script to compute, for each file in a directory (recursively), the file size, uncompressed file size (for gzipped logs), and number of lines in the file:


  #!/usr/bin/perl
  use strict;

  my $usage = "compute_size_data.pl <pathname>";

  my $pathname = $ARGV[0];
  if ($pathname eq "") {
    print "Usage: $usage\n";
    exit(-1);
  }

  my $findcmd = "find $pathname -type f";
  open(FIND, "$findcmd|") || die("Can't run $findcmd: $!");
  while(<FIND>) {
   my $foundpathname = $_;
    chomp($foundpathname);
    my $filesize = -s $foundpathname;
    my $uncompressedsize = $filesize;
    my $lines = 0;
    if ($foundpathname =~ /[.]gz$/) {
      $uncompressedsize = `gunzip -l $foundpathname | fgrep % | sed -e 's/^ *[0-9][0-9]*  * \\([0-9][0-9]*\\) .*\\/\\1/'`;
      chomp($uncompressedsize);
      $lines = `gunzip -c $foundpathname | wc -l`;
      chomp($lines);
    }
    else {
      $lines = `wc -l $foundpathname`;
      chomp($lines);
    }
    print "pathname=$foundpathname|size=$filesize|uncompressedsize=$uncompressedsize|lines=$lines\n";
  }


The Log Generator Script: compute_size_data.cfg


The details of the script are beyond the scope of this article, but the output is, this script generates log data like this:


  pathname=/logs/12345/log_12345.200806292100-2200-0.log.gz|size=542192|uncompressedsize=4046692|lines=7883
  pathname=/
logs/12345/log_12345.200808172000-2100-0.log.gz|size=667984|uncompressedsize=5331102|lines=11740
  pathname=/
logs/12345/log_12345.200806131300-1400-0.log.gz|size=380606|uncompressedsize=2970825|lines=5608
  pathname=/
logs/12345/log_12345.200805222000-2100-0.log.gz|size=589198|uncompressedsize=4567431|lines=8284
  pathname=/
logs/12345/log_12345.200803252100-2200-0.log.gz|size=691357|uncompressedsize=6072894|lines=12695
  pathname=/
logs/12346/log_12346.200803012200-2300-0.log.gz|size=513444|uncompressedsize=3881224|lines=7514
  pathname=/
logs/12346/log_12346.200805101400-1500-0.log.gz|size=322774|uncompressedsize=2501874|lines=4937
  pathname=/
logs/12346/log_12346.200712311800-1900-0.log.gz|size=461202|uncompressedsize=3422076|lines=6165
  pathname=/
logs/12346/log_12346.200806270700-0800-0.log.gz|size=105324|uncompressedsize=813253|lines=1807
  pathname=/
logs/12346/log_12346.200803172000-2100-0.log.gz|size=751699|uncompressedsize=5731115|lines=10523


The first line, for instance, means that there is a file /logs/12345/log_12345.200806292100-2200-0.log.gz, which is 542,192 bytes in size, or 4,046,692 compressed, and is 7,883 lines in length. In this case, this is the log file for customer 12345, generated on June 29, 2008, which is for the period 21:00 - 22:00.


The Log Format Plug-in

So now we have a chunk of log data, and we want to analyze it with Sawmill to answer questions like:
In order to import this custom data into Sawmill, we need to create a log format plug-in. The plug-in below recognizes and parses this format of log data. Each section of the plug-in will be described separately below.


  compute_size_data = {

    plugin_version = "1.0"

    # The name of the log format
    log.format.format_label = "compute_size_data.pl Log Format"
    log.miscellaneous.log_data_type = "other"
    log.miscellaneous.log_format_type = "other"

    # The log is in this format if any of the first ten lines match this regular expression
    log.format.autodetect_regular_expression = "^pathname=.*uncompressedsize="

    # Log fields
    log.fields = {

      date = ""
      time = ""

      pathname = {
        type = "page"
        hierarchy_dividers = "/"
        left_to_right = true
        leading_divider = "true"
      } # pathname

      size = ""
      uncompressed_size = ""
      lines = ""
      files = ""

    } # log.fields


    # Database fields
    database.fields = {

      date_time = ""
      day_of_week = ""
      hour_of_day = ""

      pathname = {
        suppress_bottom = 99999
      }

    } # database.fields

    database.numerical_fields = {

      files = {
        default = true
      }
   
      lines = {
        default = true
      }

      size = {
        type = "float"
        default = true
        display_format_type = "bandwidth"
      } # size

      uncompressed_size = {
        label = "uncompressed size"
        type = "float"
        default = true
        display_format_type = "bandwidth"
      } # uncompressed_size

    } # database.numerical_fields


    log.parsing_filters.parse = `
  if (matches_regular_expression(current_log_line(), '^pathname=([^|]+)[|]size=([0-9]+)[|]uncompressedsize=([0-9]+)[|]lines=([0-9]+)')) then (

    # Add an entry which reports total usage by all files
    pathname = $1;
    size = $2;
    uncompressed_size = $3;
    lines = $4;
    files = 1;

    if (matches_regular_expression(pathname, '[.]([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])([0-9][0-9])([0-9][0-9])-')) then (
      date = $1 . '-' . $2 . '-' . $3;
      time = $4 . ':' . $5 . ':00';
    );

  ); # if matches line
  `

    create_profile_wizard_options = {

      # The reports menu
      report_groups = {

        date_time_group = ""

      } # report_groups

    } # create_profile_wizard_options

  } # compute_size_data



The Log Format plug-in: compute_size_data.cfg


If you put this plug-in in the LogAnalysisInfo/log_formats folder of your Sawmill installation, then Create Profile using the log data above, Sawmill will recognize the data as  "compute_size_data.pl Log Format".  Sawmill will then generate fully filterable and zoomable reports, with numerical fields size, uncompressed size, and lines, and reports for date, time, and pathnames and directories.


The Log Format Plug-in, Dissected

In this section we will take the log format plug-in, one part at a time, describing what each part does.


The Log Format Plug-in: The Header

A log format plug-in starts with a header like this:


  compute_size_data = {

    plugin_version = "1.0"

    # The name of the log format
    log.format.format_label = "compute_size_data.pl Log Format"
    log.miscellaneous.log_data_type = "other"
    log.miscellaneous.log_format_type = "other"



The first line has the internal name of the plug-in, in this case, "compute_size_data". The name of the file must be the same as this, with a .cfg extension, so this file must be called compute_size_data.cfg. Plug-ins are "nodes" in Sawmill terminology (like most of Sawmill's configuration text files), so they use curly brackets ( { } ) for grouping. The first line shows that this plug-in is called compute_size_data, and the "= {" indicates that it is a group node, with multiple parameters below it. The entire remainder of the file is information within this node, describing the plug-in; and the final line of the file contains "}" to close the "compute_size_data" node.

plugin_version is an optional parameter which is useful for tracking multiple versions of a plug-in.

The line beginning with # is a comment. Everything after the #, until the end of the line, is ignored by Sawmill, and has no effect on the functionality of the plug-in.

The next three lines give the label of the plug-in as it will appear in the Create Profile Wizard, and categories for the plug-in which determine where they are listed in the documentation. The category options do not affect the functionality of the plug-in, and can safely be left as "other."


The Log Format Plug-in: The Autodetection Regular Expression

The next line is the autodetection regular expression:


    # The log is in this format if any of the first ten lines match this regular expression
    log.format.autodetect_regular_expression = "^pathname=.*uncompressedsize="



This is a regular expression (See Regular Expressions in the Sawmill documentation, or look it up in a search engine, to learn regular expression syntax) which describes what the log data looks like, for the purposes of detecting it. When a profile is first created, Sawmill will detect the format by comparing the first few lines of the log data with this expression. If any line matches, the format will be listed as a matching format in the Create Profile wizard. In this case, the regular expression means that any line starting with "pathname=", and then containing "uncompressedsize=" later in the line, is considered to match this format. The autodetect regular expression should be as tightly focused as possible, so it detects every line of the format, but is very unlikely to match a line of any other format; this ensures that the Create Profile Wizard shows the format, and no other formats, when it autodetects.


The Log Format Plug-in: The Log Fields

The log format continues with the log fields:


    # Log fields
    log.fields = {

      date = ""
      time = ""

      pathname = {
        type = "page"
        hierarchy_dividers = "/"
        left_to_right = true
        leading_divider = "true"
      } # pathname

      size = ""
      uncompressed_size = ""
      lines = ""
      files = ""


    } # log.fields


Like the plug-in itself, this section is a node, using { } syntax for grouping. The node is the "fields" node within the "log" node of the plug-in, which describes the fields in the log data. This is a list of fields which we will extract from the log data: date, time, pathname, size, uncompressed_size, lines, and files.

All fields are simple, default fields, except for "pathname," so all other fields are given an empty value, which simply defines the existence of the field, and lets the Create Profile Wizard decide what the field parameters are. But the pathname field is complicated in this case, because we're doing something fancy: we want this to be a hierarchically drillable field, so you can click "/logs/" in the Pathname report, and zoom in to see a report showing "/logs/12345/" and "/logs/12346/", and then click one of those to see another report with just the files in that subdirectory. This gives a "file browser" feel to the field, allowing you to zoom into directories by clicking them. But the default behavior of the Create Profile Wizard is to make reports list full field values, with no internal hierarchical structure, so we need to override those for this field. The options mean that this is a "page" field, a field (a pathname), with "/" between directories, with the containing items to the left (parent directories appear to the left of their children in a pathname), and with a leading divider (the pathname starts with a "/"). More information about these options is available in the Log Fields chapter of the Sawmill documentation.


The Log Format Plug-in: The Database Fields (Non-Aggregating)

The next section lists the non-aggregating (non-numerical) database fields:


    # Database fields
    database.fields = {

      date_time = ""
      day_of_week = ""
      hour_of_day = ""

      pathname = {
        suppress_bottom = 99999
      }

    } # database.fields


This section describes the database fields, i.e., the fields as they will appear in Sawmill's database. There is generally one report per database field.

Database fields roughly correspond to log fields, but there are often "derived" database fields, which are computed from specific log fields. The full list of derived fields is included in the Creating Log Format Plug-ins chapter of Sawmill online documentation. In this case, we are using the date and time log fields to derive three other fields to include in the database date_time, day_of_week, and hour_of_day. This allows us to see an integrated date/time report, as well as separate "Day of Week" and "Hour of Day" reports. The pathname field is also tracked, which will give us both a "Pathnames" report and a hierarchical "Pathnames/directories" report--the wizard automatically creates these two reports for a "page" field like "pathname" (where type="page" in the log field).

Most fields are listed just as fieldname="", which lets the wizard pick the values of their parameters, but we do need to override one parameter: we set suppress_bottom to a large number in the pathname field, to allow any number of levels of hierarchy in the pathname field. Otherwise, zooming would only go two levels deep in the Pathnames/directories report (the default suppress_bottom value is 2).

We didn't include the numerical fields here, because those don't become reports: they become columns in reports. They belong in the aggregating (numerical) fields, which is next.



The Log Format Plug-in: The Database Fields (Aggregating)

The next section lists the aggregating (numerical) database fields:


    database.numerical_fields = {

      files = {
        default = true
      }
   
      lines = {
        default = true
      }

      size = {
        type = "float"
        default = true
        display_format_type = "bandwidth"
      } # size

      uncompressed_size = {
        label = "uncompressed size"
        type = "float"
        default = true
        display_format_type = "bandwidth"
      } # uncompressed_size

    } # database.numerical_fields



Aggregating fields automatically combine their values, usually by summing them, into reports. So for instance, the "size" field automatically sums the number of bytes; if the log data contains two lines with 100 and 200 bytes listed, the Overview will show size=300. All four fields here are summing fields (the default), so the Overview will contain four entries: files, lines, size, and uncompressed size; and all reports will contain those four columns (with a non-aggregating field as the leftmost column).

The value default=true is specified for all fields, which causes them to be checked by default, in the Create Profile Wizard.

The two "size" fields are listed as type=float, which allows them to represent large numbers on 32-bit systems. The other two fields are left as default type integers, which is faster and smaller, because they are unlikely to exceed the maximum size of a 32-bit integer (about 2 billion).

The two bandwidth fields specify display_format_type=bandwidth, which tells Sawmill to display them using bandwidth formatting. So where a value of 1024 would be displayed as "1024" if it is the number of lines, it will be displayed as "1 K" if it is the uncompressed size, or the size.

The uncompressed_size field specifies label="uncompressed size". Without this, it would appear with its default label, which is the name, uncompressed_size. It looks better to have a space instead of an underbar, so we have overridden the label in this case. The Create Profile wizard also looks in the field_labels node of the file LogAnalysisInfo\language\english\lang_stats.cfg to try to find a matching label, and uses the one there if there is one (replace "english" with the name of your language if you're using a non-English translation), so you can also modify that file to change labels.


The Log Format Plug-in: The Parsing Filters

The next section lists the parsing filters:



    log.parsing_filters.parse = `
  if (matches_regular_expression(current_log_line(), '^pathname=([^|]+)[|]size=([0-9]+)[|]uncompressedsize=([0-9]+)[|]lines=([0-9]+)')) then (

    # Add an entry which reports total usage by all files
    pathname = $1;
    size = $2;
    uncompressed_size = $3;
    lines = $4;
    files = 1;

    if (matches_regular_expression(pathname, '[.]([0-9][0-9][0-9][0-9])([0-9][0-9])([0-9][0-9])([0-9][0-9])([0-9][0-9])-')) then (
      date = $1 . '-' . $2 . '-' . $3;
      time = $4 . ':' . $5 . ':00';
    );

  ); # if matches line
  `


The parsing filters contain a description of how Sawmill parses the line of log data. There are many several ways to parse the data which do not involve writing a parsing filter, including using delimited parsing (index/subindex), and using a single parsing regular expression. But parsing filters provide the most functionality, and in this case, they were useful because we wanted to reformat the date and time values from the log pathname (otherwise, we could have used a parsing regular expression). This parsing filter is an expression written in the Salang language (Sawmill's language; see The Configuration Language in the documentation for a reference). The entire expression is contained in backtick quotes (`) in this case, which are convenient because they allow you to use single (') or double (") quotes in the expression.

This expression calls the current_log_line() function of Salang to get the value of the current line, then uses matches_regular_expression() to match that line with a regular expression which has parenthesized subexpressions to extract all the field values into the variables $1, $2, $3, etc. Then, it assigns those variables to pathname, size, etc. It sets the "files" field to 1, so that the sum of the "files" field will be the total number of fields. Finally, it uses matches_regular_expression() again to extract the date and time values from the pathname, and rebuilds them in YYYY-MM-DD and HH:MM:SS format, putting them into the date and time fields for Sawmill to use.

Writing a parsing filter is similar to writing a script or computer program, and requires some experience with scripting. This is usually the most difficult part of the plug-in. Fortunately, many log formats do not require this step; they can be parsed using index/subindex or a "parsing regular expression." See Creating Log Format Plug-ins in the Sawmill documentation for examples of these simpler approaches to parsing. Parsing filters are required if a single log entry spans multiple lines, or if field values need to be converted before being put into the log fields (as in this case), or in some other cases where advanced calculations are required.


The Log Format Plug-in: The Wizard Options

The next section lists the Create Profile Wizard options:


    create_profile_wizard_options = {

      # The reports menu
      report_groups = {

        date_time_group = ""

      } # report_groups

    } # create_profile_wizard_options



There are many options which can be included here (see Creating Log Format Plug-ins), which specify report groupings, report details, field associations by report, final_step cleanup/reworking, and more. But in this example, we're sticking with the basics: all we want are a date/time group in the reports menu, and all default reports (Overview, one report per database field, Single-page Summary, and Log Detail). This is done by having a date_time_group specified in the report_groups section of the create_profile_wizard_options section.



The Log Format Plug-in: The Closing Bracket

The plug-in is a single CFG node, which starts with "compute_size_data = ". Therefore, it must have a closing bracket at the end:


  } # compute_size_data


The comment is optional but it is useful to put a the node name as a comment on every closing bracket to improve legibility.


Conclusion

This newsletter describes the process of creating a plug-in for parsing and reporting on a custom log format. This type of plug-in can be created to parse and report on any textual log data.

The process of creating a plug-in is somewhat complex and detailed, especially if it involves parsing filters. Our experts have created hundreds of plug-ins, and can do it quickly and accurately. If you would like assistance in creating a plug-in for a custom log format, you can also use Sawmill Professional Services.

[Article revision v1.0]


Professionelle Dienstleistungen

Sollten Sie die Anpassung von Sawmill Analytics nicht selbst vornehmen wollen, können wir Ihnen dies als Dienstleisung anbieten. Unsere Experten setzen sich gerne mit Ihnen in Verbindung, um die Reports oder sonstige Aspekte von Sawmill Analytics an Ihre Gegebenheiten und Wünsche anzupassen. Kontakt

Zur Tutorial-Übersicht

Weitere Informationen

      Live-Demonstrationen »    
© 1995-2011 HAAGE & PARTNER Computer GmbH · Impressum · Datenschutz · www.haage-partner.de