HAAGE&PARTNER Computer GmbH  HAAGE&PARTNER

Sawmill Analytics

Analyse und Reporting
für Web | Netzwerk | Sicherheit

Zugriffs- und Datenanalyse von Server-Logs (Proxy, Mailserver, Firewall, Webserver) und Überwachung der Sicherheit & Performance, Schwachstellenanalyse.

Sawmill Analytics 8 | Loganalyse

Sawmill-Tutorial

Using Sawmill To Query Log Data With SQL


back-end database (this is available only with Enterprise--Sawmill Professional and Sawmill Lite cannot use MySQL). [Note: Sawmill has been certified against the commercial version of MySQL server, and it is recommended that Sawmill users purchase the full MySQL Server, rather than using the "community" database server.] Sawmill's own built-in "internal" database is faster than MySQL, and is therefore generally a better choice if reports are to be generated by Sawmill. But MySQL has several major advantages over the internal database, including the ability to run SQL queries. This newsletter discusses the use of SQL to extract information from a MySQL database which was created by Sawmill. The techniques described in this newsletter can be used to run arbitrary SQL queries against log data in any of the 800+ log formats Sawmill supports; other log formats can be supported through custom log format plug-ins.


Creating A MySQL Profile And Database

To create a Sawmill profile which uses MySQL as its back-end database, create a profile just as you would for the internal database, but on the Database Options page of the Create Profile Wizard, select "Use MySQL database" (again, this is available only in Sawmill Enterprise). Then enter the hostname, username, and password of your MySQL database server. If you're using a socket to communicate with MySQL, and the socket is not in the default location, you can also enter its pathname on this page. Finally, if you want the name of the database (schema) to be different from the name of the profile, you can enter a different database name on this page:

mysql
Then continue with creating the profile, and at the end, let Sawmill build the database. It is not necessary to view reports through Sawmill now (or ever), if you intend to use the database only through external SQL queries--the "build database" operation will populate all tables in the MySQL database. When the "build database" completes, Sawmill has parsed, normalized, and inserted all log data into the MySQL database, and built the associated itemnum tables (see below).


Querying The Main Table And The Itemnum Tables

The main table of the database is called "logfile," and contains one row for each event in the log data. In this example, we are analyzing a small 5000-line Apache log file, so each line of log data is a separate event; so the resulting logfile contains 5000 rows (all examples below are captures from the "mysql" command-line program. If your mail client does not support the HTML tags used to render this in a mono-spaced font, these may format poorly; changing the font to a mono-spaced font like Courier or Monaco will show them better):


  mysql> select count(*) from logfile;
  +----------+
  | count(*) |
  +----------+
  |     5000 |
  +----------+


Let's look at one row from logfile:


  mysql> select * from logfile limit 1;
+-----------+---------------------+-----------------+-------------+-------------+----------+------+-----------+------+-------------------+--------------+----------+--------------------+----------+--------------+------+--------+----------+----------------------+---------------+---------------+-------------+------------------+--------+---------------+--------------------+-----------------+------+------------+---------+-------+--------+--------------+------------------+----------+------+
| loadorder | date_time           | bottomleveldate | day_of_week | hour_of_day | hit_type | page | file_type | worm | screen_dimensions | screen_depth | hostname | domain_description | location | organization | isp  | domain | referrer | referrer_description | search_engine | search_phrase | web_browser | operating_system | spider | server_domain | authenticated_user | server_response | hits | page_views | spiders | worms | errors | broken_links | screen_info_hits | visitors | size |
+-----------+---------------------+-----------------+-------------+-------------+----------+------+-----------+------+-------------------+--------------+----------+--------------------+----------+--------------+------+--------+----------+----------------------+---------------+---------------+-------------+------------------+--------+---------------+--------------------+-----------------+------+------------+---------+-------+--------+--------------+------------------+----------+------+
|         1 | 1998-04-07 16:53:06 |               2 |           2 |           2 |        2 |    2 |         2 |    2 |                 2 |            2 |        2 |                  2 |        2 |            2 |    2 |      2 |        2 |                    2 |             2 |             2 |           2 |                2 |      2 |             2 |                  2 |               2 |    1 |          1 |       0 |     0 |      0 |            0 |                0 |        2 |  734 |
+-----------+---------------------+-----------------+-------------+-------------+----------+------+-----------+------+-------------------+--------------+----------+--------------------+----------+--------------+------+--------+----------+----------------------+---------------+---------------+-------------+------------------+--------+---------------+--------------------+-----------------+------+------------+---------+-------+--------+--------------+------------------+----------+------+


This corresponds to the first line of the log data, which is this:

  140.177.203.25 - - [07/Apr/1998:16:53:06 -0500] "GET / HTTP/1.0" 200 734 "-" "Mozilla/4.04 [en] (X11; I; SunOS 5.6 sun4u)"

The date_time clearly matches the log data, and the 734 in the size column matches the 734 in the log data, but the rest of the columns are not so clear. That's because all non-aggregating (non-numerical) fields are normalized in logfile; instead of being directly included in logfile, their values are included in auxiliary tables (the itemnum tables), and references to those values are included in logfile. So for instance, hostname is 2 in logfile, which corresponds to 140.177.203.25 in the hostname itemnum table (the table called hostnameitemnum, which has columns called itemnum and hostname, and is used to map itemnums to hostname, and vice versa). Querying the hostnameitemnum table with SQL, and selecting only the row where itemnum=2, shows the correllation:


  mysql> select * from hostnameitemnum where itemnum = 2;
  +---------+----------------+
  | itemnum | hostname       |
  +---------+----------------+
  |       2 | 140.177.203.25 |
  +---------+----------------+



Joining The Main Table (logfile) To The Itemnums Tables

By joining one or more itemnum tables to the main table, you can get results similar to those shown in Sawmill's own reports. For instance, let's generate a "top 10 hostnames" report. We can do that by selecting/summing logfile, grouping on hostname, joining in hostnameitemnum to get real hostnames in the result (rather than normalized hostname itemnums), ordering by hits descending, and limiting to the top 10:


  mysql> select i.hostname, sum(hits) as hits, sum(page_views) as page_views, sum(size) as size from logfile l left join hostnameitemnum i on l.hostname = i.itemnum group by hostname order by hits desc limit 10;
  +-----------------------------+------+------------+----------+
  | hostname                    | hits | page_views | size     |
  +-----------------------------+------+------------+----------+
  | lipowitz.isdn.uiuc.edu      |  692 |        236 |  6348398 |
  | 192.17.19.150               |  466 |         46 |  6363739 |
  | 192.17.19.148               |  317 |        136 | 24279573 |
  | pale.kai.com                |  308 |         70 |  5289858 |
  | flowerfire.isdn.uiuc.edu    |  242 |         52 |  2337963 |
  | spider.unh.edu              |  171 |         19 |   433709 |
  | 195.101.37.244              |   87 |         21 |   575602 |
  | 206.148.222.50              |   79 |         25 |   876475 |
  | isdn-5.nii.enterconnect.net |   79 |         16 |   579010 |
  | gli2302.ctea.com            |   75 |          9 |   436318 |
  +-----------------------------+------+------------+----------+


Any of Sawmill's standard table reports can be generated similarly using SQL.


Filtering By Itemnum

Now suppose we want to filter the result. Any type of filter is possible using a WHERE clause in the SQL query. In this case, we'll filter the report above to show only .com hostnames. By including "where i.hostname like '%.com'", the query now selects only those rows from logfile which contain hostname values ending with .com, when de-normalized. So, the resulting list contains all .com addresses:


  mysql> select i.hostname, sum(hits) as hits, sum(page_views) as page_views, sum(size) as size from logfile l left join hostnameitemnum i on l.hostname = i.itemnum where i.hostname like '%.com' group by hostname order by hits desc limit 10;
  +--------------------------------+------+------------+---------+
  | hostname                       | hits | page_views | size    |
  +--------------------------------+------+------------+---------+
  | pale.kai.com                   |  308 |         70 | 5289858 |
  | gli2302.ctea.com               |   75 |          9 |  436318 |
  | wat.thedj.com                  |   57 |         22 |  767059 |
  | tosainu.trimark.com            |   45 |         12 |  341875 |
  | gianduia.compecon.com          |   39 |          6 |  206124 |
  | ip252.ts4.phx.inficad.com      |   34 |          4 |  160689 |
  | h-205-217-240-156.netscape.com |   33 |          4 |  173806 |
  | clarendon.weblogic.com         |   32 |          4 |  151206 |
  | gw.vixel.com                   |   30 |          4 |  115755 |
  | cx51617-a.dnpt1.occa.home.com  |   30 |          4 |  150578 |
  +--------------------------------+------+------------+---------+

The filter doesn't have to work on only the primary field of the query (hostname); it can work on any fields in logfile, or the joined fields of any other table. MySQL supports many joins in a single query, so we can join in one table for the main column (hostname), and also join additional tables for filtering. For instance, here are the hostnames which accessed GIF images (the hits column shows how many GIF accesses each hostname had; the size column shows how many GIF bytes each hostname transferred):


  mysql> select i.hostname, sum(hits) as hits, sum(page_views) as page_views, sum(size) as size from logfile l left join hostnameitemnum i on l.hostname = i.itemnum left join file_typeitemnum fi on l.file_type = fi.itemnum where fi.file_type = 'GIF' group by hostname order by hits desc limit 10;
  +-----------------------------+------+------------+--------+
  | hostname                    | hits | page_views | size   |
  +-----------------------------+------+------------+--------+
  | lipowitz.isdn.uiuc.edu      |  432 |          0 | 447331 |
  | 192.17.19.150               |  420 |          0 | 209925 |
  | pale.kai.com                |  235 |          0 | 144966 |
  | 192.17.19.148               |  161 |          0 | 322118 |
  | spider.unh.edu              |  152 |          0 | 129566 |
  | flowerfire.isdn.uiuc.edu    |  151 |          0 | 433550 |
  | gli2302.ctea.com            |   66 |          0 |  56660 |
  | 195.101.37.244              |   65 |          0 |  87507 |
  | isdn-5.nii.enterconnect.net |   63 |          0 |  24508 |
  | 206.148.222.50              |   52 |          0 |  74631 |
  +-----------------------------+------+------------+--------+
  10 rows in set (0.03 sec)



Conclusion

This newsletter describes the simple process for creating a MySQL profile in Sawmill, and using it to import log data into a SQL database. This process can be used to import any log data into a MySQL database. Sawmill already supports all common log formats (800+ different formats as of this writing), and if a particular format isn't on Sawmill's list, it can be added by creating a log format plug-in. This makes it possible to use Sawmill to run arbitrary SQL queries against any textual log data, by (1) creating a Sawmill profile from the log data with MySQL as the back-end database, (2) building the database in Sawmill, and (3) running SQL queries against the resulting database.

If you would like assistance in creating a plug-in for a log format you would like to query with SQL, or if you would like assistance creating SQL queries or scripts to extract the information you need from a Sawmill MySQL database, you can also use Sawmill Professional Services. Our experts have a thorough knowledge of Sawmill, log format plug-ins, and MySQL.


[Article revision v1.0]


Professionelle Dienstleistungen

Sollten Sie die Anpassung von Sawmill Analytics nicht selbst vornehmen wollen, können wir Ihnen dies als Dienstleisung anbieten. Unsere Experten setzen sich gerne mit Ihnen in Verbindung, um die Reports oder sonstige Aspekte von Sawmill Analytics an Ihre Gegebenheiten und Wünsche anzupassen. Kontakt

Zur Tutorial-Übersicht

Weitere Informationen

      Live-Demonstrationen »    
© 1995-2011 HAAGE & PARTNER Computer GmbH · Impressum · Datenschutz · www.haage-partner.de