HAAGE&PARTNER Computer GmbH  HAAGE&PARTNER

Sawmill Analytics

Analyse und Reporting
für Web | Netzwerk | Sicherheit

Zugriffs- und Datenanalyse von Server-Logs (Proxy, Mailserver, Firewall, Webserver) und Überwachung der Sicherheit & Performance, Schwachstellenanalyse.

Sawmill Analytics 8 | Loganalyse

Sawmill-Tutorial

Using Database Merges


Note: Database merge is available only with the internal database; it is not available for profiles that use a MySQL database.

A default profile created in Sawmill uses a single processor (single core) to parse log data and build the database. This is a good choice for shared environments, where using all processors can bog down the system, but for best performance, it is best to set "log processing threads" to the number of processors, in the Log Processing options in the Config page of the profile. That will split log processing across multiple processors, improving the performance of database builds and updates by using all processors on the system. This is available with Sawmill Enterprise--non-Enterprise versions of Sawmill can only use one processor.

If the dataset is too large to process in an acceptable time on a single computer, even with multiple processors, it is possible to split the processing across multiple machines. This is accomplished by building a separate database on each system, and then merging them to form a single large database. For instance, this command line adds the data from the database for profile2 to the database for profile1:

  sawmill -p profile1 -a md -mdd Databases/profile2/main

or on Windows:

  SawmillCL -p profile1 -a md -mdd Databases\profile2\main

After this command completes, profile1 will show the data it showed before the command, and the data that profile2 showed before the command (profile2 will be unchanged).

This makes it possible to build a database twice as fast using this sequence:

  sawmill -p profile1 -a bd
  sawmill -p profile2 -a bd
  sawmill -p profile1 -a md -mdd Databases/profile2/main

(Use SawmillCL and \ slashes on Windows, as shown above).

The critical piece is that the first two commands must run simultaneously; if you run them one after another, they will take as long as building the whole database. But on a two-processor system, they can both use a full CPU, fully using both CPUs, and running nearly twice as fast as a single build. The merge then takes some extra time, but overall this is still faster than a single-process build.

Running a series of builds simultaneously can be done by opening multiple windows and running a separate build in each window, or by "backgrounding" each command before starting the next (available on UNIX and similar systems). But for a fully automated environment, this is best done with a script. The attached perl script, multisawmill.pl, can be used to build multiple databases simultaneously. You will need to modify the top of the script to match your environment, and set the number of threads; then when you run it, it will spawn many database builds simultaneously (the number you specified), and as each completes, it will start another one. This script is provided as-is, with no warranty, as a proof-of-concept of a multiple-simultaneous-build script.

Using the attached script, or something like it, you can apply this approach to much larger datasets, for instance to build a year of data:

  1. Create a profile for each day in the year (it is probably easiest to use Create Many Profiles to do this; see Setting Up Multiple Users in the Sawmill documentation).

  2. Build all profiles, 8 at a time (or however many cores you have available). If you have multiple machines available, you can use multiple installations of Sawmill, by partitioning the profiles into multiple systems. For instance, if you have two 8-core nodes in the Sawmill cluster, you could build 16 databases at a time, or if you had four 4-core nodes in the cluster, you could build 32 databases at a time. This portion of the build can give a linear speedup, with nearly 32x faster log processing than using a single process, by using a 8-core 4-node cluster.

  3. Merge all the databases. The simplest way to do this, in a 365-day example, is to run 364 merges, adding each day into the final one-year database.

When the merge is done, the one-year database will function as though it had been built in a single "build database" step--but it will have taken much less time to build.


Advanced Topic: Using Binary Merges

The example described above uses "sequential merges" for step 3--it runs 364 separate merge steps, one after another, to create the final database. Each of these merges uses only a single processor of a single node, so this portion of the build does not use the cluster efficiently; and this can cause step 3 to take longer than step 2: the merge can be slower than the processing and building of data. To improve this, a more sophisticated merge method can be scripted, using a "binary tree" of merges to build the final database. Roughly, each code on each node is assigned two one-day databases, which they merge, forming two-day databases. Then each core of each node is assigned two two-day databases, which they merge to form a four-day database. This continues until a final merge combines two half-year databases into a one-year database. The number of merge stages is much less than the number of merges required if done sequentially.

For simplicity, let's assume we're merging 16 days, on a 4-core cluster. On a 4-core cluster, we can do 4 merges at a time.

  Step 1, core 1: Merge day1 with day2, creating day[1,2].
  Step 1, core 2: Merge day3 with day4, creating day[3,4].
  Step 1, core 3: Merge day5 with day6, creating day[5,6].
  Step 1, core 4: Merge day7 with day8, creating day[7,8].

When those are complete, we would continue:

  Step 2, core 1: Merge day9 with day10, creating day[9,10].
  Step 2, core 2: Merge day11 with day12, creating day[11,12].
  Step 2, core 3: Merge day13 with day14, creating day[13,14].
  Step 2, core 4: Merge day14 with day16, creating day[15,16].

Now we have taken 16 databases and merged them in two steps into 8 databases. Now we merge them into four databases:

  Step 3, core 1: Merge day[1,2] with day[3,4], creating day[1,2,3,4].
  Step 3, core 2: Merge day[5,6] with day[7,8], creating day[5,6,7,8].
  Step 3, core 3: Merge day[9,10] with day[11,12], creating day[9,10,11,12].
  Step 3, core 4: Merge day[13,14] with day[15,16], creating day[13,14,15,16].

Now we merge into two databases:

  Step 4, core 1: Merge day[1,2,3,4] with day[5,6,7,8], creating day[1,2,3,4,5,6,7,8].
  Step 4, core 2: Merge day[9,10,11,12] with day[13,14,15,16], creating day[9,10,11,12,13,14,15,16].

And finally:

  Step 5, core 1: Merge day[1,2,3,4,5,6,7,8] with day[9,10,11,12,13,14,15,16], creating day[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16].

So in 5 steps, we have build what would have required 15 steps using sequential merges: a 16-day database. This approach can be used to speed up much larger merges even more.


Advanced Topic: Re-using One-Day Databases

In the approach above, the one-day databases are not destroyed by the merge, which reads data from them but does not write to them. This makes it possible to keep the one-day databases for fast access to reports from a particular day. By leaving the one-day databases after the merge is complete, users will be able to select a particular database from the Profiles list, to see fast reports for just that day (a one-day database is much faster to generate reports than a 365-day database).


Advanced Topic: Using Different Merge Units

In the discussion above, we used one day as the unit of merge, but any unit can be used. In particular, if you are generating a database showing reports from 1000 sites, you could use a site as the unit. After building the databases from 1000 sites, you could then merge all 1000 databases to create an all-sites profile for administrative overview, leaving each of the 1000 one-site profiles to be accessed by its users.




Professionelle Dienstleistungen

Sollten Sie die Anpassung von Sawmill Analytics nicht selbst vornehmen wollen, können wir Ihnen dies als Dienstleisung anbieten. Unsere Experten setzen sich gerne mit Ihnen in Verbindung, um die Reports oder sonstige Aspekte von Sawmill Analytics an Ihre Gegebenheiten und Wünsche anzupassen. Kontakt

Zur Tutorial-Übersicht

Weitere Informationen

      Live-Demonstrationen »    
© 1995-2011 HAAGE & PARTNER Computer GmbH · Impressum · Datenschutz · www.haage-partner.de