Sawmill-Tutorial
Using Database Merges
Note: Database merge is available only with the internal database;
it is not available for profiles that use a MySQL database.
A default profile created in Sawmill uses a single processor (single
core) to parse log data and build the database. This is a good choice
for shared environments, where using all processors can bog down the
system, but for best performance, it is best to set "log processing
threads" to the number of processors, in the Log Processing options in
the Config page of the profile. That will split log processing across
multiple processors, improving the performance of database builds and
updates by using
all processors on the system. This is
available with Sawmill Enterprise--non-Enterprise versions of Sawmill
can only use one processor.
If the dataset is too large to process in an acceptable time on a
single computer, even with multiple processors, it is possible to split
the processing across multiple machines. This is accomplished by
building a separate database on each system, and then merging them to
form a single large database. For instance, this command line adds the
data from the database for
profile2 to the database for
profile1:
sawmill -p
profile1 -a md -mdd Databases/
profile2/main
or on Windows:
SawmillCL -p
profile1 -a md -mdd Databases
\profile2\main
After this command completes,
profile1 will show the data it
showed before the command,
and the data that
profile2
showed before the command (
profile2 will be unchanged).
This makes it possible to build a database twice as fast using this
sequence:
sawmill -p
profile1 -a bd
sawmill -p
profile2 -a bd
sawmill -p
profile1 -a md -mdd Databases/
profile2/main
(Use SawmillCL and \ slashes on Windows, as shown above).
The critical piece is that the first two commands must run
simultaneously;
if you run them one after another, they will take as long as building
the whole database. But on a two-processor system, they can both use a
full CPU, fully using both CPUs, and running nearly twice as fast as a
single build. The merge then takes some extra time, but overall this is
still faster than a single-process build.
Running a series of builds simultaneously can be done by opening
multiple windows and running a separate build in each window, or by
"backgrounding" each command before starting the next (available on
UNIX and similar systems). But for a fully automated environment, this
is best done with a script. The attached perl script, multisawmill.pl,
can be used to build multiple databases simultaneously. You will need
to modify the top of the script to match your environment, and set the
number of threads; then when you run it, it will spawn many database
builds simultaneously (the number you specified), and as each
completes, it will start another one. This script is provided as-is,
with no warranty, as a proof-of-concept of a
multiple-simultaneous-build script.
Using the attached script, or something like it, you can apply this
approach to much larger datasets, for instance to build a year of data:
1. Create a profile for each day in the year (it is probably easiest
to use Create Many Profiles to do this; see
Setting
Up Multiple Users in the Sawmill documentation).
2. Build all profiles, 8 at a time (or however many cores you have
available). If you have multiple machines available, you can use
multiple installations of Sawmill, by partitioning the profiles into
multiple systems. For instance, if you have two 8-core nodes in the
Sawmill cluster, you could build 16 databases at a time, or if you had
four 4-core nodes in the cluster, you could build 32 databases at a
time. This portion of the build can give a linear speedup, with nearly
32x faster log processing than using a single process, by using a
8-core 4-node cluster.
3. Merge all the databases. The simplest way to do this, in a 365-day
example, is to run 364 merges, adding each day into the final one-year
database.
When the merge is done, the one-year database will function as though
it had been built in a single "build database" step--but it will have
taken much less time to build.
Advanced Topic: Using Binary Merges
The example described above uses "sequential merges" for step 3--it
runs 364 separate merge steps, one after another, to create the final
database. Each of these merges uses only a single processor of a single
node, so this portion of the build does not use the cluster
efficiently; and this can cause step 3 to take longer than step 2: the
merge can be slower than the processing and building of data. To
improve this, a more sophisticated merge method can be scripted, using
a "binary tree" of merges to build the final database. Roughly, each
code on each node is assigned two one-day databases, which they merge,
forming two-day databases. Then each core of each node is assigned two
two-day databases, which they merge to form a four-day database. This
continues until a final merge combines two half-year databases into a
one-year database. The number of merge stages is much less than the
number of merges required if done sequentially.
For simplicity, let's assume we're merging 16 days, on a 4-core
cluster. On a 4-core cluster, we can do 4 merges at a time.
Step 1, core 1: Merge day1 with day2, creating day[1,2].
Step 1, core 2: Merge day3 with day4, creating day[3,4].
Step 1, core 3: Merge day5 with day6, creating day[5,6].
Step 1, core 4: Merge day7 with day8, creating day[7,8].
When those are complete, we would continue:
Step 2, core 1: Merge day9 with day10, creating day[9,10].
Step 2, core 2: Merge day11 with day12, creating day[11,12].
Step 2, core 3: Merge day13 with day14, creating day[13,14].
Step 2, core 4: Merge day14 with day16, creating day[15,16].
Now we have taken 16 databases and merged them in two steps into 8
databases. Now we merge them into four databases:
Step 3, core 1: Merge day[1,2] with day[3,4], creating day[1,2,3,4].
Step 3, core 2: Merge day[5,6] with day[7,8], creating day[5,6,7,8].
Step 3, core 3: Merge day[9,10] with day[11,12], creating
day[9,10,11,12].
Step 3, core 4: Merge day[13,14] with day[15,16], creating
day[13,14,15,16].
Now we merge into two databases:
Step 4, core 1: Merge day[1,2,3,4] with day[5,6,7,8], creating
day[1,2,3,4,5,6,7,8].
Step 4, core 2: Merge day[9,10,11,12] with day[13,14,15,16], creating
day[9,10,11,12,13,14,15,16].
And finally:
Step 5, core 1: Merge day[1,2,3,4,5,6,7,8] with
day[9,10,11,12,13,14,15,16], creating
day[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16].
So in 5 steps, we have build what would have required 15 steps using
sequential merges: a 16-day database. This approach can be used to
speed up much larger merges even more.
Advanced Topic: Re-using One-Day Databases
In the approach above, the one-day databases are not destroyed by the
merge, which reads data from them but does not write to them. This
makes it possible to keep the one-day databases for fast access to
reports from a particular day. By leaving the one-day databases after
the merge is complete, users will be able to select a particular
database from the Profiles list, to see fast reports for just that day
(a one-day database is much faster to generate reports than a 365-day
database).
Advanced Topic: Using Different Merge Units
In the discussion above, we used one day as the unit of merge, but any
unit can be used. In particular, if you are generating a database
showing reports from 1000 sites, you could use a
site as the
unit. After building the databases from 1000 sites, you could then
merge all 1000 databases to create an all-sites profile for
administrative overview, leaving each of the 1000 one-site profiles to
be accessed by its users.
Professionelle Dienstleistungen
Sollten Sie die Anpassung von Sawmill Analytics nicht selbst vornehmen wollen, können wir Ihnen dies als Dienstleisung anbieten. Unsere Experten setzen sich gerne mit Ihnen in Verbindung, um die Reports oder sonstige Aspekte von Sawmill Analytics an Ihre Gegebenheiten und Wünsche anzupassen.
Kontakt
Zur Tutorial-Übersicht