HAAGE&PARTNER Computer GmbH  HAAGE&PARTNER

Sawmill Analytics

Analyse und Reporting
für Web | Netzwerk | Sicherheit

Zugriffs- und Datenanalyse von Server-Logs (Proxy, Mailserver, Firewall, Webserver) und Überwachung der Sicherheit & Performance, Schwachstellenanalyse.

Sawmill Analytics 8 | Loganalyse

Sawmill-Tutorial

Improving the Performance of Database Updates


A typical installation of Sawmill updates the database for each profile nightly. Each profile points its log source to the growing dataset it is analyzing, and each night, a scheduled database update looks for new data in the log source, and adds it to the database.

This works fine for most situations, but if the dataset gets very large, or has a very large number of files in it, updates can become too slow. In a typical "nightly update" environment, the updates are too slow if they have not completed by the time reports are needed in the morning. For instance, if updates start at 8pm each night, and take 12 hours, and live reports are needed between 7am and 8pm, then the update is too slow, because it will not complete until 8am, and reports will not be available from 7am to 8am. The downtime can be completely eliminated by using separate installations of Sawmill (one for reporting, and one for updating), but even then, updates can be too slow, if they take more than 24 hours to update a day of data.

There are many ways of making database updates faster. This newsletter lists common approaches, and discusses each.


1. Use a local file log source

If you're pulling your log data from an FTP site, or from an HTTP site, or using a command line log source, Sawmill does not have as much information to efficiently skip the data in the log files. It will need to re-download and reconsider more data than if the files are on a local disk, or a locally mounted disk. Using a local file log source will speed up updates, by allowing Sawmill to skip previously seen files faster. A "local file log source" includes mounted or shared drives, including mapped drive letters, UNC paths, NFS mounts, and AppleShare mounts; these are more efficient than FTP, HTTP, or command line log sources, for skipping previously seen data.


2. Use a local file log source on a local disk

As mentioned above (1), network drives are still "local file log sources" to Sawmill, because it has full access to examine all their attributes as though they were local drives on the Sawmill server. This gives Sawmill a performance boost over FTP, HTTP, and command log sources. But with network drives, all the information still has to be pulled over the network. For better performance, use a local drive, so the network is not involved at all. For instance, on Windows, put the logs on the C: drive, or some other drive physically inside the Sawmill server. Local disk access is much faster than network access, so using local files can significantly speed updates.

If the logs are initially generated on a system other than the Sawmill server, they need to be transferred to the local disk before processing, when using this approach. This can be done in a separate step, using a third-party program. rsync is a good choice, and works on all operating systems (on Windows, it can be installed as part of cygwin). On Windows, DeltaCopy is also a good choice. Most high-end FTP clients also support scheduling of transfers, and incremental transfers (transferring only files which have not been transferred earlier). The file transfers can be scheduled to run periodically during the day; unlike database updates, they can run during periods when reports must be available.


3. Turn on "Skip processed files on update by pathname"

During a database update, Sawmill must determine which log data it has already imported into the database, and import the rest. By default, Sawmill does this by comparing the first few kilobytes of each file with the first few kilobytes of files which have been imported (by comparing checksums). When it encounters a file it has seen before, it checks if there is new data at the end of it, by reading through the file past the previously seen data, and resuming reading when it reaches the end of the data it has seen before. This is a very robust way of detecting previously seen data, as it allows files to be renamed, compressed, or concatenated after processing; Sawmill will still recognize them as previously-seen. However, the algorithm requires Sawmill to look through all the log data, briefly, to determine what it has seen. For very large datasets, especially datasets with many files, this can become the longest part of the update process.

A solution is to skip files based on their pathnames, rather than their contents. Under Config -> Log Data -> Log Processing, there is an option "Skip processed files on update by pathname." If this option is checked, Sawmill will look only at the pathname of a file when determining if it has seen that data before. If the pathname matches the pathname of a previously processed file, Sawmill will skip the entire file. If the pathname does not match, Sawmill will process it. Skipping based on pathnames takes almost no time, so turning this option on can greatly speed updates, if the skipping step is taking a lot of the time.

This will not work if any of the files in the log source are growing. If the log source is a log file which is being continually appended, Sawmill will put that log file's data into the database, and will skip that file on the next update, even though it has new data at the end now; because the pathname matches (and with this option on, only the pathname is used to determine what it new). So this option works best for datasets which appear on the disk, one complete file at a time, and where files do not gradually appear during the time when Sawmill might be updating. Typically, this option can be used by processing log data on a local disk, setting up file syncronization (see 2, above), and having it synchronize only the complete files. It can also be used if logs are compressed each day, to create daily compressed logs; then the compressed logs are complete, and can be used as the log source, and the uncompressed, growing log will be ignored because it does not end with the compression extension (e.g., .zip, .gz, or .bz2). Finally, there is another option, "Skip most recent file" (also under Config -> Log Data -> Log Processing), which looks at the modification date of each file in the log source (which works for "local file" log sources only, but remember, that includes network drives), and skips the file with the most recent modification date. This allows fast analysis of servers, like IIS, which timestamp their logs, but do not compress them or rotate them; only the most recent log is changing, and all previous days' logs are fixed, so by skipping the most recent one, we can safely skip based on pathnames.


4. Keep the new data in a separate directory

For fully automated Sawmill installations, there is often a scripting environment built around Sawmill, which manages log rotation, compression, import into the database, report generation, etc. In an environment like this, it is usually simple to handle the "previously seen data" algorithm at the master script level, by managing Sawmill's log source so it only has data which has not been imported into the database. This could be done by moving all processed logs to a "processed" location (a separate directory or folder), after each update; or it could be handled by copying the logs to be processed into a "newlogs" folder, updating using that folder as the log source, and then deleting everything in "newlogs" until the next update. By ensuring, at the master script level, that the log source contains only the new data, you can bypass Sawmill's skipping algorithm entirely, and get the best possible performance.


5. Speed up the database build, or the merge

The choices above are about speeding up the "skip previously-seen data" part of a database update. But database updates have three parts: they skip the previously seen data, then build a separate database from the new data, and then merge that database into the main database. Anything that would normally speed up a database build, will speed up a database update, and it will usually speed up the merge too. For instance, deleting database fields, deleting cross-reference tables, rejecting unneeded log entries, and simplifying database fields with log filters, can all reduce the amount and complexity of data in the database, speeding up database builds and updates.

With Enterprise licensing, on a system with multiple processors or cores, it is also possible to set "Log processing threads" to 2, 4, or more, in Config -> Log Data -> Log Processing. This tells Sawmill to use multiple processors or cores during the "build" portion of the database update (when it's building the separate database from the new data), which can significantly improve the performance of that portion. However, it increases the amount of work to be done in the "merge" step, so using more threads does not always result in a speed increase for updates.

Active-scanning anti-virus can severely affect the performance of both builds and updates, by scanning Sawmill's database files continually as it attempts to modify them. Performance can be 10x slower in extreme cases, when active scanning is enabled. This is particularly marked on Windows systems. If you have an anti-virus product which actively scans all file system modifications, you should exclude Sawmill's installation directory, and its database directories (if separate) from the active scanning.

Use of a MySQL database has its advantages, but performance is not one of them--Sawmill's internal database is at least 2x faster than MySQL for most operations, and much faster for some. Unless you need MySQL for some other reason (like to query the imported data directly with SQL, from another program; or to overcome the address space limitations of a 32-bit server), use the internal database for best performance of both rebuilds and updates.

Finally, everything speeds up when you have faster hardware. A faster CPU will improve update times, and a faster disk may have an ever bigger affect. Switching from RAID 5 to RAID 10 will typically double the speed up database builds and updates, and switching from 10Krpm to 15Krpm disks can give a 20% performance boost. Adding more memory can help too, if the system is near its memory limit.




Professionelle Dienstleistungen

Sollten Sie die Anpassung von Sawmill Analytics nicht selbst vornehmen wollen, können wir Ihnen dies als Dienstleisung anbieten. Unsere Experten setzen sich gerne mit Ihnen in Verbindung, um die Reports oder sonstige Aspekte von Sawmill Analytics an Ihre Gegebenheiten und Wünsche anzupassen. Kontakt

Zur Tutorial-Übersicht

Weitere Informationen

      Live-Demonstrationen »    
© 1995-2011 HAAGE & PARTNER Computer GmbH · Impressum · Datenschutz · www.haage-partner.de