From: Eric Lease Morgan <eric_morgan@ncsu.edu>
Date: Fri, 28 Jan 2000 13:10:54 -0500
Newsgroups: comp.infosystems.harvest
Subject: log analysis


As a librarian I like to see what people are searching for when they
query my Harvest indexes; I do log analysis against my log files.

Below are the two files I use for such analysis. First I call
extract-queries.csh which simply saves the queries to a file. Second, I
run analyse-queries.pl which actually does the analysis saving it to an
HTML file.

Enjoy?


#!/bin/csh

# extract-queries.csh - extract the harvest queries and analyse them

# Eric Lease Morgan
# http://www.lib.ncsu.edu/staff/morgan/

# 01/28/99 - edited for tomato juice
# 09/17/99 - moved to hegel.lib
# 04/29/98 - changed the date format to remove the year 2000 problem
# 02/13/98 - removed extracting JOUR and NEWS
# 12/31/96 - first cut

# get the year and month
set d = `date +%Y%m`

# extract queries from LIB
grep "broker: $d" /usr/local/harvest/brokers/TOMATO-JUICE/broker.out | \
grep "#END" > /disk02/local/apache/htdocs/stacks/serials/tomato-juice/queries/$d.log
  
# analyse the results
cat
/disk02/local/apache/htdocs/stacks/serials/tomato-juice/queries/$d.log | \
~/bin/analyse-queries.pl > \
/disk02/local/apache/htdocs/stacks/serials/tomato-juice/queries/$d.html



#!/usr/local/bin/perl

# analyse-queries.pl - analyse the harvest queries

# example usage: cat ~/logfiles/queries.log | analyse-queries.pl > query-report.html

# Eric Lease Morgan
# http://www.lib.ncsu.edu/staff/morgan/

# 09/17/99 - moved to hegel
# 12/31/96 - first cut

# process every line of STDIN
while (<>) {
        
	# remove the trailing newline character
	chop;
        
	# split out the queries and add them to an array
	# where every line is expected to contain #END
	($trash, $q) = split (/\#END /);
	push (@queries, "$q\n");
	}

# analyse the queries in terms of their structure
for (@queries) {
	
	if (/"/)       { $phrase++; }
	if (/ and /i)  { $and++; }
	if (/ or /i)   { $or++; }
	if (/\*/)      { $truncation++; }
	if (/\(/)      { $compound++; }
	if (/: /)      { $field++; }
	if (!/ /)      { $singleTerm++; }
	}

# get the month and year
($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst) = localtime(time);
$mon = $mon + 1;
$theDate = "$mon/$year";

# start the html
print "<html>\n";
print "<head>\n";
print "<title>\n";
print "EMORGAN query analysis for $theDate";
print "</title>\n";
print "</head>\n";
print "<body>\n";
print "<h1>Index Morganagus query analysis for $theDate</h1>";
print "<pre>\n";

# print the results
$label = "Phrase";           $value = "$phrase";     write;
$label = "Truncation";       $value = "$truncation"; write;
$label = "Logical AND";      $value = "$and";        write;
$label = "Logical OR";       $value = "$or";         write;
$label = "Compound";         $value = "$compound";   write;
$label = "Field";            $value = "$field";      write;
$label = "Single term";      $value = "$singleTerm"; write;
$label = "Total searches";   $value = "$#queries";   write;


# format the report
format top=

Searches classified by type

Search type           Number
----------------------------
.
format STDOUT =
@<<<<<<<<<<<<<<<<< @>>>>>>>>
$label,               $value
.

# sort the queries, count each one, tabulate the results, and sort again
@queries = sort @queries;
for (@queries) { $c{$_}++; }
for (sort keys %c) { push (@cQueries, sprintf "%d\t %s", $c{$_}, $_ ); }
@cQueries = sort numerically @cQueries;

# print tablulated results
print "\n\n";
print "Tabulated searches\n\n";
print "Number\t Search\n";
print "-------------------------------------------------\n";
print @cQueries;
print "\n";

# end the file
print "</pre>\n";
$theDate = "$mon/$mday/$year";
print "<HR>\n\n";
print "Updated: $theDate<br>";
print "Author: <a href=\"http://www.lib.ncsu.edu/staff/morgan/\">Eric
Lease Morgan</a> (eric_morgan\@ncsu.edu).<p>";
print "This page is a part of <a href=\"/~emorgan/morganagus/\">Index Morganagus</a>.";
print "</body>\n";
print "</html>\n";

# exit gracefully
exit;

# a subroutine for sorting in reverse numeric order
sub numerically { $b <=> $a; }


--
Eric Lease Morgan
Digital Library Initiatives Department, NCSU Libraries
http://www.lib.ncsu.edu/staff/morgan/

