The following examples install into $HARVEST_HOME/gatherers by default (see Section Installing the Harvest Software).
The Harvest distribution contains several examples of how to configure, customize, and run Gatherers. This section will walk you through several example Gatherers. The goal is to give you a sense of what you can do with a Gatherer and how to do it. You needn't work through all of the examples; each is instructive in its own right.
To use the Gatherer examples, you need the Harvest binary directory in your path, and HARVEST_HOME defined. For example:
% setenv HARVEST_HOME /usr/local/harvest
% set path = ($HARVEST_HOME/bin $path)
This example is a simple Gatherer that uses the default customizations. The only work that the user does to configure this Gatherer is to specify the list of URLs from which to gather (see Section The Gatherer).
To run this example, type:
% cd $HARVEST_HOME/gatherers/example-1
% ./RunGatherer
To view the configuration file for this Gatherer, look at example-1.cf. The first few lines are variables that specify some local information about the Gatherer (see Section Setting variables in the Gatherer configuration file). For example, each content summary will contain the name of the Gatherer (Gatherer-Name) that generated it. The port number (Gatherer-Port) that will be used to export the indexing information, as is the directory that contains the Gatherer (Top-Directory). Notice that there is one RootNode URL and one LeafNode URL.
After the Gatherer has finished, it will start up the Gatherer daemon which will export the content summaries. To view the content summaries, type:
% gather localhost 9111 | more
The following SOIF object should look similar to those that this Gatherer generates.
@FILE { http://harvest.cs.colorado.edu/~schwartz/IRTF.html
Time-to-Live{7}: 9676800
Last-Modification-Time{1}: 0
Refresh-Rate{7}: 2419200
Gatherer-Name{25}: Example Gatherer Number 1
Gatherer-Host{22}: powell.cs.colorado.edu
Gatherer-Version{3}: 0.4
Update-Time{9}: 781478043
Type{4}: HTML
File-Size{4}: 2099
MD5{32}: c2fa35fd44a47634f39086652e879170
Partial-Text{151}: research problems
Mic Bowman
Peter Danzig
Udi Manber
Michael Schwartz
Darren Hardy
talk
talk
Harvest
talk
Advanced
Research Projects Agency
URL-References{628}:
ftp://ftp.cs.colorado.edu/pub/cs/techreports/schwartz/RD.ResearchProblems.Jour.ps.Z
ftp://grand.central.org/afs/transarc.com/public/mic/html/Bio.html
http://excalibur.usc.edu/people/danzig.html
http://glimpse.cs.arizona.edu:1994/udi.html
http://harvest.cs.colorado.edu/~schwartz/Home.html
http://harvest.cs.colorado.edu/~hardy/Home.html
ftp://ftp.cs.colorado.edu/pub/cs/misc/schwartz/HPCC94.Slides.ps.Z
ftp://ftp.cs.colorado.edu/pub/cs/misc/schwartz/HPC94.Slides.ps.Z
http://harvest.cs.colorado.edu/harvest/Home.html
ftp://ftp.cs.colorado.edu/pub/cs/misc/schwartz/IETF.Jul94.Slides.ps.Z
http://ftp.arpa.mil/ResearchAreas/NETS/Internet.html
Title{84}: IRTF Research Group on Resource Discovery
IRTF Research Group on Resource Discovery
Keywords{121}: advanced agency bowman danzig darren hardy harvest manber mic
michael peter problems projects research schwartz talk udi
}
Notice that although the Gatherer configuration file lists only 2 URLs (one in the RootNode section and one in the LeafNode section), there are more than 2 content summaries in the Gatherer's database. The Gatherer expanded the RootNode URL into dozens of LeafNode URLs by recursively extracting the links from the HTML file at the RootNode http://harvest.cs.colorado.edu/. Then, for each LeafNode given to the Gatherer, it generated a content summary for it as in the above example summary for http://harvest.cs.colorado.edu/~schwartz/IRTF.html.
The HTML summarizer will extract structured information about the Author and Title of the file. It will also extract any URL links into the URL-References attribute, and any anchor tags into the Partial-Text attribute. Other information about the HTML file such as its MD5 (see RFC1321) and its size (File-Size) in bytes are also added to the content summary.
The Gatherer is able to ``explode'' a resource into a stream of content summaries. This is useful for files that contain manually-generated information that may describe one or more resources, or for building a gateway between various structured formats and SOIF (see Section The Summary Object Interchange Format (SOIF)).
This example demonstrates an exploder for the Linux Software Map (LSM) format. LSM files contain structured information (like the author, location, etc.) about software available for the Linux operating system.
To run this example, type:
% cd $HARVEST_HOME/gatherers/example-2
% ./RunGatherer
To view the configuration file for this Gatherer, look at example-2.cf. Notice that the Gatherer has its own Lib-Directory (see Section Setting variables in the Gatherer configuration file for help on writing configuration files). The library directory contains the typing and candidate selection customizations for Essence. In this example, we've only customized the candidate selection step. lib/stoplist.cf defines the types that Essence should not index. This example uses an empty stoplist.cf file to direct Essence to index all files.
The Gatherer retrieves each of the LeafNode URLs, which are all Linux
Software Map files from the Linux FTP archive tsx-11.mit.edu. The
Gatherer recognizes that a ``.lsm'' file is LSM type because of the
naming heuristic present in lib/byname.cf. The LSM type
is a ``nested'' type as specified in the Essence source code
(src/gatherer/essence/unnest.c). Exploder programs (named
TypeName.unnest
) are run on nested types rather than the usual
summarizers. The LSM.unnest
program is the standard exploder program
that takes an LSM file and generates one or more corresponding SOIF
objects. When the Gatherer finishes, it contains one or more corresponding
SOIF objects for the software described within each LSM file.
After the Gatherer has finished, it will start up the Gatherer daemon which will serve the content summaries. To view the content summaries, type:
% gather localhost 9222 | more
Because tsx-11.mit.edu is a popular and heavily loaded archive, the Gatherer often won't be able to retrieve the LSM files. If you suspect that something went wrong, look in log.errors and log.gatherer to try to determine the problem.
The following two SOIF objects were generated by this Gatherer. The first object is summarizes the LSM file itself, and the second object summarizes the software described in the LSM file.
@FILE { ftp://tsx-11.mit.edu/pub/linux/docs/linux-doc-project/man-pages-1.4.lsm
Time-to-Live{7}: 9676800
Last-Modification-Time{9}: 781931042
Refresh-Rate{7}: 2419200
Gatherer-Name{25}: Example Gatherer Number 2
Gatherer-Host{22}: powell.cs.colorado.edu
Gatherer-Version{3}: 0.4
Type{3}: LSM
Update-Time{9}: 781931042
File-Size{3}: 848
MD5{32}: 67377f3ea214ab680892c82906081caf
}
@FILE { ftp://ftp.cs.unc.edu/pub/faith/linux/man-pages-1.4.tar.gz
Time-to-Live{7}: 9676800
Last-Modification-Time{9}: 781931042
Refresh-Rate{7}: 2419200
Gatherer-Name{25}: Example Gatherer Number 2
Gatherer-Host{22}: powell.cs.colorado.edu
Gatherer-Version{3}: 0.4
Update-Time{9}: 781931042
Type{16}: GNUCompressedTar
Title{48}: Section 2, 3, 4, 5, 7, and 9 man pages for Linux
Version{3}: 1.4
Description{124}: Man pages for Linux. Mostly section 2 is complete. Section
3 has over 200 man pages, but it still far from being finished.
Author{27}: Linux Documentation Project
AuthorEmail{11}: DOC channel
Maintainer{9}: Rik Faith
MaintEmail{16}: faith@cs.unc.edu
Site{45}: ftp.cs.unc.edu
sunsite.unc.edu
tsx-11.mit.edu
Path{94}: /pub/faith/linux
/pub/Linux/docs/linux-doc-project/man-pages
/pub/linux/docs/linux-doc-project
File{20}: man-pages-1.4.tar.gz
FileSize{4}: 170k
CopyPolicy{47}: Public Domain or otherwise freely distributable
Keywords{10}: man
pages
Entered{24}: Sun Sep 11 19:52:06 1994
EnteredBy{9}: Rik Faith
CheckedEmail{16}: faith@cs.unc.edu
}
We've also built a Gatherer that explodes about a half-dozen index files from various PC archives into more than 25,000 content summaries. Each of these index files contain hundreds of a one-line descriptions about PC software distributions that are available via anonymous FTP.
This example demonstrates how to customize the type recognition and candidate selection steps in the Gatherer (see Section Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps). This Gatherer recognizes World Wide Web home pages, and is configured only to collect indexing information from these home pages.
To run this example, type:
% cd $HARVEST_HOME/gatherers/example-3
% ./RunGatherer
To view the configuration file for this Gatherer, look at example-3.cf. As in Section Example 2, this Gatherer has its own library directory that contains a customization for Essence. Since we're only interested in indexing home pages, we need only define the heuristics for recognizing home pages. As shown below, we can use URL naming heuristics to define a home page in lib/byurl.cf. We've also added a default Unknown type to make candidate selection easier in this file.
HomeHTML ^http:.*/$
HomeHTML ^http:.*[hH]ome\.html$
HomeHTML ^http:.*[hH]ome[pP]age\.html$
HomeHTML ^http:.*[wW]elcome\.html$
HomeHTML ^http:.*/index\.html$
The lib/stoplist.cf configuration file contains a list of types
not to index. In this example, Unknown is the only type name
listed in stoplist.configuration, so the Gatherer will only reject files
of the Unknown type. You can also recognize URLs by their
filename (in byname.cf) or by their content (in bycontent.cf
and magic); although in this example, we don't need to use those
mechanisms. The default HomeHTML.sum
summarizer summarizes each
HomeHTML file.
After the Gatherer has finished, it will start up the Gatherer daemon which will serve the content summaries. You'll notice that only content summaries for HomeHTML files are present. To view the content summaries, type:
% gather localhost 9333 | more
This example demonstrates how to customize the type recognition and summarizing steps in the Gatherer (see Section Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps. This Gatherer recognizes two new file formats and summarizes them appropriately.
To view the configuration file for this Gatherer, look at example-4.cf. As in the examples in Example 2 and Example 3, this Gatherer has its own library directory that contains the configuration files for Essence. The Essence configuration files are the same as the default customization, except for lib/byname.cf which contains two customizations for the new file formats.
The first new format is the ``ReferBibliographic'' type which is the format
that the refer
program uses to represent bibliography information.
To recognize that a file is in this format, we'll use the convention that
the filename ends in ``.referbib''. So, we add that naming heuristic as a
type recognition customization. Naming heuristics are represented as a
regular expression against the filename in the lib/byname.cf file:
ReferBibliographic ^.*\.referbib$
Now, to write a summarizer for this type, we'll need a sample ReferBibliographic file:
%A A. S. Tanenbaum
%T Computer Networks
%I Prentice Hall
%C Englewood Cliffs, NJ
%D 1988
Essence summarizers extract structured information from files. One way to write a summarizer is by using regular expressions to define the extractions. For each type of information that you want to extract from a file, add the regular expression that will match lines in that file to lib/quick-sum.cf. For example, the following regular expressions in lib/quick-sum.cf will extract the author, title, date, and other information from ReferBibliographic files:
ReferBibliographic Author ^%A[ \t]+.*$
ReferBibliographic City ^%C[ \t]+.*$
ReferBibliographic Date ^%D[ \t]+.*$
ReferBibliographic Editor ^%E[ \t]+.*$
ReferBibliographic Comments ^%H[ \t]+.*$
ReferBibliographic Issuer ^%I[ \t]+.*$
ReferBibliographic Journal ^%J[ \t]+.*$
ReferBibliographic Keywords ^%K[ \t]+.*$
ReferBibliographic Label ^%L[ \t]+.*$
ReferBibliographic Number ^%N[ \t]+.*$
ReferBibliographic Comments ^%O[ \t]+.*$
ReferBibliographic Page-Number ^%P[ \t]+.*$
ReferBibliographic Unpublished-Info ^%R[ \t]+.*$
ReferBibliographic Series-Title ^%S[ \t]+.*$
ReferBibliographic Title ^%T[ \t]+.*$
ReferBibliographic Volume ^%V[ \t]+.*$
ReferBibliographic Abstract ^%X[ \t]+.*$
The first field in lib/quick-sum.cf is the name of the type. The second field is the Attribute under which to extract the information on lines that match the regular expression in the third field.
The second new file format is the ``Abstract'' type, which is a file that contains only the text of a paper abstract (a format that is common in technical report FTP archives). To recognize that a file is written in this format, we'll use the naming convention that the filename for ``Abstract'' files ends in ``.abs''. So, we add that type recognition customization to the lib/byname.cf file as a regular expression:
Abstract ^.*\.abs$
Another way to write a summarizer is to write a program or script that takes a filename as the first argument on the command line, extracts the structured information, then outputs the results as a list of SOIF attribute-value pairs.
Summarizer programs are named TypeName.sum
, so we call our new
summarizer Abstract.sum
. Remember to place the summarizer program in
a directory that is in your path so that Gatherer can run it. You'll see
below that Abstract.sum
is a Bourne shell script that takes the first
50 lines of the file, wraps it as the ``Abstract'' attribute, and outputs
it as a SOIF attribute-value pair.
#!/bin/sh
#
# Usage: Abstract.sum filename
#
head -50 "$1" | wrapit "Abstract"
To run this example, type:
% cd $HARVEST_HOME/gatherers/example-4
% ./RunGatherer
After the Gatherer has finished, it will start up the Gatherer daemon which will serve the content summaries. To view the content summaries, type:
% gather localhost 9444 | more
This example demonstrates how to use RootNode filters to customize the candidate selection in the Gatherer (see Section RootNode filters). Only items that pass RootNode filters will be retrieved across the network (see Section Gatherer enumeration vs. candidate selection).
To run this example, type:
% cd $HARVEST_HOME/gatherers/example-5
% ./RunGatherer
After the Gatherer has finished, it will start up the Gatherer daemon which will serve the content summaries. To view the content summaries, type:
% gather localhost 9555 | more