Harvest User's Manual: Gatherer Examples

8. Gatherer Examples

The following examples install into $HARVEST_HOME/gatherers by default (see Section Installing the Harvest Software).

The Harvest distribution contains several examples of how to configure, customize, and run Gatherers. This section will walk you through several example Gatherers. The goal is to give you a sense of what you can do with a Gatherer and how to do it. You needn't work through all of the examples; each is instructive in its own right.

To use the Gatherer examples, you need the Harvest binary directory in your path, and HARVEST_HOME defined. For example:


        % setenv HARVEST_HOME /usr/local/harvest
        % set path = ($HARVEST_HOME/bin $path)

8.1 Example 1 - A simple Gatherer

This example is a simple Gatherer that uses the default customizations. The only work that the user does to configure this Gatherer is to specify the list of URLs from which to gather (see Section The Gatherer).

To run this example, type:


        % cd $HARVEST_HOME/gatherers/example-1
        % ./RunGatherer

To view the configuration file for this Gatherer, look at example-1.cf. The first few lines are variables that specify some local information about the Gatherer (see Section Setting variables in the Gatherer configuration file). For example, each content summary will contain the name of the Gatherer (Gatherer-Name) that generated it. The port number (Gatherer-Port) that will be used to export the indexing information, as is the directory that contains the Gatherer (Top-Directory). Notice that there is one RootNode URL and one LeafNode URL.

After the Gatherer has finished, it will start up the Gatherer daemon which will export the content summaries. To view the content summaries, type:


        % gather localhost 9111 | more

The following SOIF object should look similar to those that this Gatherer generates.


        @FILE { http://harvest.cs.colorado.edu/~schwartz/IRTF.html
        Time-to-Live{7}:        9676800
        Last-Modification-Time{1}:      0
        Refresh-Rate{7}:        2419200
        Gatherer-Name{25}:      Example Gatherer Number 1
        Gatherer-Host{22}:      powell.cs.colorado.edu
        Gatherer-Version{3}:    0.4
        Update-Time{9}: 781478043
        Type{4}:        HTML
        File-Size{4}:   2099
        MD5{32}:        c2fa35fd44a47634f39086652e879170
        Partial-Text{151}:      research problems
        Mic Bowman
        Peter Danzig
        Udi Manber
        Michael Schwartz
        Darren Hardy
        talk
        talk
        Harvest
        talk
        Advanced
        Research Projects Agency

        URL-References{628}:
        ftp://ftp.cs.colorado.edu/pub/cs/techreports/schwartz/RD.ResearchProblems.Jour.ps.Z
        ftp://grand.central.org/afs/transarc.com/public/mic/html/Bio.html
        http://excalibur.usc.edu/people/danzig.html
        http://glimpse.cs.arizona.edu:1994/udi.html
        http://harvest.cs.colorado.edu/~schwartz/Home.html
        http://harvest.cs.colorado.edu/~hardy/Home.html
        ftp://ftp.cs.colorado.edu/pub/cs/misc/schwartz/HPCC94.Slides.ps.Z
        ftp://ftp.cs.colorado.edu/pub/cs/misc/schwartz/HPC94.Slides.ps.Z
        http://harvest.cs.colorado.edu/harvest/Home.html
        ftp://ftp.cs.colorado.edu/pub/cs/misc/schwartz/IETF.Jul94.Slides.ps.Z
        http://ftp.arpa.mil/ResearchAreas/NETS/Internet.html

        Title{84}:      IRTF Research Group on Resource Discovery
        IRTF Research Group on Resource Discovery

        Keywords{121}:  advanced agency bowman danzig darren hardy harvest manber mic
        michael peter problems projects research schwartz talk udi

        }

Notice that although the Gatherer configuration file lists only 2 URLs (one in the RootNode section and one in the LeafNode section), there are more than 2 content summaries in the Gatherer's database. The Gatherer expanded the RootNode URL into dozens of LeafNode URLs by recursively extracting the links from the HTML file at the RootNode http://harvest.cs.colorado.edu/. Then, for each LeafNode given to the Gatherer, it generated a content summary for it as in the above example summary for http://harvest.cs.colorado.edu/~schwartz/IRTF.html.

The HTML summarizer will extract structured information about the Author and Title of the file. It will also extract any URL links into the URL-References attribute, and any anchor tags into the Partial-Text attribute. Other information about the HTML file such as its MD5 (see RFC1321) and its size (File-Size) in bytes are also added to the content summary.

8.2 Example 2 - Incorporating manually generated information

The Gatherer is able to ``explode'' a resource into a stream of content summaries. This is useful for files that contain manually-generated information that may describe one or more resources, or for building a gateway between various structured formats and SOIF (see Section The Summary Object Interchange Format (SOIF)).

This example demonstrates an exploder for the Linux Software Map (LSM) format. LSM files contain structured information (like the author, location, etc.) about software available for the Linux operating system.

To run this example, type:


        % cd $HARVEST_HOME/gatherers/example-2
        % ./RunGatherer

To view the configuration file for this Gatherer, look at example-2.cf. Notice that the Gatherer has its own Lib-Directory (see Section Setting variables in the Gatherer configuration file for help on writing configuration files). The library directory contains the typing and candidate selection customizations for Essence. In this example, we've only customized the candidate selection step. lib/stoplist.cf defines the types that Essence should not index. This example uses an empty stoplist.cf file to direct Essence to index all files.

The Gatherer retrieves each of the LeafNode URLs, which are all Linux Software Map files from the Linux FTP archive tsx-11.mit.edu. The Gatherer recognizes that a ``.lsm'' file is LSM type because of the naming heuristic present in lib/byname.cf. The LSM type is a ``nested'' type as specified in the Essence source code (src/gatherer/essence/unnest.c). Exploder programs (named TypeName.unnest) are run on nested types rather than the usual summarizers. The LSM.unnest program is the standard exploder program that takes an LSM file and generates one or more corresponding SOIF objects. When the Gatherer finishes, it contains one or more corresponding SOIF objects for the software described within each LSM file.

After the Gatherer has finished, it will start up the Gatherer daemon which will serve the content summaries. To view the content summaries, type:


        % gather localhost 9222 | more

Because tsx-11.mit.edu is a popular and heavily loaded archive, the Gatherer often won't be able to retrieve the LSM files. If you suspect that something went wrong, look in log.errors and log.gatherer to try to determine the problem.

The following two SOIF objects were generated by this Gatherer. The first object is summarizes the LSM file itself, and the second object summarizes the software described in the LSM file.


        @FILE { ftp://tsx-11.mit.edu/pub/linux/docs/linux-doc-project/man-pages-1.4.lsm
        Time-to-Live{7}:        9676800
        Last-Modification-Time{9}:      781931042
        Refresh-Rate{7}:        2419200
        Gatherer-Name{25}:      Example Gatherer Number 2
        Gatherer-Host{22}:      powell.cs.colorado.edu
        Gatherer-Version{3}:    0.4
        Type{3}:        LSM
        Update-Time{9}: 781931042
        File-Size{3}:   848
        MD5{32}:        67377f3ea214ab680892c82906081caf
        }

        @FILE { ftp://ftp.cs.unc.edu/pub/faith/linux/man-pages-1.4.tar.gz
        Time-to-Live{7}:        9676800
        Last-Modification-Time{9}:      781931042
        Refresh-Rate{7}:        2419200
        Gatherer-Name{25}:      Example Gatherer Number 2
        Gatherer-Host{22}:      powell.cs.colorado.edu
        Gatherer-Version{3}:    0.4
        Update-Time{9}: 781931042
        Type{16}:       GNUCompressedTar
        Title{48}:      Section 2, 3, 4, 5, 7, and 9 man pages for Linux
        Version{3}:     1.4
        Description{124}:       Man pages for Linux.  Mostly section 2 is complete.  Section
        3 has over 200 man pages, but it still far from being finished.
        Author{27}:     Linux Documentation Project
        AuthorEmail{11}:        DOC channel
        Maintainer{9}:  Rik Faith
        MaintEmail{16}: faith@cs.unc.edu
        Site{45}:       ftp.cs.unc.edu
        sunsite.unc.edu
        tsx-11.mit.edu
        Path{94}:       /pub/faith/linux
        /pub/Linux/docs/linux-doc-project/man-pages
        /pub/linux/docs/linux-doc-project
        File{20}:       man-pages-1.4.tar.gz
        FileSize{4}:    170k
        CopyPolicy{47}: Public Domain or otherwise freely distributable
        Keywords{10}:   man
        pages

        Entered{24}:    Sun Sep 11 19:52:06 1994
        EnteredBy{9}:   Rik Faith
        CheckedEmail{16}:       faith@cs.unc.edu
        }

We've also built a Gatherer that explodes about a half-dozen index files from various PC archives into more than 25,000 content summaries. Each of these index files contain hundreds of a one-line descriptions about PC software distributions that are available via anonymous FTP.

8.3 Example 3 - Customizing type recognition and candidate selection

This example demonstrates how to customize the type recognition and candidate selection steps in the Gatherer (see Section Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps). This Gatherer recognizes World Wide Web home pages, and is configured only to collect indexing information from these home pages.

To run this example, type:


        % cd $HARVEST_HOME/gatherers/example-3
        % ./RunGatherer

To view the configuration file for this Gatherer, look at example-3.cf. As in Section Example 2, this Gatherer has its own library directory that contains a customization for Essence. Since we're only interested in indexing home pages, we need only define the heuristics for recognizing home pages. As shown below, we can use URL naming heuristics to define a home page in lib/byurl.cf. We've also added a default Unknown type to make candidate selection easier in this file.


        HomeHTML                ^http:.*/$
        HomeHTML                ^http:.*[hH]ome\.html$
        HomeHTML                ^http:.*[hH]ome[pP]age\.html$
        HomeHTML                ^http:.*[wW]elcome\.html$
        HomeHTML                ^http:.*/index\.html$

The lib/stoplist.cf configuration file contains a list of types not to index. In this example, Unknown is the only type name listed in stoplist.configuration, so the Gatherer will only reject files of the Unknown type. You can also recognize URLs by their filename (in byname.cf) or by their content (in bycontent.cf and magic); although in this example, we don't need to use those mechanisms. The default HomeHTML.sum summarizer summarizes each HomeHTML file.

After the Gatherer has finished, it will start up the Gatherer daemon which will serve the content summaries. You'll notice that only content summaries for HomeHTML files are present. To view the content summaries, type:


        % gather localhost 9333 | more

8.4 Example 4 - Customizing type recognition and summarizing

This example demonstrates how to customize the type recognition and summarizing steps in the Gatherer (see Section Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps. This Gatherer recognizes two new file formats and summarizes them appropriately.

To view the configuration file for this Gatherer, look at example-4.cf. As in the examples in Example 2 and Example 3, this Gatherer has its own library directory that contains the configuration files for Essence. The Essence configuration files are the same as the default customization, except for lib/byname.cf which contains two customizations for the new file formats.

Using regular expressions to summarize a format

The first new format is the ``ReferBibliographic'' type which is the format that the refer program uses to represent bibliography information. To recognize that a file is in this format, we'll use the convention that the filename ends in ``.referbib''. So, we add that naming heuristic as a type recognition customization. Naming heuristics are represented as a regular expression against the filename in the lib/byname.cf file:


        ReferBibliographic      ^.*\.referbib$

Now, to write a summarizer for this type, we'll need a sample ReferBibliographic file:


        %A A. S. Tanenbaum
        %T Computer Networks
        %I Prentice Hall
        %C Englewood Cliffs, NJ
        %D 1988

Essence summarizers extract structured information from files. One way to write a summarizer is by using regular expressions to define the extractions. For each type of information that you want to extract from a file, add the regular expression that will match lines in that file to lib/quick-sum.cf. For example, the following regular expressions in lib/quick-sum.cf will extract the author, title, date, and other information from ReferBibliographic files:


        ReferBibliographic      Author                  ^%A[ \t]+.*$
        ReferBibliographic      City                    ^%C[ \t]+.*$
        ReferBibliographic      Date                    ^%D[ \t]+.*$
        ReferBibliographic      Editor                  ^%E[ \t]+.*$
        ReferBibliographic      Comments                ^%H[ \t]+.*$
        ReferBibliographic      Issuer                  ^%I[ \t]+.*$
        ReferBibliographic      Journal                 ^%J[ \t]+.*$
        ReferBibliographic      Keywords                ^%K[ \t]+.*$
        ReferBibliographic      Label                   ^%L[ \t]+.*$
        ReferBibliographic      Number                  ^%N[ \t]+.*$
        ReferBibliographic      Comments                ^%O[ \t]+.*$
        ReferBibliographic      Page-Number             ^%P[ \t]+.*$
        ReferBibliographic      Unpublished-Info        ^%R[ \t]+.*$
        ReferBibliographic      Series-Title            ^%S[ \t]+.*$
        ReferBibliographic      Title                   ^%T[ \t]+.*$
        ReferBibliographic      Volume                  ^%V[ \t]+.*$
        ReferBibliographic      Abstract                ^%X[ \t]+.*$

The first field in lib/quick-sum.cf is the name of the type. The second field is the Attribute under which to extract the information on lines that match the regular expression in the third field.

Using programs to summarize a format

The second new file format is the ``Abstract'' type, which is a file that contains only the text of a paper abstract (a format that is common in technical report FTP archives). To recognize that a file is written in this format, we'll use the naming convention that the filename for ``Abstract'' files ends in ``.abs''. So, we add that type recognition customization to the lib/byname.cf file as a regular expression:


        Abstract                ^.*\.abs$

Another way to write a summarizer is to write a program or script that takes a filename as the first argument on the command line, extracts the structured information, then outputs the results as a list of SOIF attribute-value pairs.

Summarizer programs are named TypeName.sum, so we call our new summarizer Abstract.sum. Remember to place the summarizer program in a directory that is in your path so that Gatherer can run it. You'll see below that Abstract.sum is a Bourne shell script that takes the first 50 lines of the file, wraps it as the ``Abstract'' attribute, and outputs it as a SOIF attribute-value pair.


        #!/bin/sh
        #
        #  Usage: Abstract.sum filename
        #
        head -50 "$1" | wrapit "Abstract"

Running the example

To run this example, type:


        % cd $HARVEST_HOME/gatherers/example-4
        % ./RunGatherer

After the Gatherer has finished, it will start up the Gatherer daemon which will serve the content summaries. To view the content summaries, type:


        % gather localhost 9444 | more

8.5 Example 5 - Using RootNode filters

This example demonstrates how to use RootNode filters to customize the candidate selection in the Gatherer (see Section RootNode filters). Only items that pass RootNode filters will be retrieved across the network (see Section Gatherer enumeration vs. candidate selection).

To run this example, type:


        % cd $HARVEST_HOME/gatherers/example-5
        % ./RunGatherer

After the Gatherer has finished, it will start up the Gatherer daemon which will serve the content summaries. To view the content summaries, type:


        % gather localhost 9555 | more

Next Previous Contents