Next Previous Contents

6. Programs and layout of the installed Harvest software

6.1 $HARVEST_HOME

The top directory of where you installed Harvest is known as $HARVEST_HOME. By default, $HARVEST_HOME is /usr/local/harvest. The following files and directories are located in $HARVEST_HOME:

        RunHarvest*         brokers/            gatherers/          tmp/
        bin/                cgi-bin/            lib/

RunHarvest is the script used to create and run Harvest servers (see Section Starting up the system: RunHarvest and related commands). RunHarvest has the same command line syntax as Harvest.

6.2 $HARVEST_HOME/bin

The $HARVEST_HOME/bin directory only contains programs that users would normally run directly. All other programs (e.g., individual summarizers for the Gatherer) as well as Perl library code are in the lib directory. The bin directory contains the following programs:

CreateBroker

Creates a Broker.

Usage: CreateBroker [skeleton-tree [destination]]

Gatherer

Main user interface to the Gatherer. This program is run by the RunGatherer script found in a Gatherer's directory.

Usage: Gatherer [-manual|-export|-debug] file.cf

Harvest

The program used by RunHarvest to create and run Harvest servers as per the user's description.

Usage: Harvest [flags]

where flags can be any of the following:

        -novice         Simplest Q&A.  Mostly uses the defaults.
        -glimpse        Use Glimpse for the Broker. (default)
        -swish          Use Swish for the Broker.
        -wais           Use WAIS for the Broker.
        -dumbtty        Dumb TTY mode.
        -debug          Debug mode.
        -dont-run       Don't run the Broker or the Gatherer.
        -fake           Doesn't build the Harvest servers.
        -protect        Don't change the umask.

broker

The Broker program. This program is run by the RunBroker script found in a Broker's directory. Logs messages to both broker.out and to admin/LOG.

Usage: broker [broker.conf file] [-nocol]

gather

The client interface to the Gatherer.

Usage: gather [-info] [-nocompress] host port [timestamp]

6.3 $HARVEST_HOME/brokers

The $HARVEST_HOME/brokers directory contains images and logos in images directory, some basic tutorial HTML pages, and the skeleton files that CreateBroker uses to construct new Brokers. You can change the default values in these created Brokers by editing the files in skeleton.

6.4 $HARVEST_HOME/cgi-bin

The $HARVEST_HOME/cgi-bin directory contains the programs needed for the WWW interface to the Broker (described in Section CGI programs) and configuration files for search.cgi in lib directory.

6.5 $HARVEST_HOME/gatherers

The $HARVEST_HOME/gatherers directory contains example Gatherers discussed in Section Gatherer Examples. RunHarvest, by default, will create the new Gatherer in this directory.

6.6 $HARVEST_HOME/lib

The $HARVEST_HOME/lib directory contains number of Perl library routines and other programs needed by various parts of Harvest, as follows:

chat2.pl, ftp.pl, socket.ph

Perl libraries used to communicate with remote FTP servers.

dateconv.pl, lsparse.pl, timelocal.pl

Perl libraries used to parse ls output.

ftpget

Program used to retrieve files and directories from FTP servers.

Usage: ftpget [-htmlify] localfile hostname filename A,I username password

gopherget.pl

Perl program used to retrieve files and menus from Gopher servers.

Usage: gopherget.pl localfile hostname port command

harvest-check.pl

Perl program to check whether gatherers and brokers are up.

Usage: harvest-check.pl [-v]

md5

Program used to compute MD5 checksums.

Usage: md5 file [...]

newsget.pl

Perl program used to retrieve USENET articles and group summaries from NNTP servers.

Usage: newsget.pl localfile news-URL

soif.pl, soif-mem-efficient.pl

Perl library used to process SOIF.

urlget

Program used to retrieve a URL.

Usage: urlget URL

urlpurge

Program to purge the local disk URL cache used by urlget and the Gatherer.

Usage: urlpurge

6.7 $HARVEST_HOME/lib/broker

The $HARVEST_HOME/lib/broker directory contains the search and index programs needed by the Broker, plus several utility programs needed for Broker administration, as follows:

BrokerRestart

This program will issue a restart command to a broker.

Usage: BrokerRestart [-password passwd] host port

brkclient

Client interface to the broker. Can be used to send queries or administrative commands to a broker.

Usage: brkclient hostname port command-string

dumpregistry

Prints the Broker's Registry file in a human-readable format.

Usage: dumpregistry [-count] [BrokerDirectory]

agrep, glimpse, glimpseindex, glimpseindex.bin, glimpseserver

The Glimpse indexing and search system as described in Section The Broker.

swish

The Swish indexing and search program as an alternative to Glimpse.

info-to-html.pl, mkbrokerstats.pl

Perl programs used to generate Broker statistics and to create stats.html.

Usage: gather -info host port | info-to-html.pl > host.port.html

Usage: mkbrokerstats.pl broker-dir > stats.html

6.8 $HARVEST_HOME/lib/gatherer

The $HARVEST_HOME/lib/gatherer directory contains the default summarizers described in Section Extracting data for indexing: The Essence summarizing subsystem, plus various utility programs needed by the summarizers and the Gatherer, as follows:

URL-filter-default

Default URL filter as described in Section RootNode specifications.

bycontent.cf, byname.cf, byurl.cf, magic, stoplist.cf, quick-sum.cf

Essence configuration files as described in Section Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps.

*.sum

Essence summarizers as discussed in Section Extracting data for indexing: The Essence summarizing subsystem.

HTML-sum.pl

Alternative HTML summarizer written in Perl.

HTMLurls

Program to extract URLs from a HTML file.

Usage: HTMLurls [--base-url url] filename

catdoc, xls2csv, catdoc-lib

Programs and files used by Microsoft Word summarizer.

dvi2tty, print-c-comments, ps2txt, ps2txt-2.1, pstext, skim

Programs used by various summarizers.

gifinfo

Program to support summarizers.

l2h

Program used by TeX summarizer.

rast, smgls, sgmlsasp, sgmls-lib

Programs and files used by SGML summarizer.

rtf2html

Program used by RTF summarizer.

wp2x, wp2x.sh, wp2x-lib

Programs and files used by WordPerfect summarizer.

hexbin, unshar, uudecode

Programs used to unnest nested objects.

cksoif

Programs used to check the validity of a SOIF stream (e.g., to ensure that there is not parsing errors).

Usage: cksoif < INPUT.soif

cleandb, consoldb, expiredb, folddb, mergedb, mkgathererstats.pl, mkindex, rmbinary

Programs used to prepare a Gatherer's database to be exported by gatherd.

cleandb ensures that all SOIF objects are valid, and deletes any that are not;

consoldb will consolidate n GDBM database files into a single GDBM database file;

expiredb deletes any SOIF objects that are no longer valid as defined by its Time-to-Live attribute;

folddb runs all of the operations needed to prepare the Gatherer's database for export by gatherd;

mergedb consolidates GDBM files as described in Section Incorporating manually generated information into a Gatherer;

mkgathererstats.pl generates the INFO.soif statistics file;

mkindex generates the cache of timestamps; and

rmbinary removes binary data from a GDBM database.

enum, prepurls, staturl

Programs used by Gatherer to perform the RootNode and LeafNode enumeration for the Gatherer as described in Section RootNode specifications.

enum performs a RootNode enumeration on the given URLs;

prepurls is a wrapper program used to pipe Gatherer and essence together;

staturl retrieves LeafNode URLs to determine if the URL has been modified or not.

fileenum, ftpenum, ftpenum.pl, gopherenum-*, httpenum-*, newsenum

Programs used by enum to perform protocol specific enumeration.

fileenum performs a RootNode enumeration on ``file'' URLs;

ftpenum calls ftpenum.pl to perform a RootNode enumeration on ``ftp'' URLs;

gopherenum-breadth performs a breadth first RootNode enumeration on ``gopher'' URLs;

gopherenum-depth performs a depth first RootNode enumeration on ``gopher'' URLs;

httpenum-breadth performs a breadth first RootNode enumeration on ``http'' URLs;

httpenum-depth performs a depth first RootNode enumeration on ``http'' URLs;

newsenum performs a RootNode enumeration on ``news'' URLs;

essence

The Essence content extraction system as described in Section Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps.

Usage: essence [options] -f input-URLs or essence [options] URL ...

where options are:

        --dbdir directory       Directory to place database
        --full-text             Use entire file instead of summarizing
        --gatherer-host         Gatherer-Host value
        --gatherer-name         Gatherer-Name value
        --gatherer-version      Gatherer-Version value
        --help                  Print usage information
        --libdir directory      Directory to place configuration files
        --log logfile           Name of the file to log messages to
        --max-deletions n       Number of GDBM deletions before reorganization
        --minimal-bookkeeping   Generates a minimal amount of bookkeeping attrs
        --no-access             Do not read contents of objects
        --no-keywords           Do not automatically generate keywords
        --allowlist filename    File with list of types to allow
        --stoplist filename     File with list of types to remove
        --tmpdir directory      Name of directory to use for temporary files
        --type-only             Only type data; do not summarize objects
        --verbose               Verbose output
        --version               Version information

print-attr

Reads in a SOIF stream from stdin and prints the data associated with the given attribute to stdout.

Usage: cat SOIF-file | print-attr Attribute

gatherd, in.gatherd

Daemons that exports the Gatherer's database. in.gatherd is used to run this daemon from inetd.

Usage: gatherd [-db | -index | -log | -zip | -cf file] [-dir dir] port

Usage: in.gatherd [-db | -index | -log | -zip | -cf file] [-dir dir]

gdbmutil

Program to perform various operations on a GDBM database.

Usage: gdbmutil consolidate [-d | -D] master-file file [file ...]
Usage: gdbmutil delete file key
Usage: gdbmutil dump file
Usage: gdbmutil fetch file key
Usage: gdbmutil keys file
Usage: gdbmutil print [-gatherd] file
Usage: gdbmutil reorganize file
Usage: gdbmutil restore file
Usage: gdbmutil sort file
Usage: gdbmutil stats file
Usage: gdbmutil store file key < data

mktemplate

Program to generate valid SOIF based on a more easily editable SOIF-like format (e.g., SOIF without the byte counts).

Usage: mktemplate < INPUT.txt > OUTPUT.soif

quick-sum

Simple Perl program to emulate Essence's quick-sum.cf processing for those who cannot compile Essence with the corresponding C code.

template2db

Converts a stream of SOIF objects (from stdin or given files) into a GDBM database.

Usage: template2db database [tmpl tmpl...]

wrapit

Wraps the data from stdin into a SOIF attribute-value pair with a byte count. Used by Essence summarizers to easily generate SOIf.

Usage: wrapit [Attribute]

kill-gatherd

Script to kill gatherd process.

6.9 $HARVEST_HOME/tmp

The $HARVEST_HOME/tmp directory is used by search.cgi to store search result pages.


Next Previous Contents