The top directory of where you installed Harvest is known as $HARVEST_HOME. By default, $HARVEST_HOME is /usr/local/harvest. The following files and directories are located in $HARVEST_HOME:
RunHarvest* brokers/ gatherers/ tmp/
bin/ cgi-bin/ lib/
RunHarvest
is the script used to create and run Harvest servers (see
Section
Starting up the system: RunHarvest and related commands). RunHarvest
has the same command line syntax
as Harvest
.
The $HARVEST_HOME/bin directory only contains programs that users would normally run directly. All other programs (e.g., individual summarizers for the Gatherer) as well as Perl library code are in the lib directory. The bin directory contains the following programs:
CreateBroker
Creates a Broker.
Usage: CreateBroker [skeleton-tree [destination]]
Gatherer
Main user interface to the Gatherer. This program is run by the
RunGatherer
script found in a Gatherer's directory.
Usage: Gatherer [-manual|-export|-debug] file.cf
Harvest
The program used by RunHarvest
to create and run Harvest servers
as per the user's description.
Usage: Harvest [flags]
where flags can be any of the following:
-novice Simplest Q&A. Mostly uses the defaults.
-glimpse Use Glimpse for the Broker. (default)
-swish Use Swish for the Broker.
-wais Use WAIS for the Broker.
-dumbtty Dumb TTY mode.
-debug Debug mode.
-dont-run Don't run the Broker or the Gatherer.
-fake Doesn't build the Harvest servers.
-protect Don't change the umask.
broker
The Broker program. This program is run by the RunBroker
script found in a Broker's directory. Logs messages to both
broker.out and to admin/LOG.
Usage: broker [broker.conf file] [-nocol]
gather
The client interface to the Gatherer.
Usage: gather [-info] [-nocompress] host port [timestamp]
The $HARVEST_HOME/brokers directory contains images and logos
in images directory, some basic tutorial HTML pages, and the
skeleton files that CreateBroker
uses to construct new
Brokers. You can change the default values in these created Brokers by
editing the files in skeleton.
The $HARVEST_HOME/cgi-bin directory contains the programs
needed for the WWW interface to the Broker (described in Section
CGI programs) and configuration files for
search.cgi
in lib directory.
The $HARVEST_HOME/gatherers directory contains example
Gatherers discussed in Section
Gatherer Examples. RunHarvest
, by default, will
create the new Gatherer in this directory.
The $HARVEST_HOME/lib directory contains number of Perl library routines and other programs needed by various parts of Harvest, as follows:
Perl libraries used to communicate with remote FTP servers.
Perl libraries used to parse ls
output.
ftpget
Program used to retrieve files and directories from FTP servers.
Usage: ftpget [-htmlify] localfile hostname filename A,I username password
gopherget.pl
Perl program used to retrieve files and menus from Gopher servers.
Usage: gopherget.pl localfile hostname port command
harvest-check.pl
Perl program to check whether gatherers and brokers are up.
Usage: harvest-check.pl [-v]
md5
Program used to compute MD5 checksums.
Usage: md5 file [...]
newsget.pl
Perl program used to retrieve USENET articles and group summaries from NNTP servers.
Usage: newsget.pl localfile news-URL
Perl library used to process SOIF.
urlget
Program used to retrieve a URL.
Usage: urlget URL
urlpurge
Program to purge the local disk URL cache used by urlget
and the
Gatherer.
Usage: urlpurge
The $HARVEST_HOME/lib/broker directory contains the search and index programs needed by the Broker, plus several utility programs needed for Broker administration, as follows:
BrokerRestart
This program will issue a restart command to a broker.
Usage: BrokerRestart [-password passwd] host port
brkclient
Client interface to the broker. Can be used to send queries or administrative commands to a broker.
Usage: brkclient hostname port command-string
dumpregistry
Prints the Broker's Registry file in a human-readable format.
Usage: dumpregistry [-count] [BrokerDirectory]
agrep, glimpse, glimpseindex, glimpseindex.bin, glimpseserver
The Glimpse indexing and search system as described in Section The Broker.
swish
The Swish indexing and search program as an alternative to Glimpse.
info-to-html.pl, mkbrokerstats.pl
Perl programs used to generate Broker statistics and to create stats.html.
Usage: gather -info host port | info-to-html.pl > host.port.html
Usage: mkbrokerstats.pl broker-dir > stats.html
The $HARVEST_HOME/lib/gatherer directory contains the default summarizers described in Section Extracting data for indexing: The Essence summarizing subsystem, plus various utility programs needed by the summarizers and the Gatherer, as follows:
Default URL filter as described in Section RootNode specifications.
Essence configuration files as described in Section Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps.
*.sum
Essence summarizers as discussed in Section Extracting data for indexing: The Essence summarizing subsystem.
HTML-sum.pl
Alternative HTML summarizer written in Perl.
HTMLurls
Program to extract URLs from a HTML file.
Usage: HTMLurls [--base-url url] filename
catdoc, xls2csv,
catdoc-libPrograms and files used by Microsoft Word summarizer.
dvi2tty, print-c-comments, ps2txt, ps2txt-2.1, pstext,
skim
Programs used by various summarizers.
gifinfo
Program to support summarizers.
l2h
Program used by TeX summarizer.
rast, smgls, sgmlsasp,
sgmls-libPrograms and files used by SGML summarizer.
rtf2html
Program used by RTF summarizer.
wp2x, wp2x.sh,
wp2x-libPrograms and files used by WordPerfect summarizer.
hexbin, unshar, uudecode
Programs used to unnest nested objects.
cksoif
Programs used to check the validity of a SOIF stream (e.g., to ensure that there is not parsing errors).
Usage: cksoif < INPUT.soif
cleandb, consoldb, expiredb, folddb, mergedb,
mkgathererstats.pl, mkindex, rmbinary
Programs used to prepare a Gatherer's database to be exported by
gatherd
.
cleandb
ensures that all SOIF objects are valid, and deletes
any that are not;
consoldb
will consolidate n GDBM database files into a single GDBM database file;
expiredb
deletes any SOIF objects that are no longer
valid as defined by its Time-to-Live attribute;
folddb
runs all of the operations needed to prepare the
Gatherer's database for export by gatherd
;
mergedb
consolidates GDBM files as described in Section
Incorporating manually generated information into a Gatherer;
mkgathererstats.pl
generates the INFO.soif
statistics file;
mkindex
generates the cache of timestamps; and
rmbinary
removes binary data from a GDBM database.
enum, prepurls, staturl
Programs used by Gatherer
to perform the RootNode and
LeafNode enumeration for the Gatherer as described in Section
RootNode specifications.
enum
performs a RootNode enumeration on the given URLs;
prepurls
is a wrapper program used to pipe Gatherer
and essence
together;
staturl
retrieves LeafNode URLs to determine if the URL has
been modified or not.
fileenum, ftpenum, ftpenum.pl, gopherenum-*, httpenum-*,
newsenum
Programs used by enum
to perform protocol specific
enumeration.
fileenum
performs a RootNode enumeration on ``file'' URLs;
ftpenum
calls ftpenum.pl
to perform a RootNode
enumeration on ``ftp'' URLs;
gopherenum-breadth
performs a breadth first RootNode
enumeration on ``gopher'' URLs;
gopherenum-depth
performs a depth first RootNode enumeration
on ``gopher'' URLs;
httpenum-breadth
performs a breadth first RootNode
enumeration on ``http'' URLs;
httpenum-depth
performs a depth first RootNode enumeration on
``http'' URLs;
newsenum
performs a RootNode enumeration on ``news'' URLs;
essence
The Essence content extraction system as described in Section Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps.
Usage: essence [options] -f input-URLs
or essence [options] URL ...
where options are:
--dbdir directory Directory to place database
--full-text Use entire file instead of summarizing
--gatherer-host Gatherer-Host value
--gatherer-name Gatherer-Name value
--gatherer-version Gatherer-Version value
--help Print usage information
--libdir directory Directory to place configuration files
--log logfile Name of the file to log messages to
--max-deletions n Number of GDBM deletions before reorganization
--minimal-bookkeeping Generates a minimal amount of bookkeeping attrs
--no-access Do not read contents of objects
--no-keywords Do not automatically generate keywords
--allowlist filename File with list of types to allow
--stoplist filename File with list of types to remove
--tmpdir directory Name of directory to use for temporary files
--type-only Only type data; do not summarize objects
--verbose Verbose output
--version Version information
print-attr
Reads in a SOIF stream from stdin and prints the data associated with the given attribute to stdout.
Usage: cat SOIF-file | print-attr Attribute
gatherd, in.gatherd
Daemons that exports the Gatherer's database.
in.gatherd
is used to run this daemon from inetd.
Usage:
gatherd [-db | -index | -log | -zip | -cf file] [-dir dir] port
Usage:
in.gatherd [-db | -index | -log | -zip | -cf file] [-dir dir]
gdbmutil
Program to perform various operations on a GDBM database.
Usage: gdbmutil consolidate [-d | -D] master-file file [file ...]
Usage: gdbmutil delete file key
Usage: gdbmutil dump file
Usage: gdbmutil fetch file key
Usage: gdbmutil keys file
Usage: gdbmutil print [-gatherd] file
Usage: gdbmutil reorganize file
Usage: gdbmutil restore file
Usage: gdbmutil sort file
Usage: gdbmutil stats file
Usage: gdbmutil store file key < data
mktemplate
Program to generate valid SOIF based on a more easily editable SOIF-like format (e.g., SOIF without the byte counts).
Usage: mktemplate < INPUT.txt > OUTPUT.soif
quick-sum
Simple Perl program to emulate Essence's quick-sum.cf processing for those who cannot compile Essence with the corresponding C code.
template2db
Converts a stream of SOIF objects (from stdin or given files) into a GDBM database.
Usage: template2db database [tmpl tmpl...]
wrapit
Wraps the data from stdin into a SOIF attribute-value pair with a byte count. Used by Essence summarizers to easily generate SOIf.
Usage: wrapit [Attribute]
kill-gatherd
Script to kill gatherd process.
The $HARVEST_HOME/tmp directory is used by search.cgi to store search result pages.