The top directory of where you installed Harvest is known as $HARVEST_HOME. By default, $HARVEST_HOME is /usr/local/harvest. The following files and directories are located in $HARVEST_HOME:
RunHarvest* brokers/ gatherers/ tmp/
bin/ cgi-bin/ lib/
RunHarvest is the script used to create and run Harvest servers (see
Section
Starting up the system: RunHarvest and related commands). RunHarvest has the same command line syntax
as Harvest.
The $HARVEST_HOME/bin directory only contains programs that users would normally run directly. All other programs (e.g., individual summarizers for the Gatherer) as well as Perl library code are in the lib directory. The bin directory contains the following programs:
CreateBrokerCreates a Broker.
Usage: CreateBroker [skeleton-tree [destination]]
GathererMain user interface to the Gatherer. This program is run by the
RunGatherer script found in a Gatherer's directory.
Usage: Gatherer [-manual|-export|-debug] file.cf
HarvestThe program used by RunHarvest to create and run Harvest servers
as per the user's description.
Usage: Harvest [flags]
where flags can be any of the following:
-novice Simplest Q&A. Mostly uses the defaults.
-glimpse Use Glimpse for the Broker. (default)
-swish Use Swish for the Broker.
-wais Use WAIS for the Broker.
-dumbtty Dumb TTY mode.
-debug Debug mode.
-dont-run Don't run the Broker or the Gatherer.
-fake Doesn't build the Harvest servers.
-protect Don't change the umask.
brokerThe Broker program. This program is run by the RunBroker
script found in a Broker's directory. Logs messages to both
broker.out and to admin/LOG.
Usage: broker [broker.conf file] [-nocol]
gatherThe client interface to the Gatherer.
Usage: gather [-info] [-nocompress] host port [timestamp]
The $HARVEST_HOME/brokers directory contains images and logos
in images directory, some basic tutorial HTML pages, and the
skeleton files that CreateBroker uses to construct new
Brokers. You can change the default values in these created Brokers by
editing the files in skeleton.
The $HARVEST_HOME/cgi-bin directory contains the programs
needed for the WWW interface to the Broker (described in Section
CGI programs) and configuration files for
search.cgi in lib directory.
The $HARVEST_HOME/gatherers directory contains example
Gatherers discussed in Section
Gatherer Examples. RunHarvest, by default, will
create the new Gatherer in this directory.
The $HARVEST_HOME/lib directory contains number of Perl library routines and other programs needed by various parts of Harvest, as follows:
Perl libraries used to communicate with remote FTP servers.
Perl libraries used to parse ls output.
ftpgetProgram used to retrieve files and directories from FTP servers.
Usage: ftpget [-htmlify] localfile hostname filename A,I username password
gopherget.plPerl program used to retrieve files and menus from Gopher servers.
Usage: gopherget.pl localfile hostname port command
harvest-check.plPerl program to check whether gatherers and brokers are up.
Usage: harvest-check.pl [-v]
md5Program used to compute MD5 checksums.
Usage: md5 file [...]
newsget.plPerl program used to retrieve USENET articles and group summaries from NNTP servers.
Usage: newsget.pl localfile news-URL
Perl library used to process SOIF.
urlgetProgram used to retrieve a URL.
Usage: urlget URL
urlpurgeProgram to purge the local disk URL cache used by urlget and the
Gatherer.
Usage: urlpurge
The $HARVEST_HOME/lib/broker directory contains the search and index programs needed by the Broker, plus several utility programs needed for Broker administration, as follows:
BrokerRestartThis program will issue a restart command to a broker.
Usage: BrokerRestart [-password passwd] host port
brkclientClient interface to the broker. Can be used to send queries or administrative commands to a broker.
Usage: brkclient hostname port command-string
dumpregistryPrints the Broker's Registry file in a human-readable format.
Usage: dumpregistry [-count] [BrokerDirectory]
agrep, glimpse, glimpseindex, glimpseindex.bin, glimpseserverThe Glimpse indexing and search system as described in Section The Broker.
swishThe Swish indexing and search program as an alternative to Glimpse.
info-to-html.pl, mkbrokerstats.plPerl programs used to generate Broker statistics and to create stats.html.
Usage: gather -info host port | info-to-html.pl > host.port.html
Usage: mkbrokerstats.pl broker-dir > stats.html
The $HARVEST_HOME/lib/gatherer directory contains the default summarizers described in Section Extracting data for indexing: The Essence summarizing subsystem, plus various utility programs needed by the summarizers and the Gatherer, as follows:
Default URL filter as described in Section RootNode specifications.
Essence configuration files as described in Section Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps.
*.sumEssence summarizers as discussed in Section Extracting data for indexing: The Essence summarizing subsystem.
HTML-sum.plAlternative HTML summarizer written in Perl.
HTMLurlsProgram to extract URLs from a HTML file.
Usage: HTMLurls [--base-url url] filename
catdoc, xls2csv, catdoc-libPrograms and files used by Microsoft Word summarizer.
dvi2tty, print-c-comments, ps2txt, ps2txt-2.1, pstext,
skimPrograms used by various summarizers.
gifinfoProgram to support summarizers.
l2hProgram used by TeX summarizer.
rast, smgls, sgmlsasp, sgmls-libPrograms and files used by SGML summarizer.
rtf2htmlProgram used by RTF summarizer.
wp2x, wp2x.sh, wp2x-libPrograms and files used by WordPerfect summarizer.
hexbin, unshar, uudecodePrograms used to unnest nested objects.
cksoifPrograms used to check the validity of a SOIF stream (e.g., to ensure that there is not parsing errors).
Usage: cksoif < INPUT.soif
cleandb, consoldb, expiredb, folddb, mergedb,
mkgathererstats.pl, mkindex, rmbinaryPrograms used to prepare a Gatherer's database to be exported by
gatherd.
cleandb ensures that all SOIF objects are valid, and deletes
any that are not;
consoldb will consolidate n GDBM database files into a single GDBM database file;
expiredb deletes any SOIF objects that are no longer
valid as defined by its Time-to-Live attribute;
folddb runs all of the operations needed to prepare the
Gatherer's database for export by gatherd;
mergedb consolidates GDBM files as described in Section
Incorporating manually generated information into a Gatherer;
mkgathererstats.pl generates the INFO.soif
statistics file;
mkindex generates the cache of timestamps; and
rmbinary removes binary data from a GDBM database.
enum, prepurls, staturlPrograms used by Gatherer to perform the RootNode and
LeafNode enumeration for the Gatherer as described in Section
RootNode specifications.
enum performs a RootNode enumeration on the given URLs;
prepurls is a wrapper program used to pipe Gatherer
and essence together;
staturl retrieves LeafNode URLs to determine if the URL has
been modified or not.
fileenum, ftpenum, ftpenum.pl, gopherenum-*, httpenum-*,
newsenumPrograms used by enum to perform protocol specific
enumeration.
fileenum performs a RootNode enumeration on ``file'' URLs;
ftpenum calls ftpenum.pl to perform a RootNode
enumeration on ``ftp'' URLs;
gopherenum-breadth performs a breadth first RootNode
enumeration on ``gopher'' URLs;
gopherenum-depth performs a depth first RootNode enumeration
on ``gopher'' URLs;
httpenum-breadth performs a breadth first RootNode
enumeration on ``http'' URLs;
httpenum-depth performs a depth first RootNode enumeration on
``http'' URLs;
newsenum performs a RootNode enumeration on ``news'' URLs;
essenceThe Essence content extraction system as described in Section Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps.
Usage: essence [options] -f input-URLs
or essence [options] URL ...
where options are:
--dbdir directory Directory to place database
--full-text Use entire file instead of summarizing
--gatherer-host Gatherer-Host value
--gatherer-name Gatherer-Name value
--gatherer-version Gatherer-Version value
--help Print usage information
--libdir directory Directory to place configuration files
--log logfile Name of the file to log messages to
--max-deletions n Number of GDBM deletions before reorganization
--minimal-bookkeeping Generates a minimal amount of bookkeeping attrs
--no-access Do not read contents of objects
--no-keywords Do not automatically generate keywords
--allowlist filename File with list of types to allow
--stoplist filename File with list of types to remove
--tmpdir directory Name of directory to use for temporary files
--type-only Only type data; do not summarize objects
--verbose Verbose output
--version Version information
print-attrReads in a SOIF stream from stdin and prints the data associated with the given attribute to stdout.
Usage: cat SOIF-file | print-attr Attribute
gatherd, in.gatherdDaemons that exports the Gatherer's database.
in.gatherd is used to run this daemon from inetd.
Usage:
gatherd [-db | -index | -log | -zip | -cf file] [-dir dir] port
Usage:
in.gatherd [-db | -index | -log | -zip | -cf file] [-dir dir]
gdbmutilProgram to perform various operations on a GDBM database.
Usage: gdbmutil consolidate [-d | -D] master-file file [file ...]
Usage: gdbmutil delete file key
Usage: gdbmutil dump file
Usage: gdbmutil fetch file key
Usage: gdbmutil keys file
Usage: gdbmutil print [-gatherd] file
Usage: gdbmutil reorganize file
Usage: gdbmutil restore file
Usage: gdbmutil sort file
Usage: gdbmutil stats file
Usage: gdbmutil store file key < data
mktemplateProgram to generate valid SOIF based on a more easily editable SOIF-like format (e.g., SOIF without the byte counts).
Usage: mktemplate < INPUT.txt > OUTPUT.soif
quick-sumSimple Perl program to emulate Essence's quick-sum.cf processing for those who cannot compile Essence with the corresponding C code.
template2dbConverts a stream of SOIF objects (from stdin or given files) into a GDBM database.
Usage: template2db database [tmpl tmpl...]
wrapitWraps the data from stdin into a SOIF attribute-value pair with a byte count. Used by Essence summarizers to easily generate SOIf.
Usage: wrapit [Attribute]
kill-gatherdScript to kill gatherd process.
The $HARVEST_HOME/tmp directory is used by search.cgi to store search result pages.