A good machine for running a typical Harvest server will have a reasonably
fast processor, 1-2 GB of free disk, and 128 MB of RAM.
A slower CPU will work but it will slow down the Harvest server. More
important than CPU speed, however, is memory size. Harvest uses a number of
processes, some of which provide needed ``plumbing'' (e.g.,
search.cgi
), and some of which improve performance (e.g., the
glimpseserver
process). If you do not have enough memory, your
system will page too much, and drastically reduce performance. The other
factor affecting RAM usage is how much data you are trying to index in a
Harvest Broker. The more data, the more disk I/O will be performed at query
time, the more RAM it will take to provide a reasonable sized disk buffer pool.
The amount of disk you'll need depends on how much data you want to index in a single Broker. (It is possible to distribute your index over multiple Brokers if it gets too large for one disk.) A good rule of thumb is that you will need about 10% as much disk to hold the Gatherer and Broker databases as the total size of the data you want to index. The actual space needs will vary depending on the type of data you are indexing. For example, PostScript achieves a much higher indexing space reduction than HTML, because so much of the PostScript data (such as page positioning information) is discarded when building the index.
To run a Harvest server, you need an UNIX-like Operating System.
To use Harvest, you need the following software packages:
gzip
v1.2.4 or
higher.To build Harvest from the source distribution you may need to install one or more of the following software packages:
gcc
v2.5.8 or higher.flex
v2.4.7 or higher and
bison
v1.22 or higher.The sources for gcc
, gzip
, flex
, and bison
are available at the
GNU FTP server.
Anyone with a web browser (e.g., Internet Explorer, Lynx, Mozilla, Netscape, Opera, etc.) can access and use Harvest servers.
Currently we offer only source distribution of Harvest. The source distribution contains all of the source code for the Harvest software. There are no binary distributions of Harvest.
You can retrieve the Harvest source distributions from the Harvest download site prdownloads.sourceforge.net/harvest/.
Harvest components are in the components directory. To use a component, follow the instructions included in the desired component directory.
There is a collection of unsupported user-contributed software in contrib directory. If you would like to contribute some software, please send email to lee@arco.de.
The source distribution can be extracted in any directory. The following command will extract the gnu-zipped source archive:
% gzip -dc harvest-x.y.z.tar.gz | tar xf -
For archives compressed with bzip2, use:
% bzip2 -dc harvest-x.y.z.tar.bz2 | tar xf -
Harvest uses GNU's autoconf package to perform needed configuration
at installation time. If you want to override the default installation
location of /usr/local/harvest, change the ``prefix'' variable when
invoking ``configure''. If desired, you may edit
src/common/include/config.h before compiling to change various
Harvest compile-time limits and variables. To compile the source tree type
make
.
For example, to build and install the entire Harvest system into /usr/local/harvest directory, type:
% ./configure
% make
% make install
You may see some compiler warning messages, which you can ignore.
Building the entire Harvest distribution will take few minutes on a reasonably fast machine. The compiled source tree takes approximately 25 megabytes of disk space.
Later, after the installed software working, you can remove the compiled
code (``.o'' files) and other intermediate files by typing
make clean
. If you want to remove the configure-generated
Makefiles, type make distclean
.
The Broker interacts with your HTTP server in a number of ways. You should make sure that the HTTP server can properly access the files it needs. In many cases, the HTTP server will run under a different userid than the owner of the Harvest files.
First, make sure the HTTP server userid can read the query.html
files in each broker directory. Second, make sure the HTTP server
userid can access and execute the CGI programs in
$HARVEST_HOME/cgi-bin/. The search.cgi
script reads
files from the $HARVEST_HOME/cgi-bin/lib/ directory, so check
that as well. Finally, check the files in $HARVEST_HOME/lib/.
Some of the CGI Perl scripts require ``include'' files in this directory.
The Harvest Broker requires that an HTTP server is running, and that the HTTP server ``knows'' about the Broker's files. Below are some examples of how to configure various HTTP servers to work with the Harvest Broker.
Requires a ScriptAlias and an Alias entry in httpd.conf, e.g.:
ScriptAlias /Harvest/cgi-bin/ Your-HARVEST_HOME/cgi-bin/
Alias /Harvest/ Your-HARVEST_HOME/
WARNING: The ScriptAlias entry must appear before the Alias entry.
Additionally, it might be necessary to configure Apache httpd to follow symbolic links. To do this, add following to your httpd.conf:
<Directory Your-HARVEST_HOME>
Options FollowSymLinks
</Directory>
Install the HTTP server and modify its configuration file so that the /Harvest directory points to $HARVEST_HOME. You will also need to configure your HTTP server so that it knows that the directory /Harvest/cgi-bin contains valid CGI programs. If the default behaviour of your HTTP server is not to follow symbolik links, you will need to configure it so that it will follow symbolic links in the /Harvest directory.
You can not install version 1.8 on top of version 1.6. For example, the change from version 1.6 to version 1.8 included some reorganization of the executables, and hence simply installing version 1.8 on top of version 1.6 would cause you to use old executables in some cases.
To upgrade from Harvest version 1.6 to 1.8, do:
you need to move the Gatherer's directory into $HARVEST_HOME/gatherers. Section RootNode specifications describes the Gatherer workload specifications if you want to modify your Gatherer's configuration file.
rebuild your broker by using CreateBroker
and merge in
any customizations you have made to your old Broker.
There are no known incompatibilities between versions 1.5 and 1.6.
You can not install version 1.5 on top of version 1.4. For example, the change from version 1.4 to version 1.5 included some reorganization of the executables, and hence simply installing version 1.5 on top of version 1.4 would cause you to use old executables in some cases.
To upgrade from Harvest version 1.4 to 1.5, do:
you need to move the Gatherer's directory into $HARVEST_HOME/gatherers. Section RootNode specifications describes the Gatherer workload specifications if you want to modify your Gatherer's configuration file.
you need to move the Broker's directory into
$HARVEST_HOME/brokers. Remove any .glimpse_*
files from your Broker's directory and use the
admin.html interface to force a full-index. You may
want, however, to rebuild your broker by using
CreateBroker
so that you can use the updated
query.html and related files.
There are no known incompatibilities between versions 1.3 and 1.4.
Version 1.3 is mostly backwards compatible with 1.2, with the following exception:
Harvest 1.3 uses Glimpse 3.0. The .glimpse_* files in the broker directory created with Harvest 1.2 (Glimpse 2.0) are incompatible. After installing Harvest 1.3 you should:
rm .glimpse_*
in each broker directory.RunBroker
.
There are a few incompatabilities between Harvest version 1.1 and version 1.2.
% set path = ($HARVEST_HOME/lib/gatherer $path)
% cd data
% rm -f INDEX.gdbm
% mkindex
This should create the INDEX.gdbm and MD5.gdbm
files in the current directory.
If you already have an older version of Harvest installed, and want to upgrade, you can not unpack the new distribution on top of the old one. For example, the change from version 1.0 to version 1.1 included some reorganization of the executables, and hence simply installing version 1.1 on top of version 1.0 would cause you to use old executables in some cases.
On the other hand, you may not want to start over from scratch with a new software version, as that would not take advantage of the data you have already gathered and indexed. Instead, to upgrade from Harvest version 1.0 to 1.1, do the following:
you need to move the Gatherer's directory into $HARVEST_HOME/gatherers. Section RootNode specifications describes the new Gatherer workload specifications which were introduced in version 1.1; you may modify your Gatherer's configuration file to employ this new functionality.
you need to move the Broker's directory into
$HARVEST_HOME/brokers. You may want, however, to
rebuild your broker by using CreateBroker
so that you
can use the updated query.html and related files.
The simplest way to start the Harvest system is to use the RunHarvest
command. RunHarvest
prompts the user with a short list of questions
about what data to index, etc., and then creates and runs a Gatherer and
Broker with a ``stock'' (non-customized) set of content extraction and
indexing mechanisms. Some more primitive commands are also available, for
starting individual Gatherers and Brokers (e.g., if you want to distribute
the gathering process). The Harvest startup commands are:
Checks that the Harvest software is installed correctly, prompts the user for basic configuration information, and then creates and runs a Gatherer and a Broker. If you have $HARVEST_HOME set, then it will use it; otherwise, it tries to determine $HARVEST_HOME automatically. Found in the $HARVEST_HOME directory.
Runs a Broker. Found in the Broker's directory.
Runs a Gatherer. Found in the Gatherer's directory.
Creates a single Broker which will collect its information from other existing
Brokers or Gatherers. Used by RunHarvest
, or can be run by a user to
create a new Broker. Uses $HARVEST_HOME, and defaults to
/usr/local/harvest. Found in the $HARVEST_HOME/bin
directory.
There is no CreateGatherer
command, but the RunHarvest
command can create a Gatherer, or you can create a Gatherer manually (see
Section
Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps or Section
Gatherer Examples). The layout of
the installed Harvest directories and programs is discussed in Section
Programs and layout of the installed Harvest software.
Among other things, the RunHarvest
command asks the user what port
numbers to use when running the Gatherer and the Broker. By default,
the Gatherer will use port 8500 and the Broker will use the Gatherer
port plus 1. The choice of port numbers depends on your particular
machine -- you need to choose ports that are not in use by other servers
on your machine. You might look at your /etc/services file to see
what ports are in use (although this file only lists some servers; other
servers use ports without registering that information anywhere).
Usually the above port numbers will not be in use by other processes.
Probably the easiest thing is simply to try using the default port
numbers, and see if it works.
The remainder of this manual provides information for users who wish to
customize or otherwise make more sophisticated use of Harvest than what
happens when you install the system and run RunHarvest
.
If you have questions the about Harvest system or problems with the software, post a note to the USENET newsgroup comp.infosystems.harvest. Please note your machine type, operating system type, and Harvest version number in your correspondence.
If you have bug fixes, ports to new platforms or other software improvements, please email them to the Harvest maintainer lee@arco.de.