Harvest User's Manual: Installing the Harvest Software

3. Installing the Harvest Software

3.1 Requirements for Harvest Servers

Hardware

A good machine for running a typical Harvest server will have a reasonably fast processor, 1-2 GB of free disk, and 128 MB of RAM. A slower CPU will work but it will slow down the Harvest server. More important than CPU speed, however, is memory size. Harvest uses a number of processes, some of which provide needed ``plumbing'' (e.g., search.cgi), and some of which improve performance (e.g., the glimpseserver process). If you do not have enough memory, your system will page too much, and drastically reduce performance. The other factor affecting RAM usage is how much data you are trying to index in a Harvest Broker. The more data, the more disk I/O will be performed at query time, the more RAM it will take to provide a reasonable sized disk buffer pool.

The amount of disk you'll need depends on how much data you want to index in a single Broker. (It is possible to distribute your index over multiple Brokers if it gets too large for one disk.) A good rule of thumb is that you will need about 10% as much disk to hold the Gatherer and Broker databases as the total size of the data you want to index. The actual space needs will vary depending on the type of data you are indexing. For example, PostScript achieves a much higher indexing space reduction than HTML, because so much of the PostScript data (such as page positioning information) is discarded when building the index.

Platforms

To run a Harvest server, you need an UNIX-like Operating System.

Software

To use Harvest, you need the following software packages:

All Harvest servers require: Perl v5.0 or higher.
The Harvest Broker and Gatherer require: GNU gzip v1.2.4 or higher.
The Harvest Broker requires: HTTP server.

To build Harvest from the source distribution you may need to install one or more of the following software packages:

Compiling Harvest requires: GNU gcc v2.5.8 or higher.
Compiling the Harvest Broker requires: flex v2.4.7 or higher and bison v1.22 or higher.

The sources for gcc, gzip, flex, and bison are available at the GNU FTP server.

3.2 Requirements for Harvest Users

Anyone with a web browser (e.g., Internet Explorer, Lynx, Mozilla, Netscape, Opera, etc.) can access and use Harvest servers.

3.3 Retrieving and Installing the Harvest Software

Distribution types

Currently we offer only source distribution of Harvest. The source distribution contains all of the source code for the Harvest software. There are no binary distributions of Harvest.

You can retrieve the Harvest source distributions from the Harvest download site prdownloads.sourceforge.net/harvest/.

Harvest components

Harvest components are in the components directory. To use a component, follow the instructions included in the desired component directory.

User-contributed software

There is a collection of unsupported user-contributed software in contrib directory. If you would like to contribute some software, please send email to lee@arco.de.

3.4 Building the Source Distribution

The source distribution can be extracted in any directory. The following command will extract the gnu-zipped source archive:


        % gzip -dc harvest-x.y.z.tar.gz | tar xf -

For archives compressed with bzip2, use:


        % bzip2 -dc harvest-x.y.z.tar.bz2 | tar xf -

Harvest uses GNU's autoconf package to perform needed configuration at installation time. If you want to override the default installation location of /usr/local/harvest, change the ``prefix'' variable when invoking ``configure''. If desired, you may edit src/common/include/config.h before compiling to change various Harvest compile-time limits and variables. To compile the source tree type make.

For example, to build and install the entire Harvest system into /usr/local/harvest directory, type:


        % ./configure
        % make
        % make install

You may see some compiler warning messages, which you can ignore.

Building the entire Harvest distribution will take few minutes on a reasonably fast machine. The compiled source tree takes approximately 25 megabytes of disk space.

Later, after the installed software working, you can remove the compiled code (``.o'' files) and other intermediate files by typing make clean. If you want to remove the configure-generated Makefiles, type make distclean.

3.5 Additional installation for the Harvest Broker

Checking the installation for HTTP access

The Broker interacts with your HTTP server in a number of ways. You should make sure that the HTTP server can properly access the files it needs. In many cases, the HTTP server will run under a different userid than the owner of the Harvest files.

First, make sure the HTTP server userid can read the query.html files in each broker directory. Second, make sure the HTTP server userid can access and execute the CGI programs in $HARVEST_HOME/cgi-bin/. The search.cgi script reads files from the $HARVEST_HOME/cgi-bin/lib/ directory, so check that as well. Finally, check the files in $HARVEST_HOME/lib/. Some of the CGI Perl scripts require ``include'' files in this directory.

Required modifications to your HTTP server

The Harvest Broker requires that an HTTP server is running, and that the HTTP server ``knows'' about the Broker's files. Below are some examples of how to configure various HTTP servers to work with the Harvest Broker.

Apache httpd

Requires a ScriptAlias and an Alias entry in httpd.conf, e.g.:


        ScriptAlias /Harvest/cgi-bin/ Your-HARVEST_HOME/cgi-bin/
        Alias /Harvest/ Your-HARVEST_HOME/

WARNING: The ScriptAlias entry must appear before the Alias entry.

Additionally, it might be necessary to configure Apache httpd to follow symbolic links. To do this, add following to your httpd.conf:


        <Directory Your-HARVEST_HOME>
                Options FollowSymLinks
        </Directory>

Other HTTP servers

Install the HTTP server and modify its configuration file so that the /Harvest directory points to $HARVEST_HOME. You will also need to configure your HTTP server so that it knows that the directory /Harvest/cgi-bin contains valid CGI programs. If the default behaviour of your HTTP server is not to follow symbolik links, you will need to configure it so that it will follow symbolic links in the /Harvest directory.

3.6 Upgrading versions of the Harvest software

Upgrading from version 1.6 to version 1.8

You can not install version 1.8 on top of version 1.6. For example, the change from version 1.6 to version 1.8 included some reorganization of the executables, and hence simply installing version 1.8 on top of version 1.6 would cause you to use old executables in some cases.

To upgrade from Harvest version 1.6 to 1.8, do:

Move your old installation to a temporary location.
Install the new version as directed by the release notes.
Then, for each Gatherer and Broker that you were running under the old installation, migrate the server into the new installation.

Gatherers:
you need to move the Gatherer's directory into $HARVEST_HOME/gatherers. Section RootNode specifications describes the Gatherer workload specifications if you want to modify your Gatherer's configuration file.

Brokers:
rebuild your broker by using CreateBroker and merge in any customizations you have made to your old Broker.

Upgrading from version 1.5 to version 1.6

There are no known incompatibilities between versions 1.5 and 1.6.

Upgrading from version 1.4 to version 1.5

You can not install version 1.5 on top of version 1.4. For example, the change from version 1.4 to version 1.5 included some reorganization of the executables, and hence simply installing version 1.5 on top of version 1.4 would cause you to use old executables in some cases.

To upgrade from Harvest version 1.4 to 1.5, do:

Move your old installation to a temporary location.
Install the new version as directed by the release notes.
Then, for each Gatherer and Broker that you were running under the old installation, migrate the server into the new installation.

Gatherers:
you need to move the Gatherer's directory into $HARVEST_HOME/gatherers. Section RootNode specifications describes the Gatherer workload specifications if you want to modify your Gatherer's configuration file.

Brokers:
you need to move the Broker's directory into $HARVEST_HOME/brokers. Remove any .glimpse_* files from your Broker's directory and use the admin.html interface to force a full-index. You may want, however, to rebuild your broker by using CreateBroker so that you can use the updated query.html and related files.

Upgrading from version 1.3 to version 1.4

There are no known incompatibilities between versions 1.3 and 1.4.

Upgrading from version 1.2 to version 1.3

Version 1.3 is mostly backwards compatible with 1.2, with the following exception:

Harvest 1.3 uses Glimpse 3.0. The .glimpse_* files in the broker directory created with Harvest 1.2 (Glimpse 2.0) are incompatible. After installing Harvest 1.3 you should:

Shutdown any running brokers.
Execute rm .glimpse_* in each broker directory.
Restart your brokers with RunBroker.
Force a full-index from the admin.html interface.

Upgrading from version 1.1 to version 1.2

There are a few incompatabilities between Harvest version 1.1 and version 1.2.

The Gatherer has improved incremental gatherering support which is incompatible with version 1.1. To update your existing Gatherer, change into the Gatherer's Data-Directory (usually the data subdirectory), and run the following command:
% set path = ($HARVEST_HOME/lib/gatherer $path) % cd data % rm -f INDEX.gdbm % mkindex
This should create the INDEX.gdbm and MD5.gdbm files in the current directory.
The Broker has a new log format for the admin/LOG file which is incompatible with version 1.1.

Upgrading to version 1.1 from version 1.0 or older

If you already have an older version of Harvest installed, and want to upgrade, you can not unpack the new distribution on top of the old one. For example, the change from version 1.0 to version 1.1 included some reorganization of the executables, and hence simply installing version 1.1 on top of version 1.0 would cause you to use old executables in some cases.

On the other hand, you may not want to start over from scratch with a new software version, as that would not take advantage of the data you have already gathered and indexed. Instead, to upgrade from Harvest version 1.0 to 1.1, do the following:

Move your old installation to a temporary location.
Install the new version as directed by the release notes.
Then, for each Gatherer and Broker that you were running under the old installation, migrate the server into the new installation.

Gatherers:
you need to move the Gatherer's directory into $HARVEST_HOME/gatherers. Section RootNode specifications describes the new Gatherer workload specifications which were introduced in version 1.1; you may modify your Gatherer's configuration file to employ this new functionality.

Brokers:
you need to move the Broker's directory into $HARVEST_HOME/brokers. You may want, however, to rebuild your broker by using CreateBroker so that you can use the updated query.html and related files.

3.7 Starting up the system: RunHarvest and related commands

The simplest way to start the Harvest system is to use the RunHarvest command. RunHarvest prompts the user with a short list of questions about what data to index, etc., and then creates and runs a Gatherer and Broker with a ``stock'' (non-customized) set of content extraction and indexing mechanisms. Some more primitive commands are also available, for starting individual Gatherers and Brokers (e.g., if you want to distribute the gathering process). The Harvest startup commands are:

RunHarvest

Checks that the Harvest software is installed correctly, prompts the user for basic configuration information, and then creates and runs a Gatherer and a Broker. If you have $HARVEST_HOME set, then it will use it; otherwise, it tries to determine $HARVEST_HOME automatically. Found in the $HARVEST_HOME directory.

RunBroker

Runs a Broker. Found in the Broker's directory.

RunGatherer

Runs a Gatherer. Found in the Gatherer's directory.

CreateBroker

Creates a single Broker which will collect its information from other existing Brokers or Gatherers. Used by RunHarvest, or can be run by a user to create a new Broker. Uses $HARVEST_HOME, and defaults to /usr/local/harvest. Found in the $HARVEST_HOME/bin directory.

There is no CreateGatherer command, but the RunHarvest command can create a Gatherer, or you can create a Gatherer manually (see Section Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps or Section Gatherer Examples). The layout of the installed Harvest directories and programs is discussed in Section Programs and layout of the installed Harvest software.

Among other things, the RunHarvest command asks the user what port numbers to use when running the Gatherer and the Broker. By default, the Gatherer will use port 8500 and the Broker will use the Gatherer port plus 1. The choice of port numbers depends on your particular machine -- you need to choose ports that are not in use by other servers on your machine. You might look at your /etc/services file to see what ports are in use (although this file only lists some servers; other servers use ports without registering that information anywhere). Usually the above port numbers will not be in use by other processes. Probably the easiest thing is simply to try using the default port numbers, and see if it works.

The remainder of this manual provides information for users who wish to customize or otherwise make more sophisticated use of Harvest than what happens when you install the system and run RunHarvest.

3.8 Harvest team contact information

If you have questions the about Harvest system or problems with the software, post a note to the USENET newsgroup comp.infosystems.harvest. Please note your machine type, operating system type, and Harvest version number in your correspondence.

If you have bug fixes, ports to new platforms or other software improvements, please email them to the Harvest maintainer lee@arco.de.

Next Previous Contents