Harvest User's Manual: The Broker

The Broker retrieves and manages indexing information from Gatherers and other Brokers, and provides a WWW query interface to the indexing information.

5.2 Basic setup

The Broker is automatically started by the RunHarvest command. Other relevant commands are described in Section Starting up the system: RunHarvest and related commands.

In the current section we discuss various ways users can customize and tune the Broker, how to administrate the Broker, and the various Broker programming interfaces.

As suggested in Figure 1, the Broker uses a flexible indexing interface that supports a variety of indexing subsystems. The default Harvest Broker uses Glimpse as indexer, but other indexers such as Swish, and WAIS (both freeWAIS and commercial WAIS), also work with the Broker (see Section Using different index/search engines with the Broker).

To create a new Broker, run the CreateBroker program. It will ask you a series of questions about how you'd like to configure your Broker, and then automatically create and configure it. To start your Broker, use the RunBroker program that CreateBroker generates. The Broker should be started when your system reboots. To prevent a collection while starting the broker, use the -nocol option. There are a number of ways you can customize or tune the Broker, discussed in Sections Tuning Glimpse indexing in the Broker and Using different index/search engines with the Broker. You may also use the RunHarvest command, discussed in Section Starting up the system: RunHarvest and related commands, to create both a Broker and a Gatherer.

5.3 Querying a Broker

The Harvest Broker can handle many types of queries. The queries handled by a particular Broker depend on what index/search engine is being used inside of it (e.g., WAIS does not support some of the queries that Glimpse does). In this section we describe the full syntax. If a particular Broker does not support a certain type of query, it will return an error when the user requests that type of query.

The simplest query is a single keyword, such as:


        lightbulb

Searching for common words (like ``computer'' or ``html'') may take a lot of time.

Particularly for large Brokers, it is often helpful to use more powerful queries. Harvest supports many different index/search engines, with varying capabilities. At present, our most powerful (and commonly used) search engine is Glimpse, which supports:

case-insensitive and case-sensitive queries;
matching parts of words, whole words, or multiple word phrases (like ``resource discovery'');
Boolean (AND/OR) combinations of keywords;
approximate matches (e.g., allowing spelling errors);
structured queries (which allow you to constrain matches to certain attributes);
displaying matched lines or entire matching records (e.g., for citations);
specifying limits on the number of matches returned; and
a limited form of regular expressions (e.g., allowing ``wild card'' expressions that match all words ending in a particular suffix).

The different types of queries (and how to use them) are discussed below. Note that you use the same syntax regardless of what index/search engine is running in a particular Broker, but that not all engines support all of the above features. In particular, some of the Brokers use WAIS, which sometimes searches faster than Glimpse but supports only Boolean keyword queries and the ability to specify result set limits.

The different options - case-sensitivity, approximate matching, the ability to show matched lines vs. entire matching records, and the ability to specify match count limits - can all be specified with buttons and menus in the Broker query forms.

A structured query has the form:


        tag-name : value

where tag-name is a Content Summary attribute name, and value is the search value within the attribute. If you click on a Content Summary, you will see what attributes are available for a particular Broker. A list of common attributes is shown in Section List of common SOIF attribute names.

Keyword searches and structured queries can be combined using Boolean operators (AND and OR) to form complex queries. Lacking parentheses, logical operation precedence is based left to right. For multiple word phrases or regular expressions, you need to enclose the string in double quotes, e.g.,


        "internet resource discovery"


        "discov.*"

Double quotes should also be used when searching for non-alphanumeric characters.

Example queries

Simple keyword search query:

Arizona

This query returns all objects in the Broker containing the word Arizona.

Boolean query:

Arizona AND desert

This query returns all objects in the Broker that contain both words anywhere in the object in any order.

Phrase query:

"Arizona desert"

This query returns all objects in the Broker that contain Arizona desert as a phrase. Notice that you need to put double quotes around the phrase.

Boolean queries with phrases:

"Arizona desert" AND windsurfing

This query returns all objects in the Broker that contain Arizona desert as a phrase and the word windsurfing.

Simple Structured query:

Title : windsurfing

This query returns all objects in the Broker where the Title attribute contains the value windsurfing.

Complex query:

"Arizona desert" AND (Title : windsurfing)

This query returns all objects in the Broker that contain the phrase Arizona desert and where the Title attribute of the same object contains the value windsurfing.

Regular expressions

Some types of regular expressions are supported by Glimpse. A regular expression search can be much slower that other searches. The following is a partial list of possible patterns. (For more details see the Glimpse documentations.)

^joe will match ``joe'' at the beginning of a line.
joe$ will match ``joe'' at the end of a line.
[a-ho-z] matches any character between ``a'' and ``h'' or between ``o'' and ``z''.
. matches any single character except newline.
c* matches zero or more occurrences of the character ``c''.
.* matches any number of characters except newline.
\* matches the character ``*''. (\ escapes any of the above special characters.)

Regular expressions are currently limited to approximately 30 characters, not including meta characters. Regular expressions will generally not cross word boundaries (because only words are stored in the index). So, for example, "lin.*ing" will find ``linking'' or ``flinching,'' but not ``linear programming.''

Query options selected by menus or buttons

The query page may have following checkboxes to allow some control of the query specification.

Case insensitive:

By selecting this checkbox the query will become case insensitive (lower case and upper case letters don't differ). Otherwise, the query will be case sensitive. The default is case insensitive.

Keywords match on word boundaries:

By selecting this checkbox, keywords will match on word boundaries. Otherwise, a keyword will match part of a word (or phrase). For example, "network" will match ``networking'', "sensitive" will match ``insensitive'', and "Arizona desert" will match ``Arizona desertness''. The default is to match keywords on word boundaries.

Number of errors allowed:

Glimpse allows the search to contain a number of errors. An error is either a deletion, insertion, or substitution of a single character. The Best Match option will find the match(es) with the least number of errors. The default is 0 (zero) errors.

Note: The previous three options do not apply to attribute names. Attribute names are always case insensitive and allow no errors.

Filtering query results

Harvest allows to filter the results of a query by any query term using any attribute defined in the List of common SOIF attribute names. This is done by defining filter parameters in the query form. It is possible to define more that one filter parameter; they will be concatenated by boolean AND. Filter parameters consist of two parts, separated by the pipe symbol ``|''. The first part is a query expression which is attached to the user query using AND before sending the request to the broker. The optional second part is a HTML text that shall be displayd on the results page, to give the user some information on the applied filter.

Example:


        <SELECT NAME="filter">
        <OPTION VALUE=''>No Filter
        <OPTION VALUE='uri: "xyz\.edu"|Seach only xyz.edu'>Search xyz.edu only
        <OPTION VALUE='type: html|HTML documents only'>Search HTML documents only
        </SELECT>

The first option returns an unfiltered output. The second option returns only pages found on pages with ``xyz.edu'' in their URL. The third option returns only HTML-documents. See the advanced search page of the broker for more examples.

Result set presentation

The query page may have following checkboxes allow some control of presentation of the query return.

Display matched lines (from content summaries):

By selecting this checkbox, the result set presentation will contain the lines of the Content Summary that matched the query. Otherwise, the matched lines will not be displayed. The default is to display the matched lines.

Display object descriptions (if available):

Some objects have short, one-line descriptions associated with them. By selecting this checkbox, the descriptions will be presented. Otherwise, the object descriptions will not be displayed. The default is to display object descriptions.

Display links to indexed content summary:

This checkbox allows you to set whether links to the indexed content summaries are displayed or not. The default is not to display links to inexed content summaries.

5.4 Customizing the Broker's Query Result Set

It is possible for the Harvest administrator to customize how the Broker query result set is generated, by modifying a configuration file that is interpreted by the search.cgi Perl program at query result time.

search.cgi allows you to customize almost every aspect of its HTML output. The file $HARVEST_HOME/cgi-bin/lib/search.cf contains the default output definitions. Individual brokers can be customized by creating a similar file which overrides the default definitions.

The search.cf configuration file

Definitions are enclosed within SGML-like beginning and ending tags. For example:


        <HarvestUrl>
        http://harvest.sourceforge.net/
        </HarvestUrl>

The last newline character is removed from each definition, so that the above becomes the string ``http://harvest.sourceforge.net/.''

Variable substitution occurs on every definition before it is output. A number of specific variables are defined by search.cgi which can be used inside a definition. For example:


        <BrokerLoad>
        Sorry, the Broker at <STRONG>$host, port $port</STRONG>
        is currently too heavily loaded to process your request.
        Please try again later.<P>
        </BrokerLoad>

When this definition is printed out, the variables $host and $port would be replaced with the hostname and port of the broker.

Defined Variables

The following variables are defined as soon as the query string is processed. They can be used before the broker returns any results.


        $maxresult    The maximum number of matched lines to be returned
        $host         The broker hostname
        $port         The broker port
        $query        The query string entered by the user
        $bquery       The whole query string sent to the broker

These variables are defined for each matched object returned by the broker.


        $objectnum   The number of the returned object
        $desc        The description attribute of the matched object
        $opaque      ALL the matched lines from the matched object
        $url         The original URL of the matched object
        $A           The access method of $url (e.g.: http)
        $H           The hostname (including port) from $url
        $P           The path part of $url
        $D           The directory part of $P
        $F           The filename part of $P
        $cs_url      The URL of the content summary in the broker database
        $cs_a        Access part of $cs_url
        $cs_h        Hostname part of $cs_url
        $cs_p        Path part of $cs_url
        $cs_d        Directory part of $cs_p
        $cs_f        Filename part of $cs_p

List of Definitions

Below is a partial list of definitions. A complete list can be found in the search.cf file. Only definitions likely to be customized are described here.

<Timeout>

Timeout value for search.cgi. If the broker doesn't respond within this time, search.cgi will exit.

<ResultHeader>

The first part of the result page. Should probably contain the HTML <TITLE> element and the user query string.

<ResultTrailer>

The last part of the result page. The default has URL references to the broker home page and the Harvest project home page.

<ResultSetBegin>

This is output just before looping over all the matched objects.

<ResultSetEnd>

This is output just after ending the loop over matched objects.

<PrintObject>

This definition prints out a matched object. It should probably include the variables $url, $cs_url, $desc, and $opaque.

<EndBrokerResults>

Printed between <ResultSetEnd> and <ResultTrailer> if the query was successful. Should probably include a count of matched objects and/or matched lines.

<FailBrokerResults>

Similar to <EndBrokerResults> but prints if the broker returns an error in response to the query.

<ObjectNumPrintf>

A printf format string for the object number ($objectnum).

<TruncateWarning>

Prints a warning message if the result set was truncated at the maximum number of matched lines.

These following definitions are somewhat different because they are evaluated as Perl instructions rather than strings.

<MatchedLineSub>

Evaluated for every matched line returned by the broker. Can be used to indent matched lines or to remove the leading ``Matched line'' and attribute name strings.

<InitFunction>

Evaluated near the beginning of the search.cgi program. Can be used to set up special variables or read data files.

<PerObjectFunction>

Evaluated for each object just before <PrintObject> is called.

<FormatAttribute>

Evaluated for each SOIF attribute requested for matched objects (see Section Displaying SOIF attributes in results). $att is set to the attribute name, and $val is set to the attribute value.

Example search.cf customization file

The following definitions demonstrate how to change the search.cgi output. The <PerObjectFunction> ensures that the description is not empty. It also prepends the string ``matched data:'' before any matched lines. The <PrintObject> specification prints the object number, description, and indexing data all on the first line. The description is wrapped around HMTL anchor tags so that it is a link to the object originally gathered. The words ``indexing data'' are a link to the displaySOIF program which will format the content summary for HTML browsers. The object number is formatted as a number in parenthesis such that the whole thing takes up four spaces.

The <MatchedLineSub> definition includes four substitution expressions. The first removes the words ``Matched line:'' from the beginning of each matched line. The second removes SOIF attributes of the form ``partial-text{43}:'' from the beginning of a line. The third displays the attribute names (e.g. partial-text#) in italics. The last expression indents each line by five spaces to align it with the description line. The definition for <EndBrokerResults> slightly modifies the report of how many objects were matched.


        # Demo to show some of the customization features for the Harvest output
        # More information can be found in the manual at:
        # http://harvest.sourceforge.net/harvest/doc/html/manual.html


        # The PerObjectFunction is Perl code evaluated for every hit
        <PerObjectFunction>
        # Create description
        # Is the descriptions provided by Harvest very short (e.g. missing <TITLE>)?
        if (length($desc) < 5) {
          # Yes: use filename ($F) instead
          $description = "<I>File:</I> $F";
        } else {
          # No: use description provided by Harvest
          $description = $desc;
        }

        # Format matched lines ("opaque data") if data is present
        if ($opaque ne '') {
          $opaque = "<strong>matched lines:</strong><BR>$opaque"
        }
        </PerObjectFunction>


        # PrintObject defines the apperance of hits
        <PrintObject>
        $objectnum <A HREF="$url"><STRONG>$description</STRONG></A> \
        [<A HREF="$cs_a://$cs_h/Harvest/cgi-bin/displaySOIF.cgi?object=$cs_p">\
        indexing data</A>]
        <pre>
             $opaque
        </pre>\n
        </PrintObject>


        # Format the appearance of the hit number
        <ObjectNumPrintf>
        (%2d)
        </ObjectNumPrintf>


        # Format the appearance of every matched line
        <MatchedLineSub>
        s/^Matched line: *//;            # Remove "Matched line:"
        s/^([\w-]+# )[\w-]+{\d+}:\t/\1/; # Remove SOIF attributes of the form "partial-text{43}:"
        s/^([\w-]+#)/<I>\1<\/I>/;        # Format attribute names as italics
        s/^.*/     $&/;                  # Add spaces to indent text
        </MatchedLineSub>


        # Modifies the report of how many objects were matched
        <EndBrokerResults>
        <STRONG>Found $nopaquelines matched lines, $nobjects objects.</STRONG>
        <P>\n
        </EndBrokerResults>

Integrating your customized configuration file

The search.cgi configuration files are kept in $HARVEST_HOME/cgi-bin/lib. The name of a customized file is listed in the query.html form, and passed as an option to the search.cgi program.

The simplest way to specify the customized file is by placing an <INPUT> tag in the HTML form:


        <INPUT TYPE="hidden" NAME="brokerqueryconfig" VALUE="custom.cf">

Another way is to allow users to select from different customizations with a <SELECT> list:


        <SELECT NAME="brokerqueryconfig">
        <OPTION VALUE=""> Default
        <OPTION VALUE="custom1.cf"> Customized
        <OPTION VALUE="custom2.cf" SELECTED> Highly Customized
        </SELECT>

Displaying SOIF attributes in results

It is possible to request SOIF attributes from the HTML query form. A simple approach is to include a select list in the query form:


        <SELECT MULTIPLE NAME="attribute">
        <OPTION VALUE="title">
        <OPTION VALUE="author">
        <OPTION VALUE="date">
        <OPTION VALUE="subject">
        </SELECT>

In this manner, the user may control which attributes get displayed. The layout of these attributes when the results are displayed in HTML is controlled by the <FormatAttribute> specification in the search.cf file described in Section The search.cf configuration file.

5.5 World Wide Web interface description

To allow Web browsers to easily interface with the Broker, we implemented a World Wide Web interface to the Broker's query manager and administrative interfaces. This WWW interface, which includes several HTML files and a few programs that use the Common Gateway Interface (CGI), consists of the following:

HTML files that use Forms support to present a graphical user interface (GUI) to the user;
CGI programs that act as a gateway between the user and the Broker; and
Help files for the user.

Users go through the following steps when using a Broker to locate information:

The user issues a query to the Broker.
The Broker processes the query, and returns the query results to the user.
The user can then view content summaries from the result set, or access the URLs from the result set directly.

To provide a WWW-queryable interface, the Broker needs to run in conjunction with an HTTP server. Section Additional installation for the Harvest Broker describes how to configure your HTTP server to work with Harvest.

You can run the Broker on a different machine than your HTTP server runs on, but if you want users to be able to view the Broker's content summaries then the Broker's files will need to be accessible to your HTTP server. You can NFS mount those files or manually copy them over. You'll also need to change the Brokers.cf file to point to the host that is running the Broker.

HTML files for graphical user interface

CreateBroker creates some HTML files to provide GUIs to the user:

query.html

Contains the GUI for the query interface. CreateBroker will install different query.html files for Glimpse, Swish, and WAIS, since each subsystem requires different defaults and supports different functionality (e.g., WAIS doesn't support approximate matching like Glimpse). This is also the ``home page'' for the Broker and a link to this page is included at the bottom of all query results.

admin.html

Contains the GUI for the administrative interface. This file is installed into the admin directory of the Broker.

Brokers.cf

Contains the hostname and port information for the supported brokers. This file is installed into the $HARVEST_HOME/brokers directory. The query.html file uses the value of the ``broker'' FORM tag to pass the name of the broker to search.cgi which in turn retrieves the host and port information from Brokers.cf.

CGI programs

When you install the WWW interface (see Section The Broker), a few programs are installed into your HTTP server's /Harvest/cgi-bin directory:

search.cgi

This program takes the submitted query from query.html, and sends it to the specified Broker. It then retrieves the query results from the Broker, formats them in HTML, and sends the result set in HTML to the user.

displaySOIF.cgi

This program displays the content summaries from the Broker.

BrokerAdmin.pl.cgi

This program will take the submitted administrative command from admin.html and send it to the appropriate Broker. It retrieves the result of the command from the Broker and displays it to the user.

Help files for the user

The WWW interface to the Broker includes a few help files written in HTML. These files are installed on your HTTP server in the /Harvest/brokers directory when you install the broker (see Section The Broker):

queryhelp.html

Provides a tutorial on constructing Broker queries, and on using the query.html forms. query.html has a link to this help page.

adminhelp.html

Provides a tutorial on submitting Broker administrative commands using the admin.html form. admin.html has a link to this help page.

soifhelp.html

Provides a brief description of SOIF.

5.6 Administrating a Broker

Administrators have two basic ways for managing a Broker: through the broker.conf and Collection.conf configuration files, and through the interactive administrative interface. The interactive interface controls various facilities and operating parameters within the Broker. We provide a HTML interface page for these administrative commands. See Section Collector interface description: Collection.conf for additional information on the Broker administrative and collector interfaces.

The broker.conf file is a list of variable names and their values, which consists of information about the Broker (such as the directory in which it lives) and the port on which it runs. The Collection.conf file (see Section Collector interface description: Collection.conf for an example) is a list of collection points from which the Broker collects its indexing information. The CreateBroker program automatically generates both of these configuration files. You can manually edit these files if needed.

The CreateBroker program also creates the admin.html file, which is the WWW interface to the Broker's administrative commands. Note that all administrative commands require a password as defined in broker.conf.

Note: Changes to the Broker configuration are not saved when the Broker is restarted. Permanent changes to the Broker configuration should be made by manually editing the broker.conf file.

The administrative interface created by CreateBroker has the following window fields:


Command         Select an administrative command.  See below for a
                description of the commands.
Parameters      Specify parameters for those commands that need them.
Password        The administrative password.
Broker Host     The host where the broker is running.
Broker Port     The port where the broker is listening.

The administrative interface created by CreateBroker supports the following commands:

Add objects by file:

Add object(s) to the Broker. The parameter is a list of filenames that contain SOIF object to be added to the Broker.

Close log:

Flush all accumulated log information and close the current log file. Causes the Broker to stop logging. No parameters.

Compress Registry:

Performs garbage collection on the Registry file. No parameters.

Delete expired objects:

Deletes any object from the Broker whose Time-to-Live has expired. No parameters.

Delete objects by query:

Deletes any object(s) that matches the given query. The parameter is a query with the same syntax as user queries. Query flags are currently unsupported.

Delete objects by oid:

Deletes the object(s) identified by the given OID numbers. The parameter is a list of OID numbers. The OID numbers can be obtained by using the dumpregistry command.

Disable log type:

Disables logging information about a particular type of event. The parameter is an event type. See Enable log type for a list of events.

Enable log type:

Enables logging information about a particular type of events. The parameter is the name of an event type. Currently, event types are limited to the following:


Update                  Log updated objects.
Delete                  Log deleted objects.
Refresh                 Log refreshed objects.
Query                   Log user queries.
Query-Return            Log objects returned from a query.
Cleaned                 Log objects removed by the cleaner.
Collection              Log collection events.
Admin                   Log administrative events.
Admin-Return            Log the results of administrative events.
Bulk-Transfer           Log bulk transfer events.
Bulk-Return             Log objects sent by bulk transfers.
Cleaner-On              Log cleaning events.
Compressing-Registry    Log registry compression events.
All                     Log all events.

Flush log:

Flush all accumulated log information to the current log file. No parameters.

Generate statistics:

Generates some basic statistics about the Broker object database. No parameters.

Index changes:

Index only the objects that have been added recently. No parameters.

Index corpus:

Index the entire object database. No parameters.

Open log:

Open a new log file. If the file does not exist, create a new one. The parameter is the name (relative to the broker) of a file to use for logging.

Restart server:

Force the broker to reread the Registry and reindex the corpus. This does not actually kill the broker process. No parameters.

Rotate log file:

Rotates the current log file to LOG.YYYYMMDD. Opens a new log file. No parameters.

Set variable:

Sets the value of a broker configuration variable. Takes two parameters, the name of a configuration variable and the new value for the variable. The configuration variables that can be set are those that occur in the broker.conf file. The change only is valid until the broker process dies.

Shutdown server:

Cleanly shutdown the Broker. No parameters.

Start collection:

Perform collections. No parameters.

Delete older objects of duplicate URLs:

Occasionally a broker may end up with multiple summarizes for individual URLs. This can happen when the Gatherer changes its description, hostname, or port number. Use this command to search the broker for duplicated URLs. When two objects with the same URL are found, the object with the least-recent timestamp is removed.

Deleting unwanted Broker objects

If you build a Broker and then decide not to index some of that data (e.g., you decide it would make sense to split it into two different Brokers, each targetted to a different community), you need to change the Gatherer's configuration file, rerun the Gatherer, and then let the old objects time out in the Broker (since the Broker and Gatherer maintain separate databases). If you want to clean out the Broker's data sooner than that you can use the Broker's administrative interface in one of three ways:

Use the 'Remove object by name' command. This is only reasonable if you have a small number of objects to remove in the Broker.
Use the 'Remove object by query'. This might be the best option if, for example, you can construct a regular expression based on the URLs you want to remove.
Shutdown the server, manually remove the Broker's objects/* files, and then restart the Broker. This is easiest, although if you have a large number of objects it will take longer to rebuild the index. A simple way to accomplish this is by ``rebooting'' the Broker by deleting all the current objects, and doing a full collection, as follows:
% mv objects objects.old % rm -rf objects.old & % broker ./admin/broker.conf -new

After removing objects, you should use the Index corpus command.

Command-line Administration

It is possible to perform administrative functions by using the brkclient program from the command-line and shell scripts. For example, to force a collection, run:


        % brkclient localhost 8501 '#ADMIN #Password secret #collection'

See your broker's raw admin.html file for a complete list of administrative commands.

5.7 Tuning Glimpse indexing in the Broker

The Glimpse indexing system can be tuned in a variety of ways to suit your particular needs. Probably the most noteworthy parameter is indexing granularity, for which Glimpse provides three options: a tiny index (2-3% of the total size of all files -- your mileage may vary), a small index (7-8%), and a medium-size index (20-30%). Search times are better with larger indexes. By changing the GlimpseIndex-Option in your Broker's broker.conf file, you can tune Glimpse to use one of these three indexing granularity options. By default, GlimpseIndex-Option builds a medium-size index using the glimpseindex program.

Note also that with Glimpse it is much faster to search with ``show matched lines'' turned off in the Broker query page.

Glimpse uses a ``stop list'' to avoid indexing very common words. This list is not fixed, but rather computed as the index is built. For a medium-size index, the default is to put any word that appears at least 500 times per Mbyte (on the average) in the stop-list. For a small-size index, the default is words that appear in at least 80% of all files (unless there are fewer than 256 files, in which case there is no stop-list). Both defaults can be changed using the -S option, which should be followed by the new number (average per Mbyte when -b indexing is used, or % of files when -o indexing is used). Tiny-size indexes do not maintain a stop-list (their effect is minimal).

glimpseindex includes a number of other options that may be of interest. You can find out more about these options (and more about Glimpse in general) in the Glimpse documentations. If you'd like to change how the Broker invokes the glimpseindex program, then edit the src/broker/Glimpse/index.c file from the Harvest source distribution.

The glimpseserver program

The Glimpse system comes with an auxiliary server called glimpseserver, which allows indexes to be read into a process and kept in memory. This avoids the added cost of reading the index and starting a large process for each search. glimpseserver is automatically started each time you run the Broker, or reindex the Broker's corpus. If you do not want to run glimpseserver, then set GlimpseServer-Host to ``false'' in your broker.conf.

5.8 Using different index/search engines with the Broker

By default, Harvest uses the Glimpse index/search subsystem. However, Harvest defines a flexible indexing interface, to allow Broker administrators to use different index/search subsystems to accommodate domain-specific requirements. For example, it might be useful to provide a relational database back-end.

At present we distribute code to support an interface to both the free and the commercial WAIS index/search engines, Glimpse, and Swish.

Below we discuss how to use other index/search engine instead of Glimpse in the Broker, and provide some brief discussion of how to integrate a new index/search engine into the Broker.

Using Swish as an indexer

Harvest includes support for using Swish as indexing engine with the Broker. Swish is a nice alternative to Glimpse if you need faster search support and are willing to lose the more powerful query features. It also is an alternative in cases of trouble with Glimpse' copyright status.

To use Swish with an existing Broker, you need to change the Indexer-Type variable in broker.conf to ``Swish''.

You can also specify that you want to use Swish for a Broker, when you use the RunHarvest command by running: RunHarvest -swish.

Using WAIS as an indexer

Support for using WAIS (both freeWAIS and WAIS Inc.'s index/search engine) as the Broker's indexing and search subsystem is included in the Harvest distribution. WAIS is a nice alternative to Glimpse if you need faster search support and are willing to lose the more powerful query features.

To use WAIS with an existing Broker, you need to change the Indexer-Type variable in broker.conf to ``WAIS''; you can choose among the WAIS variants by setting the WAIS-Flavor variable in broker.conf to ``Commercial-WAIS'', ``freeWAIS'', or ``WAIS''. Otherwise, CreateBroker will ask you if you want to use WAIS, and where the WAIS programs (waisindex, waissearch, waisserver, and with the commercial version of WAIS waisparse) are located. When you run the Broker, a WAIS server will be started automatically after the index is built.

You can also specify that you want to use WAIS for a Broker, when you use the RunHarvest command by running: RunHarvest -wais.

5.9 Collector interface description: Collection.conf

The Broker retrieves indexing information from Gatherers or other Brokers through its Collector interface. A list of collection points is specified in the admin/Collection.conf configuration file. This file contains a collection point on each line, with 4 fields. The first field is the host of the remote Gatherer or Broker, the second field is the port number on that host, the third field is the collection type, and the forth field is the query filter or -- if there is no filter.

The Broker supports various types of collections as described below:


  Type  Remote Process       Description      Compression?
  --------------------------------------------------------
    0     Gatherer    Full collection each time     No
    1     Gatherer    Incremental collections       No
    2     Gatherer    Full collection each time     Yes
    3     Gatherer    Incremental collections       Yes
    4     Broker      Full collection each time     No
    5     Broker      Incremental collections       No
    6     Broker      Collection based on a query   No
    7     Broker      Incremental based on a query  No

The query filter specification for collection types 6 and 7 contains two parts: the --QUERY keywords portion and an optional --FLAGS flags portion. The --QUERY portion is passed on to the Broker as the keywords for the query (the keywords can be any Boolean and/or structured query); the --FLAGS portion is passed on to the Broker as the indexer-specific flags to the query. The following table shows the valid indexer-specific flags for the supported indexers:


Indexer         Flag                            Description
-----------------------------------------------------------------------------
All:            #desc                           Show Description Lines

Glimpse:        #index case insensitive         Case Insensitive
                #index case sensitive           Case sensitive
                #index error number             Allow "number" errors
                #index matchword                Matches on word boundaries
                #index maxresult number         Allow max of "number" results
                #opaque                         Show matched lines

Wais:           #index maxresult number         Allow max of "number" results
                #opaque                         Show scores and rankings

The following is an example Collection.conf, which collects information from 2 Gatherers (one compressed incrementals and the other uncompressed full transfers), and collects information from 3 Brokers (one incrementally based on a timestamp, and the others using query filters):


        gatherer-host1.foo.com 8500 3 --
        gatherer-host2.foo.com 8500 0 --
        broker-host1.foo.com   8501 5 --
        broker-host2.foo.com   8501 6 --QUERY (URL : document) AND gnu
        broker-host3.foo.com   8501 7 --QUERY Harvest --FLAGS #index case sensitive

5.10 Troubleshooting

Symptom

The Broker is running but always returns empty query results.

Solution

Look at the log messages in the broker.out file in the Broker's directory for error messages. If your Broker didn't index the data, use the administrative interface to force the Broker to build the index (see Section Administrating a Broker).

Symptom

When I query my Broker, I get a "500 Server Error".

Solution

Generally, the ``500'' errors are related to a CGI program not working correctly or a misconfigured httpd server. Make sure that the userid running the HTTP server has access to the Harvest cgi-bin directory and the Perl include files in $HARVEST_HOME/lib. Refer to Section Additional installation for the Harvest Broker for further details.

Symptom

I see duplicate documents in my Broker.

Solution

The Broker performs duplicate elimination based on a combination of MD5 checksums and Gatherer-Host, Name, Version. Therefore, you can end up with duplicate documents if your Broker collects from more than one Gatherer, each of which gathers from the (a subset of) the same URLs. (As an aside, the reason for this notion of duplicate elimination is to allow a single Broker to contain several different SOIF objects for the same URL, but summarized in different ways.)

Two solutions to the problem are:

Run your Gatherers on the same host.
Remove the duplicate URLs in a customized version of the search.cgi program by doing a string comparison of the URLs.

Symptom

The Broker takes a long time and does not answer queries.

Solution

Some queries are quite expensive, because they involve a great deal of I/O. For this reason we modified the Broker so that if a query takes longer than 5 minutes, the query process is killed. The best solution is to use a less expensive query, for example by using less common keywords.

Symptom

Some of the query options (such as structured or case sensitive queries) aren't working.

Solution

This usually means you are using an index/search engine that does not support structured queries (like the current Harvest support for commercial WAIS). If you are setting up your own Broker (rather than using someone else's Broker), see Section Using different index/search engines with the Broker for details on how to switch to other index/search engines. Or, it could be that your search.cgi program is an old version and should be updated.

Symptom

I get syntax errors when I specify queries.

Solution

Usually this means you did not use double quotes where needed. See Section Querying a Broker.

Symptom

When I submit a query, I get an answer faster than I can believe it takes to perform the query, and the answer contains garbage data.

Solution

This probably indicates that your httpd is misconfigured. A common case is not putting the 'ScriptAlias' before the 'Alias' in your conf/httpd.conf file, when running the Apache httpd. See Section Additional installation for the Harvest Broker.

Symptom

When I make changes to the Broker configuration via the administration interface, they are lost after the Broker is restarted.

Solution

The Broker administration interface does not save changes across sessions. Permanent changes to the Broker configuration should be done through the broker.conf file.

Symptom

My Broker is running very slowly.

Solution

Performance tuning can be complicated, but the most likely problem is that you are running on a machine with insufficient RAM, and paging a lot because the query engine kicks pages out in order to access the needed index and data files. (In UNIX the disk buffer cache competes with program and data pages for memory.)

A simple way to tell is to run ``vmstat 5'' in one window, and after a couple of lines of output, issue a query from another window. This will print a line of measurements about the virtual memory status of your machine every 5 seconds. In particular, look at the ``pi'' and ``po'' columns. If the numbers suddenly jump into the 500-1,000 range after you issue the query, you are paging a lot.

Note that paging problems are accentuated by running simultaneous memory-intensive or disk I/O-intensive programs on your machine. Simultaneous queries to a single Broker should not cause a paging problem, because the Broker processes the queries sequentially.

It is best to run Brokers on an otherwise mostly unused machine with at least 128 MB of RAM (or more, if the above ``vmstat'' experiment indicates you are paging alot).

One other performance enhancer is to run an httpd-accelerator on your Broker machine, to intercept queries headed for your Broker. While it will not cache the results of queries, it will reduce load on the machine because it provides a very efficient means of returning results in the case of concurrent queries. Without the accelerator the results are sent back by a search.cgi UNIX process per query, and inefficiently time sliced by the UNIX kernel. With an accelerator the search.cgi processes exit quickly, and let the accelerator send the results back to the concurrent users. The accelerator will also reduce load for (non-query) retrievals of data from your httpd server.

Next Previous Contents

5. The Broker

5.1 Overview

5.2 Basic setup

5.3 Querying a Broker

Example queries

Regular expressions

Query options selected by menus or buttons

Filtering query results

Result set presentation

5.4 Customizing the Broker's Query Result Set

The search.cf configuration file

Defined Variables

List of Definitions

Example search.cf customization file

Integrating your customized configuration file

Displaying SOIF attributes in results

5.5 World Wide Web interface description

HTML files for graphical user interface

CGI programs

Help files for the user

5.6 Administrating a Broker

Deleting unwanted Broker objects

Command-line Administration

5.7 Tuning Glimpse indexing in the Broker

The glimpseserver program

5.8 Using different index/search engines with the Broker

Using Swish as an indexer

Using WAIS as an indexer

5.9 Collector interface description: Collection.conf

5.10 Troubleshooting