The Broker retrieves and manages indexing information from Gatherers and other Brokers, and provides a WWW query interface to the indexing information.
The Broker is automatically started by the RunHarvest
command.
Other relevant commands are described in Section
Starting up the system: RunHarvest and related commands.
In the current section we discuss various ways users can customize and tune the Broker, how to administrate the Broker, and the various Broker programming interfaces.
As suggested in Figure 1, the Broker uses a flexible indexing interface that supports a variety of indexing subsystems. The default Harvest Broker uses Glimpse as indexer, but other indexers such as Swish, and WAIS (both freeWAIS and commercial WAIS), also work with the Broker (see Section Using different index/search engines with the Broker).
To create a new Broker, run the CreateBroker
program. It will ask
you a series of questions about how you'd like to configure your Broker, and
then automatically create and configure it. To start your Broker, use the
RunBroker
program that CreateBroker
generates. The Broker
should be started when your system reboots. To prevent a collection while
starting the broker, use the -nocol option. There are a number of
ways you can customize or tune the Broker, discussed in Sections
Tuning Glimpse indexing in the Broker and
Using different index/search engines with the Broker. You may also use the RunHarvest
command,
discussed in Section
Starting up the system: RunHarvest and related commands, to create both a Broker and a Gatherer.
The Harvest Broker can handle many types of queries. The queries handled by a particular Broker depend on what index/search engine is being used inside of it (e.g., WAIS does not support some of the queries that Glimpse does). In this section we describe the full syntax. If a particular Broker does not support a certain type of query, it will return an error when the user requests that type of query.
The simplest query is a single keyword, such as:
lightbulb
Searching for common words (like ``computer'' or ``html'') may take a lot of time.
Particularly for large Brokers, it is often helpful to use more powerful queries. Harvest supports many different index/search engines, with varying capabilities. At present, our most powerful (and commonly used) search engine is Glimpse, which supports:
The different types of queries (and how to use them) are discussed below. Note that you use the same syntax regardless of what index/search engine is running in a particular Broker, but that not all engines support all of the above features. In particular, some of the Brokers use WAIS, which sometimes searches faster than Glimpse but supports only Boolean keyword queries and the ability to specify result set limits.
The different options - case-sensitivity, approximate matching, the ability to show matched lines vs. entire matching records, and the ability to specify match count limits - can all be specified with buttons and menus in the Broker query forms.
A structured query has the form:
tag-name : value
where tag-name is a Content Summary attribute name, and value is the search value within the attribute. If you click on a Content Summary, you will see what attributes are available for a particular Broker. A list of common attributes is shown in Section List of common SOIF attribute names.
Keyword searches and structured queries can be combined using Boolean operators (AND and OR) to form complex queries. Lacking parentheses, logical operation precedence is based left to right. For multiple word phrases or regular expressions, you need to enclose the string in double quotes, e.g.,
"internet resource discovery"
or
"discov.*"
Double quotes should also be used when searching for non-alphanumeric characters.
Arizona
This query returns all objects in the Broker containing the word Arizona.
Arizona AND desert
This query returns all objects in the Broker that contain both words anywhere in the object in any order.
"Arizona desert"
This query returns all objects in the Broker that contain Arizona desert as a phrase. Notice that you need to put double quotes around the phrase.
"Arizona desert" AND windsurfing
This query returns all objects in the Broker that contain Arizona desert as a phrase and the word windsurfing.
Title : windsurfing
This query returns all objects in the Broker where the Title attribute contains the value windsurfing.
"Arizona desert" AND (Title : windsurfing)
This query returns all objects in the Broker that contain the phrase Arizona desert and where the Title attribute of the same object contains the value windsurfing.
Some types of regular expressions are supported by Glimpse. A regular expression search can be much slower that other searches. The following is a partial list of possible patterns. (For more details see the Glimpse documentations.)
Regular expressions are currently limited to approximately 30 characters, not including meta characters. Regular expressions will generally not cross word boundaries (because only words are stored in the index). So, for example, "lin.*ing" will find ``linking'' or ``flinching,'' but not ``linear programming.''
The query page may have following checkboxes to allow some control of the query specification.
By selecting this checkbox the query will become case insensitive (lower case and upper case letters don't differ). Otherwise, the query will be case sensitive. The default is case insensitive.
By selecting this checkbox, keywords will match on word boundaries. Otherwise, a keyword will match part of a word (or phrase). For example, "network" will match ``networking'', "sensitive" will match ``insensitive'', and "Arizona desert" will match ``Arizona desertness''. The default is to match keywords on word boundaries.
Glimpse allows the search to contain a number of errors. An error is either a deletion, insertion, or substitution of a single character. The Best Match option will find the match(es) with the least number of errors. The default is 0 (zero) errors.
Note: The previous three options do not apply to attribute names. Attribute names are always case insensitive and allow no errors.
Harvest allows to filter the results of a query by any query term using any attribute defined in the List of common SOIF attribute names. This is done by defining filter parameters in the query form. It is possible to define more that one filter parameter; they will be concatenated by boolean AND. Filter parameters consist of two parts, separated by the pipe symbol ``|''. The first part is a query expression which is attached to the user query using AND before sending the request to the broker. The optional second part is a HTML text that shall be displayd on the results page, to give the user some information on the applied filter.
Example:
<SELECT NAME="filter">
<OPTION VALUE=''>No Filter
<OPTION VALUE='uri: "xyz\.edu"|Seach only xyz.edu'>Search xyz.edu only
<OPTION VALUE='type: html|HTML documents only'>Search HTML documents only
</SELECT>
The first option returns an unfiltered output. The second option returns only pages found on pages with ``xyz.edu'' in their URL. The third option returns only HTML-documents. See the advanced search page of the broker for more examples.
The query page may have following checkboxes allow some control of presentation of the query return.
By selecting this checkbox, the result set presentation will contain the lines of the Content Summary that matched the query. Otherwise, the matched lines will not be displayed. The default is to display the matched lines.
Some objects have short, one-line descriptions associated with them. By selecting this checkbox, the descriptions will be presented. Otherwise, the object descriptions will not be displayed. The default is to display object descriptions.
This checkbox allows you to set whether links to the indexed content summaries are displayed or not. The default is not to display links to inexed content summaries.
It is possible for the Harvest administrator to customize how the Broker
query result set is generated, by modifying a configuration file that is
interpreted by the search.cgi
Perl program at query result time.
search.cgi
allows you to customize almost every aspect of its
HTML output. The file $HARVEST_HOME/cgi-bin/lib/search.cf
contains the default output definitions. Individual brokers can be customized
by creating a similar file which overrides the default definitions.
Definitions are enclosed within SGML-like beginning and ending tags. For example:
<HarvestUrl>
http://harvest.sourceforge.net/
</HarvestUrl>
The last newline character is removed from each definition, so that the above becomes the string ``http://harvest.sourceforge.net/.''
Variable substitution occurs on every definition before it is output. A
number of specific variables are defined by search.cgi
which can
be used inside a definition. For example:
<BrokerLoad>
Sorry, the Broker at <STRONG>$host, port $port</STRONG>
is currently too heavily loaded to process your request.
Please try again later.<P>
</BrokerLoad>
When this definition is printed out, the variables $host and $port would be replaced with the hostname and port of the broker.
The following variables are defined as soon as the query string is processed. They can be used before the broker returns any results.
$maxresult The maximum number of matched lines to be returned
$host The broker hostname
$port The broker port
$query The query string entered by the user
$bquery The whole query string sent to the broker
These variables are defined for each matched object returned by the broker.
$objectnum The number of the returned object
$desc The description attribute of the matched object
$opaque ALL the matched lines from the matched object
$url The original URL of the matched object
$A The access method of $url (e.g.: http)
$H The hostname (including port) from $url
$P The path part of $url
$D The directory part of $P
$F The filename part of $P
$cs_url The URL of the content summary in the broker database
$cs_a Access part of $cs_url
$cs_h Hostname part of $cs_url
$cs_p Path part of $cs_url
$cs_d Directory part of $cs_p
$cs_f Filename part of $cs_p
Below is a partial list of definitions. A complete list can be found in the search.cf file. Only definitions likely to be customized are described here.
Timeout value for search.cgi
. If the broker doesn't respond
within this time, search.cgi
will exit.
The first part of the result page. Should probably contain the HTML <TITLE> element and the user query string.
The last part of the result page. The default has URL references to the broker home page and the Harvest project home page.
This is output just before looping over all the matched objects.
This is output just after ending the loop over matched objects.
This definition prints out a matched object. It should probably include the variables $url, $cs_url, $desc, and $opaque.
Printed between <ResultSetEnd> and <ResultTrailer> if the query was successful. Should probably include a count of matched objects and/or matched lines.
Similar to <EndBrokerResults> but prints if the broker returns an error in response to the query.
A printf
format string for the object number
($objectnum).
Prints a warning message if the result set was truncated at the maximum number of matched lines.
These following definitions are somewhat different because they are evaluated as Perl instructions rather than strings.
Evaluated for every matched line returned by the broker. Can be used to indent matched lines or to remove the leading ``Matched line'' and attribute name strings.
Evaluated near the beginning of the search.cgi
program. Can be
used to set up special variables or read data files.
Evaluated for each object just before <PrintObject> is called.
Evaluated for each SOIF attribute requested for matched objects (see Section Displaying SOIF attributes in results). $att is set to the attribute name, and $val is set to the attribute value.
The following definitions demonstrate how to change the
search.cgi
output. The <PerObjectFunction>
ensures that the description is not empty. It also prepends the
string ``matched data:'' before any matched lines. The
<PrintObject> specification prints the object number,
description, and indexing data all on the first line. The description
is wrapped around HMTL anchor tags so that it is a link to the object
originally gathered. The words ``indexing data'' are a link to the
displaySOIF program which will format the content summary for HTML
browsers. The object number is formatted as a number in parenthesis
such that the whole thing takes up four spaces.
The <MatchedLineSub> definition includes four substitution expressions. The first removes the words ``Matched line:'' from the beginning of each matched line. The second removes SOIF attributes of the form ``partial-text{43}:'' from the beginning of a line. The third displays the attribute names (e.g. partial-text#) in italics. The last expression indents each line by five spaces to align it with the description line. The definition for <EndBrokerResults> slightly modifies the report of how many objects were matched.
# Demo to show some of the customization features for the Harvest output
# More information can be found in the manual at:
# http://harvest.sourceforge.net/harvest/doc/html/manual.html
# The PerObjectFunction is Perl code evaluated for every hit
<PerObjectFunction>
# Create description
# Is the descriptions provided by Harvest very short (e.g. missing <TITLE>)?
if (length($desc) < 5) {
# Yes: use filename ($F) instead
$description = "<I>File:</I> $F";
} else {
# No: use description provided by Harvest
$description = $desc;
}
# Format matched lines ("opaque data") if data is present
if ($opaque ne '') {
$opaque = "<strong>matched lines:</strong><BR>$opaque"
}
</PerObjectFunction>
# PrintObject defines the apperance of hits
<PrintObject>
$objectnum <A HREF="$url"><STRONG>$description</STRONG></A> \
[<A HREF="$cs_a://$cs_h/Harvest/cgi-bin/displaySOIF.cgi?object=$cs_p">\
indexing data</A>]
<pre>
$opaque
</pre>\n
</PrintObject>
# Format the appearance of the hit number
<ObjectNumPrintf>
(%2d)
</ObjectNumPrintf>
# Format the appearance of every matched line
<MatchedLineSub>
s/^Matched line: *//; # Remove "Matched line:"
s/^([\w-]+# )[\w-]+{\d+}:\t/\1/; # Remove SOIF attributes of the form "partial-text{43}:"
s/^([\w-]+#)/<I>\1<\/I>/; # Format attribute names as italics
s/^.*/ $&/; # Add spaces to indent text
</MatchedLineSub>
# Modifies the report of how many objects were matched
<EndBrokerResults>
<STRONG>Found $nopaquelines matched lines, $nobjects objects.</STRONG>
<P>\n
</EndBrokerResults>
The search.cgi
configuration files are kept in
$HARVEST_HOME/cgi-bin/lib. The name of a customized file is
listed in the query.html form, and passed as an option to the
search.cgi
program.
The simplest way to specify the customized file is by placing an <INPUT> tag in the HTML form:
<INPUT TYPE="hidden" NAME="brokerqueryconfig" VALUE="custom.cf">
Another way is to allow users to select from different customizations with a <SELECT> list:
<SELECT NAME="brokerqueryconfig">
<OPTION VALUE=""> Default
<OPTION VALUE="custom1.cf"> Customized
<OPTION VALUE="custom2.cf" SELECTED> Highly Customized
</SELECT>
It is possible to request SOIF attributes from the HTML query form. A simple approach is to include a select list in the query form:
<SELECT MULTIPLE NAME="attribute">
<OPTION VALUE="title">
<OPTION VALUE="author">
<OPTION VALUE="date">
<OPTION VALUE="subject">
</SELECT>
In this manner, the user may control which attributes get displayed. The layout of these attributes when the results are displayed in HTML is controlled by the <FormatAttribute> specification in the search.cf file described in Section The search.cf configuration file.
To allow Web browsers to easily interface with the Broker, we implemented a World Wide Web interface to the Broker's query manager and administrative interfaces. This WWW interface, which includes several HTML files and a few programs that use the Common Gateway Interface (CGI), consists of the following:
Users go through the following steps when using a Broker to locate information:
To provide a WWW-queryable interface, the Broker needs to run in conjunction with an HTTP server. Section Additional installation for the Harvest Broker describes how to configure your HTTP server to work with Harvest.
You can run the Broker on a different machine than your HTTP server runs on, but if you want users to be able to view the Broker's content summaries then the Broker's files will need to be accessible to your HTTP server. You can NFS mount those files or manually copy them over. You'll also need to change the Brokers.cf file to point to the host that is running the Broker.
CreateBroker
creates some HTML files to provide GUIs to the user:
Contains the GUI for the query interface. CreateBroker
will
install different query.html files for Glimpse, Swish, and WAIS,
since each subsystem requires different defaults and supports different
functionality (e.g., WAIS doesn't support approximate matching like
Glimpse). This is also the ``home page'' for the Broker and a link to
this page is included at the bottom of all query results.
Contains the GUI for the administrative interface. This file is installed into the admin directory of the Broker.
Contains the hostname and port information for the supported brokers.
This file is installed into the $HARVEST_HOME/brokers directory.
The query.html file uses the value of the ``broker'' FORM
tag to pass the name of the broker to search.cgi
which
in turn retrieves the host and port information from Brokers.cf.
When you install the WWW interface (see Section The Broker), a few programs are installed into your HTTP server's /Harvest/cgi-bin directory:
search.cgi
This program takes the submitted query from query.html, and sends it to the specified Broker. It then retrieves the query results from the Broker, formats them in HTML, and sends the result set in HTML to the user.
displaySOIF.cgi
This program displays the content summaries from the Broker.
BrokerAdmin.pl.cgi
This program will take the submitted administrative command from admin.html and send it to the appropriate Broker. It retrieves the result of the command from the Broker and displays it to the user.
The WWW interface to the Broker includes a few help files written in HTML. These files are installed on your HTTP server in the /Harvest/brokers directory when you install the broker (see Section The Broker):
Provides a tutorial on constructing Broker queries, and on using the query.html forms. query.html has a link to this help page.
Provides a tutorial on submitting Broker administrative commands using the admin.html form. admin.html has a link to this help page.
Provides a brief description of SOIF.
Administrators have two basic ways for managing a Broker: through the broker.conf and Collection.conf configuration files, and through the interactive administrative interface. The interactive interface controls various facilities and operating parameters within the Broker. We provide a HTML interface page for these administrative commands. See Section Collector interface description: Collection.conf for additional information on the Broker administrative and collector interfaces.
The broker.conf file is a list of variable names and their values,
which consists of information about the Broker (such as the directory in
which it lives) and the port on which it runs. The Collection.conf
file (see Section
Collector interface description: Collection.conf for an example) is a list of collection points from
which the Broker collects its indexing information. The
CreateBroker
program automatically generates both of these
configuration files. You can manually edit these files if needed.
The CreateBroker
program also creates the admin.html
file, which is the WWW interface to the Broker's administrative commands.
Note that all administrative commands require a password as defined in
broker.conf.
Note: Changes to the Broker configuration are not saved when the Broker is restarted. Permanent changes to the Broker configuration should be made by manually editing the broker.conf file.
The administrative interface created by CreateBroker
has the
following window fields:
Command Select an administrative command. See below for a
description of the commands.
Parameters Specify parameters for those commands that need them.
Password The administrative password.
Broker Host The host where the broker is running.
Broker Port The port where the broker is listening.
The administrative interface created by CreateBroker
supports the following commands:
Add object(s) to the Broker. The parameter is a list of filenames that contain SOIF object to be added to the Broker.
Flush all accumulated log information and close the current log file. Causes the Broker to stop logging. No parameters.
Performs garbage collection on the Registry file. No parameters.
Deletes any object from the Broker whose Time-to-Live has expired. No parameters.
Deletes any object(s) that matches the given query. The parameter is a query with the same syntax as user queries. Query flags are currently unsupported.
Deletes the object(s) identified by the given OID numbers.
The parameter is a list of OID numbers. The OID numbers
can be obtained by using the dumpregistry
command.
Disables logging information about a particular type of event. The parameter is an event type. See Enable log type for a list of events.
Enables logging information about a particular type of events. The parameter is the name of an event type. Currently, event types are limited to the following:
Update Log updated objects.
Delete Log deleted objects.
Refresh Log refreshed objects.
Query Log user queries.
Query-Return Log objects returned from a query.
Cleaned Log objects removed by the cleaner.
Collection Log collection events.
Admin Log administrative events.
Admin-Return Log the results of administrative events.
Bulk-Transfer Log bulk transfer events.
Bulk-Return Log objects sent by bulk transfers.
Cleaner-On Log cleaning events.
Compressing-Registry Log registry compression events.
All Log all events.
Flush all accumulated log information to the current log file. No parameters.
Generates some basic statistics about the Broker object database. No parameters.
Index only the objects that have been added recently. No parameters.
Index the entire object database. No parameters.
Open a new log file. If the file does not exist, create a new one. The parameter is the name (relative to the broker) of a file to use for logging.
Force the broker to reread the Registry and reindex the corpus. This does not actually kill the broker process. No parameters.
Rotates the current log file to LOG.YYYYMMDD. Opens a new log file. No parameters.
Sets the value of a broker configuration variable. Takes two parameters, the name of a configuration variable and the new value for the variable. The configuration variables that can be set are those that occur in the broker.conf file. The change only is valid until the broker process dies.
Cleanly shutdown the Broker. No parameters.
Perform collections. No parameters.
Occasionally a broker may end up with multiple summarizes for individual URLs. This can happen when the Gatherer changes its description, hostname, or port number. Use this command to search the broker for duplicated URLs. When two objects with the same URL are found, the object with the least-recent timestamp is removed.
If you build a Broker and then decide not to index some of that data (e.g., you decide it would make sense to split it into two different Brokers, each targetted to a different community), you need to change the Gatherer's configuration file, rerun the Gatherer, and then let the old objects time out in the Broker (since the Broker and Gatherer maintain separate databases). If you want to clean out the Broker's data sooner than that you can use the Broker's administrative interface in one of three ways:
% mv objects objects.old
% rm -rf objects.old &
% broker ./admin/broker.conf -new
After removing objects, you should use the Index corpus command.
It is possible to perform administrative functions by using the
brkclient
program from the command-line and shell scripts.
For example, to force a collection, run:
% brkclient localhost 8501 '#ADMIN #Password secret #collection'
See your broker's raw admin.html file for a complete list of administrative commands.
The Glimpse indexing system can be tuned in a variety of ways to suit your
particular needs. Probably the most noteworthy parameter is indexing
granularity, for which Glimpse provides three options: a tiny index (2-3% of
the total size of all files -- your mileage may vary), a small index (7-8%),
and a medium-size index (20-30%). Search times are better with larger
indexes. By changing the GlimpseIndex-Option in your Broker's
broker.conf file, you can tune Glimpse to use one of these three
indexing granularity options. By default, GlimpseIndex-Option
builds a medium-size index using the glimpseindex
program.
Note also that with Glimpse it is much faster to search with ``show matched lines'' turned off in the Broker query page.
Glimpse uses a ``stop list'' to avoid indexing very common words. This list is not fixed, but rather computed as the index is built. For a medium-size index, the default is to put any word that appears at least 500 times per Mbyte (on the average) in the stop-list. For a small-size index, the default is words that appear in at least 80% of all files (unless there are fewer than 256 files, in which case there is no stop-list). Both defaults can be changed using the -S option, which should be followed by the new number (average per Mbyte when -b indexing is used, or % of files when -o indexing is used). Tiny-size indexes do not maintain a stop-list (their effect is minimal).
glimpseindex
includes a number of other options that may be of
interest. You can find out more about these options (and more about Glimpse
in general) in the
Glimpse documentations. If you'd like to change how the Broker invokes the
glimpseindex
program, then edit the
src/broker/Glimpse/index.c file from the Harvest source distribution.
The Glimpse system comes with an auxiliary server called
glimpseserver
, which allows indexes to be read into a
process and kept in memory. This avoids the added cost of reading the index
and starting a large process for each search. glimpseserver
is
automatically started each time you run the Broker, or reindex the Broker's
corpus. If you do not want to run glimpseserver
, then set
GlimpseServer-Host to ``false'' in your broker.conf.
By default, Harvest uses the Glimpse index/search subsystem. However, Harvest defines a flexible indexing interface, to allow Broker administrators to use different index/search subsystems to accommodate domain-specific requirements. For example, it might be useful to provide a relational database back-end.
At present we distribute code to support an interface to both the free and the commercial WAIS index/search engines, Glimpse, and Swish.
Below we discuss how to use other index/search engine instead of Glimpse in the Broker, and provide some brief discussion of how to integrate a new index/search engine into the Broker.
Harvest includes support for using Swish as indexing engine with the Broker. Swish is a nice alternative to Glimpse if you need faster search support and are willing to lose the more powerful query features. It also is an alternative in cases of trouble with Glimpse' copyright status.
To use Swish with an existing Broker, you need to change the Indexer-Type variable in broker.conf to ``Swish''.
You can also specify that you want to use Swish for a Broker, when you use the
RunHarvest
command by running: RunHarvest -swish
.
Support for using WAIS (both freeWAIS and WAIS Inc.'s index/search engine) as the Broker's indexing and search subsystem is included in the Harvest distribution. WAIS is a nice alternative to Glimpse if you need faster search support and are willing to lose the more powerful query features.
To use WAIS with an existing Broker, you need to change the
Indexer-Type variable in broker.conf to ``WAIS''; you can
choose among the WAIS variants by setting the WAIS-Flavor variable
in broker.conf to ``Commercial-WAIS'', ``freeWAIS'', or ``WAIS''.
Otherwise, CreateBroker
will ask you if you want to use WAIS, and
where the WAIS programs (waisindex
, waissearch
,
waisserver
, and with the commercial version of WAIS
waisparse
) are located. When you run the Broker, a WAIS
server will be started automatically after the index is built.
You can also specify that you want to use WAIS for a Broker, when you use the
RunHarvest
command by running: RunHarvest -wais
.
The Broker retrieves indexing information from Gatherers or other Brokers through its Collector interface. A list of collection points is specified in the admin/Collection.conf configuration file. This file contains a collection point on each line, with 4 fields. The first field is the host of the remote Gatherer or Broker, the second field is the port number on that host, the third field is the collection type, and the forth field is the query filter or -- if there is no filter.
The Broker supports various types of collections as described below:
Type Remote Process Description Compression?
--------------------------------------------------------
0 Gatherer Full collection each time No
1 Gatherer Incremental collections No
2 Gatherer Full collection each time Yes
3 Gatherer Incremental collections Yes
4 Broker Full collection each time No
5 Broker Incremental collections No
6 Broker Collection based on a query No
7 Broker Incremental based on a query No
The query filter specification for collection types 6 and 7 contains two parts: the --QUERY keywords portion and an optional --FLAGS flags portion. The --QUERY portion is passed on to the Broker as the keywords for the query (the keywords can be any Boolean and/or structured query); the --FLAGS portion is passed on to the Broker as the indexer-specific flags to the query. The following table shows the valid indexer-specific flags for the supported indexers:
Indexer Flag Description
-----------------------------------------------------------------------------
All: #desc Show Description Lines
Glimpse: #index case insensitive Case Insensitive
#index case sensitive Case sensitive
#index error number Allow "number" errors
#index matchword Matches on word boundaries
#index maxresult number Allow max of "number" results
#opaque Show matched lines
Wais: #index maxresult number Allow max of "number" results
#opaque Show scores and rankings
The following is an example Collection.conf, which collects information from 2 Gatherers (one compressed incrementals and the other uncompressed full transfers), and collects information from 3 Brokers (one incrementally based on a timestamp, and the others using query filters):
gatherer-host1.foo.com 8500 3 --
gatherer-host2.foo.com 8500 0 --
broker-host1.foo.com 8501 5 --
broker-host2.foo.com 8501 6 --QUERY (URL : document) AND gnu
broker-host3.foo.com 8501 7 --QUERY Harvest --FLAGS #index case sensitive
The Broker is running but always returns empty query results.
Look at the log messages in the broker.out file in the Broker's directory for error messages. If your Broker didn't index the data, use the administrative interface to force the Broker to build the index (see Section Administrating a Broker).
When I query my Broker, I get a "500 Server Error".
Generally, the ``500'' errors are related to a CGI program not working correctly or a misconfigured httpd server. Make sure that the userid running the HTTP server has access to the Harvest cgi-bin directory and the Perl include files in $HARVEST_HOME/lib. Refer to Section Additional installation for the Harvest Broker for further details.
I see duplicate documents in my Broker.
The Broker performs duplicate elimination based on a combination of MD5 checksums and Gatherer-Host, Name, Version. Therefore, you can end up with duplicate documents if your Broker collects from more than one Gatherer, each of which gathers from the (a subset of) the same URLs. (As an aside, the reason for this notion of duplicate elimination is to allow a single Broker to contain several different SOIF objects for the same URL, but summarized in different ways.)
Two solutions to the problem are:
search.cgi
program by doing a string comparison of the URLs.
The Broker takes a long time and does not answer queries.
Some queries are quite expensive, because they involve a great deal of I/O. For this reason we modified the Broker so that if a query takes longer than 5 minutes, the query process is killed. The best solution is to use a less expensive query, for example by using less common keywords.
Some of the query options (such as structured or case sensitive queries) aren't working.
This usually means you are using an index/search engine that does not
support structured queries (like the current Harvest support for commercial
WAIS). If you are setting up your own Broker (rather than using someone
else's Broker), see Section
Using different index/search engines with the Broker for details on how to switch to other index/search
engines. Or, it could be that your search.cgi
program is an
old version and should be updated.
I get syntax errors when I specify queries.
Usually this means you did not use double quotes where needed. See Section Querying a Broker.
When I submit a query, I get an answer faster than I can believe it takes to perform the query, and the answer contains garbage data.
This probably indicates that your httpd
is misconfigured. A
common case is not putting the 'ScriptAlias' before the 'Alias' in your
conf/httpd.conf file, when running the Apache httpd
. See
Section
Additional installation for the Harvest Broker.
When I make changes to the Broker configuration via the administration interface, they are lost after the Broker is restarted.
The Broker administration interface does not save changes across sessions. Permanent changes to the Broker configuration should be done through the broker.conf file.
My Broker is running very slowly.
Performance tuning can be complicated, but the most likely problem is that you are running on a machine with insufficient RAM, and paging a lot because the query engine kicks pages out in order to access the needed index and data files. (In UNIX the disk buffer cache competes with program and data pages for memory.)
A simple way to tell is to run ``vmstat 5'' in one window, and after a couple of lines of output, issue a query from another window. This will print a line of measurements about the virtual memory status of your machine every 5 seconds. In particular, look at the ``pi'' and ``po'' columns. If the numbers suddenly jump into the 500-1,000 range after you issue the query, you are paging a lot.
Note that paging problems are accentuated by running simultaneous memory-intensive or disk I/O-intensive programs on your machine. Simultaneous queries to a single Broker should not cause a paging problem, because the Broker processes the queries sequentially.
It is best to run Brokers on an otherwise mostly unused machine with at least 128 MB of RAM (or more, if the above ``vmstat'' experiment indicates you are paging alot).
One other performance enhancer is to run an httpd-accelerator
on your Broker machine, to intercept queries headed for your Broker.
While it will not cache the results of queries, it will reduce load on the
machine because it provides a very efficient means of returning results in
the case of concurrent queries. Without the accelerator the results are
sent back by a search.cgi
UNIX process per query, and inefficiently
time sliced by the UNIX kernel. With an accelerator the search.cgi
processes exit quickly, and let the accelerator send the results back to the
concurrent users. The accelerator will also reduce load for (non-query)
retrievals of data from your httpd server.