The Gatherer retrieves information resources using a variety of standard access methods (FTP, Gopher, HTTP, NNTP, and local files), and then summarizes those resources in various type-specific ways to generate structured indexing information. For example, a Gatherer can retrieve a technical report from an FTP archive, and then extract the author, title, and abstract from the paper to summarize the technical report. Harvest Brokers or other search services can then retrieve the indexing information from the Gatherer to use in a searchable index available via a WWW interface.
The Gatherer consists of a number of separate components.
The Gatherer
program reads a Gatherer configuration file
and controls the overall process of enumerating and summarizing
data objects.
The structured indexing information that the Gatherer collects is
represented as a list of attribute-value pairs using the Summary Object
Interchange Format (SOIF, see Section
The Summary Object Interchange Format (SOIF)). The gatherd
daemon serves the Gatherer database to Brokers.
It hangs around, in the background, after a gathering session is complete.
A stand-alone gather
program is a client for the gatherd
server. It can be used from the command line for testing, and is used
by the Broker. The Gatherer uses a local disk cache to store objects it
has retrieved. The disk cache is described in Section
The local disk cache.
Even though the gatherd
daemon remains in the background, a
Gatherer does not automatically update or refresh its summary objects.
Each object in a Gatherer has a Time-to-Live value. Objects remain
in the database until they expire. See Section
Periodic gathering and realtime updates for more information on keeping Gatherer objects up to date.
Several example Gatherers are provided with the Harvest software distribution (see Section Gatherer Examples).
To run a basic Gatherer, you need only list the Uniform Resource Locators (URLs, see RFC1630 and RFC1738) from which it will gather indexing information. This list is specified in the Gatherer configuration file, along with other optional information such as the Gatherer's name and the directory in which it resides (see Section Setting variables in the Gatherer configuration file for details on the optional information). Below is an example Gatherer configuration file:
#
# sample.cf - Sample Gatherer Configuration File
#
Gatherer-Name: My Sample Harvest Gatherer
Gatherer-Port: 8500
Top-Directory: /usr/local/harvest/gatherers/sample
<RootNodes>
# Enter URLs for RootNodes here
http://www.mozilla.org/
http://www.xfree86.org/
</RootNodes>
<LeafNodes>
# Enter URLs for LeafNodes here
http://www.arco.de/~kj/index.html
</LeafNodes>
As shown in the example configuration file, you may classify an URL as a RootNode or a LeafNode. For a LeafNode URL, the Gatherer simply retrieves the URL and processes it. LeafNode URLs are typically files like PostScript papers or compressed ``tar'' distributions. For a RootNode URL, the Gatherer will expand it into zero or more LeafNode URLs by recursively enumerating it in an access method-specific way. For FTP or Gopher, the Gatherer will perform a recursive directory listing on the FTP or Gopher server to expand the RootNode (typically a directory name). For HTTP, a RootNode URL is expanded by following the embedded HTML links to other URLs. For News, the enumeration returns all the messages in the specified USENET newsgroup.
PLEASE BE CAREFUL when specifying RootNodes as it is possible to specify an enormous amount of work with a single RootNode URL. To help prevent a misconfigured Gatherer from abusing servers or running wildly, by default the Gatherer will only expand a RootNode into 250 LeafNodes, and will only include HTML links that point to documents that reside on the same server as the original RootNode URL. There are several options that allow you to change these limits and otherwise enhance the Gatherer specification. See Section RootNode specifications for details.
The Gatherer is a ``robot'' and collects URLs starting from the URLs specified in RootNodes. It obeys the robots.txt convention and the robots META tag. It also is HTTP Version 1.1 compliant and sends the User-Agent and From request fields to HTTP servers for accountability.
After you have written the Gatherer configuration file, create a directory for
the Gatherer and copy the configuration file there. Then, run the
Gatherer
program with the configuration file as the only command-line
argument, as shown below:
% Gatherer GathName.cf
The Gatherer will generate a database of the content summaries, a log file
(log.gatherer), and an error log file (log.errors). It will
also start the gatherd
daemon which exports the indexing information
automatically to Brokers and other clients. To view the exported indexing
information, you can use the gather
client program, as shown below:
% gather localhost 8500 | more
The -info option causes the Gatherer to respond only with the Gatherer summary information, which consists of the attributes available in the specified Gatherer's database, the Gatherer's host and name, the range of object update times, and the number of objects. Compression is the default, but can be disabled with the -nocompress option. The optional timestamp tells the Gatherer to send only the objects that have changed since the specified timestamp (in seconds since the UNIX ``epoch'' of January 1, 1970).
News URLs are somewhat different than the other access protocols because
the URL generally does not contain a hostname. The Gatherer retrieves
News URLs from an NNTP server. The name of this server must be placed
in the environment variable $NNTPSERVER. It is probably a good
idea to add this to your RunGatherer
script. If the environment
variable is not set, the Gatherer attempts to connect to a host named
news at your site.
Remember the Gatherer databases persists between runs. Objects remain in the databases until they expire. When experimenting with the gatherer, it is always a good idea to ``clean out'' the databases between runs. This is most easily accomplished by executing this command from the Gatherer directory:
% rm -rf data tmp log.*
The RootNode specification facility described in Section Basic setup provides a basic set of default enumeration actions for RootNodes. Often it is useful to enumerate beyond the default limits, for example, to increase the enumeration limit beyond 250 URLs, or to allow site boundaries to be crossed when enumerating HTML links. It is possible to specify these and other aspects of enumeration, using the following syntax:
<RootNodes>
URL EnumSpec
URL EnumSpec
...
</RootNodes>
where EnumSpec is on a single line (using ``\'' to escape linefeeds), with the following syntax:
URL=URL-Max[,URL-Filter-filename] \
Host=Host-Max[,Host-Filter-filename] \
Access=TypeList \
Delay=Seconds \
Depth=Number \
Enumeration=Enumeration-Program
The EnumSpec modifiers are all optional, and have the following meanings:
The number specified on the right hand side of the ``URL='' expression lists the maximum number of LeafNode URLs to generate at all levels of depth, from the current URL. Note that URL-Max is the maximum number of URLs that are generated during the enumeration, and not a limit on how many URLs can pass through the candidate selection phase (see Section Customizing the candidate selection step).
This is the name of a file containing a set of regular expression filters (see Section RootNode filters) to allow or deny particular LeafNodes in the enumeration. The default filter is $HARVEST_HOME/lib/gatherer/URL-filter-default which excludes many image and sound files.
The number specified on the right hand side of the ``Host='' expression lists the maximum number of hosts that will be touched during the RootNode enumeration. This enumeration actually counts hosts by IP address so that aliased hosts are properly enumerated. Note that this does not work correctly for multi-homed hosts, or for hosts with rotating DNS entries (used by some sites for load balancing heavily accessed servers).
Note: Prior to Harvest Version 1.2 the ``Host=...'' line was called ``Site=...''. We changed the name to ``Host='' because it is more intuitively meaningful (being a host count limit, not a site count limit). For backwards compatibility with older Gatherer configuration files, we will continue to treat ``Site='' as an alias for ``Host=''.
This is the name of a file containing a set of regular expression filters to allow or deny particular hosts in the enumeration. Each expression can specify both a host name (or IP address) and a port number (in case you have multiple servers running on different ports of the same server and you want to index only one). The syntax is ``hostname:port''.
If the RootNode is an HTTP URL, then you can specify which access methods across which to enumerate. Valid access method types are: FILE, FTP, Gopher, HTTP, News, Telnet, or WAIS. Use a ``|'' character between type names to allow multiple access methods. For example, ``Access=HTTP|FTP|Gopher'' will follow HTTP, FTP, and Gopher URLs while enumerating an HTTP RootNode URL.
Note: We do not support cross-method enumeration from Gopher, because of the difficulty of ensuring that Gopher pointers do not cross site boundaries. For example, the Gopher URL gopher://powell.cs.colorado.edu:7005/1ftp3aftp.cs.washington.edu40pub/ would get an FTP directory listing of ftp.cs.washington.edu:/pub, even though the host part of the URL is powell.cs.colorado.edu.
This is the number of seconds to wait between server contacts. It defaults to one second, when not specified otherwise. Delay=3 will let the gatherer sleep 3 seconds between server contacts.
This is the maximum number of levels of enumeration that will be followed during gathering. Depth=0 means that there is no limit to the depth of the enumeration. Depth=1 means the specified URL will be retrieved, and all the URLs referenced by the specified URL will be retrieved; and so on for higher Depth values. In other words, the enumeration will follow links up to Depth steps away from the specified URL.
This modifier adds a very flexible way to control a Gatherer. The Enumeration-Program is a filter which reads URLs as input and writes new enumeration parameters on output. See section Generic Enumeration program description for specific details.
By default, URL-Max defaults to 250, URL-Filter defaults to no limit, Host-Max defaults to 1, Host-Filter defaults to no limit, Access defaults to HTTP only, Delay defaults to 1 second, and Depth defaults to zero. There is no way to specify an unlimited value for URL-Max or Host-Max.
Filter files use the standard UNIX regular expression syntax (as defined by the POSIX standard), not the csh ``globbing'' syntax. For example, you would use ``.*abc'' to indicate any string ending with ``abc'', not ``*abc''. A filter file has the following syntax:
Deny regex
Allow regex
The URL-Filter regular expressions are matched only on the URL-path portion of each URL (the scheme, hostname and port are excluded). For example, the following URL-Filter file would allow all URLs except those containing the regular expression ``/gatherers/'':
Deny /gatherers/
Allow .
Another common use of URL-filters is to prevent the Gatherer from travelling ``up'' a directory. Automatically generated HTML pages for HTTP and FTP directories often contain a link for the parent directory ``..''. To keep the gatherer below a specific directory, use a URL-filter file such as:
Allow ^/my/cool/sutff/
Deny .
The Host-Filter regular expressions are matched on the ``hostname:port'' portion of each URL. Because the port is included, you cannot use ``$'' to anchor the end of a hostname. Beginning with version 1.3, IP addresses may be specified in place of hostnames. A class B address such as 128.138.0.0 would be written as ``^128\.138\..*'' in regular expression syntax. For example:
Deny bcn.boulder.co.us:8080
Deny bvsd.k12.co.us
Allow ^128\.138\..*
Deny .
The order of the Allow and Deny entries is important, since the filters are applied sequentially from first to last. So, for example, if you list ``Allow .*'' first, no subsequent Deny expressions will be used, since this Allow filter will allow all entries.
Flexible enumeration can be achieved by giving an Enumeration=Enumeration-Program modifier to a RootNode URL. The Enumeration-Program is a filter which takes URLs on standard input and writes new RootNode URLs on standard output.
The output format is different than specifying a RootNode URL in a Gatherer configuration file. Each output line must have nine fields separated by spaces. These fields are:
URL
URL-Max
URL-Filter-filename
Host-Max
Host-Filter-filename
Access
Delay
Depth
Enumeration-Program
These are the same fields as described in section
RootNode specifications.
Values must be given for each field. Use /dev/null to disable
the URL-Filter-filename and Host-Filter-filename. Use /bin/false
to disable the Enumeration-Program.
Below is an example RootNode configuration:
<RootNodes>
(1) http://harvest.cs.colorado.edu/ URL=100,MyFilter
(2) http://www.cs.colorado.edu/ Host=50 Delay=60
(3) gopher://gopher.colorado.edu/ Depth=1
(4) file://powell.cs.colorado.edu/home/hardy/ Depth=2
(5) ftp://ftp.cs.colorado.edu/pub/cs/techreports/ Depth=1
(6) http://harvest.cs.colorado.edu/~hardy/hotlist.html \
Depth=1 Delay=60
(7) http://harvest.cs.colorado.edu/~hardy/ \
Depth=2 Access=HTTP|FTP
</RootNodes>
Each of the above RootNodes follows a different enumeration configuration as follows:
In addition to using the URL-Filter and Host-Filter files for the RootNode specification mechanism described in Section RootNode specifications, you can prevent documents from being indexed through customizing the stoplist.cf file, described in Section Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps. Since these mechanisms are invoked at different times, they have different effects. The URL-Filter and Host-Filter mechanisms are invoked by the Gatherer's ``RootNode'' enumeration programs. Using these filters as stop lists can prevent unwanted objects from being retrieved across the network. This can dramatically reduce gathering time and network traffic.
The stoplist.cf file is used by the Essence content extraction system (described in Section Extracting data for indexing: The Essence summarizing subsystem) after the objects are retrieved, to select which objects should be content extracted and indexed. This can be useful because Essence provides a more powerful means of rejecting indexing candidates, in which you can customize based not only file naming conventions but also on file contents (e.g., looking at strings at the beginning of a file or at UNIX ``magic'' numbers), and also by more sophisticated file-grouping schemes (e.g., deciding not to extract contents from object code files for which source code is available).
As an example of combining these mechanisms, suppose you want to index the ``.ps'' files linked into your WWW site. You could do this by having a stoplist.cf file that contains ``HTML'', and a RootNode URL-Filter that contains:
Allow \.html
Allow \.ps
Deny .*
As a final note, independent of these customizations the Gatherer attempts to avoid retrieving objects where possible, by using a local disk cache of objects, and by using the HTTP ``If-Modified-Since'' request header. The local disk cache is described in Section The local disk cache.
It is possible to generate RootNode or LeafNode URLs automatically from program output. This might be useful when gathering a large number of Usenet newsgroups, for example. The program is specified inside the RootNode or LeafNode section, preceded by a pipe symbol.
<LeafNodes>
|generate-news-urls.sh
</LeafNodes>
The script must output valid URLs, such as
news:comp.unix.voodoo
news:rec.pets.birds
http://www.nlanr.net/
...
In the case of RootNode URLs, enumeration parameters can be given after the program.
<RootNodes>
|my-fave-sites.pl Depth=1 URL=5000,url-filter
</RootNodes>
After the Gatherer retrieves a document, it passes the document through a subsystem called Essence to extract indexing information. Essence allows the Gatherer to collect indexing information easily from a wide variety of information, using different techniques depending on the type of data and the needs of the particular corpus being indexed. In a nutshell, Essence can determine the type of data pointed to by a URL (e.g., PostScript vs. HTML), ``unravel'' presentation nesting formats (such as compressed ``tar'' files), select which types of data to index (e.g., don't index Audio files), and then apply a type-specific extraction algorithm (called a summarizer) to the data to generate a content summary. Users can customize each of these aspects, but often this is not necessary. Harvest is distributed with a ``stock'' set of type recognizers, presentation unnesters, candidate selectors, and summarizers that work well for many applications.
Below we describe the stock summarizer set, the current components distribution, and how users can customize summarizers to change how they operate and add summarizers for new types of data. If you develop a summarizer that is likely to be useful to other users, please notify us via email at lee@arco.de so we may include it in our Harvest distribution.
Type Summarizer Function
--------------------------------------------------------------------
Bibliographic Extract author and titles
Binary Extract meaningful strings and manual page summary
C, CHeader Extract procedure names, included file names, and comments
Dvi Invoke the Text summarizer on extracted ASCII text
FAQ, FullText, README
Extract all words in file
Font Extract comments
HTML Extract anchors, hypertext links, and selected fields
LaTex Parse selected LaTex fields (author, title, etc.)
Mail Extract certain header fields
Makefile Extract comments and target names
ManPage Extract synopsis, author, title, etc., based on ``-man'' macros
News Extract certain header fields
Object Extract symbol table
Patch Extract patched file names
Perl Extract procedure names and comments
PostScript Extract text in word processor-specific fashion, and pass
through Text summarizer.
RCS, SCCS Extract revision control summary
RTF Up-convert to HTML and pass through HTML summarizer
SGML Extract fields named in extraction table
ShellScript Extract comments
SourceDistribution
Extract full text of README file and comments from Makefile
and source code files, and summarize any manual pages
SymbolicLink Extract file name, owner, and date created
TeX Invoke the Text summarizer on extracted ASCII text
Text Extract first 100 lines plus first sentence of each
remaining paragraph
Troff Extract author, title, etc., based on ``-man'', ``-ms'',
``-me'' macro packages, or extract section headers and
topic sentences.
Unrecognized Extract file name, owner, and date created.
The table in Section Extracting data for indexing: The Essence summarizing subsystem provides a brief reference for how documents are summarized depending on their type. These actions can be customized, as discussed in Section Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps. Some summarizers are implemented as UNIX programs while others are expressed as regular expressions; see Section Customizing the summarizing step or Section Example 4 for more information about how to write a summarizer.
It is possible to summarize documents that conform to the Standard Generalized Markup Language (SGML), for which you have a Document Type Definition (DTD). The World Wide Web's Hypertext Mark-up Language (HTML) is actually a particular application of SGML, with a corresponding DTD. (In fact, the Harvest HTML summarizer can use the HTML DTD and our SGML summarizing mechanism, which provides various advantages; see Section The SGML-based HTML summarizer.) SGML is being used in an increasingly broad variety of applications, for example as a format for storing data for a number of physical sciences. Because SGML allows documents to contain a good deal of structure, Harvest can summarize SGML documents very effectively.
The SGML summarizer (SGML.sum
) uses the sgmls
program by
James Clark to parse the SGML document. The parser needs both a DTD for the
document and a Declaration file that describes the allowed character set.
The SGML.sum
program uses a table that maps SGML tags to SOIF
attributes.
SGML support files can be found in $HARVEST_HOME/lib/gatherer/sgmls-lib/. For example, these are the default pathnames for HTML summarizing using the SGML summarizing mechanism:
$HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/html.dtd
$HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.decl
$HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.sum.tbl
The location of the DTD file must be specified in the sgmls
catalog
($HARVEST_HOME/lib/gatherer/sgmls-lib/catalog). For example:
DOCTYPE HTML HTML/html.dtd
The SGML.sum
program looks for the .decl file in the default
location. An alternate pathname can be specified with the -d option to
SGML.sum
.
The summarizer looks for the .sum.tbl file first in the Gatherer's
lib directory and then in the default location. Both of these can
be overridden with the -t option to SGML.sum
.
The translation table provides a simple yet powerful way to specify how an SGML document is to be summarized. There are four ways to map SGML data into SOIF. The first two are concerned with placing the content of an SGML tag into a SOIF attribute.
A simple SGML-to-SOIF mapping looks like this:
<TAG> soif1,soif2,...
This places the content that occurs inside the tag ``TAG'' into the SOIF attributes ``soif1'' and ``soif2''. It is possible to select different SOIF attributes based on SGML attribute values. For example, if ``ATT'' is an attribute of ``TAG'', then it would be written like this:
<TAG,ATT=x> x-stuff
<TAG,ATT=y> y-stuff
<TAG> stuff
The second two mappings place values of SGML attributes into SOIF attributes. To place the value of the ``ATT'' attribute of the ``TAG'' tag into the ``att-stuff'' SOIF attribute you would write:
<TAG:ATT> att-stuff
It is also possible to place the value of an SGML attribute into a SOIF attribute named by a different SOIF attribute:
<TAG:ATT1> $ATT2
When the summarizer encounters an SGML attribute not listed in the table, the content is passed to the parent tag and becomes a part of the parent's content. To force the content of some tag not to be passed up, specify the SOIF attribute as ``ignore''. To force the content of some tag to be passed to the parent in addition to being placed into a SOIF attribute, list an addition SOIF attribute named ``parent''.
Please see Section The SGML-based HTML summarizer for examples of these mappings.
The sgmls
parser can generate an overwhelming volume of error and
warning messages. This will be especially true for HTML documents
found on the Internet, which often do not conform to the strict HTML
DTD. By default, errors and warnings are redirected to /dev/null so
that they do not clutter the Gatherer's log files. To enable logging of
these messages, edit the SGML.sum
Perl script and set
$syntax_check = 1.
To create an SGML summarizer for a new SGML-tagged data type with an associated DTD, you need to do the following:
#!/bin/sh
exec SGML.sum FOO $*
At this point you can test everything from the command line as follows:
% FOO.sum myfile.foo
Harvest can summarize HTML using the generic SGML summarizer described in
Section
Summarizing SGML data.
The advantage of this approach is that the summarizer is more easily
customizable, and fits with the well-conceived SGML model (where you
define DTDs for individual document types and build interpretation
software to understand DTDs rather than individual document types).
The downside is that the summarizer is now pickier about syntax, and
many Web documents are not syntactically correct. Because of this
pickiness, the default is for the HTML summarizer to run with syntax
checking outputs disabled. If your documents are so badly formed that
they confuse the parser, this may mean the summarizing process
dies unceremoniously. If you find that some of your HTML documents do not
get summarized or only get summarized in part, you can turn syntax-checking
output on by setting $syntax_check = 1 in
$HARVEST_HOME/lib/gatherer/SGML.sum
. That will allow you to see
which documents are invalid and where.
Note that part of the reason for this problem is that Web browsers do not insist on well-formed documents. So, users can easily create documents that are not completely valid, yet display fine.
Below is the default SGML-to-SOIF table used by the HTML summarizer:
HTML ELEMENT SOIF ATTRIBUTES
------------ -----------------------
<A> keywords,parent
<A:HREF> url-references
<ADDRESS> address
<B> keywords,parent
<BODY> body
<CITE> references
<CODE> ignore
<EM> keywords,parent
<H1> headings
<H2> headings
<H3> headings
<H4> headings
<H5> headings
<H6> headings
<HEAD> head
<I> keywords,parent
<IMG:SRC> images
<META:CONTENT> $NAME
<STRONG> keywords,parent
<TITLE> title
<TT> keywords,parent
<UL> keywords,parent
The pathname to this file is $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.sum.tbl.
Individual Gatherers may do customized HTML summarizing by placing a
modified version of this file in the Gatherer lib directory.
Another way to customize is to modify the HTML.sum
script and
add a -t option to the SGML.sum command. For example:
SGML.sum -t $HARVEST_HOME/lib/my-HTML.table HTML $*
In HTML, the document title is written as:
<TITLE>My Home Page</TITLE>
The above translation table will place this in the SOIF summary as:
title{13}: My Home Page
Note that ``keywords,parent'' occurs frequently in the table. For any specially marked text (bold, emphasized, hypertext links, etc.), the words will be copied into the keywords attribute and also left in the content of the parent element. This keeps the body of the text readable by not removing certain words.
Any text that appears inside a pair of CODE tags will not show up in the summary because we specified ``ignore'' as the SOIF attribute.
URLs in HTML anchors are written as:
<A HREF="http://harvest.cs.colorado.edu/">
The specification for <A:HREF> in the above translation table causes this to appear as:
url-references{32}: http://harvest.cs.colorado.edu/
One of the most useful HTML tags is META. This allows the document writer to include arbitrary metadata in an HTML document. A Typical usage of the META element is:
<META NAME="author" CONTENT="Joe T. Slacker">
By specifying ``<META:CONTENT> $NAME'' in the translation table, this comes out as:
author{15}: Joe T. Slacker
Using the META tags, HTML authors can easily add a list of keywords to their documents:
<META NAME="keywords" CONTENT="word1 word2">
<META NAME="keywords" CONTENT="word3 word4">
A very terse HTML summarizer could be specified with a table that only puts emphasized words into the keywords attribute:
HTML ELEMENT SOIF ATTRIBUTES
------------ -----------------------
<A> keywords
<B> keywords
<EM> keywords
<H1> keywords
<H2> keywords
<H3> keywords
<I> keywords
<META:CONTENT> $NAME
<STRONG> keywords
<TITLE> title,keywords
<TT> keywords
Conversely, a full-text summarizer can be easily specified with only:
HTML ELEMENT SOIF ATTRIBUTES
------------ -----------------------
<HTML> full-text
<TITLE> title,parent
The Harvest Gatherer's actions are defined by a set of configuration and utility files, and a corresponding set of executable programs referenced by some of the configuration files.
If you want to customize a Gatherer, you should create bin and lib subdirectories in the directory where you are running the Gatherer, and then copy $HARVEST_HOME/lib/gatherer/*.cf and $HARVEST_HOME/lib/gatherer/magic into your lib directory. Then add to your Gatherer configuration file:
Lib-Directory: lib
The details about what each of these files does are described below. The basic contents of a typical Gatherer's directory is as follows (note: some of the file names below can be changed by setting variables in the Gatherer configuration file, as described in Section Setting variables in the Gatherer configuration file):
RunGatherd* bin/ GathName.cf log.errors tmp/
RunGatherer* data/ lib/ log.gatherer
bin:
MyNewType.sum*
data:
All-Templates.gz INFO.soif PRODUCTION.gdbm gatherd.log
INDEX.gdbm MD5.gdbm gatherd.cf
lib:
bycontent.cf byurl.cf quick-sum.cf
byname.cf magic stoplist.cf
tmp:
The RunGatherd
and RunGatherer
are used to export the
Gatherer's database after a machine reboot and to run the Gatherer,
respectively. The log.errors and log.gatherer files
contain error messages and the output of the Essence typing step,
respectively (Essence will be described shortly). The GathName.cf
file is the Gatherer's configuration file.
The bin directory contains any summarizers and any other program
needed by the summarizers. If you were to customize the Gatherer by
adding a summarizer, you would place those programs in this
bin directory; the MyNewType.sum
is an example.
The data directory contains the Gatherer's database which
gatherd
exports. The Gatherer's database consists of the
All-Templates.gz, INDEX.gdbm, INFO.soif, MD5.gdbm and
PRODUCTION.gdbm files. The gatherd.cf file is used
to support access control as described in Section
Controlling access to the Gatherer's database. The gatherd.log file is where the
gatherd
program logs its information.
The lib directory contains the configuration files used by the Gatherer's subsystems, namely Essence. These files are described briefly in the following table:
bycontent.cf Content parsing heuristics for type recognition step
byname.cf File naming heuristics for type recognition step
byurl.cf URL naming heuristics for type recognition step
magic UNIX ``file'' command specifications (matched against
bycontent.cf strings)
quick-sum.cf Extracts attributes for summarizing step.
stoplist.cf File types to reject during candidate selection
Essence recognizes types in three ways (in order of precedence):
by URL naming heuristics, by file naming heuristics, and by locating
identifying data within a file using the UNIX file
command.
To modify the type recognition step, edit lib/byname.cf to add file
naming heuristics, or lib/byurl.cf to add URL naming heuristics, or
lib/bycontent.cf to add by-content heuristics. The by-content
heuristics match the output of the UNIX file
command, so you may
also need to edit the lib/magic file. See Section
Example 3 and
Example 4 for detailed examples on how to customize the type
recognition step.
The lib/stoplist.cf configuration file contains a list of types that are rejected by Essence. You can add or delete types from lib/stoplist.cf to control the candidate selection step.
To direct Essence to index only certain types, you can list the types to index in lib/allowlist.cf. Then, supply Essence with the --allowlist flag.
The file and URL naming heuristics used by the type recognition step (described in Section Customizing the type recognition step) are particularly useful for candidate selection when gathering remote data. They allow the Gatherer to avoid retrieving files that you don't want to index (in contrast, recognizing types by locating identifying data within a file requires that the file be retrieved first). This approach can save quite a bit of network traffic, particularly when used in combination with enumerated RootNode URLs. For example, many sites provide each of their files in both a compressed and uncompressed form. By building a lib/allowlist.cf containing only the Compressed types, you can avoid retrieving the uncompressed versions of the files.
Some types are declared as ``nested'' types. Essence treats these differently than other types, by running a presentation unnesting algorithm or ``Exploder'' on the data rather than a Summarizer. At present Essence can handle files nested in the following formats:
To customize the presentation unnesting step you can modify the Essence
source file src/gatherer/essence/unnest.c. This file lists the
available presentation encodings, and also specifies the unnesting algorithm.
Typically, an external program is used to unravel a file into one or more
component files (e.g. bzip2, gunzip, uudecode,
and tar
).
An Exploder may also be used to explode a file into a stream of SOIF
objects. An Exploder program takes a URL as its first command-line argument
and a file containing the data to use as its second, and then generates one
or more SOIF objects as output. For your convenience, the Exploder
type is already defined as a nested type. To save some time, you can use
this type and its corresponding Exploder.unnest
program rather than
modifying the Essence code.
See Section Example 2 for a detailed example on writing an Exploder. The unnest.c file also contains further information on defining the unnesting algorithms.
Essence supports two mechanisms for defining the type-specific extraction algorithms (called Summarizers) that generate content summaries: a UNIX program that takes as its only command line argument the filename of the data to summarize, and line-based regular expressions specified in lib/quick-sum.cf. See Section Example 4 for detailed examples on how to define both types of Summarizers.
The UNIX Summarizers are named using the convention TypeName.sum
(e.g., PostScript.sum
). These Summarizers output their content
summary in a SOIF attribute-value list (see Section
The Summary Object Interchange Format (SOIF)). You can use the
wrapit
command to wrap raw output into the SOIF format (i.e.,
to provide byte-count delimiters on the individual attribute-value pairs).
There is a summarizer called FullText.sum
that you can use to
perform full text indexing of selected file types, by simply setting up
the lib/bycontent.cf and lib/byname.cf configuration files
to recognize the desired file types as FullText (i.e., using ``FullText''
in column 1 next to the matching regular expression).
It is possible to ``fine-tune'' the summary information generated by the Essence summarizers. A typical application of this would be to change the Time-to-Live attribute based on some knowledge about the objects. So an administrator could use the post-summarizing feature to give quickly-changing objects a lower TTL, and very stable documents a higher TTL.
Objects are selected for post-summarizing if they meet a specified condition. A condition consists of three parts: An attribute name, an operation, and some string data. For example:
city == 'New York'
In this case we are checking if the city attribute is equal to the string `New York'. For exact string matching, the string data must be enclosed in single quotes. Regular expressions are also supported:
city ~ /New York/
Negative operators are also supported:
city != 'New York'
city !~ /New York/
Conditions can be joined with `&&' (logical and) or `||' (logical or) operators:
city == 'New York' && state != 'NY';
When all conditions are met for an object, some number of instructions are executed on it. There are four types of instructions which can be specified:
time-to-live = "86400"
keywords | tr A-Z a-z
address,city,state,zip ! cleanup-address.pl
delete()
The conditions and instructions are combined together in a ``rules'' file. The format of this file is somewhat similar to a Makefile; conditions begin in the first column and instructions are indented by a tab-stop.
Example:
type == 'HTML'
partial-text | cleanup-html-text.pl
URL ~ /users/
time-to-live = "86400"
partial-text ! extract-owner.sh
type == 'SOIFStream'
delete()
This rules file is specified in the gatherer.cf file with the Post-Summarizing tag, e.g.:
Post-Summarizing: lib/myrules
Until version 1.4 it was not possible to rewrite the URL-part of an object summary. It is now possible, but only by using the ``pipe'' instruction. This may be useful for people wanting to run a Gatherer on file:// URLs, but have them appear as http:// URLs. This can be done with a post-summarizing rule such as:
url ~ 'file://localhost/web/htdocs/'
url | fix-url.pl
And the 'fix-url.pl' script might look like:
#!/usr/local/bin/perl -p
s'file://localhost/web/htdocs/'http://www.my.domain/';
In addition to customizing the steps described in Section Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps, you can customize the Gatherer by setting variables in the Gatherer configuration file. This file consists of two parts: a list of variables that specify information about the Gatherer (such as its name, host, and port number), and two lists of URLs (divided into RootNodes and LeafNodes) from which to collect indexing information. Section Basic setup shows an example Gatherer configuration file. In this section we focus on the variables that the user can set in the first part of the Gatherer configuration file.
Each variable name starts in the first column, ends with a colon, then is followed by the value. The following table shows the supported variables:
Access-Delay: Default delay between URLs accesses.
Data-Directory: Directory where GDBM database is written.
Debug-Options: Debugging options passed to child programs.
Errorlog-File: File for logging errors.
Essence-Options: Any extra options to pass to Essence.
FTP-Auth: Username/password for protected FTP documents.
Gatherd-Inetd: Denotes that gatherd is run from inetd.
Gatherer-Host: Full hostname where the Gatherer is run.
Gatherer-Name: A Unique name for the Gatherer.
Gatherer-Options: Extra options for the Gatherer.
Gatherer-Port: Port number for gatherd.
Gatherer-Version: Version string for the Gatherer.
HTTP-Basic-Auth: Username/password for protected HTTP documents.
HTTP-Proxy: host:port of your HTTP proxy.
Keep-Cache: ``yes'' to not remove local disk cache.
Lib-Directory: Directory where configuration files live.
Local-Mapping: Mapping information for local gathering.
Log-File: File for logging progress.
Post-Summarizing: A rules-file for post-summarizing.
Refresh-Rate: Object refresh-rate in seconds, default 1 week.
Time-To-Live: Object time-to-live in seconds, default 1 month.
Top-Directory: Top-level directory for the Gatherer.
Working-Directory: Directory for tmp files and local disk cache.
Notes:
Gatherer
program will accept the
-background flag which will cause the Gatherer to
run in the background.The Essence options are:
Option Meaning
--------------------------------------------------------------------
--allowlist filename File with list of types to allow
--fake-md5s Generates MD5s for SOIF objects from a .unnest program
--fast-summarizing Trade speed for some consistency. Use only when
an external summarizer is known to generate clean,
unique attributes.
--full-text Use entire file instead of summarizing. Alternatively,
you can perform full text indexing of individual file
types by using the FullText.sum summarizer.
--max-deletions n Number of GDBM deletions before reorganization
--minimal-bookkeeping Generates a minimal amount of bookkeeping attrs
--no-access Do not read contents of objects
--no-keywords Do not automatically generate keywords
--stoplist filename File with list of types to remove
--type-only Only type data; do not summarize objects
A particular note about full text summarizing: Using the Essence --full-text option causes files not to be passed through the Essence content extraction mechanism. Instead, their entire content is included in the SOIF summary stream. In some cases this may produce unwanted results (e.g., it will directly include the PostScript for a document rather than first passing the data through a PostScript to text extractor, providing few searchable terms and large SOIF objects). Using the individual file type summarizing mechanism described in Section Customizing the summarizing step will work better in this regard, but will require you to specify how data are extracted for each individual file type. In a future version of Harvest we will change the Essence --full-text option to perform content extraction before including the full text of documents.
Although the Gatherer's work load is specified using URLs, often the files being gathered are located on a local file system. In this case it is much more efficient to gather directly from the local file system than via FTP/Gopher/HTTP/News, primarily because of all the UNIX forking required to gather information via these network processes. For example, our measurements indicate it causes from 4-7x more CPU load to gather from FTP than directly from the local file system. For large collections (e.g., archive sites containing many thousands of files), the CPU savings can be considerable.
Starting with Harvest Version 1.1, it is possible to tell the Gatherer how to translate URLs to local file system names, using the Local-Mapping Gatherer configuration file variable (see Section Setting variables in the Gatherer configuration file). The syntax is:
Local-Mapping: URL_prefix local_path_prefix
This causes all URLs starting with URL_prefix to be translated to files starting with the prefix local_path_prefix while gathering, but to be left as URLs in the results of queries (so the objects can be retrieved as usual). Note that no regular expressions are supported here. As an example, the specification
Local-Mapping: http://harvest.cs.colorado.edu/~hardy/ /homes/hardy/public_html/
Local-Mapping: ftp://ftp.cs.colorado.edu/pub/cs/ /cs/ftp/
would cause the URL http://harvest.cs.colorado.edu/~hardy/Home.html to be translated to the local file name /homes/hardy/public_html/Home.html, while the URL ftp://ftp.cs.colorado.edu/pub/cs/techreports/schwartz/Harvest.Conf.ps.Z would be translated to the local file name /cs/ftp/techreports/schwartz/Harvest.Conf.ps.Z.
Local gathering will work over NFS file systems. A local mapping will fail if: the local file cannot be opened for reading; or the local file is not a regular file; or the local file has execute bits set. So, for directories, symbolic links and CGI scripts, the server is always contacted rather than the local file system. Lastly, the Gatherer does not perform any URL syntax translations for local mappings. If your URL has characters that should be escaped (as in RFC1738), then the local mapping will fail. Starting with version 1.4 patchlevel 2 Essence will print [L] after URLs which were successfully accessed locally.
Note that if your network is highly congested, it may actually be faster to gather via HTTP/FTP/Gopher than via NFS, because NFS becomes very inefficient in highly congested situations. Even better would be to run local Gatherers on the hosts where the disks reside, and access them directly via the local file system.
You can gather password-protected documents from HTTP and FTP servers. In both cases, you can specify a username and password as a part of the URL. The format is as follows:
ftp://user:password@host:port/url-path
http://user:password@host:port/url-path
With this format, the ``user:password'' part is kept as a part of the URL string all throughout Harvest. This may enable anyone who uses your Broker(s) to access password-protected documents.
You can keep the username and password information ``hidden'' by specifying the authentication information in the Gatherer configuration file. For HTTP, the format is as follows:
HTTP-Basic-Auth: realm username password
where realm is the same as the AuthName parameter given in an Apache httpd httpd.conf or .htaccess file. In other httpd server configuration, the realm value is sometimes called ServerId.
For FTP, the format in the gatherer.cf file is
FTP-Auth: hostname[:port] username password
You can use the gatherd.cf file (placed in the Data-Directory of a Gatherer) to control access to the Gatherer's database. A line that begins with Allow is followed by any number of domain or host names that are allowed to connect to the Gatherer. If the word all is used, then all hosts are matched. Deny is the opposite of Allow. The following example will only allow hosts in the cs.colorado.edu or usc.edu domain access the Gatherer's database:
Allow cs.colorado.edu usc.edu
Deny all
The Gatherer
program does not automatically do any periodic
updates -- when you run it, it processes the specified URLs, starts up
a gatherd
daemon (if one isn't already running), and then exits.
If you want to update the data periodically (e.g., to capture new files as
they are added to an FTP archive), you need to use the UNIX cron
command to run the Gatherer
program at some regular interval.
To set up periodic gathering via cron
, use the RunGatherer
command that RunHarvest
will create. An example RunGatherer
script follows:
#!/bin/sh
#
# RunGatherer - Runs the ATT 800 Gatherer (from cron)
#
HARVEST_HOME=/usr/local/harvest; export HARVEST_HOME
PATH=${HARVEST_HOME}/bin:${HARVEST_HOME}/lib/gatherer:${HARVEST_HOME}/lib:$PATH
export PATH
NNTPSERVER=localhost; export NNTPSERVER
cd /usr/local/harvest/gatherers/att800
exec Gatherer "att800.cf"
You should run the RunGatherd
command from your system startup
(e.g. /etc/rc.local) file, so the Gatherer's database is exported
each time the machine reboots. An example RunGatherd
script
follows:
#!/bin/sh
#
# RunGatherd - starts up the gatherd process (from /etc/rc.local)
#
HARVEST_HOME=/usr/local/harvest; export HARVEST_HOME
PATH=${HARVEST_HOME}/lib/gatherer:${HARVEST_HOME}/bin:$PATH; export PATH
exec gatherd -d /usr/local/harvest/gatherers/att800/data 8500
The Gatherer maintains a local disk cache of files it gathers to reduce
network traffic from restarting aborted gathering attempts. However, since
the remote server must still be contacted whenever Gatherer
runs,
please do not set your cron job to run Gatherer
frequently. A
typical value might be weekly or monthly, depending on how congested the
network and how important it is to have the most current data.
By default, the Gatherer's local disk cache is deleted after each successful completion. To save the local disk cache between Gatherer sessions, define Keep-Cache: yes in your Gatherer configuration file (Section Setting variables in the Gatherer configuration file).
If you want your Broker's index to reflect new data, then you must run the Gatherer and run a Broker collection. By default, a Broker will perform collections once a day. If you want the Broker to collect data as soon as it's gathered, then you will need to coordinate the timing of the completion of the Gatherer and the Broker collections.
If you run your Gatherer frequently and you use the Keep-Cache: yes
in your Gatherer configuration file, then the Gatherer's local disk cache
may interfere with retrieving updates. By default, objects in the
local disk cache expire after 7 days; however, you can expire objects
more quickly by setting the $GATHERER_CACHE_TTL environment
variable to the number of seconds for the Time-To-Live (TTL) before you
run the Gatherer, or you can change RunGatherer
to remove the
Gatherer's tmp directory after each Gatherer run. For example, to
expire objects in the local disk cache after one day:
% setenv GATHERER_CACHE_TTL 86400 # one day
% ./RunGatherer
The Gatherer's local disk cache size defaults to 32 MBs, but you can change this value by setting the $HARVEST_MAX_LOCAL_CACHE environment variable to the number of MBs before you run the Gatherer. For example, to have a maximum cache of 10 MB you can do as follows:
% setenv HARVEST_MAX_LOCAL_CACHE 10 # 10 MB
% ./RunGatherer
If you have access to the software that creates the files that you are indexing (e.g., if all updates are funneled through a particular editor, update script, or system call), you can modify this software to schedule realtime Gatherer updates whenever a file is created or updated. For example, if all users update the files being indexed using a particular program, this program could be modified to run the Gatherer upon completion of the user's update.
Note that, when used in conjunction with cron
, the Gatherer
provides a powerful data ``mirroring'' facility. You can use the Gatherer
to replicate the contents of one or more sites, retrieve data in multiple
formats via multiple protocols (FTP, HTTP, etc.), optionally perform a
variety of type- or site-specific transformations on the data, and serve
the results very efficiently as compressed SOIF object summary streams to
other sites that wish to use the data for building indexes or for other
purposes.
You may want to inspect the quality of the automatically-generated SOIF templates. In general, Essence's techniques for automatic information extraction produce imperfect results. Sometimes it is possible to customize the summarizers to better suit the particular context (see Section Customizing the summarizing step). Sometimes, however, it makes sense to augment or change the automatically generated keywords with manually entered information. For example, you may want to add Title attributes to the content summaries for a set of PostScript documents (since it's difficult to parse them out of PostScript automatically).
Harvest provides some programs that automatically clean up a Gatherer's
database. The rmbinary
program removes any binary data from the
templates. The cleandb
program does some simple validation of SOIF
objects, and when given the -truncate flag it will truncate
the Keywords data field to 8 kilobytes. To help in manually
managing the Gatherer's databases, the gdbmutil
GDBM database
management tool is provided in $HARVEST_HOME/lib/gatherer.
In a future release of Harvest we will provide a forms-based mechanism to
make it easy to provide manual annotations. In the meantime, you can
annotate the Gatherer's database with manually generated information by
using the mktemplate
, template2db
, mergedb
, and
mkindex
programs. You first need to create a file (called, say,
annotations) in the following format:
@FILE { url1
Attribute-Name-1: DATA
Attribute-Name-2: DATA
...
Attribute-Name-n: DATA
}
@FILE { url2
Attribute-Name-1: DATA
Attribute-Name-2: DATA
...
Attribute-Name-n: DATA
}
...
Note that the Attributes must begin in column 0 and have one tab after the colon, and the DATA must be on a single line.
Next, run the mktemplate
and template2db
programs to
generate SOIF and then GDBM versions of these data (you can have several
files containing the annotations, and generate a single GDBM database from
the above commands):
% set path = ($HARVEST_HOME/lib/gatherer $path)
% mktemplate annotations [annotations2 ...] | template2db annotations.gdbm
Finally, you run mergedb
to incorporate the annotations into the
automatically generated data, and mkindex
to generate an index for
it. The usage line for mergedb
is:
mergedb production automatic manual [manual ...]
The idea is that production is the final GDBM database that the Gatherer will serve. This is a new database that will be generated from the other databases on the command line. automatic is the GDBM database that a Gatherer automatically generated in a previous run (e.g., WORKING.gdbm or a previous PRODUCTION.gdbm). manual and so on are the GDBM databases that you manually created. When mergedb runs, it builds the production database by first copying the templates from the manual databases, and then merging in the attributes from the automatic database. In case of a conflict (the same attribute with different values in the manual and automatic databases), the manual values override the automatic values.
By keeping the automatically and manually generated data stored separately, you can avoid losing the manual updates when doing periodic automatic gathering. To do this, you will need to set up a script to remerge the manual annotations with the automatically gathered data after each gathering.
An example use of mergedb
is:
% mergedb PRODUCTION.new PRODUCTION.gdbm annotations.gdbm
% mv PRODUCTION.new PRODUCTION.gdbm
% mkindex
If the manual database looked like this:
@FILE { url1
my-manual-attribute: this is a neat attribute
}
and the automatic database looked like this:
@FILE { url1
keywords: boulder colorado
file-size: 1034
md5: c3d79dc037efd538ce50464089af2fb6
}
then in the end, the production database will look like this:
@FILE { url1
my-manual-attribute: this is a neat attribute
keywords: boulder colorado
file-size: 1034
md5: c3d79dc037efd538ce50464089af2fb6
}
Extra information from specific programs and library routines can be logged by setting debugging flags. A debugging flag has the form -Dsection,level. Section is an integer in the range 1-255, and level is an integer in the range 1-9. Debugging flags can be given on a command line, with the Debug-Options: tag in a gatherer configuration file, or by setting the environment variable $HARVEST_DEBUG.
Examples:
Debug-Options: -D68,5 -D44,1
% httpenum -D20,1 -D21,1 -D42,1 http://harvest.cs.colorado.edu/
% setenv HARVEST_DEBUG '-D20,1 -D23,1 -D63,1'
Debugging sections and levels have been assigned to the following sections of the code:
section 20, level 1, 5, 9 Common liburl URL processing
section 21, level 1, 5, 9 Common liburl HTTP routines
section 22, level 1, 5 Common liburl disk cache routines
section 23, level 1 Common liburl FTP routines
section 24, level 1 Common liburl Gopher routines
section 25, level 1 urlget - standalone liburl program.
section 26, level 1 ftpget - standalone liburl program.
section 40, level 1, 5, 9 Gatherer URL enumeration
section 41, level 1 Gatherer enumeration URL verification
section 42, level 1, 5, 9 Gatherer enumeration for HTTP
section 43, level 1, 5, 9 Gatherer enumeration for Gopher
section 44, level 1, 5 Gatherer enumeration filter routines
section 45, level 1 Gatherer enumeration for FTP
section 46, level 1 Gatherer enumeration for file:// URLs
section 48, level 1, 5 Gatherer enumeration robots.txt stuff
section 60, level 1 Gatherer essence data object processing
section 61, level 1 Gatherer essence database routines
section 62, level 1 Gatherer essence main
section 63, level 1 Gatherer essence type recognition
section 64, level 1 Gatherer essence object summarizing
section 65, level 1 Gatherer essence object unnesting
section 66, level 1, 2, 5 Gatherer essence post-summarizing
section 67, level 1 Gatherer essence object-ID code
section 69, level 1, 5, 9 Common SOIF template processing
section 70, level 1, 5, 9 Broker registry
section 71, level 1 Broker collection routines
section 72, level 1 Broker SOIF parsing routines
section 73, level 1, 5, 9 Broker registry hash tables
section 74, level 1 Broker storage manager routines
section 75, level 1, 5 Broker query manager routines
section 75, level 4 Broker query_list debugging
section 76, level 1 Broker event management routines
section 77, level 1 Broker main
section 78, level 9 Broker select(2) loop
section 79, level 1, 5, 9 Broker gatherer-id management
section 80, level 1 Common utilities memory management
section 81, level 1 Common utilities buffer routines
section 82, level 1 Common utilities system(3) routines
section 83, level 1 Common utilities pathname routines
section 84, level 1 Common utilities hostname processing
section 85, level 1 Common utilities string processing
section 86, level 1 Common utilities DNS host cache
section 101, level 1 Broker PLWeb indexing engine
section 102, level 1, 2, 5 Broker Glimpse indexing engine
section 103, level 1 Broker Swish indexing engine
The Gatherer doesn't pick up all the objects pointed to by some of my RootNodes.
The Gatherer places various limits on enumeration to prevent a misconfigured Gatherer from abusing servers or running wildly. See section RootNode specifications for details on how to override these limits.
Local-Mapping did not work for me - it retrieved the objects via the usual remote access protocols.
A local mapping will fail if:
So for directories, symbolic links, and CGI scripts, the HTTP server is always contacted. We don't perform URL translation for local mappings. If your URL's have funny characters that must be escaped, then the local mapping will also fail. Add debug option -D20,1 to understand how local mappings are taking place.
Using the --full-text option I see a lot of raw data in the content summaries, with few keywords I can search.
At present --full-text simply includes the full data content in the SOIF summaries. Using the individual file type summarizing mechanism described in Section Customizing the summarizing step will work better in this regard, but will require you to specify how data are extracted for each individual file type. In a future version of Harvest we will change the Essence --full-text option to perform content extraction before including the full text of documents.
No indexing terms are being generated in the SOIF summary for the META tags in my HTML documents.
This probably indicates that your HTML is not syntactically well-formed, and hence the SGML-based HTML summarizer is not able to recognize it. See Section Summarizing SGML data for details and debugging options.
Gathered data are not being updated.
The Gatherer does not automatically do periodic updates. See Section Periodic gathering and realtime updates for details.
The Gatherer puts slightly different URLs in the SOIF summaries than I specified in the Gatherer configuration file.
This happens because the Gatherer attempts to put URLs into a canonical format. It does this by removing default port numbers and similar cosmetic changes. Also, by default, Essence (the content extraction subsystem within the Gatherer) removes the standard stoplist.cf types, which includes HTTP-Query (the cgi-bin stuff).
There are no Last-Modification-Time or MD5 attributes in my gatherered SOIF data, so the Broker can't do duplicate elimination.
If you gather remote, manually-created information, it is pulled into Harvest using ``exploders'' that translate from the remote format into SOIF. That means they don't have a direct way to fill in the Last-Modification-Time or MD5 information per record. Note also that this will mean one update to the remote records would cause all records to look updated, which will result in more network load for Brokers that collect from this Gatherer's data. As a solution, you can compute MD5s for all objects, and store them as part of the record. Then, when you run the exploder you only generate timestamps for the ones for which the MD5s changed - giving you real last-modification times.
The Gatherer substitutes a ``%7e'' for a ``~'' in all the user directory URLs.
The Gatherer conforms to RFC1738, which says that a tilde inside a URL should be encoded as ``%7e'', because it is considered an ``unsafe'' character.
When I search using keywords I know are in a document I have indexed with Harvest, the document isn't found.
Harvest uses a content extraction subsystem called Essence that by default does not extract every keyword in a document. Instead, it uses heuristics to try to select promising keywords. You can change what keywords are selected by customizing the summarizers for that type of data, as discussed in Section Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps. Or, you can tell Essence to use full text summarizing if you feel the added disk space costs are merited, as discussed in Section Setting variables in the Gatherer configuration file.
I'm running Harvest on HP-UX, but the essence
process in the
Gatherer takes too much memory.
The supplied regular expression library has memory leaks on HP-UX, so you need to use the regular expression library supplied with HP-UX. Change the Makefile in src/gatherer/essence to read:
REGEX_DEFINE = -DUSE_POSIX_REGEX
REGEX_INCLUDE =
REGEX_OBJ =
REGEX_TYPE = posix
I built the configuration files to customize how Essence types/content extracts data, but it uses the standard typing/extracting mechanisms anyway.
Verify that you have the Lib-Directory set to the lib/ directory that you put your configuration files. Lib-Directory is defined in your Gatherer configuration file.
I am having problems resolving host names on SunOS.
In order to gather data from hosts outside of your organization, your system must be able to resolve fully qualified domain names into IP addresses. If your system cannot resolve hostnames, you will see error messages such as ``Unknown Host.'' In this case, either:
To verify that your system is configured for DNS, make sure that the
file /etc/resolv.conf exists and is readable. Read the
resolv.conf(5) manual page for information on this file. You can verify
that DNS is working with the nslookup
command.
Some sites may use Sun Microsystem's Network Information Service (NIS)
instead of, or in addition to, DNS. We believe that Harvest works on
systems where NIS has been properly configured. The NIS servers (the
names of which you can determine from the ypwhich
command) must be
configured to query DNS servers for hostnames they do not know about.
See the -b option of the ypxfr
command.
I cannot get the Gatherer to work across our firewall gateway.
Harvest only supports retrieving HTTP objects through a proxy. It is not yet possible to request Gopher and FTP objects through a firewall. For these objects, you may need to run Harvest internally (behind the firewall) or on the firewall host itself.
If you see the ``Host is unreachable'' message, these are the likely problems:
If you see the ``Connection refused'' message, the likely problem is that you are trying to connect with an unused port on the destination machine. In other words, there is no program listening for connections on that port.
The Harvest gatherer is essentially a WWW client. You should expect it to work the same as any Web browser.