Harvest Gatherers and Brokers communicate using an attribute-value stream protocol called the Summary Object Interchange Format (SOIF), an example of which is available in Section Example 1. Gatherers generate content summaries for individual objects in SOIF, and serve these summaries to Brokers that wish to collect and index them. SOIF provides a means of bracketing collections of summary objects, allowing Harvest Brokers to retrieve SOIF content summaries from a Gatherer for many objects in a single, efficient compressed stream. Harvest Brokers provide support for querying SOIF data using structured attribute-value queries and many other types of queries, as discussed in Section Querying a Broker.
The SOIF Grammar is as follows:
SOIF ::= OBJECT SOIF | OBJECT
OBJECT ::= @ TEMPLATE-TYPE { URL ATTRIBUTE-LIST }
ATTRIBUTE-LIST ::= ATTRIBUTE ATTRIBUTE-LIST | ATTRIBUTE
ATTRIBUTE ::= IDENTIFIER {VALUE-SIZE} DELIMITER VALUE
TEMPLATE-TYPE ::= Alpha-Numeric-String
IDENTIFIER ::= Alpha-Numeric-String
VALUE ::= Arbitrary-Data
VALUE-SIZE ::= Number
DELIMITER ::= ":<tab>"
Each Broker can support different attributes, depending on the data it holds. Below we list a set of the most common attributes:
Abstract
Brief abstract about the object.
Author
Author(s) of the object.
Description
Brief description about the object.
File-Size
Number of bytes in the object.
Full-Text
Entire contents of the object.
Gatherer-Host
Host on which the Gatherer ran to extract information from the object.
Gatherer-Name
Name of the Gatherer that extracted information from the object. (eg.
Full-Text, Selected-Text, or Terse).
Gatherer-Port
Port number on the Gatherer-Host that serves the Gatherer's information.
Gatherer-Version
Version number of the Gatherer.
Update-Time
The time that Gatherer updated the content summary for the object.
Keywords
Searchable keywords extracted from the object.
Last-Modification-Time
The time that the object was last modified.
MD5
MD5 16-byte checksum of the object.
Refresh-Rate
The number of seconds after Update-Time when the summary object is to
be re-generated. Defaults to 1 month.
Time-to-Live
The number of seconds after Update-Time when the summary object is
no longer valid. Defaults to 6 months.
Title
Title of the object.
Type
The object's type. Some example types are:
Archive
Audio
Awk
Backup
Binary
C
CHeader
Command
Compressed
CompressedTar
Configuration
Data
Directory
DotFile
Dvi
FAQ
FYI
Font
FormattedText
GDBM
GNUCompressed
GNUCompressedTar
HTML
Image
Internet-Draft
MacCompressed
Mail
Makefile
ManPage
Object
OtherCode
PCCompressed
Patch
Pdf
Perl
PostScript
RCS
README
RFC
RTF
SCCS
ShellArchive
Tar
Tcl
Tex
Text
Troff
Uuencoded
WaisSource
Update-Time
The time that the summary object was last updated.
REQUIRED field, no default.
URI
Uniform Resource Identifier.
URL-References
Any URL references present within HTML objects.