Next Previous Contents

7. The Summary Object Interchange Format (SOIF)

Harvest Gatherers and Brokers communicate using an attribute-value stream protocol called the Summary Object Interchange Format (SOIF), an example of which is available in Section Example 1. Gatherers generate content summaries for individual objects in SOIF, and serve these summaries to Brokers that wish to collect and index them. SOIF provides a means of bracketing collections of summary objects, allowing Harvest Brokers to retrieve SOIF content summaries from a Gatherer for many objects in a single, efficient compressed stream. Harvest Brokers provide support for querying SOIF data using structured attribute-value queries and many other types of queries, as discussed in Section Querying a Broker.

7.1 Formal description of SOIF

The SOIF Grammar is as follows:

    SOIF            ::=  OBJECT SOIF | OBJECT
    OBJECT          ::=  @ TEMPLATE-TYPE { URL ATTRIBUTE-LIST }
    ATTRIBUTE-LIST  ::=  ATTRIBUTE ATTRIBUTE-LIST | ATTRIBUTE
    ATTRIBUTE       ::=  IDENTIFIER {VALUE-SIZE} DELIMITER VALUE
    TEMPLATE-TYPE   ::=  Alpha-Numeric-String
    IDENTIFIER      ::=  Alpha-Numeric-String
    VALUE           ::=  Arbitrary-Data
    VALUE-SIZE      ::=  Number
    DELIMITER       ::=  ":<tab>"

7.2 List of common SOIF attribute names

Each Broker can support different attributes, depending on the data it holds. Below we list a set of the most common attributes:

Abstract
     Brief abstract about the object.
Author
     Author(s) of the object.
Description
     Brief description about the object.
File-Size
     Number of bytes in the object.
Full-Text
     Entire contents of the object.
Gatherer-Host
     Host on which the Gatherer ran to extract information from the object.
Gatherer-Name
     Name of the Gatherer that extracted information from the object. (eg.
     Full-Text, Selected-Text, or Terse).
Gatherer-Port
     Port number on the Gatherer-Host that serves the Gatherer's information.
Gatherer-Version
     Version number of the Gatherer.
Update-Time
     The time that Gatherer updated the content summary for the object.
Keywords
     Searchable keywords extracted from the object.
Last-Modification-Time
     The time that the object was last modified.
MD5
     MD5 16-byte checksum of the object.
Refresh-Rate
     The number of seconds after Update-Time when the summary object is to
     be re-generated.  Defaults to 1 month.
Time-to-Live
     The number of seconds after Update-Time when the summary object is
     no longer valid.  Defaults to 6 months.
Title
     Title of the object.
Type
     The object's type. Some example types are:

             Archive
             Audio
             Awk
             Backup
             Binary
             C
             CHeader
             Command
             Compressed
             CompressedTar
             Configuration
             Data
             Directory
             DotFile
             Dvi
             FAQ
             FYI
             Font
             FormattedText
             GDBM
             GNUCompressed
             GNUCompressedTar
             HTML
             Image
             Internet-Draft
             MacCompressed
             Mail
             Makefile
             ManPage
             Object
             OtherCode
             PCCompressed
             Patch
             Pdf
             Perl
             PostScript
             RCS
             README
             RFC
             RTF
             SCCS
             ShellArchive
             Tar
             Tcl
             Tex
             Text
             Troff
             Uuencoded
             WaisSource

Update-Time
     The time that the summary object was last updated.
     REQUIRED field, no default.
URI
     Uniform Resource Identifier.
URL-References
     Any URL references present within HTML objects.


Next Previous Contents