Harvest User's Manual Darren R. Hardy, Michael F. Schwartz, Duane Wessels, Kang- Jin Lee 2002-10-29 Harvest User's Manual was edited by Kang-Jin Lee and covers Harvest version 1.8. It was originally written by Darren R. Hardy, Michael F. Schwartz and Duane Wessels for Harvest 1.4.pl2 in 1996-01-31. ______________________________________________________________________ Table of Contents 1. Introduction to Harvest 1.1 Copyright 1.2 Online Harvest Resources 2. Subsystem Overview 2.1 Distributing the Gathering and Brokering Processes 3. Installing the Harvest Software 3.1 Requirements for Harvest Servers 3.1.1 Hardware 3.1.2 Platforms 3.1.3 Software 3.2 Requirements for Harvest Users 3.3 Retrieving and Installing the Harvest Software 3.3.1 Distribution types 3.3.2 Harvest components 3.3.3 User-contributed software 3.4 Building the Source Distribution 3.5 Additional installation for the Harvest Broker 3.5.1 Checking the installation for HTTP access 3.5.2 Required modifications to your HTTP server 3.5.3 Apache httpd 3.5.4 Other HTTP servers 3.6 Upgrading versions of the Harvest software 3.6.1 Upgrading from version 1.6 to version 1.8 3.6.2 Upgrading from version 1.5 to version 1.6 3.6.3 Upgrading from version 1.4 to version 1.5 3.6.4 Upgrading from version 1.3 to version 1.4 3.6.5 Upgrading from version 1.2 to version 1.3 3.6.6 Upgrading from version 1.1 to version 1.2 3.6.7 Upgrading to version 1.1 from version 1.0 or older 3.7 Starting up the system: RunHarvest and related commands 3.8 Harvest team contact information 4. The Gatherer 4.1 Overview 4.2 Basic setup 4.2.1 Gathering News URLs with NNTP 4.2.2 Cleaning out a Gatherer 4.3 RootNode specifications 4.3.1 RootNode filters 4.3.2 Generic Enumeration program description 4.3.3 Example RootNode configuration 4.3.4 Gatherer enumeration vs. candidate selection 4.4 Generating LeafNode/RootNode URLs from a program 4.5 Extracting data for indexing: The Essence summarizing subsystem 4.5.1 Default actions of ``stock'' summarizers 4.5.2 Summarizing SGML data 4.5.2.1 Location of support files 4.5.2.2 The SGML to SOIF table 4.5.2.3 Errors and warnings from the SGML Parser 4.5.2.4 Creating a summarizer for a new SGML-tagged data type 4.5.2.5 The SGML-based HTML summarizer 4.5.2.6 Adding META data to your HTML 4.5.2.7 Other examples 4.5.3 Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps 4.5.3.1 Customizing the type recognition step 4.5.3.2 Customizing the candidate selection step 4.5.3.3 Customizing the presentation unnesting step 4.5.3.4 Customizing the summarizing step 4.6 Post-Summarizing: Rule-based tuning of object summaries 4.6.1 The Rules file 4.6.2 Rewriting URLs 4.7 Gatherer administration 4.7.1 Setting variables in the Gatherer configuration file 4.7.2 Local file system gathering for reduced CPU load 4.7.3 Gathering from password-protected servers 4.7.4 Controlling access to the Gatherer's database 4.7.5 Periodic gathering and realtime updates 4.7.6 The local disk cache 4.7.7 Incorporating manually generated information into a Gatherer 4.8 Troubleshooting 5. The Broker 5.1 Overview 5.2 Basic setup 5.3 Querying a Broker 5.3.1 Example queries 5.3.2 Regular expressions 5.3.3 Query options selected by menus or buttons 5.3.4 Filtering query results 5.3.5 Result set presentation 5.4 Customizing the Broker's Query Result Set 5.4.1 The search.cf configuration file 5.4.1.1 Defined Variables 5.4.1.2 List of Definitions 5.4.2 Example search.cf customization file 5.4.3 Integrating your customized configuration file 5.4.4 Displaying SOIF attributes in results 5.5 World Wide Web interface description 5.5.1 HTML files for graphical user interface 5.5.2 CGI programs 5.5.3 Help files for the user 5.6 Administrating a Broker 5.6.1 Deleting unwanted Broker objects 5.6.2 Command-line Administration 5.7 Tuning Glimpse indexing in the Broker 5.7.1 The glimpseserver program 5.8 Using different index/search engines with the Broker 5.8.1 Using Swish as an indexer 5.8.2 Using WAIS as an indexer 5.9 Collector interface description: Collection.conf 5.10 Troubleshooting 6. Programs and layout of the installed Harvest software 6.1 $HARVEST_HOME 6.2 $HARVEST_HOME/bin 6.3 $HARVEST_HOME/brokers 6.4 $HARVEST_HOME/cgi-bin 6.5 $HARVEST_HOME/gatherers 6.6 $HARVEST_HOME/lib 6.7 $HARVEST_HOME/lib/broker 6.8 $HARVEST_HOME/lib/gatherer 6.9 $HARVEST_HOME/tmp 7. The Summary Object Interchange Format (SOIF) 7.1 Formal description of SOIF 7.2 List of common SOIF attribute names 8. Gatherer Examples 8.1 Example 1 - A simple Gatherer 8.2 Example 2 - Incorporating manually generated information 8.3 Example 3 - Customizing type recognition and candidate selection 8.4 Example 4 - Customizing type recognition and summarizing 8.4.1 Using regular expressions to summarize a format 8.4.2 Using programs to summarize a format 8.4.3 Running the example 8.5 Example 5 - Using RootNode filters 9. History of Harvest 9.1 History of Harvest 9.2 History of Harvest User's Manual ______________________________________________________________________ 11.. IInnttrroodduuccttiioonn ttoo HHaarrvveesstt HARVEST is an integrated set of tools to gather, extract, organize, and search information across the Internet. With modest effort users can tailor Harvest to digest information in many different formats, and offer custom search services on the Internet. A key goal of Harvest is to provide a flexible system that can be configured in various ways to create many types of indexes. Harvest also allows users to extract structured (attribute-value pair) information from many different information formats and build indexes that allow these attributes to be referenced during queries (e.g., searching for all documents with a certain regular expression in the title field). An important advantage of Harvest is that it allows users to build indexes using either manually constructed templates (for maximum control over index content) or automatically extracted data constructed templates (for easy coverage of large collections), or using a hybrid of the two methods. Harvest is designed to make it easy to distribute the search system on a pool of networked machines to handle higher load. 11..11.. CCooppyyrriigghhtt The core of Harvest is licensed under GPL <../../COPYING>. Additional components distributed with Harvest are also under GPL or similar license. Glimpse, the current default fulltext indexer has a different license. Here is a clarification of Glimpse' copyright status <../glimpse-license-status> kindly posted by Golda Velez to comp.infosystems.harvest . 11..22.. OOnnlliinnee HHaarrvveesstt RReessoouurrcceess This manual is available at harvest.sourceforge.net/harvest/doc/html/manual.html. More information about Harvest is available at harvest.sourceforge.net. 22.. SSuubbssyysstteemm OOvveerrvviieeww Harvest consists of several subsystems. The _G_a_t_h_e_r_e_r subsystem collects indexing information (such as keywords, author names, and titles) from the resources available at _P_r_o_v_i_d_e_r sites (such as FTP and HTTP servers). The _B_r_o_k_e_r subsystem retrieves indexing information from one or more Gatherers, suppresses duplicate information, incrementally indexes the collected information, and provides a WWW query interface to it. Harvest Software Components You should start using Harvest simply, by installing a single ``stock'' (i.e., not customized) Gatherer and Broker on one machine to index some of the FTP, World Wide Web, and NetNews data at your site. After you get the system working in this basic configuration, you can invest additional effort as warranted. First, as you scale up to index larger volumes of information, you can reduce the CPU and network load to index your data by distributing the gathering process. Second, you can customize how Harvest extracts, indexes, and searches your information, to better match the types of data you have and the ways your users would like to interact with the data. We discuss how to distribute the gathering process in the next subsection. We cover various forms of customization in Section ``Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps'' and in several parts of Section ``The Broker''. 22..11.. DDiissttrriibbuuttiinngg tthhee GGaatthheerriinngg aanndd BBrrookkeerriinngg PPrroocceesssseess Harvest Gatherers and Brokers can be configured in various ways. Running a Gatherer remotely from a Provider site allows Harvest to interoperate with sites that are not running Harvest Gatherers, by using standard object retrieval protocols like FTP, Gopher, HTTP, and NNTP. However, as suggested by the bold lines in the left side of Figure ``2'', this arrangement results in excess server and network load. Running a Gatherer locally is much more efficient, as shown in the right side of Figure ``2''. Nonetheless, running a Gatherer remotely is still better than having many sites independently collect indexing information, since many Brokers or other search services can share the indexing information that the Gatherer collects. If you have a number of FTP/HTTP/Gopher/NNTP servers at your site, it is most efficient to run a Gatherer on each machine where these servers run. On the other hand, you can reduce installation effort by running a Gatherer at just one machine at your site and letting it retrieve data from across the network. Harvest Configuration Options Figure ``2'' also illustrates that a Broker can collect information from many Gatherers (to build an index of widely distributed information). Brokers can also retrieve information from other Brokers, in effect cascading indexed views from one another. Brokers retrieve this information using the query interface, allowing them to filter or refine the information from one Broker to the next. 33.. IInnssttaalllliinngg tthhee HHaarrvveesstt SSooffttwwaarree 33..11.. RReeqquuiirreemmeennttss ffoorr HHaarrvveesstt SSeerrvveerrss 33..11..11.. HHaarrddwwaarree A good machine for running a typical Harvest server will have a reasonably fast processor, 1-2 GB of free disk, and 128 MB of RAM. A slower CPU will work but it will slow down the Harvest server. More important than CPU speed, however, is memory size. Harvest uses a number of processes, some of which provide needed ``plumbing'' (e.g., search.cgi), and some of which improve performance (e.g., the glimpseserver process). If you do not have enough memory, your system will page too much, and drastically reduce performance. The other factor affecting RAM usage is how much data you are trying to index in a Harvest Broker. The more data, the more disk I/O will be performed at query time, the more RAM it will take to provide a reasonable sized disk buffer pool. The amount of disk you'll need depends on how much data you want to index in a single Broker. (It is possible to distribute your index over multiple Brokers if it gets too large for one disk.) A good rule of thumb is that you will need about 10% as much disk to hold the Gatherer and Broker databases as the total size of the data you want to index. The actual space needs will vary depending on the type of data you are indexing. For example, PostScript achieves a much higher indexing space reduction than HTML, because so much of the PostScript data (such as page positioning information) is discarded when building the index. 33..11..22.. PPllaattffoorrmmss To run a Harvest server, you need an UNIX-like Operating System. 33..11..33.. SSooffttwwaarree To use Harvest, you need the following software packages: +o All Harvest servers require: Perl v5.0 or higher. +o The Harvest Broker and Gatherer require: GNU gzip v1.2.4 or higher. +o The Harvest Broker requires: HTTP server. To build Harvest from the source distribution you may need to install one or more of the following software packages: +o Compiling Harvest requires: GNU gcc v2.5.8 or higher. +o Compiling the Harvest Broker requires: flex v2.4.7 or higher and bison v1.22 or higher. The sources for gcc, gzip, flex, and bison are available at the GNU FTP server . 33..22.. RReeqquuiirreemmeennttss ffoorr HHaarrvveesstt UUsseerrss Anyone with a web browser (e.g., Internet Explorer, Lynx, Mozilla, Netscape, Opera, etc.) can access and use Harvest servers. 33..33.. RReettrriieevviinngg aanndd IInnssttaalllliinngg tthhee HHaarrvveesstt SSooffttwwaarree 33..33..11.. DDiissttrriibbuuttiioonn ttyyppeess Currently we offer only source distribution of Harvest. The _s_o_u_r_c_e _d_i_s_t_r_i_b_u_t_i_o_n contains all of the source code for the Harvest software. There are no _b_i_n_a_r_y _d_i_s_t_r_i_b_u_t_i_o_n_s of Harvest. You can retrieve the Harvest source distributions from the Harvest download site prdownloads.sourceforge.net/harvest/. 33..33..22.. HHaarrvveesstt ccoommppoonneennttss Harvest components are in the _c_o_m_p_o_n_e_n_t_s directory. To use a component, follow the instructions included in the desired component directory. 33..33..33.. UUsseerr--ccoonnttrriibbuutteedd ssooffttwwaarree There is a collection of unsupported user-contributed software in _c_o_n_t_r_i_b directory. If you would like to contribute some software, please send email to lee@arco.de . 33..44.. BBuuiillddiinngg tthhee SSoouurrccee DDiissttrriibbuuttiioonn The source distribution can be extracted in any directory. The following command will extract the gnu-zipped source archive: % gzip -dc harvest-x.y.z.tar.gz | tar xf - For archives compressed with bzip2, use: % bzip2 -dc harvest-x.y.z.tar.bz2 | tar xf - Harvest uses GNU's _a_u_t_o_c_o_n_f package to perform needed configuration at installation time. If you want to override the default installation location of _/_u_s_r_/_l_o_c_a_l_/_h_a_r_v_e_s_t, change the ``prefix'' variable when invoking ``configure''. If desired, you may edit _s_r_c_/_c_o_m_m_o_n_/_i_n_c_l_u_d_e_/_c_o_n_f_i_g_._h before compiling to change various Harvest compile-time limits and variables. To compile the source tree type make. For example, to build and install the entire Harvest system into _/_u_s_r_/_l_o_c_a_l_/_h_a_r_v_e_s_t directory, type: % ./configure % make % make install You may see some compiler warning messages, which you can ignore. Building the entire Harvest distribution will take few minutes on a reasonably fast machine. The compiled source tree takes approximately 25 megabytes of disk space. Later, after the installed software working, you can remove the compiled code (``.o'' files) and other intermediate files by typing make clean. If you want to remove the configure-generated Makefiles, type make distclean. 33..55.. AAddddiittiioonnaall iinnssttaallllaattiioonn ffoorr tthhee HHaarrvveesstt BBrrookkeerr 33..55..11.. CChheecckkiinngg tthhee iinnssttaallllaattiioonn ffoorr HHTTTTPP aacccceessss The Broker interacts with your HTTP server in a number of ways. You should make sure that the HTTP server can properly access the files it needs. In many cases, the HTTP server will run under a different userid than the owner of the Harvest files. First, make sure the HTTP server userid can read the _q_u_e_r_y_._h_t_m_l files in each broker directory. Second, make sure the HTTP server userid can access and execute the CGI programs in _$_H_A_R_V_E_S_T___H_O_M_E_/_c_g_i_-_b_i_n_/. The search.cgi script reads files from the _$_H_A_R_V_E_S_T___H_O_M_E_/_c_g_i_-_b_i_n_/_l_i_b_/ directory, so check that as well. Finally, check the files in _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/. Some of the CGI Perl scripts require ``include'' files in this directory. 33..55..22.. RReeqquuiirreedd mmooddiiffiiccaattiioonnss ttoo yyoouurr HHTTTTPP sseerrvveerr The Harvest Broker requires that an HTTP server is running, and that the HTTP server ``knows'' about the Broker's files. Below are some examples of how to configure various HTTP servers to work with the Harvest Broker. 33..55..33.. AAppaacchhee hhttttppdd Requires a SSccrriippttAAlliiaass and an AAlliiaass entry in _h_t_t_p_d_._c_o_n_f, e.g.: ScriptAlias /Harvest/cgi-bin/ Your-HARVEST_HOME/cgi-bin/ Alias /Harvest/ Your-HARVEST_HOME/ _W_A_R_N_I_N_G_: The SSccrriippttAAlliiaass entry must appear _b_e_f_o_r_e the AAlliiaass entry. Additionally, it might be necessary to configure Apache httpd to follow _s_y_m_b_o_l_i_c _l_i_n_k_s. To do this, add following to your _h_t_t_p_d_._c_o_n_f: Options FollowSymLinks 33..55..44.. OOtthheerr HHTTTTPP sseerrvveerrss Install the HTTP server and modify its configuration file so that the _/_H_a_r_v_e_s_t directory points to _$_H_A_R_V_E_S_T___H_O_M_E. You will also need to configure your HTTP server so that it knows that the directory _/_H_a_r_v_e_s_t_/_c_g_i_-_b_i_n contains valid CGI programs. If the default behaviour of your HTTP server is not to follow symbolik links, you will need to configure it so that it will follow symbolic links in the _/_H_a_r_v_e_s_t directory. 33..66.. UUppggrraaddiinngg vveerrssiioonnss ooff tthhee HHaarrvveesstt ssooffttwwaarree 33..66..11.. UUppggrraaddiinngg ffrroomm vveerrssiioonn 11..66 ttoo vveerrssiioonn 11..88 You _c_a_n _n_o_t install version 1.8 on top of version 1.6. For example, the change from version 1.6 to version 1.8 included some reorganization of the executables, and hence simply installing version 1.8 on top of version 1.6 would cause you to use old executables in some cases. To upgrade from Harvest version 1.6 to 1.8, do: 1. Move your old installation to a temporary location. 2. Install the new version as directed by the release notes. 3. Then, for each Gatherer and Broker that you were running under the old installation, migrate the server into the new installation. GGaatthheerreerrss:: you need to move the Gatherer's directory into _$_H_A_R_V_E_S_T___H_O_M_E_/_g_a_t_h_e_r_e_r_s. Section ``RootNode specifications'' describes the Gatherer workload specifications if you want to modify your Gatherer's configuration file. BBrrookkeerrss:: rebuild your broker by using CreateBroker and merge in any customizations you have made to your old Broker. 33..66..22.. UUppggrraaddiinngg ffrroomm vveerrssiioonn 11..55 ttoo vveerrssiioonn 11..66 There are no known incompatibilities between versions 1.5 and 1.6. 33..66..33.. UUppggrraaddiinngg ffrroomm vveerrssiioonn 11..44 ttoo vveerrssiioonn 11..55 You _c_a_n _n_o_t install version 1.5 on top of version 1.4. For example, the change from version 1.4 to version 1.5 included some reorganization of the executables, and hence simply installing version 1.5 on top of version 1.4 would cause you to use old executables in some cases. To upgrade from Harvest version 1.4 to 1.5, do: 1. Move your old installation to a temporary location. 2. Install the new version as directed by the release notes. 3. Then, for each Gatherer and Broker that you were running under the old installation, migrate the server into the new installation. GGaatthheerreerrss:: you need to move the Gatherer's directory into _$_H_A_R_V_E_S_T___H_O_M_E_/_g_a_t_h_e_r_e_r_s. Section ``RootNode specifications'' describes the Gatherer workload specifications if you want to modify your Gatherer's configuration file. BBrrookkeerrss:: you need to move the Broker's directory into _$_H_A_R_V_E_S_T___H_O_M_E_/_b_r_o_k_e_r_s. Remove any _._g_l_i_m_p_s_e___* files from your Broker's directory and use the _a_d_m_i_n_._h_t_m_l interface to force a full-index. You may want, however, to rebuild your broker by using CreateBroker so that you can use the updated _q_u_e_r_y_._h_t_m_l and related files. 33..66..44.. UUppggrraaddiinngg ffrroomm vveerrssiioonn 11..33 ttoo vveerrssiioonn 11..44 There are no known incompatibilities between versions 1.3 and 1.4. 33..66..55.. UUppggrraaddiinngg ffrroomm vveerrssiioonn 11..22 ttoo vveerrssiioonn 11..33 Version 1.3 is mostly backwards compatible with 1.2, with the following exception: Harvest 1.3 uses Glimpse 3.0. The _._g_l_i_m_p_s_e___* files in the broker directory created with Harvest 1.2 (Glimpse 2.0) are incompatible. After installing Harvest 1.3 you should: 1. Shutdown any running brokers. 2. Execute rm .glimpse_* in each broker directory. 3. Restart your brokers with RunBroker. 4. Force a full-index from the _a_d_m_i_n_._h_t_m_l interface. 33..66..66.. UUppggrraaddiinngg ffrroomm vveerrssiioonn 11..11 ttoo vveerrssiioonn 11..22 There are a few incompatabilities between Harvest version 1.1 and version 1.2. +o The Gatherer has improved incremental gatherering support which is incompatible with version 1.1. To update your existing Gatherer, change into the Gatherer's _D_a_t_a_-_D_i_r_e_c_t_o_r_y (usually the _d_a_t_a subdirectory), and run the following command: % set path = ($HARVEST_HOME/lib/gatherer $path) % cd data % rm -f INDEX.gdbm % mkindex This should create the _I_N_D_E_X_._g_d_b_m and _M_D_5_._g_d_b_m files in the current directory. +o The Broker has a new log format for the _a_d_m_i_n_/_L_O_G file which is incompatible with version 1.1. 33..66..77.. UUppggrraaddiinngg ttoo vveerrssiioonn 11..11 ffrroomm vveerrssiioonn 11..00 oorr oollddeerr If you already have an older version of Harvest installed, and want to upgrade, you _c_a_n _n_o_t unpack the new distribution on top of the old one. For example, the change from version 1.0 to version 1.1 included some reorganization of the executables, and hence simply installing version 1.1 on top of version 1.0 would cause you to use old executables in some cases. On the other hand, you may not want to start over from scratch with a new software version, as that would not take advantage of the data you have already gathered and indexed. Instead, to upgrade from Harvest version 1.0 to 1.1, do the following: 1. Move your old installation to a temporary location. 2. Install the new version as directed by the release notes. 3. Then, for each Gatherer and Broker that you were running under the old installation, migrate the server into the new installation. GGaatthheerreerrss:: you need to move the Gatherer's directory into _$_H_A_R_V_E_S_T___H_O_M_E_/_g_a_t_h_e_r_e_r_s. Section ``RootNode specifications'' describes the new Gatherer workload specifications which were introduced in version 1.1; you may modify your Gatherer's configuration file to employ this new functionality. BBrrookkeerrss:: you need to move the Broker's directory into _$_H_A_R_V_E_S_T___H_O_M_E_/_b_r_o_k_e_r_s. You may want, however, to rebuild your broker by using CreateBroker so that you can use the updated _q_u_e_r_y_._h_t_m_l and related files. 33..77.. SSttaarrttiinngg uupp tthhee ssyysstteemm:: RRuunnHHaarrvveesstt aanndd rreellaatteedd ccoommmmaannddss The simplest way to start the Harvest system is to use the RunHarvest command. RunHarvest prompts the user with a short list of questions about what data to index, etc., and then creates and runs a Gatherer and Broker with a ``stock'' (non-customized) set of content extraction and indexing mechanisms. Some more primitive commands are also available, for starting individual Gatherers and Brokers (e.g., if you want to distribute the gathering process). The Harvest startup commands are: RRuunnHHaarrvveesstt Checks that the Harvest software is installed correctly, prompts the user for basic configuration information, and then creates and runs a Gatherer and a Broker. If you have _$_H_A_R_V_E_S_T___H_O_M_E set, then it will use it; otherwise, it tries to determine _$_H_A_R_V_E_S_T___H_O_M_E automatically. Found in the _$_H_A_R_V_E_S_T___H_O_M_E directory. RRuunnBBrrookkeerr Runs a Broker. Found in the Broker's directory. RRuunnGGaatthheerreerr Runs a Gatherer. Found in the Gatherer's directory. CCrreeaatteeBBrrookkeerr Creates a single Broker which will collect its information from other existing Brokers or Gatherers. Used by RunHarvest, or can be run by a user to create a new Broker. Uses _$_H_A_R_V_E_S_T___H_O_M_E, and defaults to _/_u_s_r_/_l_o_c_a_l_/_h_a_r_v_e_s_t. Found in the _$_H_A_R_V_E_S_T___H_O_M_E_/_b_i_n directory. There is no CreateGatherer command, but the RunHarvest command can create a Gatherer, or you can create a Gatherer manually (see Section ``Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps'' or Section ``Gatherer Examples''). The layout of the installed Harvest directories and programs is discussed in Section ``Programs and layout of the installed Harvest software''. Among other things, the RunHarvest command asks the user what port numbers to use when running the Gatherer and the Broker. By default, the Gatherer will use port 8500 and the Broker will use the Gatherer port plus 1. The choice of port numbers depends on your particular machine -- you need to choose ports that are not in use by other servers on your machine. You might look at your _/_e_t_c_/_s_e_r_v_i_c_e_s file to see what ports are in use (although this file only lists some servers; other servers use ports without registering that information anywhere). Usually the above port numbers will not be in use by other processes. Probably the easiest thing is simply to try using the default port numbers, and see if it works. The remainder of this manual provides information for users who wish to customize or otherwise make more sophisticated use of Harvest than what happens when you install the system and run RunHarvest. 33..88.. HHaarrvveesstt tteeaamm ccoonnttaacctt iinnffoorrmmaattiioonn If you have questions the about Harvest system or problems with the software, post a note to the USENET newsgroup comp.infosystems.harvest . Please note your machine type, operating system type, and Harvest version number in your correspondence. If you have bug fixes, ports to new platforms or other software improvements, please email them to the Harvest maintainer lee@arco.de . 44.. TThhee GGaatthheerreerr 44..11.. OOvveerrvviieeww The Gatherer retrieves information resources using a variety of standard access methods (FTP, Gopher, HTTP, NNTP, and local files), and then summarizes those resources in various type-specific ways to generate structured indexing information. For example, a Gatherer can retrieve a technical report from an FTP archive, and then extract the author, title, and abstract from the paper to summarize the technical report. Harvest Brokers or other search services can then retrieve the indexing information from the Gatherer to use in a searchable index available via a WWW interface. The Gatherer consists of a number of separate components. The Gatherer program reads a Gatherer configuration file and controls the overall process of enumerating and summarizing data objects. The structured indexing information that the Gatherer collects is represented as a list of attribute-value pairs using the _S_u_m_m_a_r_y _O_b_j_e_c_t _I_n_t_e_r_c_h_a_n_g_e _F_o_r_m_a_t (SOIF, see Section ``The Summary Object Interchange Format (SOIF)''). The gatherd daemon serves the Gatherer database to Brokers. It hangs around, in the background, after a gathering session is complete. A stand-alone gather program is a client for the gatherd server. It can be used from the command line for testing, and is used by the Broker. The Gatherer uses a local disk cache to store objects it has retrieved. The disk cache is described in Section ``The local disk cache''. Even though the gatherd daemon remains in the background, a Gatherer does not automatically update or refresh its summary objects. Each object in a Gatherer has a Time-to-Live value. Objects remain in the database until they expire. See Section ``Periodic gathering and realtime updates'' for more information on keeping Gatherer objects up to date. Several example Gatherers are provided with the Harvest software distribution (see Section ``Gatherer Examples''). 44..22.. BBaassiicc sseettuupp To run a basic Gatherer, you need only list the Uniform Resource Locators (URLs, see RFC1630 and RFC1738) from which it will gather indexing information. This list is specified in the Gatherer configuration file, along with other optional information such as the Gatherer's name and the directory in which it resides (see Section ``Setting variables in the Gatherer configuration file'' for details on the optional information). Below is an example Gatherer configuration file: # # sample.cf - Sample Gatherer Configuration File # Gatherer-Name: My Sample Harvest Gatherer Gatherer-Port: 8500 Top-Directory: /usr/local/harvest/gatherers/sample # Enter URLs for RootNodes here http://www.mozilla.org/ http://www.xfree86.org/ # Enter URLs for LeafNodes here http://www.arco.de/~kj/index.html As shown in the example configuration file, you may classify an URL as a RRoooottNNooddee or a LLeeaaffNNooddee. For a LeafNode URL, the Gatherer simply retrieves the URL and processes it. LeafNode URLs are typically files like PostScript papers or compressed ``tar'' distributions. For a RootNode URL, the Gatherer will expand it into zero or more LeafNode URLs by recursively enumerating it in an access method-specific way. For FTP or Gopher, the Gatherer will perform a recursive directory listing on the FTP or Gopher server to expand the RootNode (typically a directory name). For HTTP, a RootNode URL is expanded by following the embedded HTML links to other URLs. For News, the enumeration returns all the messages in the specified USENET newsgroup. PLEASE BE CAREFUL when specifying RootNodes as it is possible to specify an enormous amount of work with a single RootNode URL. To help prevent a misconfigured Gatherer from abusing servers or running wildly, by default the Gatherer will only expand a RootNode into 250 LeafNodes, and will only include HTML links that point to documents that reside on the same server as the original RootNode URL. There are several options that allow you to change these limits and otherwise enhance the Gatherer specification. See Section ``RootNode specifications'' for details. The Gatherer is a ``robot'' and collects URLs starting from the URLs specified in RootNodes. It obeys the _r_o_b_o_t_s_._t_x_t convention and the _r_o_b_o_t_s _M_E_T_A _t_a_g. It also is HTTP Version 1.1 compliant and sends the _U_s_e_r_-_A_g_e_n_t and _F_r_o_m request fields to HTTP servers for accountability. After you have written the Gatherer configuration file, create a directory for the Gatherer and copy the configuration file there. Then, run the Gatherer program with the configuration file as the only command-line argument, as shown below: % Gatherer GathName.cf The Gatherer will generate a database of the content summaries, a log file (_l_o_g_._g_a_t_h_e_r_e_r), and an error log file (_l_o_g_._e_r_r_o_r_s). It will also start the gatherd daemon which exports the indexing information automatically to Brokers and other clients. To view the exported indexing information, you can use the gather client program, as shown below: % gather localhost 8500 | more The --iinnffoo option causes the Gatherer to respond only with the Gatherer summary information, which consists of the attributes available in the specified Gatherer's database, the Gatherer's host and name, the range of object update times, and the number of objects. Compression is the default, but can be disabled with the --nnooccoommpprreessss option. The optional timestamp tells the Gatherer to send only the objects that have changed since the specified timestamp (in seconds since the UNIX ``epoch'' of January 1, 1970). 44..22..11.. GGaatthheerriinngg NNeewwss UURRLLss wwiitthh NNNNTTPP News URLs are somewhat different than the other access protocols because the URL generally does not contain a hostname. The Gatherer retrieves News URLs from an NNTP server. The name of this server must be placed in the environment variable _$_N_N_T_P_S_E_R_V_E_R. It is probably a good idea to add this to your RunGatherer script. If the environment variable is not set, the Gatherer attempts to connect to a host named _n_e_w_s at your site. 44..22..22.. CClleeaanniinngg oouutt aa GGaatthheerreerr Remember the Gatherer databases persists between runs. Objects remain in the databases until they expire. When experimenting with the gatherer, it is always a good idea to ``clean out'' the databases between runs. This is most easily accomplished by executing this command from the Gatherer directory: % rm -rf data tmp log.* 44..33.. RRoooottNNooddee ssppeecciiffiiccaattiioonnss The RootNode specification facility described in Section ``Basic setup'' provides a basic set of default enumeration actions for RootNodes. Often it is useful to enumerate beyond the default limits, for example, to increase the enumeration limit beyond 250 URLs, or to allow site boundaries to be crossed when enumerating HTML links. It is possible to specify these and other aspects of enumeration, using the following syntax: URL EnumSpec URL EnumSpec ... where _E_n_u_m_S_p_e_c is on a single line (using ``\\'' to escape linefeeds), with the following syntax: URL=URL-Max[,URL-Filter-filename] \ Host=Host-Max[,Host-Filter-filename] \ Access=TypeList \ Delay=Seconds \ Depth=Number \ Enumeration=Enumeration-Program The _E_n_u_m_S_p_e_c modifiers are all optional, and have the following meanings: UURRLL--MMaaxx The number specified on the right hand side of the ``URL='' expression lists the maximum number of LeafNode URLs to generate at all levels of depth, from the current URL. Note that _U_R_L_-_M_a_x is the maximum number of URLs that are generated during the enumeration, and _n_o_t a limit on how many URLs can pass through the candidate selection phase (see Section ``Customizing the candidate selection step''). UURRLL--FFiilltteerr--ffiilleennaammee This is the name of a file containing a set of regular expression filters (see Section ``RootNode filters'') to allow or deny particular LeafNodes in the enumeration. The default filter is _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_U_R_L_-_f_i_l_t_e_r_-_d_e_f_a_u_l_t which excludes many image and sound files. HHoosstt--MMaaxx The number specified on the right hand side of the ``Host='' expression lists the maximum number of hosts that will be touched during the RootNode enumeration. This enumeration actually counts hosts by IP address so that aliased hosts are properly enumerated. Note that this does not work correctly for multi-homed hosts, or for hosts with rotating DNS entries (used by some sites for load balancing heavily accessed servers). _N_o_t_e_: Prior to Harvest Version 1.2 the ``Host=...'' line was called ``Site=...''. We changed the name to ``Host='' because it is more intuitively meaningful (being a host count limit, not a site count limit). For backwards compatibility with older Gatherer configuration files, we will continue to treat ``Site='' as an alias for ``Host=''. HHoosstt--FFiilltteerr--ffiilleennaammee This is the name of a file containing a set of regular expression filters to allow or deny particular hosts in the enumeration. Each expression can specify both a host name (or IP address) and a port number (in case you have multiple servers running on different ports of the same server and you want to index only one). The syntax is ``hostname:port''. AAcccceessss If the RootNode is an HTTP URL, then you can specify which access methods across which to enumerate. Valid access method types are: FFIILLEE,, FFTTPP,, GGoopphheerr,, HHTTTTPP,, NNeewwss,, TTeellnneett,, or WWAAIISS. Use a ``||'' character between type names to allow multiple access methods. For example, ``AAcccceessss==HHTTTTPP||FFTTPP||GGoopphheerr'' will follow HTTP, FTP, and Gopher URLs while enumerating an HTTP RootNode URL. _N_o_t_e_: We do not support cross-method enumeration from Gopher, because of the difficulty of ensuring that Gopher pointers do not cross site boundaries. For example, the Gopher URL _g_o_p_h_e_r_:_/_/_p_o_w_e_l_l_._c_s_._c_o_l_o_r_a_d_o_._e_d_u_:_7_0_0_5_/_1_f_t_p_3_a_f_t_p_._c_s_._w_a_s_h_i_n_g_t_o_n_._e_d_u_4_0_p_u_b_/ would get an FTP directory listing of ftp.cs.washington.edu:/pub, even though the host part of the URL is powell.cs.colorado.edu. DDeellaayy This is the number of seconds to wait between server contacts. It defaults to one second, when not specified otherwise. DDeellaayy==33 will let the gatherer sleep 3 seconds between server contacts. DDeepptthh This is the maximum number of levels of enumeration that will be followed during gathering. DDeepptthh==00 means that there is _n_o limit to the depth of the enumeration. DDeepptthh==11 means the specified URL will be retrieved, and all the URLs referenced by the specified URL will be retrieved; and so on for higher Depth values. In other words, the enumeration will follow links up to _D_e_p_t_h steps away from the specified URL. EEnnuummeerraattiioonn--PPrrooggrraamm This modifier adds a very flexible way to control a Gatherer. The Enumeration-Program is a filter which reads URLs as input and writes new enumeration parameters on output. See section ``Generic Enumeration program description'' for specific details. By default, _U_R_L_-_M_a_x defaults to 250, _U_R_L_-_F_i_l_t_e_r defaults to no limit, _H_o_s_t_-_M_a_x defaults to 1, _H_o_s_t_-_F_i_l_t_e_r defaults to no limit, _A_c_c_e_s_s defaults to HTTP only, _D_e_l_a_y defaults to 1 second, and _D_e_p_t_h defaults to zero. There is no way to specify an unlimited value for _U_R_L_-_M_a_x or _H_o_s_t_-_M_a_x. 44..33..11.. RRoooottNNooddee ffiilltteerrss Filter files use the standard UNIX regular expression syntax (as defined by the POSIX standard), not the csh ``globbing'' syntax. For example, you would use ``.*abc'' to indicate any string ending with ``abc'', not ``*abc''. A filter file has the following syntax: Deny regex Allow regex The _U_R_L_-_F_i_l_t_e_r regular expressions are matched only on the URL-path portion of each URL (the scheme, hostname and port are excluded). For example, the following URL-Filter file would allow all URLs except those containing the regular expression ``_/_g_a_t_h_e_r_e_r_s_/'': Deny /gatherers/ Allow . Another common use of URL-filters is to prevent the Gatherer from travelling ``up'' a directory. Automatically generated HTML pages for HTTP and FTP directories often contain a link for the parent directory ``_._.''. To keep the gatherer below a specific directory, use a URL- filter file such as: Allow ^/my/cool/sutff/ Deny . The _H_o_s_t_-_F_i_l_t_e_r regular expressions are matched on the ``hostname:port'' portion of each URL. Because the port is included, you cannot use ``$$'' to anchor the end of a hostname. Beginning with version 1.3, IP addresses may be specified in place of hostnames. A class B address such as 128.138.0.0 would be written as ``^^112288\\..113388\\....**'' in regular expression syntax. For example: Deny bcn.boulder.co.us:8080 Deny bvsd.k12.co.us Allow ^128\.138\..* Deny . The order of the AAllllooww and DDeennyy entries is important, since the filters are applied sequentially from first to last. So, for example, if you list ``AAllllooww ..**'' first, no subsequent DDeennyy expressions will be used, since this AAllllooww filter will allow all entries. 44..33..22.. GGeenneerriicc EEnnuummeerraattiioonn pprrooggrraamm ddeessccrriippttiioonn Flexible enumeration can be achieved by giving an EEnnuummeerraattiioonn==EEnnuummeerraattiioonn--PPrrooggrraamm modifier to a RootNode URL. The _E_n_u_m_e_r_a_t_i_o_n_-_P_r_o_g_r_a_m is a filter which takes URLs on standard input and writes new RootNode URLs on standard output. The output format is different than specifying a RootNode URL in a Gatherer configuration file. Each output line must have nine fields separated by spaces. These fields are: URL URL-Max URL-Filter-filename Host-Max Host-Filter-filename Access Delay Depth Enumeration-Program These are the same fields as described in section ``RootNode specifications''. Values must be given for each field. Use _/_d_e_v_/_n_u_l_l to disable the URL-Filter-filename and Host-Filter-filename. Use /bin/false to disable the Enumeration-Program. 44..33..33.. EExxaammppllee RRoooottNNooddee ccoonnffiigguurraattiioonn Below is an example RootNode configuration: (1) http://harvest.cs.colorado.edu/ URL=100,MyFilter (2) http://www.cs.colorado.edu/ Host=50 Delay=60 (3) gopher://gopher.colorado.edu/ Depth=1 (4) file://powell.cs.colorado.edu/home/hardy/ Depth=2 (5) ftp://ftp.cs.colorado.edu/pub/cs/techreports/ Depth=1 (6) http://harvest.cs.colorado.edu/~hardy/hotlist.html \ Depth=1 Delay=60 (7) http://harvest.cs.colorado.edu/~hardy/ \ Depth=2 Access=HTTP|FTP Each of the above RootNodes follows a different enumeration configuration as follows: 1. This RootNode will gather up to 100 documents that pass through the URL name filters contained within the file _M_y_F_i_l_t_e_r. 2. This RootNode will gather the documents from up to the first 50 hosts it encounters while enumerating the specified URL, with no limit on the Depth of link enumeration. It will also wait for 60 seconds between each retrieval. 3. This RootNode will gather only the documents from the top-level menu of the Gopher server at _g_o_p_h_e_r_._c_o_l_o_r_a_d_o_._e_d_u. 4. This RootNode will gather all documents that are in the _/_h_o_m_e_/_h_a_r_d_y directory, or that are in any subdirectory of _/_h_o_m_e_/_h_a_r_d_y. 5. This RootNode will gather only the documents that are in the _/_p_u_b_/_t_e_c_h_r_e_p_o_r_t_s directory which, in this case, is some bibliographic files rather than the technical reports themselves. 6. This RootNode will gather all documents that are within 1 step away from the specified RootNode URL, waiting 60 seconds between each retrieval. This is a good method by which to index your hotlist. By putting an HTML file containing ``hotlist'' pointers as this RootNode, this enumeration will gather the top-level pages to all of your hotlist pointers. 7. This RootNode will gather all documents that are at most 2 steps away from the specified RootNode URL. Furthermore, it will follow and enumerate any HTTP or FTP URLs that it encounters during enumeration. 44..33..44.. GGaatthheerreerr eennuummeerraattiioonn vvss.. ccaannddiiddaattee sseelleeccttiioonn In addition to using the _U_R_L_-_F_i_l_t_e_r and _H_o_s_t_-_F_i_l_t_e_r files for the RootNode specification mechanism described in Section ``RootNode specifications'', you can prevent documents from being indexed through customizing the _s_t_o_p_l_i_s_t_._c_f file, described in Section ``Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps''. Since these mechanisms are invoked at different times, they have different effects. The _U_R_L_-_F_i_l_t_e_r and _H_o_s_t_-_F_i_l_t_e_r mechanisms are invoked by the Gatherer's ``RootNode'' enumeration programs. Using these filters as stop lists can prevent unwanted objects from being retrieved across the network. This can dramatically reduce gathering time and network traffic. The _s_t_o_p_l_i_s_t_._c_f file is used by the _E_s_s_e_n_c_e content extraction system (described in Section ``Extracting data for indexing: The Essence summarizing subsystem'') _a_f_t_e_r the objects are retrieved, to select which objects should be content extracted and indexed. This can be useful because Essence provides a more powerful means of rejecting indexing candidates, in which you can customize based not only file naming conventions but also on file contents (e.g., looking at strings at the beginning of a file or at UNIX ``magic'' numbers), and also by more sophisticated file-grouping schemes (e.g., deciding not to extract contents from object code files for which source code is available). As an example of combining these mechanisms, suppose you want to index the ``.ps'' files linked into your WWW site. You could do this by having a _s_t_o_p_l_i_s_t_._c_f file that contains ``HTML'', and a RootNode _U_R_L_- _F_i_l_t_e_r that contains: Allow \.html Allow \.ps Deny .* As a final note, independent of these customizations the Gatherer attempts to avoid retrieving objects where possible, by using a local disk cache of objects, and by using the HTTP ``If-Modified-Since'' request header. The local disk cache is described in Section ``The local disk cache''. 44..44.. GGeenneerraattiinngg LLeeaaffNNooddee//RRoooottNNooddee UURRLLss ffrroomm aa pprrooggrraamm It is possible to generate RootNode or LeafNode URLs automatically from program output. This might be useful when gathering a large number of Usenet newsgroups, for example. The program is specified inside the RootNode or LeafNode section, preceded by a pipe symbol. |generate-news-urls.sh The script must output valid URLs, such as news:comp.unix.voodoo news:rec.pets.birds http://www.nlanr.net/ ... In the case of RootNode URLs, enumeration parameters can be given after the program. |my-fave-sites.pl Depth=1 URL=5000,url-filter 44..55.. EExxttrraaccttiinngg ddaattaa ffoorr iinnddeexxiinngg:: TThhee EEsssseennccee ssuummmmaarriizziinngg ssuubbssyysstteemm After the Gatherer retrieves a document, it passes the document through a subsystem called _E_s_s_e_n_c_e to extract indexing information. Essence allows the Gatherer to collect indexing information easily from a wide variety of information, using different techniques depending on the type of data and the needs of the particular corpus being indexed. In a nutshell, Essence can determine the type of data pointed to by a URL (e.g., PostScript vs. HTML), ``unravel'' presentation nesting formats (such as compressed ``tar'' files), select which types of data to index (e.g., don't index Audio files), and then apply a type-specific extraction algorithm (called a _s_u_m_m_a_r_i_z_e_r) to the data to generate a content summary. Users can customize each of these aspects, but often this is not necessary. Harvest is distributed with a ``stock'' set of type recognizers, presentation unnesters, candidate selectors, and summarizers that work well for many applications. Below we describe the stock summarizer set, the current components distribution, and how users can customize summarizers to change how they operate and add summarizers for new types of data. If you develop a summarizer that is likely to be useful to other users, please notify us via email at lee@arco.de so we may include it in our Harvest distribution. Type Summarizer Function -------------------------------------------------------------------- Bibliographic Extract author and titles Binary Extract meaningful strings and manual page summary C, CHeader Extract procedure names, included file names, and comments Dvi Invoke the Text summarizer on extracted ASCII text FAQ, FullText, README Extract all words in file Font Extract comments HTML Extract anchors, hypertext links, and selected fields LaTex Parse selected LaTex fields (author, title, etc.) Mail Extract certain header fields Makefile Extract comments and target names ManPage Extract synopsis, author, title, etc., based on ``-man'' macros News Extract certain header fields Object Extract symbol table Patch Extract patched file names Perl Extract procedure names and comments PostScript Extract text in word processor-specific fashion, and pass through Text summarizer. RCS, SCCS Extract revision control summary RTF Up-convert to HTML and pass through HTML summarizer SGML Extract fields named in extraction table ShellScript Extract comments SourceDistribution Extract full text of README file and comments from Makefile and source code files, and summarize any manual pages SymbolicLink Extract file name, owner, and date created TeX Invoke the Text summarizer on extracted ASCII text Text Extract first 100 lines plus first sentence of each remaining paragraph Troff Extract author, title, etc., based on ``-man'', ``-ms'', ``-me'' macro packages, or extract section headers and topic sentences. Unrecognized Extract file name, owner, and date created. 44..55..11.. DDeeffaauulltt aaccttiioonnss ooff ````ssttoocckk'''' ssuummmmaarriizzeerrss The table in Section ``Extracting data for indexing: The Essence summarizing subsystem'' provides a brief reference for how documents are summarized depending on their type. These actions can be customized, as discussed in Section ``Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps''. Some summarizers are implemented as UNIX programs while others are expressed as regular expressions; see Section ``Customizing the summarizing step'' or Section ``Example 4'' for more information about how to write a summarizer. 44..55..22.. SSuummmmaarriizziinngg SSGGMMLL ddaattaa It is possible to summarize documents that conform to the Standard Generalized Markup Language (SGML), for which you have a Document Type Definition (DTD). The World Wide Web's Hypertext Mark-up Language (HTML) is actually a particular application of SGML, with a corresponding DTD. (In fact, the Harvest HTML summarizer can use the HTML DTD and our SGML summarizing mechanism, which provides various advantages; see Section ``The SGML-based HTML summarizer''.) SGML is being used in an increasingly broad variety of applications, for example as a format for storing data for a number of physical sciences. Because SGML allows documents to contain a good deal of structure, Harvest can summarize SGML documents very effectively. The SGML summarizer (SGML.sum) uses the sgmls program by James Clark to parse the SGML document. The parser needs both a DTD for the document and a Declaration file that describes the allowed character set. The SGML.sum program uses a table that maps SGML tags to SOIF attributes. 44..55..22..11.. LLooccaattiioonn ooff ssuuppppoorrtt ffiilleess SGML support files can be found in _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_g_m_l_s_- _l_i_b_/. For example, these are the default pathnames for HTML summarizing using the SGML summarizing mechanism: $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/html.dtd $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.decl $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.sum.tbl The location of the DTD file must be specified in the sgmls catalog (_$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_g_m_l_s_-_l_i_b_/_c_a_t_a_l_o_g). For example: DOCTYPE HTML HTML/html.dtd The SGML.sum program looks for the _._d_e_c_l file in the default location. An alternate pathname can be specified with the --dd option to SGML.sum. The summarizer looks for the _._s_u_m_._t_b_l file first in the Gatherer's lib directory and then in the default location. Both of these can be overridden with the --tt option to SGML.sum. 44..55..22..22.. TThhee SSGGMMLL ttoo SSOOIIFF ttaabbllee The translation table provides a simple yet powerful way to specify how an SGML document is to be summarized. There are four ways to map SGML data into SOIF. The first two are concerned with placing the _c_o_n_t_e_n_t of an SGML tag into a SOIF attribute. A simple SGML-to-SOIF mapping looks like this: soif1,soif2,... This places the content that occurs inside the tag ``TAG'' into the SOIF attributes ``soif1'' and ``soif2''. It is possible to select different SOIF attributes based on SGML attribute values. For example, if ``ATT'' is an attribute of ``TAG'', then it would be written like this: x-stuff y-stuff stuff The second two mappings place values of SGML attributes into SOIF attributes. To place the value of the ``ATT'' attribute of the ``TAG'' tag into the ``att-stuff'' SOIF attribute you would write: att-stuff It is also possible to place the value of an SGML attribute into a SOIF attribute named by a different SOIF attribute: $ATT2 When the summarizer encounters an SGML attribute not listed in the table, the content is passed to the parent tag and becomes a part of the parent's content. To force the content of some tag _n_o_t to be passed up, specify the SOIF attribute as ``ignore''. To force the content of some tag to be passed to the parent in addition to being placed into a SOIF attribute, list an addition SOIF attribute named ``parent''. Please see Section ``The SGML-based HTML summarizer'' for examples of these mappings. 44..55..22..33.. EErrrroorrss aanndd wwaarrnniinnggss ffrroomm tthhee SSGGMMLL PPaarrsseerr The sgmls parser can generate an overwhelming volume of error and warning messages. This will be especially true for HTML documents found on the Internet, which often do not conform to the strict HTML DTD. By default, errors and warnings are redirected to _/_d_e_v_/_n_u_l_l so that they do not clutter the Gatherer's log files. To enable logging of these messages, edit the SGML.sum Perl script and set $$ssyynnttaaxx__cchheecckk == 11. 44..55..22..44.. CCrreeaattiinngg aa ssuummmmaarriizzeerr ffoorr aa nneeww SSGGMMLL--ttaaggggeedd ddaattaa ttyyppee To create an SGML summarizer for a new SGML-tagged data type with an associated DTD, you need to do the following: 1. Write a shell script named FOO.sum which simply contains #!/bin/sh exec SGML.sum FOO $* 2. Modify the essence configuration files (as described in Section ``Customizing the type recognition step'') so that your documents get typed as FOO. 3. Create the directory _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_g_m_l_s_-_l_i_b_/_F_O_O_/ and copy your DTD and Declaration there as FOO.dtd and FOO.decl. Edit _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_g_m_l_s_-_l_i_b_/_c_a_t_a_l_o_g and add FOO.dtd to it. 4. Create the translation table FOO.sum.tbl and place it with the DTD in _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_g_m_l_s_-_l_i_b_/_F_O_O_/. At this point you can test everything from the command line as follows: % FOO.sum myfile.foo 44..55..22..55.. TThhee SSGGMMLL--bbaasseedd HHTTMMLL ssuummmmaarriizzeerr Harvest can summarize HTML using the generic SGML summarizer described in Section ``Summarizing SGML data''. The advantage of this approach is that the summarizer is more easily customizable, and fits with the well-conceived SGML model (where you define DTDs for individual document types and build interpretation software to understand DTDs rather than individual document types). The downside is that the summarizer is now pickier about syntax, and many Web documents are not syntactically correct. Because of this pickiness, the default is for the HTML summarizer to run with syntax checking outputs disabled. If your documents are so badly formed that they confuse the parser, this may mean the summarizing process dies unceremoniously. If you find that some of your HTML documents do not get summarized or only get summarized in part, you can turn syntax-checking output on by setting $$ssyynnttaaxx__cchheecckk == 11 in $HARVEST_HOME/lib/gatherer/SGML.sum. That will allow you to see which documents are invalid and where. Note that part of the reason for this problem is that Web browsers do not insist on well-formed documents. So, users can easily create documents that are not completely valid, yet display fine. Below is the default SGML-to-SOIF table used by the HTML summarizer: HTML ELEMENT SOIF ATTRIBUTES ------------ ----------------------- keywords,parent url-references
address keywords,parent body references ignore keywords,parent

headings

headings

headings

headings

headings
headings head keywords,parent images $NAME keywords,parent title <TT> keywords,parent <UL> keywords,parent The pathname to this file is _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_g_m_l_s_- _l_i_b_/_H_T_M_L_/_H_T_M_L_._s_u_m_._t_b_l. Individual Gatherers may do customized HTML summarizing by placing a modified version of this file in the Gatherer _l_i_b directory. Another way to customize is to modify the HTML.sum script and add a --tt option to the SGML.sum command. For example: SGML.sum -t $HARVEST_HOME/lib/my-HTML.table HTML $* In HTML, the document title is written as: <TITLE>My Home Page The above translation table will place this in the SOIF summary as: title{13}: My Home Page Note that ``keywords,parent'' occurs frequently in the table. For any specially marked text (bold, emphasized, hypertext links, etc.), the words will be copied into the keywords attribute and also left in the content of the parent element. This keeps the body of the text readable by not removing certain words. Any text that appears inside a pair of CODE tags will not show up in the summary because we specified ``ignore'' as the SOIF attribute. URLs in HTML anchors are written as: The specification for <> in the above translation table causes this to appear as: url-references{32}: http://harvest.cs.colorado.edu/ 44..55..22..66.. AAddddiinngg MMEETTAA ddaattaa ttoo yyoouurr HHTTMMLL One of the most useful HTML tags is META. This allows the document writer to include arbitrary metadata in an HTML document. A Typical usage of the META element is: By specifying ``<> $NAME'' in the translation table, this comes out as: author{15}: Joe T. Slacker Using the META tags, HTML authors can easily add a list of keywords to their documents: 44..55..22..77.. OOtthheerr eexxaammpplleess A very terse HTML summarizer could be specified with a table that only puts emphasized words into the keywords attribute: HTML ELEMENT SOIF ATTRIBUTES ------------ ----------------------- keywords keywords keywords

keywords

keywords

keywords keywords $NAME keywords title,keywords <TT> keywords Conversely, a full-text summarizer can be easily specified with only: HTML ELEMENT SOIF ATTRIBUTES ------------ ----------------------- <HTML> full-text <TITLE> title,parent 44..55..33.. CCuussttoommiizziinngg tthhee ttyyppee rreeccooggnniittiioonn,, ccaannddiiddaattee sseelleeccttiioonn,, pprreesseenn-- ttaattiioonn uunnnneessttiinngg,, aanndd ssuummmmaarriizziinngg sstteeppss The Harvest Gatherer's actions are defined by a set of configuration and utility files, and a corresponding set of executable programs referenced by some of the configuration files. If you want to customize a Gatherer, you should create _b_i_n and _l_i_b subdirectories in the directory where you are running the Gatherer, and then copy _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_*_._c_f and _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_m_a_g_i_c into your _l_i_b directory. Then add to your Gatherer configuration file: Lib-Directory: lib The details about what each of these files does are described below. The basic contents of a typical Gatherer's directory is as follows (note: some of the file names below can be changed by setting variables in the Gatherer configuration file, as described in Section ``Setting variables in the Gatherer configuration file''): RunGatherd* bin/ GathName.cf log.errors tmp/ RunGatherer* data/ lib/ log.gatherer bin: MyNewType.sum* data: All-Templates.gz INFO.soif PRODUCTION.gdbm gatherd.log INDEX.gdbm MD5.gdbm gatherd.cf lib: bycontent.cf byurl.cf quick-sum.cf byname.cf magic stoplist.cf tmp: The RunGatherd and RunGatherer are used to export the Gatherer's database after a machine reboot and to run the Gatherer, respectively. The _l_o_g_._e_r_r_o_r_s and _l_o_g_._g_a_t_h_e_r_e_r files contain error messages and the output of the _E_s_s_e_n_c_e typing step, respectively (Essence will be described shortly). The _G_a_t_h_N_a_m_e_._c_f file is the Gatherer's configuration file. The _b_i_n directory contains any summarizers and any other program needed by the summarizers. If you were to customize the Gatherer by adding a summarizer, you would place those programs in this _b_i_n directory; the MyNewType.sum is an example. The _d_a_t_a directory contains the Gatherer's database which gatherd exports. The Gatherer's database consists of the _A_l_l_-_T_e_m_p_l_a_t_e_s_._g_z_, _I_N_D_E_X_._g_d_b_m_, _I_N_F_O_._s_o_i_f_, _M_D_5_._g_d_b_m and _P_R_O_D_U_C_T_I_O_N_._g_d_b_m files. The _g_a_t_h_e_r_d_._c_f file is used to support access control as described in Section ``Controlling access to the Gatherer's database''. The _g_a_t_h_e_r_d_._l_o_g file is where the gatherd program logs its information. The _l_i_b directory contains the configuration files used by the Gatherer's subsystems, namely Essence. These files are described briefly in the following table: bycontent.cf Content parsing heuristics for type recognition step byname.cf File naming heuristics for type recognition step byurl.cf URL naming heuristics for type recognition step magic UNIX ``file'' command specifications (matched against bycontent.cf strings) quick-sum.cf Extracts attributes for summarizing step. stoplist.cf File types to reject during candidate selection 44..55..33..11.. CCuussttoommiizziinngg tthhee ttyyppee rreeccooggnniittiioonn sstteepp Essence recognizes types in three ways (in order of precedence): by URL naming heuristics, by file naming heuristics, and by locating _i_d_e_n_t_i_f_y_i_n_g data within a file using the UNIX file command. To modify the type recognition step, edit _l_i_b_/_b_y_n_a_m_e_._c_f to add file naming heuristics, or _l_i_b_/_b_y_u_r_l_._c_f to add URL naming heuristics, or _l_i_b_/_b_y_c_o_n_t_e_n_t_._c_f to add by-content heuristics. The by-content heuristics match the output of the UNIX file command, so you may also need to edit the _l_i_b_/_m_a_g_i_c file. See Section ``Example 3'' and ``Example 4'' for detailed examples on how to customize the type recognition step. 44..55..33..22.. CCuussttoommiizziinngg tthhee ccaannddiiddaattee sseelleeccttiioonn sstteepp The _l_i_b_/_s_t_o_p_l_i_s_t_._c_f configuration file contains a list of types that are rejected by Essence. You can add or delete types from _l_i_b_/_s_t_o_p_l_i_s_t_._c_f to control the candidate selection step. To direct Essence to index only certain types, you can list the types to index in _l_i_b_/_a_l_l_o_w_l_i_s_t_._c_f. Then, supply Essence with the ----aalllloowwlliisstt flag. The file and URL naming heuristics used by the type recognition step (described in Section ``Customizing the type recognition step'') are particularly useful for candidate selection when gathering remote data. They allow the Gatherer to avoid retrieving files that you don't want to index (in contrast, recognizing types by locating identifying data within a file requires that the file be retrieved first). This approach can save quite a bit of network traffic, particularly when used in combination with enumerated _R_o_o_t_N_o_d_e URLs. For example, many sites provide each of their files in both a compressed and uncompressed form. By building a _l_i_b_/_a_l_l_o_w_l_i_s_t_._c_f containing only the Compressed types, you can avoid retrieving the uncompressed versions of the files. 44..55..33..33.. CCuussttoommiizziinngg tthhee pprreesseennttaattiioonn uunnnneessttiinngg sstteepp Some types are declared as ``nested'' types. Essence treats these differently than other types, by running a presentation unnesting algorithm or ``Exploder'' on the data rather than a Summarizer. At present Essence can handle files nested in the following formats: 1. binhex 2. uuencode 3. shell archive (``shar'') 4. tape archive (``tar'') 5. bzip2 compressed (``bzip2'') 6. compressed 7. GNU compressed (``gzip'') 8. zip compressed archive To customize the presentation unnesting step you can modify the Essence source file _s_r_c_/_g_a_t_h_e_r_e_r_/_e_s_s_e_n_c_e_/_u_n_n_e_s_t_._c. This file lists the available presentation encodings, and also specifies the unnesting algorithm. Typically, an external program is used to unravel a file into one or more component files (e.g. bzip2, gunzip, uudecode, and tar). An _E_x_p_l_o_d_e_r may also be used to explode a file into a stream of SOIF objects. An Exploder program takes a URL as its first command-line argument and a file containing the data to use as its second, and then generates one or more SOIF objects as output. For your convenience, the _E_x_p_l_o_d_e_r type is already defined as a nested type. To save some time, you can use this type and its corresponding Exploder.unnest program rather than modifying the Essence code. See Section ``Example 2'' for a detailed example on writing an Exploder. The _u_n_n_e_s_t_._c file also contains further information on defining the unnesting algorithms. 44..55..33..44.. CCuussttoommiizziinngg tthhee ssuummmmaarriizziinngg sstteepp Essence supports two mechanisms for defining the type-specific extraction algorithms (called _S_u_m_m_a_r_i_z_e_r_s) that generate content summaries: a UNIX program that takes as its only command line argument the filename of the data to summarize, and line-based regular expressions specified in _l_i_b_/_q_u_i_c_k_-_s_u_m_._c_f. See Section ``Example 4'' for detailed examples on how to define both types of Summarizers. The UNIX Summarizers are named using the convention TypeName.sum (e.g., PostScript.sum). These Summarizers output their content summary in a SOIF attribute-value list (see Section ``The Summary Object Interchange Format (SOIF)''). You can use the wrapit command to wrap raw output into the SOIF format (i.e., to provide byte-count delimiters on the individual attribute-value pairs). There is a summarizer called FullText.sum that you can use to perform full text indexing of selected file types, by simply setting up the _l_i_b_/_b_y_c_o_n_t_e_n_t_._c_f and _l_i_b_/_b_y_n_a_m_e_._c_f configuration files to recognize the desired file types as FullText (i.e., using ``FullText'' in column 1 next to the matching regular expression). 44..66.. PPoosstt--SSuummmmaarriizziinngg:: RRuullee--bbaasseedd ttuunniinngg ooff oobbjjeecctt ssuummmmaarriieess It is possible to ``fine-tune'' the summary information generated by the Essence summarizers. A typical application of this would be to change the _T_i_m_e_-_t_o_-_L_i_v_e attribute based on some knowledge about the objects. So an administrator could use the post-summarizing feature to give quickly-changing objects a lower TTL, and very stable documents a higher TTL. Objects are selected for post-summarizing if they meet a specified condition. A condition consists of three parts: An attribute name, an operation, and some string data. For example: city == 'New York' In this case we are checking if the _c_i_t_y attribute is equal to the string `New York'. For exact string matching, the string data must be enclosed in single quotes. Regular expressions are also supported: city ~ /New York/ Negative operators are also supported: city != 'New York' city !~ /New York/ Conditions can be joined with `&&&&' (logical and) or `||||' (logical or) operators: city == 'New York' && state != 'NY'; When all conditions are met for an object, some number of instructions are executed on it. There are four types of instructions which can be specified: 1. Set an attribute exactly to some specific string. Example: time-to-live = "86400" 2. Filter an attribute through some program. The attribute value is given as input to the filter. The output of the filter becomes the new attribute value. Example: keywords | tr A-Z a-z 3. Filter multiple attributes through some program. In this case the filter must read and write attributes in the SOIF format. Example: address,city,state,zip ! cleanup-address.pl 4. A special case instruction is to delete an object. To do this, simply write: delete() 44..66..11.. TThhee RRuulleess ffiillee The conditions and instructions are combined together in a ``rules'' file. The format of this file is somewhat similar to a Makefile; conditions begin in the first column and instructions are indented by a tab-stop. Example: type == 'HTML' partial-text | cleanup-html-text.pl URL ~ /users/ time-to-live = "86400" partial-text ! extract-owner.sh type == 'SOIFStream' delete() This rules file is specified in the gatherer.cf file with the Post- Summarizing tag, e.g.: Post-Summarizing: lib/myrules 44..66..22.. RReewwrriittiinngg UURRLLss Until version 1.4 it was not possible to rewrite the URL-part of an object summary. It is now possible, but only by using the ``pipe'' instruction. This may be useful for people wanting to run a Gatherer on _f_i_l_e_:_/_/ URLs, but have them appear as _h_t_t_p_:_/_/ URLs. This can be done with a post-summarizing rule such as: url ~ 'file://localhost/web/htdocs/' url | fix-url.pl And the 'fix-url.pl' script might look like: #!/usr/local/bin/perl -p s'file://localhost/web/htdocs/'http://www.my.domain/'; 44..77.. GGaatthheerreerr aaddmmiinniissttrraattiioonn 44..77..11.. SSeettttiinngg vvaarriiaabblleess iinn tthhee GGaatthheerreerr ccoonnffiigguurraattiioonn ffiillee In addition to customizing the steps described in Section ``Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps'', you can customize the Gatherer by setting variables in the Gatherer configuration file. This file consists of two parts: a list of variables that specify information about the Gatherer (such as its name, host, and port number), and two lists of URLs (divided into RRoooottNNooddeess and LLeeaaffNNooddeess) from which to collect indexing information. Section ``Basic setup'' shows an example Gatherer configuration file. In this section we focus on the variables that the user can set in the first part of the Gatherer configuration file. Each variable name starts in the first column, ends with a colon, then is followed by the value. The following table shows the supported variables: Access-Delay: Default delay between URLs accesses. Data-Directory: Directory where GDBM database is written. Debug-Options: Debugging options passed to child programs. Errorlog-File: File for logging errors. Essence-Options: Any extra options to pass to Essence. FTP-Auth: Username/password for protected FTP documents. Gatherd-Inetd: Denotes that gatherd is run from inetd. Gatherer-Host: Full hostname where the Gatherer is run. Gatherer-Name: A Unique name for the Gatherer. Gatherer-Options: Extra options for the Gatherer. Gatherer-Port: Port number for gatherd. Gatherer-Version: Version string for the Gatherer. HTTP-Basic-Auth: Username/password for protected HTTP documents. HTTP-Proxy: host:port of your HTTP proxy. Keep-Cache: ``yes'' to not remove local disk cache. Lib-Directory: Directory where configuration files live. Local-Mapping: Mapping information for local gathering. Log-File: File for logging progress. Post-Summarizing: A rules-file for post-summarizing. Refresh-Rate: Object refresh-rate in seconds, default 1 week. Time-To-Live: Object time-to-live in seconds, default 1 month. Top-Directory: Top-level directory for the Gatherer. Working-Directory: Directory for tmp files and local disk cache. Notes: +o We recommend that you use the TToopp--DDiirreeccttoorryy variable, since it will set the DDaattaa--DDiirreeccttoorryy, LLiibb--DDiirreeccttoorryy, and WWoorrkkiinngg--DDiirreeccttoorryy variables. +o Both WWoorrkkiinngg--DDiirreeccttoorryy and DDaattaa--DDiirreeccttoorryy will have files in them after the Gatherer has run. The WWoorrkkiinngg--DDiirreeccttoorryy will hold the local-disk cache that the Gatherer uses to reduce network I/O, and the DDaattaa--DDiirreeccttoorryy will hold the GDBM databases that contain the content summaries. +o You should use full rather than relative pathnames. +o All variable definitions _m_u_s_t come before the RootNode or LeafNode URLs. +o Any line that starts with a ``#'' is a comment. +o LLooccaall--MMaappppiinngg is discussed in Section ``Local file system gathering for reduced CPU load''. +o HHTTTTPP--PPrrooxxyy will retrieve HTTP URLs via a proxy host. The syntax is hhoossttnnaammee::ppoorrtt; for example, pprrooxxyy..yyoouurrssiittee..ccoomm::33112288. +o EEsssseennccee--OOppttiioonnss is particularly useful, as it lets you customize basic aspects of the Gatherer easily. +o The only valid GGaatthheerreerr--OOppttiioonnss is ----ssaavvee--ssppaaccee which directs the Gatherer to be more space efficient when preparing its database for export. +o The Gatherer program will accept the --bbaacckkggrroouunndd flag which will cause the Gatherer to run in the background. The Essence options are: Option Meaning -------------------------------------------------------------------- --allowlist filename File with list of types to allow --fake-md5s Generates MD5s for SOIF objects from a .unnest program --fast-summarizing Trade speed for some consistency. Use only when an external summarizer is known to generate clean, unique attributes. --full-text Use entire file instead of summarizing. Alternatively, you can perform full text indexing of individual file types by using the FullText.sum summarizer. --max-deletions n Number of GDBM deletions before reorganization --minimal-bookkeeping Generates a minimal amount of bookkeeping attrs --no-access Do not read contents of objects --no-keywords Do not automatically generate keywords --stoplist filename File with list of types to remove --type-only Only type data; do not summarize objects A particular note about full text summarizing: Using the Essence ----ffuullll--tteexxtt option causes files not to be passed through the Essence content extraction mechanism. Instead, their entire content is included in the SOIF summary stream. In some cases this may produce unwanted results (e.g., it will directly include the PostScript for a document rather than first passing the data through a PostScript to text extractor, providing few searchable terms and large SOIF objects). Using the individual file type summarizing mechanism described in Section ``Customizing the summarizing step'' will work better in this regard, but will require you to specify how data are extracted for each individual file type. In a future version of Harvest we will change the Essence ----ffuullll--tteexxtt option to perform content extraction before including the full text of documents. 44..77..22.. LLooccaall ffiillee ssyysstteemm ggaatthheerriinngg ffoorr rreedduucceedd CCPPUU llooaadd Although the Gatherer's work load is specified using URLs, often the files being gathered are located on a local file system. In this case it is much more efficient to gather directly from the local file system than via FTP/Gopher/HTTP/News, primarily because of all the UNIX forking required to gather information via these network processes. For example, our measurements indicate it causes from 4-7x more CPU load to gather from FTP than directly from the local file system. For large collections (e.g., archive sites containing many thousands of files), the CPU savings can be considerable. Starting with Harvest Version 1.1, it is possible to tell the Gatherer how to translate URLs to local file system names, using the LLooccaall-- MMaappppiinngg Gatherer configuration file variable (see Section ``Setting variables in the Gatherer configuration file''). The syntax is: Local-Mapping: URL_prefix local_path_prefix This causes all URLs starting with UURRLL__pprreeffiixx to be translated to files starting with the prefix llooccaall__ppaatthh__pprreeffiixx while gathering, but to be left as URLs in the results of queries (so the objects can be retrieved as usual). Note that no regular expressions are supported here. As an example, the specification Local-Mapping: http://harvest.cs.colorado.edu/~hardy/ /homes/hardy/public_html/ Local-Mapping: ftp://ftp.cs.colorado.edu/pub/cs/ /cs/ftp/ would cause the URL _h_t_t_p_:_/_/_h_a_r_v_e_s_t_._c_s_._c_o_l_o_r_a_d_o_._e_d_u_/_~_h_a_r_d_y_/_H_o_m_e_._h_t_m_l to be translated to the local file name _/_h_o_m_e_s_/_h_a_r_d_y_/_p_u_b_l_i_c___h_t_m_l_/_H_o_m_e_._h_t_m_l, while the URL _f_t_p_:_/_/_f_t_p_._c_s_._c_o_l_o_r_a_d_o_._e_d_u_/_p_u_b_/_c_s_/_t_e_c_h_r_e_p_o_r_t_s_/_s_c_h_w_a_r_t_z_/_H_a_r_v_e_s_t_._C_o_n_f_._p_s_._Z would be translated to the local file name _/_c_s_/_f_t_p_/_t_e_c_h_r_e_p_o_r_t_s_/_s_c_h_w_a_r_t_z_/_H_a_r_v_e_s_t_._C_o_n_f_._p_s_._Z. Local gathering will work over NFS file systems. A local mapping will fail if: the local file cannot be opened for reading; or the local file is not a regular file; or the local file has execute bits set. So, for directories, symbolic links and CGI scripts, the server is always contacted rather than the local file system. Lastly, the Gatherer does not perform any URL syntax translations for local mappings. If your URL has characters that should be escaped (as in RFC1738), then the local mapping will fail. Starting with version 1.4 patchlevel 2 Essence will print _[_L_] after URLs which were successfully accessed locally. Note that if your network is highly congested, it may actually be faster to gather via HTTP/FTP/Gopher than via NFS, because NFS becomes very inefficient in highly congested situations. Even better would be to run local Gatherers on the hosts where the disks reside, and access them directly via the local file system. 44..77..33.. GGaatthheerriinngg ffrroomm ppaasssswwoorrdd--pprrootteecctteedd sseerrvveerrss You can gather password-protected documents from HTTP and FTP servers. In both cases, you can specify a username and password as a part of the URL. The format is as follows: ftp://user:password@host:port/url-path http://user:password@host:port/url-path With this format, the ``user:password'' part is kept as a part of the URL string all throughout Harvest. This may enable anyone who uses your Broker(s) to access password-protected documents. You can keep the username and password information ``hidden'' by specifying the authentication information in the Gatherer configuration file. For HTTP, the format is as follows: HTTP-Basic-Auth: realm username password where rreeaallmm is the same as the AAuutthhNNaammee parameter given in an Apache httpd _h_t_t_p_d_._c_o_n_f or _._h_t_a_c_c_e_s_s file. In other httpd server configuration, the realm value is sometimes called SSeerrvveerrIIdd. For FTP, the format in the gatherer.cf file is FTP-Auth: hostname[:port] username password 44..77..44.. CCoonnttrroolllliinngg aacccceessss ttoo tthhee GGaatthheerreerr''ss ddaattaabbaassee You can use the _g_a_t_h_e_r_d_._c_f file (placed in the DDaattaa--DDiirreeccttoorryy of a Gatherer) to control access to the Gatherer's database. A line that begins with AAllllooww is followed by any number of domain or host names that are allowed to connect to the Gatherer. If the word aallll is used, then all hosts are matched. DDeennyy is the opposite of AAllllooww. The following example will only allow hosts in the ccss..ccoolloorraaddoo..eedduu or uusscc..eedduu domain access the Gatherer's database: Allow cs.colorado.edu usc.edu Deny all 44..77..55.. PPeerriiooddiicc ggaatthheerriinngg aanndd rreeaallttiimmee uuppddaatteess The Gatherer program does not automatically do any periodic updates -- when you run it, it processes the specified URLs, starts up a gatherd daemon (if one isn't already running), and then exits. If you want to update the data periodically (e.g., to capture new files as they are added to an FTP archive), you need to use the UNIX cron command to run the Gatherer program at some regular interval. To set up periodic gathering via cron, use the RunGatherer command that RunHarvest will create. An example RunGatherer script follows: #!/bin/sh # # RunGatherer - Runs the ATT 800 Gatherer (from cron) # HARVEST_HOME=/usr/local/harvest; export HARVEST_HOME PATH=${HARVEST_HOME}/bin:${HARVEST_HOME}/lib/gatherer:${HARVEST_HOME}/lib:$PATH export PATH NNTPSERVER=localhost; export NNTPSERVER cd /usr/local/harvest/gatherers/att800 exec Gatherer "att800.cf" You should run the RunGatherd command from your system startup (e.g. _/_e_t_c_/_r_c_._l_o_c_a_l) file, so the Gatherer's database is exported each time the machine reboots. An example RunGatherd script follows: #!/bin/sh # # RunGatherd - starts up the gatherd process (from /etc/rc.local) # HARVEST_HOME=/usr/local/harvest; export HARVEST_HOME PATH=${HARVEST_HOME}/lib/gatherer:${HARVEST_HOME}/bin:$PATH; export PATH exec gatherd -d /usr/local/harvest/gatherers/att800/data 8500 44..77..66.. TThhee llooccaall ddiisskk ccaacchhee The Gatherer maintains a local disk cache of files it gathers to reduce network traffic from restarting aborted gathering attempts. However, since the remote server must still be contacted whenever Gatherer runs, please do not set your cron job to run Gatherer frequently. A typical value might be weekly or monthly, depending on how congested the network and how important it is to have the most current data. By default, the Gatherer's local disk cache is deleted after each successful completion. To save the local disk cache between Gatherer sessions, define KKeeeepp--CCaacchhee:: yyeess in your Gatherer configuration file (Section ``Setting variables in the Gatherer configuration file''). If you want your Broker's index to reflect new data, then you must run the Gatherer _a_n_d run a Broker collection. By default, a Broker will perform collections once a day. If you want the Broker to collect data as soon as it's gathered, then you will need to coordinate the timing of the completion of the Gatherer and the Broker collections. If you run your Gatherer frequently and you use the KKeeeepp--CCaacchhee:: yyeess in your Gatherer configuration file, then the Gatherer's local disk cache may interfere with retrieving updates. By default, objects in the local disk cache expire after 7 days; however, you can expire objects more quickly by setting the $$GGAATTHHEERREERR__CCAACCHHEE__TTTTLL environment variable to the number of seconds for the Time-To-Live (TTL) before you run the Gatherer, or you can change RunGatherer to remove the Gatherer's _t_m_p directory after each Gatherer run. For example, to expire objects in the local disk cache after one day: % setenv GATHERER_CACHE_TTL 86400 # one day % ./RunGatherer The Gatherer's local disk cache size defaults to 32 MBs, but you can change this value by setting the $$HHAARRVVEESSTT__MMAAXX__LLOOCCAALL__CCAACCHHEE environment variable to the number of MBs before you run the Gatherer. For example, to have a maximum cache of 10 MB you can do as follows: % setenv HARVEST_MAX_LOCAL_CACHE 10 # 10 MB % ./RunGatherer If you have access to the software that creates the files that you are indexing (e.g., if all updates are funneled through a particular editor, update script, or system call), you can modify this software to schedule realtime Gatherer updates whenever a file is created or updated. For example, if all users update the files being indexed using a particular program, this program could be modified to run the Gatherer upon completion of the user's update. Note that, when used in conjunction with cron, the Gatherer provides a powerful data ``mirroring'' facility. You can use the Gatherer to replicate the contents of one or more sites, retrieve data in multiple formats via multiple protocols (FTP, HTTP, etc.), optionally perform a variety of type- or site-specific transformations on the data, and serve the results very efficiently as compressed SOIF object summary streams to other sites that wish to use the data for building indexes or for other purposes. 44..77..77.. IInnccoorrppoorraattiinngg mmaannuuaallllyy ggeenneerraatteedd iinnffoorrmmaattiioonn iinnttoo aa GGaatthheerreerr You may want to inspect the quality of the automatically-generated SOIF templates. In general, Essence's techniques for automatic information extraction produce imperfect results. Sometimes it is possible to customize the summarizers to better suit the particular context (see Section ``Customizing the summarizing step''). Sometimes, however, it makes sense to augment or change the automatically generated keywords with manually entered information. For example, you may want to add _T_i_t_l_e attributes to the content summaries for a set of PostScript documents (since it's difficult to parse them out of PostScript automatically). Harvest provides some programs that automatically clean up a Gatherer's database. The rmbinary program removes any binary data from the templates. The cleandb program does some simple validation of SOIF objects, and when given the --ttrruunnccaattee flag it will truncate the _K_e_y_w_o_r_d_s data field to 8 kilobytes. To help in manually managing the Gatherer's databases, the gdbmutil GDBM database management tool is provided in _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r. In a future release of Harvest we will provide a forms-based mechanism to make it easy to provide manual annotations. In the meantime, you can annotate the Gatherer's database with manually generated information by using the mktemplate, template2db, mergedb, and mkindex programs. You first need to create a file (called, say, _a_n_n_o_t_a_t_i_o_n_s) in the following format: @FILE { url1 Attribute-Name-1: DATA Attribute-Name-2: DATA ... Attribute-Name-n: DATA } @FILE { url2 Attribute-Name-1: DATA Attribute-Name-2: DATA ... Attribute-Name-n: DATA } ... Note that the _A_t_t_r_i_b_u_t_e_s must begin in column 0 and have one tab after the colon, and the _D_A_T_A must be on a single line. Next, run the mktemplate and template2db programs to generate SOIF and then GDBM versions of these data (you can have several files containing the annotations, and generate a single GDBM database from the above commands): % set path = ($HARVEST_HOME/lib/gatherer $path) % mktemplate annotations [annotations2 ...] | template2db annotations.gdbm Finally, you run mergedb to incorporate the annotations into the automatically generated data, and mkindex to generate an index for it. The usage line for mergedb is: mergedb production automatic manual [manual ...] The idea is that _p_r_o_d_u_c_t_i_o_n is the final GDBM database that the Gatherer will serve. This is a _n_e_w database that will be generated from the other databases on the command line. _a_u_t_o_m_a_t_i_c is the GDBM database that a Gatherer automatically generated in a previous run (e.g., _W_O_R_K_I_N_G_._g_d_b_m or a previous _P_R_O_D_U_C_T_I_O_N_._g_d_b_m). _m_a_n_u_a_l and so on are the GDBM databases that you manually created. When mergedb runs, it builds the _p_r_o_d_u_c_t_i_o_n database by first copying the templates from the _m_a_n_u_a_l databases, and then merging in the attributes from the _a_u_t_o_m_a_t_i_c database. In case of a conflict (the same attribute with different values in the _m_a_n_u_a_l and _a_u_t_o_m_a_t_i_c databases), the _m_a_n_u_a_l values override the _a_u_t_o_m_a_t_i_c values. By keeping the automatically and manually generated data stored separately, you can avoid losing the manual updates when doing periodic automatic gathering. To do this, you will need to set up a script to remerge the manual annotations with the automatically gathered data after each gathering. An example use of mergedb is: % mergedb PRODUCTION.new PRODUCTION.gdbm annotations.gdbm % mv PRODUCTION.new PRODUCTION.gdbm % mkindex If the manual database looked like this: @FILE { url1 my-manual-attribute: this is a neat attribute } and the automatic database looked like this: @FILE { url1 keywords: boulder colorado file-size: 1034 md5: c3d79dc037efd538ce50464089af2fb6 } then in the end, the production database will look like this: @FILE { url1 my-manual-attribute: this is a neat attribute keywords: boulder colorado file-size: 1034 md5: c3d79dc037efd538ce50464089af2fb6 } 44..88.. TTrroouubblleesshhoooottiinngg DDeebbuuggggiinngg Extra information from specific programs and library routines can be logged by setting debugging flags. A debugging flag has the form --DDsseeccttiioonn,,lleevveell. _S_e_c_t_i_o_n is an integer in the range 1-255, and _l_e_v_e_l is an integer in the range 1-9. Debugging flags can be given on a command line, with the DDeebbuugg--OOppttiioonnss:: tag in a gatherer configuration file, or by setting the environment variable $$HHAARRVVEESSTT__DDEEBBUUGG. Examples: Debug-Options: -D68,5 -D44,1 % httpenum -D20,1 -D21,1 -D42,1 http://harvest.cs.colorado.edu/ % setenv HARVEST_DEBUG '-D20,1 -D23,1 -D63,1' Debugging sections and levels have been assigned to the following sections of the code: section 20, level 1, 5, 9 Common liburl URL processing section 21, level 1, 5, 9 Common liburl HTTP routines section 22, level 1, 5 Common liburl disk cache routines section 23, level 1 Common liburl FTP routines section 24, level 1 Common liburl Gopher routines section 25, level 1 urlget - standalone liburl program. section 26, level 1 ftpget - standalone liburl program. section 40, level 1, 5, 9 Gatherer URL enumeration section 41, level 1 Gatherer enumeration URL verification section 42, level 1, 5, 9 Gatherer enumeration for HTTP section 43, level 1, 5, 9 Gatherer enumeration for Gopher section 44, level 1, 5 Gatherer enumeration filter routines section 45, level 1 Gatherer enumeration for FTP section 46, level 1 Gatherer enumeration for file:// URLs section 48, level 1, 5 Gatherer enumeration robots.txt stuff section 60, level 1 Gatherer essence data object processing section 61, level 1 Gatherer essence database routines section 62, level 1 Gatherer essence main section 63, level 1 Gatherer essence type recognition section 64, level 1 Gatherer essence object summarizing section 65, level 1 Gatherer essence object unnesting section 66, level 1, 2, 5 Gatherer essence post-summarizing section 67, level 1 Gatherer essence object-ID code section 69, level 1, 5, 9 Common SOIF template processing section 70, level 1, 5, 9 Broker registry section 71, level 1 Broker collection routines section 72, level 1 Broker SOIF parsing routines section 73, level 1, 5, 9 Broker registry hash tables section 74, level 1 Broker storage manager routines section 75, level 1, 5 Broker query manager routines section 75, level 4 Broker query_list debugging section 76, level 1 Broker event management routines section 77, level 1 Broker main section 78, level 9 Broker select(2) loop section 79, level 1, 5, 9 Broker gatherer-id management section 80, level 1 Common utilities memory management section 81, level 1 Common utilities buffer routines section 82, level 1 Common utilities system(3) routines section 83, level 1 Common utilities pathname routines section 84, level 1 Common utilities hostname processing section 85, level 1 Common utilities string processing section 86, level 1 Common utilities DNS host cache section 101, level 1 Broker PLWeb indexing engine section 102, level 1, 2, 5 Broker Glimpse indexing engine section 103, level 1 Broker Swish indexing engine SSyymmppttoomm The Gatherer _d_o_e_s_n_'_t _p_i_c_k _u_p _a_l_l _t_h_e _o_b_j_e_c_t_s pointed to by some of my RootNodes. SSoolluuttiioonn The Gatherer places various limits on enumeration to prevent a misconfigured Gatherer from abusing servers or running wildly. See section ``RootNode specifications'' for details on how to override these limits. SSyymmppttoomm _L_o_c_a_l_-_M_a_p_p_i_n_g _d_i_d _n_o_t _w_o_r_k for me - it retrieved the objects via the usual remote access protocols. SSoolluuttiioonn A local mapping will fail if: +o the local filename cannot be opened for reading; or, +o the local filename is not a regular file; or, +o the local filename has execute bits set. So for directories, symbolic links, and CGI scripts, the HTTP server is always contacted. We don't perform URL translation for local mappings. If your URL's have funny characters that must be escaped, then the local mapping will also fail. Add debug option --DD2200,,11 to understand how local mappings are taking place. SSyymmppttoomm Using the ----ffuullll--tteexxtt option I see a lot of _r_a_w _d_a_t_a in the content summaries, with few keywords I can search. SSoolluuttiioonn At present ----ffuullll--tteexxtt simply includes the full data content in the SOIF summaries. Using the individual file type summarizing mechanism described in Section ``Customizing the summarizing step'' will work better in this regard, but will require you to specify how data are extracted for each individual file type. In a future version of Harvest we will change the Essence ----ffuullll--tteexxtt option to perform content extraction before including the full text of documents. SSyymmppttoomm No indexing terms are being generated in the SOIF summary for the META tags in my HTML documents. SSoolluuttiioonn This probably indicates that your HTML is not syntactically well-formed, and hence the SGML-based HTML summarizer is not able to recognize it. See Section ``Summarizing SGML data'' for details and debugging options. SSyymmppttoomm Gathered data are _n_o_t _b_e_i_n_g _u_p_d_a_t_e_d. SSoolluuttiioonn The Gatherer does not automatically do periodic updates. See Section ``Periodic gathering and realtime updates'' for details. SSyymmppttoomm The Gatherer puts _s_l_i_g_h_t_l_y _d_i_f_f_e_r_e_n_t _U_R_L_s in the _S_O_I_F summaries than I specified in the Gatherer _c_o_n_f_i_g_u_r_a_t_i_o_n _f_i_l_e. SSoolluuttiioonn This happens because the Gatherer attempts to put URLs into a canonical format. It does this by removing default port numbers and similar cosmetic changes. Also, by default, Essence (the content extraction subsystem within the Gatherer) removes the standard stoplist.cf types, which includes HTTP-Query (the cgi- bin stuff). SSyymmppttoomm There are _n_o _L_a_s_t_-_M_o_d_i_f_i_c_a_t_i_o_n_-_T_i_m_e or _M_D_5 _a_t_t_r_i_b_u_t_e_s in my gatherered SOIF data, so the Broker can't do duplicate elimination. SSoolluuttiioonn If you gather remote, manually-created information, it is pulled into Harvest using ``exploders'' that translate from the remote format into SOIF. That means they don't have a direct way to fill in the Last-Modification-Time or MD5 information per record. Note also that this will mean one update to the remote records would cause all records to look updated, which will result in more network load for Brokers that collect from this Gatherer's data. As a solution, you can compute MD5s for all objects, and store them as part of the record. Then, when you run the exploder you only generate timestamps for the ones for which the MD5s changed - giving you real last-modification times. SSyymmppttoomm The Gatherer substitutes a ``%7e'' for a ``~'' in all the user directory URLs. SSoolluuttiioonn The Gatherer conforms to RFC1738, which says that a tilde inside a URL should be encoded as ``%7e'', because it is considered an ``unsafe'' character. SSyymmppttoomm When I search using keywords I know are in a document I have indexed with Harvest, the _d_o_c_u_m_e_n_t _i_s_n_'_t _f_o_u_n_d. SSoolluuttiioonn Harvest uses a content extraction subsystem called _E_s_s_e_n_c_e that by default does not extract every keyword in a document. Instead, it uses heuristics to try to select promising keywords. You can change what keywords are selected by customizing the summarizers for that type of data, as discussed in Section ``Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps''. Or, you can tell _E_s_s_e_n_c_e to use full text summarizing if you feel the added disk space costs are merited, as discussed in Section ``Setting variables in the Gatherer configuration file''. SSyymmppttoomm I'm running Harvest on HP-UX, but the essence process in the Gatherer _t_a_k_e_s _t_o_o _m_u_c_h _m_e_m_o_r_y. SSoolluuttiioonn The supplied regular expression library has memory leaks on HP- UX, so you need to use the regular expression library supplied with HP-UX. Change the _M_a_k_e_f_i_l_e in _s_r_c_/_g_a_t_h_e_r_e_r_/_e_s_s_e_n_c_e to read: REGEX_DEFINE = -DUSE_POSIX_REGEX REGEX_INCLUDE = REGEX_OBJ = REGEX_TYPE = posix SSyymmppttoomm I built the configuration files to _c_u_s_t_o_m_i_z_e how Essence types/content extracts data, but it _u_s_e_s _t_h_e _s_t_a_n_d_a_r_d _t_y_p_i_n_g_/_e_x_t_r_a_c_t_i_n_g mechanisms anyway. SSoolluuttiioonn Verify that you have the LLiibb--DDiirreeccttoorryy set to the _l_i_b_/ directory that you put your configuration files. LLiibb--DDiirreeccttoorryy is defined in your Gatherer configuration file. SSyymmppttoomm I am having problems _r_e_s_o_l_v_i_n_g _h_o_s_t _n_a_m_e_s on SunOS. SSoolluuttiioonn In order to gather data from hosts outside of your organization, your system must be able to resolve fully qualified domain names into IP addresses. If your system cannot resolve hostnames, you will see error messages such as ``Unknown Host.'' In this case, either: +o the hostname you gave does not really exist; or +o your system is not configured to use the DNS. To verify that your system is configured for DNS, make sure that the file _/_e_t_c_/_r_e_s_o_l_v_._c_o_n_f exists and is readable. Read the resolv.conf(5) manual page for information on this file. You can verify that DNS is working with the nslookup command. Some sites may use Sun Microsystem's Network Information Service (NIS) instead of, or in addition to, DNS. We believe that Harvest works on systems where NIS has been properly configured. The NIS servers (the names of which you can determine from the ypwhich command) must be configured to query DNS servers for hostnames they do not know about. See the --bb option of the ypxfr command. SSyymmppttoomm I cannot get the Gatherer to work across our _f_i_r_e_w_a_l_l _g_a_t_e_w_a_y. SSoolluuttiioonn Harvest only supports retrieving HTTP objects through a proxy. It is not yet possible to request Gopher and FTP objects through a firewall. For these objects, you may need to run Harvest internally (behind the firewall) or on the firewall host itself. If you see the ``Host is unreachable'' message, these are the likely problems: +o your connection to the Internet is temporarily down due to a circuit or routing failure; or +o you are behind a firewall. If you see the ``Connection refused'' message, the likely problem is that you are trying to connect with an unused port on the destination machine. In other words, there is no program listening for connections on that port. The Harvest gatherer is essentially a WWW client. You should expect it to work the same as any Web browser. 55.. TThhee BBrrookkeerr 55..11.. OOvveerrvviieeww The Broker retrieves and manages indexing information from Gatherers and other Brokers, and provides a WWW query interface to the indexing information. 55..22.. BBaassiicc sseettuupp The Broker is automatically started by the RunHarvest command. Other relevant commands are described in Section ``Starting up the system: RunHarvest and related commands''. In the current section we discuss various ways users can customize and tune the Broker, how to administrate the Broker, and the various Broker programming interfaces. As suggested in Figure ``1'', the Broker uses a flexible indexing interface that supports a variety of indexing subsystems. The default Harvest Broker uses Glimpse as indexer, but other indexers such as Swish, and WAIS (both freeWAIS and commercial WAIS <ftp://ftp.cnidr.org/pub/software/freewais/>), also work with the Broker (see Section ``Using different index/search engines with the Broker''). To create a new Broker, run the CreateBroker program. It will ask you a series of questions about how you'd like to configure your Broker, and then automatically create and configure it. To start your Broker, use the RunBroker program that CreateBroker generates. The Broker should be started when your system reboots. To prevent a collection while starting the broker, use the --nnooccooll option. There are a number of ways you can customize or tune the Broker, discussed in Sections ``Tuning Glimpse indexing in the Broker'' and ``Using different index/search engines with the Broker''. You may also use the RunHarvest command, discussed in Section ``Starting up the system: RunHarvest and related commands'', to create both a Broker and a Gatherer. 55..33.. QQuueerryyiinngg aa BBrrookkeerr The Harvest Broker can handle many types of queries. The queries handled by a particular Broker depend on what index/search engine is being used inside of it (e.g., WAIS does not support some of the queries that Glimpse does). In this section we describe the full syntax. If a particular Broker does not support a certain type of query, it will return an error when the user requests that type of query. The simplest query is a single keyword, such as: lightbulb Searching for common words (like ``computer'' or ``html'') may take a lot of time. Particularly for large Brokers, it is often helpful to use more powerful queries. Harvest supports many different index/search engines, with varying capabilities. At present, our most powerful (and commonly used) search engine is Glimpse, which supports: +o case-insensitive and case-sensitive queries; +o matching parts of words, whole words, or multiple word phrases (like ``resource discovery''); +o Boolean (AND/OR) combinations of keywords; +o approximate matches (e.g., allowing spelling errors); +o structured queries (which allow you to constrain matches to certain attributes); +o displaying matched lines or entire matching records (e.g., for citations); +o specifying limits on the number of matches returned; and +o a limited form of regular expressions (e.g., allowing ``wild card'' expressions that match all words ending in a particular suffix). The different types of queries (and how to use them) are discussed below. Note that you use the same syntax regardless of what index/search engine is running in a particular Broker, but that not all engines support all of the above features. In particular, some of the Brokers use WAIS, which sometimes searches faster than Glimpse but supports only Boolean keyword queries and the ability to specify result set limits. The different options - case-sensitivity, approximate matching, the ability to show matched lines vs. entire matching records, and the ability to specify match count limits - can all be specified with buttons and menus in the Broker query forms. A structured query has the form: tag-name : value where _t_a_g_-_n_a_m_e is a Content Summary attribute name, and _v_a_l_u_e is the search value within the attribute. If you click on a Content Summary, you will see what attributes are available for a particular Broker. A list of common attributes is shown in Section ``List of common SOIF attribute names''. Keyword searches and structured queries can be combined using Boolean operators (AND and OR) to form complex queries. Lacking parentheses, logical operation precedence is based left to right. For multiple word phrases or regular expressions, you need to enclose the string in double quotes, e.g., "internet resource discovery" or "discov.*" Double quotes should also be used when searching for non-alphanumeric characters. 55..33..11.. EExxaammppllee qquueerriieess SSiimmppllee kkeeyywwoorrdd sseeaarrcchh qquueerryy:: _A_r_i_z_o_n_a This query returns all objects in the Broker containing the word _A_r_i_z_o_n_a. BBoooolleeaann qquueerryy:: _A_r_i_z_o_n_a _A_N_D _d_e_s_e_r_t This query returns all objects in the Broker that contain both words anywhere in the object in any order. PPhhrraassee qquueerryy:: _"_A_r_i_z_o_n_a _d_e_s_e_r_t_" This query returns all objects in the Broker that contain _A_r_i_z_o_n_a _d_e_s_e_r_t as a phrase. Notice that you need to put double quotes around the phrase. BBoooolleeaann qquueerriieess wwiitthh pphhrraasseess:: _"_A_r_i_z_o_n_a _d_e_s_e_r_t_" _A_N_D _w_i_n_d_s_u_r_f_i_n_g This query returns all objects in the Broker that contain _A_r_i_z_o_n_a _d_e_s_e_r_t as a phrase and the word windsurfing. SSiimmppllee SSttrruuccttuurreedd qquueerryy:: _T_i_t_l_e _: _w_i_n_d_s_u_r_f_i_n_g This query returns all objects in the Broker where the _T_i_t_l_e attribute contains the value _w_i_n_d_s_u_r_f_i_n_g. CCoommpplleexx qquueerryy:: _"_A_r_i_z_o_n_a _d_e_s_e_r_t_" _A_N_D _(_T_i_t_l_e _: _w_i_n_d_s_u_r_f_i_n_g_) This query returns all objects in the Broker that contain the phrase _A_r_i_z_o_n_a _d_e_s_e_r_t and where the _T_i_t_l_e attribute of the same object contains the value _w_i_n_d_s_u_r_f_i_n_g. 55..33..22.. RReegguullaarr eexxpprreessssiioonnss Some types of regular expressions are supported by Glimpse. A regular expression search can be much slower that other searches. The following is a partial list of possible patterns. (For more details see the Glimpse documentations.) +o _^_j_o_e will match ``joe'' at the beginning of a line. +o _j_o_e_$ will match ``joe'' at the end of a line. +o _[_a_-_h_o_-_z_] matches any character between ``a'' and ``h'' or between ``o'' and ``z''. +o _. matches any single character except newline. +o _c_* matches zero or more occurrences of the character ``c''. +o _._* matches any number of characters except newline. +o _\_* matches the character ``*''. (_\ escapes any of the above special characters.) Regular expressions are currently limited to approximately 30 characters, not including meta characters. Regular expressions will generally not cross word boundaries (because only words are stored in the index). So, for example, _"_l_i_n_._*_i_n_g_" will find ``linking'' or ``flinching,'' but not ``linear programming.'' 55..33..33.. QQuueerryy ooppttiioonnss sseelleecctteedd bbyy mmeennuuss oorr bbuuttttoonnss The query page may have following checkboxes to allow some control of the query specification. CCaassee iinnsseennssiittiivvee:: By selecting this checkbox the query will become case insensitive (lower case and upper case letters don't differ). Otherwise, the query will be case sensitive. The default is case insensitive. KKeeyywwoorrddss mmaattcchh oonn wwoorrdd bboouunnddaarriieess:: By selecting this checkbox, keywords will match on word boundaries. Otherwise, a keyword will match part of a word (or phrase). For example, "network" will match ``networking'', "sensitive" will match ``insensitive'', and "Arizona desert" will match ``Arizona desertness''. The default is to match keywords on word boundaries. NNuummbbeerr ooff eerrrroorrss aalllloowweedd:: Glimpse allows the search to contain a number of errors. An error is either a deletion, insertion, or substitution of a single character. The Best Match option will find the match(es) with the least number of errors. The default is 0 (zero) errors. _N_o_t_e_: The previous three options do not apply to attribute names. Attribute names are always case insensitive and allow no errors. 55..33..44.. FFiilltteerriinngg qquueerryy rreessuullttss Harvest allows to filter the results of a query by any query term using any attribute defined in the ``List of common SOIF attribute names''. This is done by defining ffiilltteerr parameters in the query form. It is possible to define more that one filter parameter; they will be concatenated by boolean AANNDD. Filter parameters consist of two parts, separated by the pipe symbol ``|''. The first part is a query expression which is attached to the user query using AANNDD before sending the request to the broker. The optional second part is a HTML text that shall be displayd on the results page, to give the user some information on the applied filter. Example: <SELECT NAME="filter"> <OPTION VALUE=''>No Filter <OPTION VALUE='uri: "xyz\.edu"|Seach only xyz.edu'>Search xyz.edu only <OPTION VALUE='type: html|HTML documents only'>Search HTML documents only </SELECT> The first option returns an unfiltered output. The second option returns only pages found on pages with ``xyz.edu'' in their URL. The third option returns only HTML-documents. See the advanced search page of the broker for more examples. 55..33..55.. RReessuulltt sseett pprreesseennttaattiioonn The query page may have following checkboxes allow some control of presentation of the query return. DDiissppllaayy mmaattcchheedd lliinneess ((ffrroomm ccoonntteenntt ssuummmmaarriieess)):: By selecting this checkbox, the result set presentation will contain the lines of the Content Summary that matched the query. Otherwise, the matched lines will not be displayed. The default is to display the matched lines. DDiissppllaayy oobbjjeecctt ddeessccrriippttiioonnss ((iiff aavvaaiillaabbllee)):: Some objects have short, one-line descriptions associated with them. By selecting this checkbox, the descriptions will be presented. Otherwise, the object descriptions will not be displayed. The default is to display object descriptions. DDiissppllaayy lliinnkkss ttoo iinnddeexxeedd ccoonntteenntt ssuummmmaarryy:: This checkbox allows you to set whether links to the indexed content summaries are displayed or not. The default is not to display links to inexed content summaries. 55..44.. CCuussttoommiizziinngg tthhee BBrrookkeerr''ss QQuueerryy RReessuulltt SSeett It is possible for the Harvest administrator to customize how the Broker query result set is generated, by modifying a configuration file that is interpreted by the search.cgi Perl program at query result time. search.cgi allows you to customize almost every aspect of its HTML output. The file _$_H_A_R_V_E_S_T___H_O_M_E_/_c_g_i_-_b_i_n_/_l_i_b_/_s_e_a_r_c_h_._c_f contains the default output definitions. Individual brokers can be customized by creating a similar file which overrides the default definitions. 55..44..11.. TThhee sseeaarrcchh..ccff ccoonnffiigguurraattiioonn ffiillee Definitions are enclosed within SGML-like beginning and ending tags. For example: <HarvestUrl> http://harvest.sourceforge.net/ </HarvestUrl> The last newline character is removed from each definition, so that the above becomes the string ``http://harvest.sourceforge.net/.'' Variable substitution occurs on every definition before it is output. A number of specific variables are defined by search.cgi which can be used inside a definition. For example: <BrokerLoad> Sorry, the Broker at <STRONG>$host, port $port</STRONG> is currently too heavily loaded to process your request. Please try again later.<P> </BrokerLoad> When this definition is printed out, the variables _$_h_o_s_t and _$_p_o_r_t would be replaced with the hostname and port of the broker. 55..44..11..11.. DDeeffiinneedd VVaarriiaabblleess The following variables are defined as soon as the query string is processed. They can be used before the broker returns any results. $maxresult The maximum number of matched lines to be returned $host The broker hostname $port The broker port $query The query string entered by the user $bquery The whole query string sent to the broker These variables are defined for each matched object returned by the broker. $objectnum The number of the returned object $desc The description attribute of the matched object $opaque ALL the matched lines from the matched object $url The original URL of the matched object $A The access method of $url (e.g.: http) $H The hostname (including port) from $url $P The path part of $url $D The directory part of $P $F The filename part of $P $cs_url The URL of the content summary in the broker database $cs_a Access part of $cs_url $cs_h Hostname part of $cs_url $cs_p Path part of $cs_url $cs_d Directory part of $cs_p $cs_f Filename part of $cs_p 55..44..11..22.. LLiisstt ooff DDeeffiinniittiioonnss Below is a partial list of definitions. A complete list can be found in the search.cf file. Only definitions likely to be customized are described here. <<TTiimmeeoouutt>> Timeout value for search.cgi. If the broker doesn't respond within this time, search.cgi will exit. <<RReessuullttHHeeaaddeerr>> The first part of the result page. Should probably contain the HTML <<TTIITTLLEE>> element and the user query string. <<RReessuullttTTrraaiilleerr>> The last part of the result page. The default has URL references to the broker home page and the Harvest project home page. <<RReessuullttSSeettBBeeggiinn>> This is output just before looping over all the matched objects. <<RReessuullttSSeettEEnndd>> This is output just after ending the loop over matched objects. <<PPrriinnttOObbjjeecctt>> This definition prints out a matched object. It should probably include the variables _$_u_r_l_, _$_c_s___u_r_l_, _$_d_e_s_c, and _$_o_p_a_q_u_e. <<EEnnddBBrrookkeerrRReessuullttss>> Printed between <<RReessuullttSSeettEEnndd>> and <<RReessuullttTTrraaiilleerr>> if the query was successful. Should probably include a count of matched objects and/or matched lines. <<FFaaiillBBrrookkeerrRReessuullttss>> Similar to <<EEnnddBBrrookkeerrRReessuullttss>> but prints if the broker returns an error in response to the query. <<OObbjjeeccttNNuummPPrriinnttff>> A printf format string for the object number (_$_o_b_j_e_c_t_n_u_m). <<TTrruunnccaatteeWWaarrnniinngg>> Prints a warning message if the result set was truncated at the maximum number of matched lines. These following definitions are somewhat different because they are evaluated as Perl instructions rather than strings. <<MMaattcchheeddLLiinneeSSuubb>> Evaluated for every matched line returned by the broker. Can be used to indent matched lines or to remove the leading ``Matched line'' and attribute name strings. <<IInniittFFuunnccttiioonn>> Evaluated near the beginning of the search.cgi program. Can be used to set up special variables or read data files. <<PPeerrOObbjjeeccttFFuunnccttiioonn>> Evaluated for each object just before <<PPrriinnttOObbjjeecctt>> is called. <<FFoorrmmaattAAttttrriibbuuttee>> Evaluated for each SOIF attribute requested for matched objects (see Section ``Displaying SOIF attributes in results''). _$_a_t_t is set to the attribute name, and _$_v_a_l is set to the attribute value. 55..44..22.. EExxaammppllee sseeaarrcchh..ccff ccuussttoommiizzaattiioonn ffiillee The following definitions demonstrate how to change the search.cgi output. The <<PPeerrOObbjjeeccttFFuunnccttiioonn>> ensures that the description is not empty. It also prepends the string ``matched data:'' before any matched lines. The <<PPrriinnttOObbjjeecctt>> specification prints the object number, description, and indexing data all on the first line. The description is wrapped around HMTL anchor tags so that it is a link to the object originally gathered. The words ``indexing data'' are a link to the displaySOIF program which will format the content summary for HTML browsers. The object number is formatted as a number in parenthesis such that the whole thing takes up four spaces. The <<MMaattcchheeddLLiinneeSSuubb>> definition includes four substitution expressions. The first removes the words ``Matched line:'' from the beginning of each matched line. The second removes SOIF attributes of the form ``_p_a_r_t_i_a_l_-_t_e_x_t_{_4_3_}_:'' from the beginning of a line. The third displays the attribute names (e.g. _p_a_r_t_i_a_l_-_t_e_x_t_#) in italics. The last expression indents each line by five spaces to align it with the description line. The definition for <<EEnnddBBrrookkeerrRReessuullttss>> slightly modifies the report of how many objects were matched. # Demo to show some of the customization features for the Harvest output # More information can be found in the manual at: # http://harvest.sourceforge.net/harvest/doc/html/manual.html # The PerObjectFunction is Perl code evaluated for every hit <PerObjectFunction> # Create description # Is the descriptions provided by Harvest very short (e.g. missing <TITLE>)? if (length($desc) < 5) { # Yes: use filename ($F) instead $description = "<I>File:</I> $F"; } else { # No: use description provided by Harvest $description = $desc; } # Format matched lines ("opaque data") if data is present if ($opaque ne '') { $opaque = "<strong>matched lines:</strong><BR>$opaque" } </PerObjectFunction> # PrintObject defines the apperance of hits <PrintObject> $objectnum <A HREF="$url"><STRONG>$description</STRONG></A> \ [<A HREF="$cs_a://$cs_h/Harvest/cgi-bin/displaySOIF.cgi?object=$cs_p">\ indexing data</A>] <pre> $opaque </pre>\n </PrintObject> # Format the appearance of the hit number <ObjectNumPrintf> (%2d) </ObjectNumPrintf> # Format the appearance of every matched line <MatchedLineSub> s/^Matched line: *//; # Remove "Matched line:" s/^([\w-]+# )[\w-]+{\d+}:\t/\1/; # Remove SOIF attributes of the form "partial-text{43}:" s/^([\w-]+#)/<I>\1<\/I>/; # Format attribute names as italics s/^.*/ $&/; # Add spaces to indent text </MatchedLineSub> # Modifies the report of how many objects were matched <EndBrokerResults> <STRONG>Found $nopaquelines matched lines, $nobjects objects.</STRONG> <P>\n </EndBrokerResults> 55..44..33.. IInntteeggrraattiinngg yyoouurr ccuussttoommiizzeedd ccoonnffiigguurraattiioonn ffiillee The search.cgi configuration files are kept in _$_H_A_R_V_E_S_T___H_O_M_E_/_c_g_i_- _b_i_n_/_l_i_b. The name of a customized file is listed in the _q_u_e_r_y_._h_t_m_l form, and passed as an option to the search.cgi program. The simplest way to specify the customized file is by placing an <<IINNPPUUTT>> tag in the HTML form: <INPUT TYPE="hidden" NAME="brokerqueryconfig" VALUE="custom.cf"> Another way is to allow users to select from different customizations with a <<SSEELLEECCTT>> list: <SELECT NAME="brokerqueryconfig"> <OPTION VALUE=""> Default <OPTION VALUE="custom1.cf"> Customized <OPTION VALUE="custom2.cf" SELECTED> Highly Customized </SELECT> 55..44..44.. DDiissppllaayyiinngg SSOOIIFF aattttrriibbuutteess iinn rreessuullttss It is possible to request SOIF attributes from the HTML query form. A simple approach is to include a select list in the query form: <SELECT MULTIPLE NAME="attribute"> <OPTION VALUE="title"> <OPTION VALUE="author"> <OPTION VALUE="date"> <OPTION VALUE="subject"> </SELECT> In this manner, the user may control which attributes get displayed. The layout of these attributes when the results are displayed in HTML is controlled by the <<FFoorrmmaattAAttttrriibbuuttee>> specification in the _s_e_a_r_c_h_._c_f file described in Section ``The search.cf configuration file''. 55..55.. WWoorrlldd WWiiddee WWeebb iinntteerrffaaccee ddeessccrriippttiioonn To allow Web browsers to easily interface with the Broker, we implemented a World Wide Web interface to the Broker's query manager and administrative interfaces. This WWW interface, which includes several HTML files and a few programs that use the Common Gateway Interface (CGI), consists of the following: +o HTML files that use Forms support to present a graphical user interface (GUI) to the user; +o CGI programs that act as a gateway between the user and the Broker; and +o Help files for the user. Users go through the following steps when using a Broker to locate information: 1. The user issues a query to the Broker. 2. The Broker processes the query, and returns the query results to the user. 3. The user can then view content summaries from the result set, or access the URLs from the result set directly. To provide a WWW-queryable interface, the Broker needs to run in conjunction with an HTTP server. Section ``Additional installation for the Harvest Broker'' describes how to configure your HTTP server to work with Harvest. You can run the Broker on a different machine than your HTTP server runs on, but if you want users to be able to view the Broker's content summaries then the Broker's files will need to be accessible to your HTTP server. You can NFS mount those files or manually copy them over. You'll also need to change the _B_r_o_k_e_r_s_._c_f file to point to the host that is running the Broker. 55..55..11.. HHTTMMLL ffiilleess ffoorr ggrraapphhiiccaall uusseerr iinntteerrffaaccee CreateBroker creates some HTML files to provide GUIs to the user: _q_u_e_r_y_._h_t_m_l Contains the GUI for the query interface. CreateBroker will install different _q_u_e_r_y_._h_t_m_l files for Glimpse, Swish, and WAIS, since each subsystem requires different defaults and supports different functionality (e.g., WAIS doesn't support approximate matching like Glimpse). This is also the ``home page'' for the Broker and a link to this page is included at the bottom of all query results. _a_d_m_i_n_._h_t_m_l Contains the GUI for the administrative interface. This file is installed into the _a_d_m_i_n directory of the Broker. _B_r_o_k_e_r_s_._c_f Contains the hostname and port information for the supported brokers. This file is installed into the _$_H_A_R_V_E_S_T___H_O_M_E_/_b_r_o_k_e_r_s directory. The _q_u_e_r_y_._h_t_m_l file uses the value of the ``broker'' FORM tag to pass the name of the broker to search.cgi which in turn retrieves the host and port information from _B_r_o_k_e_r_s_._c_f. 55..55..22.. CCGGII pprrooggrraammss When you install the WWW interface (see Section ``The Broker''), a few programs are installed into your HTTP server's _/_H_a_r_v_e_s_t_/_c_g_i_-_b_i_n directory: search.cgi This program takes the submitted query from _q_u_e_r_y_._h_t_m_l, and sends it to the specified Broker. It then retrieves the query results from the Broker, formats them in HTML, and sends the result set in HTML to the user. displaySOIF.cgi This program displays the content summaries from the Broker. BrokerAdmin.pl.cgi This program will take the submitted administrative command from _a_d_m_i_n_._h_t_m_l and send it to the appropriate Broker. It retrieves the result of the command from the Broker and displays it to the user. 55..55..33.. HHeellpp ffiilleess ffoorr tthhee uusseerr The WWW interface to the Broker includes a few help files written in HTML. These files are installed on your HTTP server in the _/_H_a_r_v_e_s_t_/_b_r_o_k_e_r_s directory when you install the broker (see Section ``The Broker''): _q_u_e_r_y_h_e_l_p_._h_t_m_l Provides a tutorial on constructing Broker queries, and on using the _q_u_e_r_y_._h_t_m_l forms. _q_u_e_r_y_._h_t_m_l has a link to this help page. _a_d_m_i_n_h_e_l_p_._h_t_m_l Provides a tutorial on submitting Broker administrative commands using the _a_d_m_i_n_._h_t_m_l form. _a_d_m_i_n_._h_t_m_l has a link to this help page. _s_o_i_f_h_e_l_p_._h_t_m_l Provides a brief description of SOIF. 55..66.. AAddmmiinniissttrraattiinngg aa BBrrookkeerr Administrators have two basic ways for managing a Broker: through the _b_r_o_k_e_r_._c_o_n_f and _C_o_l_l_e_c_t_i_o_n_._c_o_n_f configuration files, and through the interactive administrative interface. The interactive interface controls various facilities and operating parameters within the Broker. We provide a HTML interface page for these administrative commands. See Section ``Collector interface description: Collection.conf'' for additional information on the Broker administrative and collector interfaces. The _b_r_o_k_e_r_._c_o_n_f file is a list of variable names and their values, which consists of information about the Broker (such as the directory in which it lives) and the port on which it runs. The _C_o_l_l_e_c_t_i_o_n_._c_o_n_f file (see Section ``Collector interface description: Collection.conf'' for an example) is a list of collection points from which the Broker collects its indexing information. The CreateBroker program automatically generates both of these configuration files. You can manually edit these files if needed. The CreateBroker program also creates the _a_d_m_i_n_._h_t_m_l file, which is the WWW interface to the Broker's administrative commands. Note that all administrative commands require a password as defined in _b_r_o_k_e_r_._c_o_n_f. _N_o_t_e_: Changes to the Broker configuration are not saved when the Broker is restarted. Permanent changes to the Broker configuration should be made by manually editing the _b_r_o_k_e_r_._c_o_n_f file. The administrative interface created by CreateBroker has the following window fields: Command Select an administrative command. See below for a description of the commands. Parameters Specify parameters for those commands that need them. Password The administrative password. Broker Host The host where the broker is running. Broker Port The port where the broker is listening. The administrative interface created by CreateBroker supports the following commands: AAdddd oobbjjeeccttss bbyy ffiillee:: Add object(s) to the Broker. The parameter is a list of filenames that contain SOIF object to be added to the Broker. CClloossee lloogg:: Flush all accumulated log information and close the current log file. Causes the Broker to stop logging. No parameters. CCoommpprreessss RReeggiissttrryy:: Performs garbage collection on the Registry file. No parameters. DDeelleettee eexxppiirreedd oobbjjeeccttss:: Deletes any object from the Broker whose _T_i_m_e_-_t_o_-_L_i_v_e has expired. No parameters. DDeelleettee oobbjjeeccttss bbyy qquueerryy:: Deletes any object(s) that matches the given query. The parameter is a query with the same syntax as user queries. Query flags are currently unsupported. DDeelleettee oobbjjeeccttss bbyy ooiidd:: Deletes the object(s) identified by the given OID numbers. The parameter is a list of OID numbers. The OID numbers can be obtained by using the dumpregistry command. DDiissaabbllee lloogg ttyyppee:: Disables logging information about a particular type of event. The parameter is an event type. See Enable log type for a list of events. EEnnaabbllee lloogg ttyyppee:: Enables logging information about a particular type of events. The parameter is the name of an event type. Currently, event types are limited to the following: Update Log updated objects. Delete Log deleted objects. Refresh Log refreshed objects. Query Log user queries. Query-Return Log objects returned from a query. Cleaned Log objects removed by the cleaner. Collection Log collection events. Admin Log administrative events. Admin-Return Log the results of administrative events. Bulk-Transfer Log bulk transfer events. Bulk-Return Log objects sent by bulk transfers. Cleaner-On Log cleaning events. Compressing-Registry Log registry compression events. All Log all events. FFlluusshh lloogg:: Flush all accumulated log information to the current log file. No parameters. GGeenneerraattee ssttaattiissttiiccss:: Generates some basic statistics about the Broker object database. No parameters. IInnddeexx cchhaannggeess:: Index only the objects that have been added recently. No parameters. IInnddeexx ccoorrppuuss:: Index the _e_n_t_i_r_e object database. No parameters. OOppeenn lloogg:: Open a new log file. If the file does not exist, create a new one. The parameter is the name (relative to the broker) of a file to use for logging. RReessttaarrtt sseerrvveerr:: Force the broker to reread the Registry and reindex the corpus. This does not actually kill the broker process. No parameters. RRoottaattee lloogg ffiillee:: Rotates the current log file to LOG.YYYYMMDD. Opens a new log file. No parameters. SSeett vvaarriiaabbllee:: Sets the value of a broker configuration variable. Takes two parameters, the name of a configuration variable and the new value for the variable. The configuration variables that can be set are those that occur in the _b_r_o_k_e_r_._c_o_n_f file. The change only is valid until the broker process dies. SShhuuttddoowwnn sseerrvveerr:: Cleanly shutdown the Broker. No parameters. SSttaarrtt ccoolllleeccttiioonn:: Perform collections. No parameters. DDeelleettee oollddeerr oobbjjeeccttss ooff dduupplliiccaattee UURRLLss:: Occasionally a broker may end up with multiple summarizes for individual URLs. This can happen when the Gatherer changes its description, hostname, or port number. Use this command to search the broker for duplicated URLs. When two objects with the same URL are found, the object with the least-recent timestamp is removed. 55..66..11.. DDeelleettiinngg uunnwwaanntteedd BBrrookkeerr oobbjjeeccttss If you build a Broker and then decide not to index some of that data (e.g., you decide it would make sense to split it into two different Brokers, each targetted to a different community), you need to change the Gatherer's configuration file, rerun the Gatherer, and then let the old objects time out in the Broker (since the Broker and Gatherer maintain separate databases). If you want to clean out the Broker's data sooner than that you can use the Broker's administrative interface in one of three ways: 1. Use the 'Remove object by name' command. This is only reasonable if you have a small number of objects to remove in the Broker. 2. Use the 'Remove object by query'. This might be the best option if, for example, you can construct a regular expression based on the URLs you want to remove. 3. Shutdown the server, manually remove the Broker's _o_b_j_e_c_t_s_/_* files, and then restart the Broker. This is easiest, although if you have a large number of objects it will take longer to rebuild the index. A simple way to accomplish this is by ``rebooting'' the Broker by deleting all the current objects, and doing a full collection, as follows: % mv objects objects.old % rm -rf objects.old & % broker ./admin/broker.conf -new After removing objects, you should use the _I_n_d_e_x _c_o_r_p_u_s command. 55..66..22.. CCoommmmaanndd--lliinnee AAddmmiinniissttrraattiioonn It is possible to perform administrative functions by using the brkclient program from the command-line and shell scripts. For example, to force a collection, run: % brkclient localhost 8501 '#ADMIN #Password secret #collection' See your broker's raw _a_d_m_i_n_._h_t_m_l file for a complete list of administrative commands. 55..77.. TTuunniinngg GGlliimmppssee iinnddeexxiinngg iinn tthhee BBrrookkeerr The Glimpse indexing system can be tuned in a variety of ways to suit your particular needs. Probably the most noteworthy parameter is indexing granularity, for which Glimpse provides three options: a tiny index (2-3% of the total size of all files -- your mileage may vary), a small index (7-8%), and a medium-size index (20-30%). Search times are better with larger indexes. By changing the GGlliimmppsseeIInnddeexx--OOppttiioonn in your Broker's _b_r_o_k_e_r_._c_o_n_f file, you can tune Glimpse to use one of these three indexing granularity options. By default, GGlliimmppsseeIInnddeexx-- OOppttiioonn builds a medium-size index using the glimpseindex program. Note also that with Glimpse it is much faster to search with ``show matched lines'' turned off in the Broker query page. Glimpse uses a ``stop list'' to avoid indexing very common words. This list is not fixed, but rather computed as the index is built. For a medium-size index, the default is to put any word that appears at least 500 times per Mbyte (on the average) in the stop-list. For a small-size index, the default is words that appear in at least 80% of all files (unless there are fewer than 256 files, in which case there is no stop-list). Both defaults can be changed using the --SS option, which should be followed by the new number (average per Mbyte when --bb indexing is used, or % of files when --oo indexing is used). Tiny-size indexes do not maintain a stop-list (their effect is minimal). glimpseindex includes a number of other options that may be of interest. You can find out more about these options (and more about Glimpse in general) in the Glimpse documentations. If you'd like to change how the Broker invokes the glimpseindex program, then edit the _s_r_c_/_b_r_o_k_e_r_/_G_l_i_m_p_s_e_/_i_n_d_e_x_._c file from the Harvest source distribution. 55..77..11.. TThhee gglliimmppsseesseerrvveerr pprrooggrraamm The Glimpse system comes with an auxiliary server called glimpseserver, which allows indexes to be read into a process and kept in memory. This avoids the added cost of reading the index and starting a large process for each search. glimpseserver is automatically started each time you run the Broker, or reindex the Broker's corpus. If you do not want to run glimpseserver, then set GGlliimmppsseeSSeerrvveerr--HHoosstt to ``false'' in your _b_r_o_k_e_r_._c_o_n_f. 55..88.. UUssiinngg ddiiffffeerreenntt iinnddeexx//sseeaarrcchh eennggiinneess wwiitthh tthhee BBrrookkeerr By default, Harvest uses the Glimpse index/search subsystem. However, Harvest defines a flexible indexing interface, to allow Broker administrators to use different index/search subsystems to accommodate domain-specific requirements. For example, it might be useful to provide a relational database back-end. At present we distribute code to support an interface to both the free and the commercial WAIS index/search engines, Glimpse, and Swish. Below we discuss how to use other index/search engine instead of Glimpse in the Broker, and provide some brief discussion of how to integrate a new index/search engine into the Broker. 55..88..11.. UUssiinngg SSwwiisshh aass aann iinnddeexxeerr Harvest includes support for using Swish as indexing engine with the Broker. Swish is a nice alternative to Glimpse if you need faster search support and are willing to lose the more powerful query features. It also is an alternative in cases of trouble with Glimpse' copyright status. To use Swish with an existing Broker, you need to change the _I_n_d_e_x_e_r_- _T_y_p_e variable in _b_r_o_k_e_r_._c_o_n_f to ``Swish''. You can also specify that you want to use Swish for a Broker, when you use the RunHarvest command by running: RunHarvest -swish. 55..88..22.. UUssiinngg WWAAIISS aass aann iinnddeexxeerr Support for using WAIS (both freeWAIS and WAIS Inc.'s index/search engine) as the Broker's indexing and search subsystem is included in the Harvest distribution. WAIS is a nice alternative to Glimpse if you need faster search support and are willing to lose the more powerful query features. To use WAIS with an existing Broker, you need to change the _I_n_d_e_x_e_r_- _T_y_p_e variable in _b_r_o_k_e_r_._c_o_n_f to ``WAIS''; you can choose among the WAIS variants by setting the _W_A_I_S_-_F_l_a_v_o_r variable in _b_r_o_k_e_r_._c_o_n_f to ``Commercial-WAIS'', ``freeWAIS'', or ``WAIS''. Otherwise, CreateBroker will ask you if you want to use WAIS, and where the WAIS programs (waisindex, waissearch, waisserver, and with the commercial version of WAIS waisparse) are located. When you run the Broker, a WAIS server will be started automatically after the index is built. You can also specify that you want to use WAIS for a Broker, when you use the RunHarvest command by running: RunHarvest -wais. 55..99.. CCoolllleeccttoorr iinntteerrffaaccee ddeessccrriippttiioonn:: CCoolllleeccttiioonn..ccoonnff The Broker retrieves indexing information from Gatherers or other Brokers through its _C_o_l_l_e_c_t_o_r interface. A list of collection points is specified in the _a_d_m_i_n_/_C_o_l_l_e_c_t_i_o_n_._c_o_n_f configuration file. This file contains a collection point on each line, with 4 fields. The first field is the host of the remote Gatherer or Broker, the second field is the port number on that host, the third field is the collection type, and the forth field is the query filter or ---- if there is no filter. The Broker supports various types of collections as described below: Type Remote Process Description Compression? -------------------------------------------------------- 0 Gatherer Full collection each time No 1 Gatherer Incremental collections No 2 Gatherer Full collection each time Yes 3 Gatherer Incremental collections Yes 4 Broker Full collection each time No 5 Broker Incremental collections No 6 Broker Collection based on a query No 7 Broker Incremental based on a query No The query filter specification for collection types 6 and 7 contains two parts: the ----QQUUEERRYY kkeeyywwoorrddss portion and an optional ----FFLLAAGGSS ffllaaggss portion. The ----QQUUEERRYY portion is passed on to the Broker as the keywords for the query (the keywords can be any Boolean and/or structured query); the ----FFLLAAGGSS portion is passed on to the Broker as the indexer-specific flags to the query. The following table shows the valid indexer-specific flags for the supported indexers: Indexer Flag Description ----------------------------------------------------------------------------- All: #desc Show Description Lines Glimpse: #index case insensitive Case Insensitive #index case sensitive Case sensitive #index error number Allow "number" errors #index matchword Matches on word boundaries #index maxresult number Allow max of "number" results #opaque Show matched lines Wais: #index maxresult number Allow max of "number" results #opaque Show scores and rankings The following is an example _C_o_l_l_e_c_t_i_o_n_._c_o_n_f, which collects information from 2 Gatherers (one compressed incrementals and the other uncompressed full transfers), and collects information from 3 Brokers (one incrementally based on a timestamp, and the others using query filters): gatherer-host1.foo.com 8500 3 -- gatherer-host2.foo.com 8500 0 -- broker-host1.foo.com 8501 5 -- broker-host2.foo.com 8501 6 --QUERY (URL : document) AND gnu broker-host3.foo.com 8501 7 --QUERY Harvest --FLAGS #index case sensitive 55..1100.. TTrroouubblleesshhoooottiinngg SSyymmppttoomm The Broker is running but always returns _e_m_p_t_y _q_u_e_r_y _r_e_s_u_l_t_s. SSoolluuttiioonn Look at the log messages in the broker.out file in the Broker's directory for error messages. If your Broker didn't index the data, use the administrative interface to force the Broker to build the index (see Section ``Administrating a Broker''). SSyymmppttoomm When I query my Broker, I get a "500 Server Error". SSoolluuttiioonn Generally, the ``500'' errors are related to a CGI program not working correctly or a misconfigured httpd server. Make sure that the userid running the HTTP server has access to the Harvest cgi-bin directory and the Perl include files in _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b. Refer to Section ``Additional installation for the Harvest Broker'' for further details. SSyymmppttoomm I see _d_u_p_l_i_c_a_t_e _d_o_c_u_m_e_n_t_s in my Broker. SSoolluuttiioonn The Broker performs duplicate elimination based on a combination of MD5 checksums and Gatherer-Host, Name, Version. Therefore, you can end up with duplicate documents if your Broker collects from more than one Gatherer, each of which gathers from the (a subset of) the same URLs. (As an aside, the reason for this notion of duplicate elimination is to allow a single Broker to contain several different SOIF objects for the same URL, but summarized in different ways.) Two solutions to the problem are: 1. Run your Gatherers on the same host. 2. Remove the duplicate URLs in a customized version of the search.cgi program by doing a string comparison of the URLs. SSyymmppttoomm The Broker takes a _l_o_n_g _t_i_m_e and does not answer queries. SSoolluuttiioonn Some queries are quite expensive, because they involve a great deal of I/O. For this reason we modified the Broker so that if a query takes longer than 5 minutes, the query process is killed. The best solution is to use a less expensive query, for example by using less common keywords. SSyymmppttoomm Some of the _q_u_e_r_y _o_p_t_i_o_n_s (such as structured or case sensitive queries) _a_r_e_n_'_t _w_o_r_k_i_n_g. SSoolluuttiioonn This usually means you are using an index/search engine that does not support structured queries (like the current Harvest support for commercial WAIS). If you are setting up your own Broker (rather than using someone else's Broker), see Section ``Using different index/search engines with the Broker'' for details on how to switch to other index/search engines. Or, it could be that your search.cgi program is an old version and should be updated. SSyymmppttoomm I get _s_y_n_t_a_x _e_r_r_o_r_s when I specify queries. SSoolluuttiioonn Usually this means you did not use double quotes where needed. See Section ``Querying a Broker''. SSyymmppttoomm When I submit a query, I get an _a_n_s_w_e_r _f_a_s_t_e_r _t_h_a_n _I _c_a_n _b_e_l_i_e_v_e it takes to perform the query, and the answer contains _g_a_r_b_a_g_e _d_a_t_a. SSoolluuttiioonn This probably indicates that your httpd is misconfigured. A common case is not putting the 'ScriptAlias' before the 'Alias' in your _c_o_n_f_/_h_t_t_p_d_._c_o_n_f file, when running the Apache httpd. See Section ``Additional installation for the Harvest Broker''. SSyymmppttoomm When I make _c_h_a_n_g_e_s to the Broker configuration via the _a_d_m_i_n_i_s_t_r_a_t_i_o_n _i_n_t_e_r_f_a_c_e, they are _l_o_s_t after the Broker is restarted. SSoolluuttiioonn The Broker administration interface does not save changes across sessions. Permanent changes to the Broker configuration should be done through the _b_r_o_k_e_r_._c_o_n_f file. SSyymmppttoomm My Broker is _r_u_n_n_i_n_g _v_e_r_y _s_l_o_w_l_y. SSoolluuttiioonn Performance tuning can be complicated, but the most likely problem is that you are running on a machine with insufficient RAM, and paging a lot because the query engine kicks pages out in order to access the needed index and data files. (In UNIX the disk buffer cache competes with program and data pages for memory.) A simple way to tell is to run ``vmstat 5'' in one window, and after a couple of lines of output, issue a query from another window. This will print a line of measurements about the virtual memory status of your machine every 5 seconds. In particular, look at the ``pi'' and ``po'' columns. If the numbers suddenly jump into the 500-1,000 range after you issue the query, you are paging a lot. Note that paging problems are accentuated by running simultaneous memory-intensive or disk I/O-intensive programs on your machine. Simultaneous queries to a single Broker should not cause a paging problem, because the Broker processes the queries sequentially. It is best to run Brokers on an otherwise mostly unused machine with at least 128 MB of RAM (or more, if the above ``vmstat'' experiment indicates you are paging alot). One other performance enhancer is to run an _h_t_t_p_d_-_a_c_c_e_l_e_r_a_t_o_r on your Broker machine, to intercept queries headed for your Broker. While it will not cache the results of queries, it will reduce load on the machine because it provides a very efficient means of returning results in the case of concurrent queries. Without the accelerator the results are sent back by a search.cgi UNIX process per query, and inefficiently time sliced by the UNIX kernel. With an accelerator the search.cgi processes exit quickly, and let the accelerator send the results back to the concurrent users. The accelerator will also reduce load for (non-query) retrievals of data from your httpd server. 66.. PPrrooggrraammss aanndd llaayyoouutt ooff tthhee iinnssttaalllleedd HHaarrvveesstt ssooffttwwaarree 66..11.. $$HHAARRVVEESSTT__HHOOMMEE The top directory of where you installed Harvest is known as _$_H_A_R_V_E_S_T___H_O_M_E. By default, _$_H_A_R_V_E_S_T___H_O_M_E is _/_u_s_r_/_l_o_c_a_l_/_h_a_r_v_e_s_t. The following files and directories are located in _$_H_A_R_V_E_S_T___H_O_M_E: RunHarvest* brokers/ gatherers/ tmp/ bin/ cgi-bin/ lib/ RunHarvest is the script used to create and run Harvest servers (see Section ``Starting up the system: RunHarvest and related commands''). RunHarvest has the same command line syntax as Harvest. 66..22.. $$HHAARRVVEESSTT__HHOOMMEE//bbiinn The _$_H_A_R_V_E_S_T___H_O_M_E_/_b_i_n directory only contains programs that users would normally run directly. All other programs (e.g., individual summarizers for the Gatherer) as well as Perl library code are in the _l_i_b directory. The _b_i_n directory contains the following programs: CreateBroker Creates a Broker. Usage: CreateBroker [skeleton-tree [destination]] Gatherer Main user interface to the Gatherer. This program is run by the RunGatherer script found in a Gatherer's directory. Usage: Gatherer [-manual|-export|-debug] file.cf Harvest The program used by RunHarvest to create and run Harvest servers as per the user's description. Usage: Harvest [flags] where flags can be any of the following: -novice Simplest Q&A. Mostly uses the defaults. -glimpse Use Glimpse for the Broker. (default) -swish Use Swish for the Broker. -wais Use WAIS for the Broker. -dumbtty Dumb TTY mode. -debug Debug mode. -dont-run Don't run the Broker or the Gatherer. -fake Doesn't build the Harvest servers. -protect Don't change the umask. broker The Broker program. This program is run by the RunBroker script found in a Broker's directory. Logs messages to both _b_r_o_k_e_r_._o_u_t and to _a_d_m_i_n_/_L_O_G. Usage: broker [broker.conf file] [-nocol] gather The client interface to the Gatherer. Usage: gather [-info] [-nocompress] host port [timestamp] 66..33.. $$HHAARRVVEESSTT__HHOOMMEE//bbrrookkeerrss The _$_H_A_R_V_E_S_T___H_O_M_E_/_b_r_o_k_e_r_s directory contains images and logos in _i_m_a_g_e_s directory, some basic tutorial HTML pages, and the skeleton files that CreateBroker uses to construct new Brokers. You can change the default values in these created Brokers by editing the files in _s_k_e_l_e_t_o_n. 66..44.. $$HHAARRVVEESSTT__HHOOMMEE//ccggii--bbiinn The _$_H_A_R_V_E_S_T___H_O_M_E_/_c_g_i_-_b_i_n directory contains the programs needed for the WWW interface to the Broker (described in Section ``CGI programs'') and configuration files for search.cgi in _l_i_b directory. 66..55.. $$HHAARRVVEESSTT__HHOOMMEE//ggaatthheerreerrss The _$_H_A_R_V_E_S_T___H_O_M_E_/_g_a_t_h_e_r_e_r_s directory contains example Gatherers discussed in Section ``Gatherer Examples''. RunHarvest, by default, will create the new Gatherer in this directory. 66..66.. $$HHAARRVVEESSTT__HHOOMMEE//lliibb The _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b directory contains number of Perl library routines and other programs needed by various parts of Harvest, as follows: _c_h_a_t_2_._p_l_, _f_t_p_._p_l_, _s_o_c_k_e_t_._p_h Perl libraries used to communicate with remote FTP servers. _d_a_t_e_c_o_n_v_._p_l_, _l_s_p_a_r_s_e_._p_l_, _t_i_m_e_l_o_c_a_l_._p_l Perl libraries used to parse ls output. ftpget Program used to retrieve files and directories from FTP servers. Usage: ftpget [-htmlify] localfile hostname filename A,I username password gopherget.pl Perl program used to retrieve files and menus from Gopher servers. Usage: gopherget.pl localfile hostname port command harvest-check.pl Perl program to check whether gatherers and brokers are up. Usage: harvest-check.pl [-v] md5 Program used to compute MD5 checksums. Usage: md5 file [...] newsget.pl Perl program used to retrieve USENET articles and group summaries from NNTP servers. Usage: newsget.pl localfile news-URL _s_o_i_f_._p_l_, _s_o_i_f_-_m_e_m_-_e_f_f_i_c_i_e_n_t_._p_l Perl library used to process SOIF. urlget Program used to retrieve a URL. Usage: urlget URL urlpurge Program to purge the local disk URL cache used by urlget and the Gatherer. Usage: urlpurge 66..77.. $$HHAARRVVEESSTT__HHOOMMEE//lliibb//bbrrookkeerr The _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_b_r_o_k_e_r directory contains the search and index programs needed by the Broker, plus several utility programs needed for Broker administration, as follows: BrokerRestart This program will issue a restart command to a broker. Usage: BrokerRestart [-password passwd] host port brkclient Client interface to the broker. Can be used to send queries or administrative commands to a broker. Usage: brkclient hostname port command-string dumpregistry Prints the Broker's Registry file in a human-readable format. Usage: dumpregistry [-count] [BrokerDirectory] agrep, glimpse, glimpseindex, glimpseindex.bin, glimpseserver The Glimpse indexing and search system as described in Section ``The Broker''. swish The Swish indexing and search program as an alternative to Glimpse. info-to-html.pl, mkbrokerstats.pl Perl programs used to generate Broker statistics and to create _s_t_a_t_s_._h_t_m_l. Usage: gather -info host port | info-to-html.pl > host.port.html Usage: mkbrokerstats.pl broker-dir > stats.html 66..88.. $$HHAARRVVEESSTT__HHOOMMEE//lliibb//ggaatthheerreerr The _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r directory contains the default summarizers described in Section ``Extracting data for indexing: The Essence summarizing subsystem'', plus various utility programs needed by the summarizers and the Gatherer, as follows: _U_R_L_-_f_i_l_t_e_r_-_d_e_f_a_u_l_t Default URL filter as described in Section ``RootNode specifications''. _b_y_c_o_n_t_e_n_t_._c_f_, _b_y_n_a_m_e_._c_f_, _b_y_u_r_l_._c_f_, _m_a_g_i_c_, _s_t_o_p_l_i_s_t_._c_f_, _q_u_i_c_k_-_s_u_m_._c_f Essence configuration files as described in Section ``Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps''. *.sum Essence summarizers as discussed in Section ``Extracting data for indexing: The Essence summarizing subsystem''. HTML-sum.pl Alternative HTML summarizer written in Perl. HTMLurls Program to extract URLs from a HTML file. Usage: HTMLurls [--base-url url] filename catdoc, xls2csv, _c_a_t_d_o_c_-_l_i_b Programs and files used by Microsoft Word summarizer. dvi2tty, print-c-comments, ps2txt, ps2txt-2.1, pstext, skim" Programs used by various summarizers. gifinfo Program to support summarizers. l2h Program used by TeX summarizer. rast, smgls, sgmlsasp, _s_g_m_l_s_-_l_i_b Programs and files used by SGML summarizer. rtf2html Program used by RTF summarizer. wp2x, wp2x.sh, _w_p_2_x_-_l_i_b Programs and files used by WordPerfect summarizer. hexbin, unshar, uudecode Programs used to unnest nested objects. cksoif Programs used to check the validity of a SOIF stream (e.g., to ensure that there is not parsing errors). Usage: cksoif < INPUT.soif cleandb, consoldb, expiredb, folddb, mergedb, mkgathererstats.pl, mkindex, rmbinary" Programs used to prepare a Gatherer's database to be exported by gatherd. cleandb ensures that all SOIF objects are valid, and deletes any that are not; consoldb will consolidate n GDBM database files into a single GDBM database file; expiredb deletes any SOIF objects that are no longer valid as defined by its _T_i_m_e_-_t_o_-_L_i_v_e attribute; folddb runs all of the operations needed to prepare the Gatherer's database for export by gatherd; mergedb consolidates GDBM files as described in Section ``Incorporating manually generated information into a Gatherer''; mkgathererstats.pl generates the _I_N_F_O_._s_o_i_f statistics file; mkindex generates the cache of timestamps; and rmbinary removes binary data from a GDBM database. enum, prepurls, staturl Programs used by Gatherer to perform the RootNode and LeafNode enumeration for the Gatherer as described in Section ``RootNode specifications''. enum performs a RootNode enumeration on the given URLs; prepurls is a wrapper program used to pipe Gatherer and essence together; staturl retrieves LeafNode URLs to determine if the URL has been modified or not. fileenum, ftpenum, ftpenum.pl, gopherenum-*, httpenum-*, newsenum" Programs used by enum to perform protocol specific enumeration. fileenum performs a RootNode enumeration on ``file'' URLs; ftpenum calls ftpenum.pl to perform a RootNode enumeration on ``ftp'' URLs; gopherenum-breadth performs a breadth first RootNode enumeration on ``gopher'' URLs; gopherenum-depth performs a depth first RootNode enumeration on ``gopher'' URLs; httpenum-breadth performs a breadth first RootNode enumeration on ``http'' URLs; httpenum-depth performs a depth first RootNode enumeration on ``http'' URLs; newsenum performs a RootNode enumeration on ``news'' URLs; essence The Essence content extraction system as described in Section ``Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps''. Usage: essence [options] -f input-URLs or essence [options] URL ... where options are: --dbdir directory Directory to place database --full-text Use entire file instead of summarizing --gatherer-host Gatherer-Host value --gatherer-name Gatherer-Name value --gatherer-version Gatherer-Version value --help Print usage information --libdir directory Directory to place configuration files --log logfile Name of the file to log messages to --max-deletions n Number of GDBM deletions before reorganization --minimal-bookkeeping Generates a minimal amount of bookkeeping attrs --no-access Do not read contents of objects --no-keywords Do not automatically generate keywords --allowlist filename File with list of types to allow --stoplist filename File with list of types to remove --tmpdir directory Name of directory to use for temporary files --type-only Only type data; do not summarize objects --verbose Verbose output --version Version information print-attr Reads in a SOIF stream from stdin and prints the data associated with the given attribute to stdout. Usage: cat SOIF-file | print-attr Attribute gatherd, in.gatherd Daemons that exports the Gatherer's database. in.gatherd is used to run this daemon from inetd. Usage: gatherd [-db | -index | -log | -zip | -cf file] [-dir dir] port Usage: in.gatherd [-db | -index | -log | -zip | -cf file] [-dir dir] gdbmutil Program to perform various operations on a GDBM database. Usage: gdbmutil consolidate [-d | -D] master-file file [file ...] Usage: gdbmutil delete file key Usage: gdbmutil dump file Usage: gdbmutil fetch file key Usage: gdbmutil keys file Usage: gdbmutil print [-gatherd] file Usage: gdbmutil reorganize file Usage: gdbmutil restore file Usage: gdbmutil sort file Usage: gdbmutil stats file Usage: gdbmutil store file key < data mktemplate Program to generate valid SOIF based on a more easily editable SOIF-like format (e.g., SOIF without the byte counts). Usage: mktemplate < INPUT.txt > OUTPUT.soif quick-sum Simple Perl program to emulate Essence's _q_u_i_c_k_-_s_u_m_._c_f processing for those who cannot compile Essence with the corresponding C code. template2db Converts a stream of SOIF objects (from stdin or given files) into a GDBM database. Usage: template2db database [tmpl tmpl...] wrapit Wraps the data from stdin into a SOIF attribute-value pair with a byte count. Used by Essence summarizers to easily generate SOIf. Usage: wrapit [Attribute] kill-gatherd Script to kill gatherd process. 66..99.. $$HHAARRVVEESSTT__HHOOMMEE//ttmmpp The _$_H_A_R_V_E_S_T___H_O_M_E_/_t_m_p directory is used by search.cgi to store search result pages. 77.. TThhee SSuummmmaarryy OObbjjeecctt IInntteerrcchhaannggee FFoorrmmaatt ((SSOOIIFF)) Harvest Gatherers and Brokers communicate using an attribute-value stream protocol called the _S_u_m_m_a_r_y _O_b_j_e_c_t _I_n_t_e_r_c_h_a_n_g_e _F_o_r_m_a_t _(_S_O_I_F_), an example of which is available in Section ``Example 1''. Gatherers generate content summaries for individual objects in SOIF, and serve these summaries to Brokers that wish to collect and index them. SOIF provides a means of bracketing collections of summary objects, allowing Harvest Brokers to retrieve SOIF content summaries from a Gatherer for many objects in a single, efficient compressed stream. Harvest Brokers provide support for querying SOIF data using structured attribute-value queries and many other types of queries, as discussed in Section ``Querying a Broker''. 77..11.. FFoorrmmaall ddeessccrriippttiioonn ooff SSOOIIFF The SOIF Grammar is as follows: SOIF ::= OBJECT SOIF | OBJECT OBJECT ::= @ TEMPLATE-TYPE { URL ATTRIBUTE-LIST } ATTRIBUTE-LIST ::= ATTRIBUTE ATTRIBUTE-LIST | ATTRIBUTE ATTRIBUTE ::= IDENTIFIER {VALUE-SIZE} DELIMITER VALUE TEMPLATE-TYPE ::= Alpha-Numeric-String IDENTIFIER ::= Alpha-Numeric-String VALUE ::= Arbitrary-Data VALUE-SIZE ::= Number DELIMITER ::= ":<tab>" 77..22.. LLiisstt ooff ccoommmmoonn SSOOIIFF aattttrriibbuuttee nnaammeess Each Broker can support different attributes, depending on the data it holds. Below we list a set of the most common attributes: Abstract Brief abstract about the object. Author Author(s) of the object. Description Brief description about the object. File-Size Number of bytes in the object. Full-Text Entire contents of the object. Gatherer-Host Host on which the Gatherer ran to extract information from the object. Gatherer-Name Name of the Gatherer that extracted information from the object. (eg. Full-Text, Selected-Text, or Terse). Gatherer-Port Port number on the Gatherer-Host that serves the Gatherer's information. Gatherer-Version Version number of the Gatherer. Update-Time The time that Gatherer updated the content summary for the object. Keywords Searchable keywords extracted from the object. Last-Modification-Time The time that the object was last modified. MD5 MD5 16-byte checksum of the object. Refresh-Rate The number of seconds after Update-Time when the summary object is to be re-generated. Defaults to 1 month. Time-to-Live The number of seconds after Update-Time when the summary object is no longer valid. Defaults to 6 months. Title Title of the object. Type The object's type. Some example types are: Archive Audio Awk Backup Binary C CHeader Command Compressed CompressedTar Configuration Data Directory DotFile Dvi FAQ FYI Font FormattedText GDBM GNUCompressed GNUCompressedTar HTML Image Internet-Draft MacCompressed Mail Makefile ManPage Object OtherCode PCCompressed Patch Pdf Perl PostScript RCS README RFC RTF SCCS ShellArchive Tar Tcl Tex Text Troff Uuencoded WaisSource Update-Time The time that the summary object was last updated. REQUIRED field, no default. URI Uniform Resource Identifier. URL-References Any URL references present within HTML objects. 88.. GGaatthheerreerr EExxaammpplleess The following examples install into _$_H_A_R_V_E_S_T___H_O_M_E_/_g_a_t_h_e_r_e_r_s by default (see Section ``Installing the Harvest Software''). The Harvest distribution contains several examples of how to configure, customize, and run Gatherers. This section will walk you through several example Gatherers. The goal is to give you a sense of what you can do with a Gatherer and how to do it. You needn't work through all of the examples; each is instructive in its own right. To use the Gatherer examples, you need the Harvest binary directory in your path, and _H_A_R_V_E_S_T___H_O_M_E defined. For example: % setenv HARVEST_HOME /usr/local/harvest % set path = ($HARVEST_HOME/bin $path) 88..11.. EExxaammppllee 11 -- AA ssiimmppllee GGaatthheerreerr This example is a simple Gatherer that uses the default customizations. The only work that the user does to configure this Gatherer is to specify the list of URLs from which to gather (see Section ``The Gatherer''). To run this example, type: % cd $HARVEST_HOME/gatherers/example-1 % ./RunGatherer To view the configuration file for this Gatherer, look at _e_x_a_m_p_l_e_-_1_._c_f. The first few lines are variables that specify some local information about the Gatherer (see Section ``Setting variables in the Gatherer configuration file''). For example, each content summary will contain the name of the Gatherer (GGaatthheerreerr--NNaammee) that generated it. The port number (GGaatthheerreerr--PPoorrtt) that will be used to export the indexing information, as is the directory that contains the Gatherer (TToopp--DDiirreeccttoorryy). Notice that there is one RootNode URL and one LeafNode URL. After the Gatherer has finished, it will start up the Gatherer daemon which will export the content summaries. To view the content summaries, type: % gather localhost 9111 | more The following SOIF object should look similar to those that this Gatherer generates. @FILE { http://harvest.cs.colorado.edu/~schwartz/IRTF.html Time-to-Live{7}: 9676800 Last-Modification-Time{1}: 0 Refresh-Rate{7}: 2419200 Gatherer-Name{25}: Example Gatherer Number 1 Gatherer-Host{22}: powell.cs.colorado.edu Gatherer-Version{3}: 0.4 Update-Time{9}: 781478043 Type{4}: HTML File-Size{4}: 2099 MD5{32}: c2fa35fd44a47634f39086652e879170 Partial-Text{151}: research problems Mic Bowman Peter Danzig Udi Manber Michael Schwartz Darren Hardy talk talk Harvest talk Advanced Research Projects Agency URL-References{628}: ftp://ftp.cs.colorado.edu/pub/cs/techreports/schwartz/RD.ResearchProblems.Jour.ps.Z ftp://grand.central.org/afs/transarc.com/public/mic/html/Bio.html http://excalibur.usc.edu/people/danzig.html http://glimpse.cs.arizona.edu:1994/udi.html http://harvest.cs.colorado.edu/~schwartz/Home.html http://harvest.cs.colorado.edu/~hardy/Home.html ftp://ftp.cs.colorado.edu/pub/cs/misc/schwartz/HPCC94.Slides.ps.Z ftp://ftp.cs.colorado.edu/pub/cs/misc/schwartz/HPC94.Slides.ps.Z http://harvest.cs.colorado.edu/harvest/Home.html ftp://ftp.cs.colorado.edu/pub/cs/misc/schwartz/IETF.Jul94.Slides.ps.Z http://ftp.arpa.mil/ResearchAreas/NETS/Internet.html Title{84}: IRTF Research Group on Resource Discovery IRTF Research Group on Resource Discovery Keywords{121}: advanced agency bowman danzig darren hardy harvest manber mic michael peter problems projects research schwartz talk udi } Notice that although the Gatherer configuration file lists only 2 URLs (one in the RootNode section and one in the LeafNode section), there are more than 2 content summaries in the Gatherer's database. The Gatherer expanded the RootNode URL into dozens of LeafNode URLs by recursively extracting the links from the HTML file at the RootNode _h_t_t_p_:_/_/_h_a_r_v_e_s_t_._c_s_._c_o_l_o_r_a_d_o_._e_d_u_/. Then, for each LeafNode given to the Gatherer, it generated a content summary for it as in the above example summary for _h_t_t_p_:_/_/_h_a_r_v_e_s_t_._c_s_._c_o_l_o_r_a_d_o_._e_d_u_/_~_s_c_h_w_a_r_t_z_/_I_R_T_F_._h_t_m_l. The HTML summarizer will extract structured information about the Author and Title of the file. It will also extract any URL links into the _U_R_L_-_R_e_f_e_r_e_n_c_e_s attribute, and any anchor tags into the _P_a_r_t_i_a_l_- _T_e_x_t attribute. Other information about the HTML file such as its MD5 (see RFC1321) and its size (_F_i_l_e_-_S_i_z_e) in bytes are also added to the content summary. 88..22.. EExxaammppllee 22 -- IInnccoorrppoorraattiinngg mmaannuuaallllyy ggeenneerraatteedd iinnffoorrmmaattiioonn The Gatherer is able to ``explode'' a resource into a stream of content summaries. This is useful for files that contain manually- generated information that may describe one or more resources, or for building a gateway between various structured formats and SOIF (see Section ``The Summary Object Interchange Format (SOIF)''). This example demonstrates an exploder for the Linux Software Map (LSM) format. LSM files contain structured information (like the author, location, etc.) about software available for the Linux operating system. To run this example, type: % cd $HARVEST_HOME/gatherers/example-2 % ./RunGatherer To view the configuration file for this Gatherer, look at _e_x_a_m_p_l_e_-_2_._c_f. Notice that the Gatherer has its own _L_i_b_-_D_i_r_e_c_t_o_r_y (see Section ``Setting variables in the Gatherer configuration file'' for help on writing configuration files). The library directory contains the typing and candidate selection customizations for Essence. In this example, we've only customized the candidate selection step. _l_i_b_/_s_t_o_p_l_i_s_t_._c_f defines the types that Essence should not index. This example uses an empty _s_t_o_p_l_i_s_t_._c_f file to direct Essence to index all files. The Gatherer retrieves each of the LeafNode URLs, which are all Linux Software Map files from the Linux FTP archive _t_s_x_-_1_1_._m_i_t_._e_d_u. The Gatherer recognizes that a ``.lsm'' file is _L_S_M type because of the naming heuristic present in _l_i_b_/_b_y_n_a_m_e_._c_f. The _L_S_M type is a ``nested'' type as specified in the Essence source code (_s_r_c_/_g_a_t_h_e_r_e_r_/_e_s_s_e_n_c_e_/_u_n_n_e_s_t_._c). Exploder programs (named TypeName.unnest) are run on nested types rather than the usual summarizers. The LSM.unnest program is the standard exploder program that takes an _L_S_M file and generates one or more corresponding SOIF objects. When the Gatherer finishes, it contains one or more corresponding SOIF objects for the software described within each _L_S_M file. After the Gatherer has finished, it will start up the Gatherer daemon which will serve the content summaries. To view the content summaries, type: % gather localhost 9222 | more Because _t_s_x_-_1_1_._m_i_t_._e_d_u is a popular and heavily loaded archive, the Gatherer often won't be able to retrieve the LSM files. If you suspect that something went wrong, look in _l_o_g_._e_r_r_o_r_s and _l_o_g_._g_a_t_h_e_r_e_r to try to determine the problem. The following two SOIF objects were generated by this Gatherer. The first object is summarizes the _L_S_M file itself, and the second object summarizes the software described in the _L_S_M file. @FILE { ftp://tsx-11.mit.edu/pub/linux/docs/linux-doc-project/man-pages-1.4.lsm Time-to-Live{7}: 9676800 Last-Modification-Time{9}: 781931042 Refresh-Rate{7}: 2419200 Gatherer-Name{25}: Example Gatherer Number 2 Gatherer-Host{22}: powell.cs.colorado.edu Gatherer-Version{3}: 0.4 Type{3}: LSM Update-Time{9}: 781931042 File-Size{3}: 848 MD5{32}: 67377f3ea214ab680892c82906081caf } @FILE { ftp://ftp.cs.unc.edu/pub/faith/linux/man-pages-1.4.tar.gz Time-to-Live{7}: 9676800 Last-Modification-Time{9}: 781931042 Refresh-Rate{7}: 2419200 Gatherer-Name{25}: Example Gatherer Number 2 Gatherer-Host{22}: powell.cs.colorado.edu Gatherer-Version{3}: 0.4 Update-Time{9}: 781931042 Type{16}: GNUCompressedTar Title{48}: Section 2, 3, 4, 5, 7, and 9 man pages for Linux Version{3}: 1.4 Description{124}: Man pages for Linux. Mostly section 2 is complete. Section 3 has over 200 man pages, but it still far from being finished. Author{27}: Linux Documentation Project AuthorEmail{11}: DOC channel Maintainer{9}: Rik Faith MaintEmail{16}: faith@cs.unc.edu Site{45}: ftp.cs.unc.edu sunsite.unc.edu tsx-11.mit.edu Path{94}: /pub/faith/linux /pub/Linux/docs/linux-doc-project/man-pages /pub/linux/docs/linux-doc-project File{20}: man-pages-1.4.tar.gz FileSize{4}: 170k CopyPolicy{47}: Public Domain or otherwise freely distributable Keywords{10}: man pages Entered{24}: Sun Sep 11 19:52:06 1994 EnteredBy{9}: Rik Faith CheckedEmail{16}: faith@cs.unc.edu } We've also built a Gatherer that explodes about a half-dozen index files from various PC archives into more than 25,000 content summaries. Each of these index files contain hundreds of a one-line descriptions about PC software distributions that are available via anonymous FTP. 88..33.. EExxaammppllee 33 -- CCuussttoommiizziinngg ttyyppee rreeccooggnniittiioonn aanndd ccaannddiiddaattee sseelleeccttiioonn This example demonstrates how to customize the type recognition and candidate selection steps in the Gatherer (see Section ``Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps''). This Gatherer recognizes World Wide Web home pages, and is configured only to collect indexing information from these home pages. To run this example, type: % cd $HARVEST_HOME/gatherers/example-3 % ./RunGatherer To view the configuration file for this Gatherer, look at _e_x_a_m_p_l_e_-_3_._c_f. As in Section ``Example 2'', this Gatherer has its own library directory that contains a customization for Essence. Since we're only interested in indexing home pages, we need only define the heuristics for recognizing home pages. As shown below, we can use URL naming heuristics to define a home page in _l_i_b_/_b_y_u_r_l_._c_f. We've also added a default _U_n_k_n_o_w_n type to make candidate selection easier in this file. HomeHTML ^http:.*/$ HomeHTML ^http:.*[hH]ome\.html$ HomeHTML ^http:.*[hH]ome[pP]age\.html$ HomeHTML ^http:.*[wW]elcome\.html$ HomeHTML ^http:.*/index\.html$ The _l_i_b_/_s_t_o_p_l_i_s_t_._c_f configuration file contains a list of types not to index. In this example, _U_n_k_n_o_w_n is the only type name listed in stoplist.configuration, so the Gatherer will only reject files of the _U_n_k_n_o_w_n type. You can also recognize URLs by their filename (in _b_y_n_a_m_e_._c_f) or by their content (in _b_y_c_o_n_t_e_n_t_._c_f and _m_a_g_i_c); although in this example, we don't need to use those mechanisms. The default HomeHTML.sum summarizer summarizes each _H_o_m_e_H_T_M_L file. After the Gatherer has finished, it will start up the Gatherer daemon which will serve the content summaries. You'll notice that only content summaries for HomeHTML files are present. To view the content summaries, type: % gather localhost 9333 | more 88..44.. EExxaammppllee 44 -- CCuussttoommiizziinngg ttyyppee rreeccooggnniittiioonn aanndd ssuummmmaarriizziinngg This example demonstrates how to customize the type recognition and summarizing steps in the Gatherer (see Section ``Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps''. This Gatherer recognizes two new file formats and summarizes them appropriately. To view the configuration file for this Gatherer, look at _e_x_a_m_p_l_e_-_4_._c_f. As in the examples in ``Example 2'' and ``Example 3'', this Gatherer has its own library directory that contains the configuration files for Essence. The Essence configuration files are the same as the default customization, except for _l_i_b_/_b_y_n_a_m_e_._c_f which contains two customizations for the new file formats. 88..44..11.. UUssiinngg rreegguullaarr eexxpprreessssiioonnss ttoo ssuummmmaarriizzee aa ffoorrmmaatt The first new format is the ``ReferBibliographic'' type which is the format that the refer program uses to represent bibliography information. To recognize that a file is in this format, we'll use the convention that the filename ends in ``.referbib''. So, we add that naming heuristic as a type recognition customization. Naming heuristics are represented as a regular expression against the filename in the _l_i_b_/_b_y_n_a_m_e_._c_f file: ReferBibliographic ^.*\.referbib$ Now, to write a summarizer for this type, we'll need a sample ReferBibliographic file: %A A. S. Tanenbaum %T Computer Networks %I Prentice Hall %C Englewood Cliffs, NJ %D 1988 Essence summarizers extract structured information from files. One way to write a summarizer is by using regular expressions to define the extractions. For each type of information that you want to extract from a file, add the regular expression that will match lines in that file to _l_i_b_/_q_u_i_c_k_-_s_u_m_._c_f. For example, the following regular expressions in _l_i_b_/_q_u_i_c_k_-_s_u_m_._c_f will extract the author, title, date, and other information from ReferBibliographic files: ReferBibliographic Author ^%A[ \t]+.*$ ReferBibliographic City ^%C[ \t]+.*$ ReferBibliographic Date ^%D[ \t]+.*$ ReferBibliographic Editor ^%E[ \t]+.*$ ReferBibliographic Comments ^%H[ \t]+.*$ ReferBibliographic Issuer ^%I[ \t]+.*$ ReferBibliographic Journal ^%J[ \t]+.*$ ReferBibliographic Keywords ^%K[ \t]+.*$ ReferBibliographic Label ^%L[ \t]+.*$ ReferBibliographic Number ^%N[ \t]+.*$ ReferBibliographic Comments ^%O[ \t]+.*$ ReferBibliographic Page-Number ^%P[ \t]+.*$ ReferBibliographic Unpublished-Info ^%R[ \t]+.*$ ReferBibliographic Series-Title ^%S[ \t]+.*$ ReferBibliographic Title ^%T[ \t]+.*$ ReferBibliographic Volume ^%V[ \t]+.*$ ReferBibliographic Abstract ^%X[ \t]+.*$ The first field in _l_i_b_/_q_u_i_c_k_-_s_u_m_._c_f is the name of the type. The second field is the Attribute under which to extract the information on lines that match the regular expression in the third field. 88..44..22.. UUssiinngg pprrooggrraammss ttoo ssuummmmaarriizzee aa ffoorrmmaatt The second new file format is the ``Abstract'' type, which is a file that contains only the text of a paper abstract (a format that is common in technical report FTP archives). To recognize that a file is written in this format, we'll use the naming convention that the filename for ``Abstract'' files ends in ``.abs''. So, we add that type recognition customization to the _l_i_b_/_b_y_n_a_m_e_._c_f file as a regular expression: Abstract ^.*\.abs$ Another way to write a summarizer is to write a program or script that takes a filename as the first argument on the command line, extracts the structured information, then outputs the results as a list of SOIF attribute-value pairs. Summarizer programs are named TypeName.sum, so we call our new summarizer Abstract.sum. Remember to place the summarizer program in a directory that is in your path so that Gatherer can run it. You'll see below that Abstract.sum is a Bourne shell script that takes the first 50 lines of the file, wraps it as the ``Abstract'' attribute, and outputs it as a SOIF attribute-value pair. #!/bin/sh # # Usage: Abstract.sum filename # head -50 "$1" | wrapit "Abstract" 88..44..33.. RRuunnnniinngg tthhee eexxaammppllee To run this example, type: % cd $HARVEST_HOME/gatherers/example-4 % ./RunGatherer After the Gatherer has finished, it will start up the Gatherer daemon which will serve the content summaries. To view the content summaries, type: % gather localhost 9444 | more 88..55.. EExxaammppllee 55 -- UUssiinngg RRoooottNNooddee ffiilltteerrss This example demonstrates how to use RootNode filters to customize the candidate selection in the Gatherer (see Section ``RootNode filters''). Only items that pass RootNode filters will be retrieved across the network (see Section ``Gatherer enumeration vs. candidate selection''). To run this example, type: % cd $HARVEST_HOME/gatherers/example-5 % ./RunGatherer After the Gatherer has finished, it will start up the Gatherer daemon which will serve the content summaries. To view the content summaries, type: % gather localhost 9555 | more 99.. HHiissttoorryy ooff HHaarrvveesstt 99..11.. HHiissttoorryy ooff HHaarrvveesstt +o 1996-01-31: Harvest 1.4pl2 was the last official release by Darren R. Hardy, Michael F. Schwartz, and Duane Wessels. +o 1997-04-21: Harvest 1.5 was released by Simon Wilkinson. +o 1998-06-12: Harvest 1.5.20 was released by Simon Wilkinson. +o 1999-05-26: Harvest-MathNet100.tar.gz released. +o 2000-01-14: harvest-modified-by-RL-Stajsic.tar.gz released. +o 2000-02-07: Harvest 1.6.1 was released by Kang-Jin Lee in cooperation with Simon Wilkinson. +o 2002-10-25: Harvest 1.8.0 was released by Harald Weinreich and Kang-Jin Lee. 99..22.. HHiissttoorryy ooff HHaarrvveesstt UUsseerr''ss MMaannuuaall +o 1996-01-31: Harvest User's Manual for Harvest 1.4.pl2 was written by Darren R. Hardy, Michael F. Schwartz, and Duane Wessels. The document was written in LaTeX. The HTML (converted with LaTeX2HTML) and the Postscript versions were made available to the public. +o 2001-04-27: The HTML version of this document was updated and bundled with the Harvest distribution by Kang-Jin Lee. Notable changes were removing the sections about Harvest Object Cache and the Replicator which are not part of Harvest any more. +o 2002-01-28: This Harvest User's Manual was converted to linuxdoc. It is now available in PostScript, PDF, text and HTML format.