Harvest FAQ
  Kang-Jin Lee lee@arco.de
  2002-12-22

  Harvest frequently asked questions (FAQ) with answers
  ______________________________________________________________________

  Table of Contents


  1. Harvest

     1.1 Where can I get more information about Harvest?
     1.2 Where can I download Harvest?
     1.3 What is Harvest-ng?
     1.4 What is the copyright status of Harvest?
     1.5 Which Operating System do I need to run Harvest?
     1.6 Does Harvest run under Windows NT/2000/XP?
     1.7 What Hardware do I need to use Harvest?
     1.8 Which version of Harvest should I use?
     1.9 What are "harvest-modified-by-RL-Stajsic", "harvest-MathNet", and "harvest-1.5.20-kj"?
     1.10 What are the limits of Harvest?
     1.11 Do I need root access to install and run Harvest?
     1.12 How do I block Harvest from my site? How do I identify Harvest?
     1.13 What can I do to help?

  2. Building Harvest

     2.1 Where can I get bison and flex?
     2.2 How can I install Harvest in "/my/directory/harvest" instead of "/usr/local/harvest"?
     2.3 How can I avoid "syntax error before `regoff_t'" error message when compiling Harvest?
     2.4 Where can I get more information for building Harvest on FreeBSD?

  3. Gatherer

     3.1 Does the Gatherer support cookies?
     3.2 Why doesn't Local-Mapping work?
     3.3 Does the Gatherer gather the Root- and LeafNode-URLs periodically?
     3.4 Can Harvest gather https URLs?
     3.5 When will Harvest be able to gather https URLs?
     3.6 Does Harvest support client based scripting/plugin like Javascript, Flash?
     3.7 Why does the gatherer stop after gathering few pages?
     3.8 How can I index local newsgroups? How can I put hostname into News URL?
     3.9 What do the gatherer options "Search=Breadth" and "Search=Depth" do and which keywords are available for "Search=" option?
     3.10 How can I index html pages generated by cgi scripts? How can I index URLs which has a "?" (question mark) in it?
     3.11 Why is the gatherer so slow? How can I make it faster?
     3.12 Why is the gatherer still so slow?
     3.13 How do I request "304 Not Modified" answers from HTTP servers?
     3.14 Why does Harvest gather different URLs between gatherings?
     3.15 Why has the Gatherer's database vanished after gathering?
     3.16 How can I avoid GDBM files growing very big during Gathering?
     3.17 Can I use Htdig as Gatherer? Can the Broker import data from Htdig?
     3.18 How can I control access to Gatherer's database?
     3.19 Does Harvest's Gatherer support WAP/WML, Gnutella, Napster?
     3.20 How do I gather ftp URLs from wu-ftp daemons?
     3.21 Why doesn't file URLs in LeafNodes work as expected?
     3.22 Why does gathering from a site fail completely or for parts of the site?

  4. Summarizer

     4.1 Why doesn't Post-Summarizing work?
     4.2 How can I summarize meta tags in HTML documents?
     4.3 Why are raw HTML tags in some query results?
     4.4 How can I summarize DVI files?
     4.5 How can I summarize Pdf files?
     4.6 Where can I get pdftotext?
     4.7 How can I improve summarizer for Microsoft Word files?
     4.8 Where can I get wvWare?
     4.9 How can I add support for new file type?
     4.10 How can I use nsgmls instead of sgmls to summarize documents?

  5. Broker

     5.1 How can I start a Broker at boot time?
     5.2 How can I start a Broker without starting a collection?
     5.3 Why don't the documents which I have gathered right now show up in the Broker?
     5.4 Why do I get error messages when I try to access "http://some.host/Harvest/brokers/your-broker-path/" after running $HARVEST_HOME/RunHarvest?
     5.5 Why are NEWS URLs broken? Where are the hostnames in NEWS URLs? How can I follow NEWS URLs?
     5.6 Why don't I get any results if I use a long or complex query string?
     5.7 Can I use wildcards in attribute value for structured queries?
     5.8 Are the attribute names case sensitive?
     5.9 Why doesn't collecting from broker work?
     5.10 How can I customize the Harvest user interface?
     5.11 How do I localize/translate user interface?
     5.12 How can I replace the bundled Glimpse with an other version of Glimpse?

  6. Terms

     6.1 What is a Gatherer?
     6.2 What is Local-Mapping?
     6.3 What is a Summarizer?
     6.4 What is a Broker?

  7. Miscellaneous

     7.1 Who are the maintainers of Harvest?
     7.2 I have found a bug. What should I do?
     7.3 Is there a mailinglist for Harvest? What about a newsgroup?


  ______________________________________________________________________

  11..  HHaarrvveesstt


  11..11..  WWhheerree ccaann II ggeett mmoorree iinnffoorrmmaattiioonn aabboouutt HHaarrvveesstt??

  See Harvest homepage http://harvest.sourceforge.net/ for informations
  about Harvest.


  11..22..  WWhheerree ccaann II ddoowwnnllooaadd HHaarrvveesstt??

  Harvest is available for download at Harvest download page
  http://prdownloads.sourceforge.net/harvest/.


  11..33..  WWhhaatt iiss HHaarrvveesstt--nngg??

  Harvest-ng is a reimplementation of Harvest's gatherer by Simon
  Wilkinson. You can get more info about Harvest-ng at Harvest-ng
  homepage http://webharvest.sourceforge.net/ng/.


  11..44..  WWhhaatt iiss tthhee ccooppyyrriigghhtt ssttaattuuss ooff HHaarrvveesstt??

  The core of Harvest located in _s_r_c directory is under GPL. Additional
  components, located in _c_o_m_p_o_n_e_n_t_s directory are under GPL or similar
  copyright.


  11..55..  WWhhiicchh OOppeerraattiinngg SSyysstteemm ddoo II nneeeedd ttoo rruunn HHaarrvveesstt??

  Harvest should run on any *nix like platforms including FreeBSD, Linux
  and Solaris.


  11..66..  DDooeess HHaarrvveesstt rruunn uunnddeerr WWiinnddoowwss NNTT//22000000//XXPP??

  Michael Schlenker has ported Harvest to Windows platforms using Cygwin
  http://sources.redhat.com/cygwin/.
  11..77..  WWhhaatt HHaarrddwwaarree ddoo II nneeeedd ttoo uussee HHaarrvveesstt??

  A Pentium 120MHz with 64MB RAM should achieve reasonable performance
  for around 350 MB of fulltext data in ca. 20.000 objects. A Pentium
  650MHz with 256MB RAM should be able to handle around 1.5 GB of
  fulltext data in ca. 100.000 objects.


  11..88..  WWhhiicchh vveerrssiioonn ooff HHaarrvveesstt sshhoouulldd II uussee??


  +o  If you want to help developing Harvest, use the most recent version
     of Harvest.

  +o  If you are cautious, a version older than a week should reasonably
     be safe to use.

  +o  If you don't want to use development versions of Harvest, use the
     last version marked as stable.


  11..99..  WWhhaatt aarree ""hhaarrvveesstt--mmooddiiffiieedd--bbyy--RRLL--SSttaajjssiicc"",, ""hhaarrvveesstt--MMaatthhNNeett"",,
  aanndd ""hhaarrvveesstt--11..55..2200--kkjj""??

  After the original authors ceased working on Harvest, there were some
  periods where Harvest was unmaintained. During this time there were
  following forked versions of Harvest:


  +o  "harvest-modified-by-RL-Stajsic" was released by R.L. Stajsic and
     Tim Samshuijzen with some bugfixes.

  +o  "harvest-MathNet" is a modified version of Harvest-1.5.20 to
     improve the handling of German specical characters ("Umlaute",
     "scharfes S").

  +o  "harvest-1.5.20-kj" series were released by me with bugfixes to
     Harvest 1.5.20.

  All these forked trees were merged into Harvest 1.6.


  11..1100..  WWhhaatt aarree tthhee lliimmiittss ooff HHaarrvveesstt??


  +o  Harvest's Gatherer uses GDBM database to store the summarized data.
     On some architecture/OS, the maximum file size is 2 GB, so you
     can't have a database larger than 2 GB per Gatherer on those
     systems. To collect more data, you have to set up multiple
     Gatherers.

  +o  The Broker stores the data as single files. On most OS, performance
     degrades noticeably with increasing number of files in a directory.
     Since the Broker uses finite number of directories defined in
     _s_r_c_/_b_r_o_k_e_r_/_s_t_o_r___m_a_n_._c to store the files, the broker will slow down
     with increasing number files.


  11..1111..  DDoo II nneeeedd rroooott aacccceessss ttoo iinnssttaallll aanndd rruunn HHaarrvveesstt??

  For initial setup, you must be able to modify the webserver
  configuration and to schedule cron jobs. After the initial setup, it
  is recommended to run Harvest as a different user for security
  reasons.


  11..1122..  HHooww ddoo II bblloocckk HHaarrvveesstt ffrroomm mmyy ssiittee?? HHooww ddoo II iiddeennttiiffyy HHaarrvveesstt??

  Put a line like this to your robots.txt:


               User-agent: Harvest
               Disallow: /


  11..1133..  WWhhaatt ccaann II ddoo ttoo hheellpp??

  There are many ways to help depending your skills and time you want to
  contribute to improve Harvest:


  +o  Use Harvest and let others know that you are using Harvest.

  +o  Use Harvest and let me know why you are using Harvest.

  +o  Submit ideas, feature requests and bug reports.

  +o  Contribute localization.

  +o  Contribute documentation.

  +o  Contribute code.


  22..  BBuuiillddiinngg HHaarrvveesstt


  22..11..  WWhheerree ccaann II ggeett bbiissoonn aanndd fflleexx??

  Bison and flex are available at GNU FTP Site <ftp://ftp.gnu.org/> and
  its mirrors.


  22..22..  HHooww ccaann II iinnssttaallll HHaarrvveesstt iinn ""//mmyy//ddiirreeccttoorryy//hhaarrvveesstt"" iinnsstteeaadd ooff
  ""//uussrr//llooccaall//hhaarrvveesstt""??

  Do


               # ./configure --prefix=/my/directory/harvest
               # make
               # make install


  22..33..  HHooww ccaann II aavvooiidd ""ssyynnttaaxx eerrrroorr bbeeffoorree ``rreeggooffff__tt''"" eerrrroorr mmeessssaaggee
  wwhheenn ccoommppiilliinngg HHaarrvveesstt??

  On some systems, building Harvest may fail with following message:


          Making all in util
          gcc  -I../include -I./../include -c buffer.c
          In file included from ../include/config.h:350,
                           from ../include/util.h:112,
                           from buffer.c:86:
          /usr/include/regex.h:46: syntax error before `regoff_t'
          /usr/include/regex.h:46: warning: data definition has no type or storage class
          /usr/include/regex.h:56: syntax error before `regoff_t'
          *** Error code 1


  If you get this error, edit _s_r_c_/_c_o_m_m_o_n_/_i_n_c_l_u_d_e_/_a_u_t_o_c_o_n_f_._h and add
  "#define USE_GNU_REGEX 1" before typing make to build Harvest.


  22..44..  WWhheerree ccaann II ggeett mmoorree iinnffoorrmmaattiioonn ffoorr bbuuiillddiinngg HHaarrvveesstt oonn
  FFrreeeeBBSSDD??

  See FreshPorts Harvest page http://www.freshports.org/www/harvest/ for
  more informations about building Harvest on FreeBSD.


  33..  GGaatthheerreerr


  33..11..  DDooeess tthhee GGaatthheerreerr ssuuppppoorrtt ccooookkiieess??

  No, Harvest's Gatherer doesn't support cookies.


  33..22..  WWhhyy ddooeessnn''tt LLooccaall--MMaappppiinngg wwoorrkk??

  In Harvest 1.7.7, the default HTML enumerator was switched from
  httpenum-depth to httpenum-breadth. The breadth first enumerator had a
  bug in LLooccaall--MMaappppiinngg, which was fixed in Harvest 1.7.19. To make
  LLooccaall--MMaappppiinngg work, use depth first enumerator or update to Harvest
  1.7.19 or later.

  Local mapping will fail if the file is not readable by the gatherer
  process, or the file is not a regular file, or the file has execute
  bits set, or the filename contains characters that have to be escaped
  (like tilde, space, curly brace, quote, etc). So, for directories,
  symbolic links and cgi scripts, the gatherer will always contact the
  server instead of using local file.


  33..33..  DDooeess tthhee GGaatthheerreerr ggaatthheerr tthhee RRoooott-- aanndd LLeeaaffNNooddee--UURRLLss ppeerriiooddii--
  ccaallllyy??

  No, the Gatherer gathers Root- and LeafNode URLs only once. To check
  the URLs periodically, you have to use cron (see "man 8 cron") to run
  $HARVEST_HOME/gatherers/YOUR_GATHERER/RunGatherer.


  33..44..  CCaann HHaarrvveesstt ggaatthheerr hhttttppss UURRLLss??

  No, https is not supported by Harvest. To gather https URLs, use
  Harvest-ng from Simon Wilkinson. It is available at Harvest-ng
  homepage http://webharvest.sourceforge.net/ng/.


  33..55..  WWhheenn wwiillll HHaarrvveesstt bbee aabbllee ttoo ggaatthheerr hhttttppss UURRLLss??

  This is not on top of my to-do list and may take some time.


  33..66..  DDooeess HHaarrvveesstt ssuuppppoorrtt cclliieenntt bbaasseedd ssccrriippttiinngg//pplluuggiinn lliikkee
  JJaavvaassccrriipptt,, FFllaasshh??

  No, Harvest's gatherer does not support Javascript, Flash, etc., and
  there are no plans to add support for them.


  33..77..  WWhhyy ddooeess tthhee ggaatthheerreerr ssttoopp aafftteerr ggaatthheerriinngg ffeeww ppaaggeess??

  Harvest's gatherer doesn't support Javascript, Flash, etc.  Check the
  site you want to gather and make sure that the site is browsable
  without any plugins, Javascript, etc.


  33..88..  HHooww ccaann II iinnddeexx llooccaall nneewwssggrroouuppss?? HHooww ccaann II ppuutt hhoossttnnaammee iinnttoo
  NNeewwss UURRLL??

  You will find a News URL hostname patch by Collin Smith in the _c_o_n_t_r_i_b
  directory.

  NOTE: Even though most web browsers support this, this violates
  RFC-1738.


  33..99..  WWhhaatt ddoo tthhee ggaatthheerreerr ooppttiioonnss ""SSeeaarrcchh==BBrreeaaddtthh"" aanndd ""SSeeaarrcchh==DDeepptthh""
  ddoo aanndd wwhhiicchh kkeeyywwoorrddss aarree aavvaaiillaabbllee ffoorr ""SSeeaarrcchh=="" ooppttiioonn??

  Search option selects an enumerator for http and gopher URLs. Harvest
  comes with breadth first (Search=Breadth) and depth first
  (Search=Depth) enumerator for http and gopher. They have different
  strategy when following the URLs to get a list of candidates for
  processing. The breadth first enumerator processes all links in a
  level before descending to next level. In case of limiting the number
  of URLs to gather from a site, it will give you a more representative
  overview of the site. The depth first enumerator will descend to next
  level as soon as possible. When there are no links left for the
  current branch, it will  process the next branch. The depth first
  enumerator doesn't use as much memory as the breadth first enumerator.
  If you don't have compelling reasons to switch from an enumerator to
  the other, the default value should be a reasonable choice.


  33..1100..  HHooww ccaann II iinnddeexx hhttmmll ppaaggeess ggeenneerraatteedd bbyy ccggii ssccrriippttss?? HHooww ccaann II
  iinnddeexx UURRLLss wwhhiicchh hhaass aa ""??"" ((qquueessttiioonn mmaarrkk)) iinn iitt??

  Remove _H_T_T_P_-_Q_u_e_r_y from _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_t_o_p_l_i_s_t_._c_f and
  _$_H_A_R_V_E_S_T___H_O_M_E_/_g_a_t_h_e_r_e_r_s_/_Y_O_U_R___G_A_T_H_E_R_E_R_/_l_i_b_/_s_t_o_p_l_i_s_t_._c_f. For versions
  earlier than 1.7.5, you also have to create a (symbolic) link from
  $HARVEST_HOME/lib/gatherer/HTML.sum to
  $HARVEST_HOME/lib/gatherer/HTTP-Query.sum. To do this, type:


               # cd $HARVEST_HOME/lib/gatherer
               # ln -s HTML.sum HTTP-Query.sum


  33..1111..  WWhhyy iiss tthhee ggaatthheerreerr ssoo ssllooww?? HHooww ccaann II mmaakkee iitt ffaasstteerr??

  The gatherer's default setting is to sleep one second after retrieving
  an URL. This is to avoid an overload of the webserver. If you gather
  from webservers under your control and know that they can handle the
  additional load caused by the gatherer add "Delay=0" in your root node
  specification to disable the sleep.

  The lines should look like:


               <RootNodes>
               http://www.SOMESERVER.com/ Search=Breadth Delay=0
               </RootNodes>


  Alternatively, you can set the delay value for all root nodes by
  adding AAcccceess--DDeellaayy:: 00 in your configuration file.

  It should look like:


               Gatherer-Name:  YOUR Gatherer
               Gatherer-Port:  8500
               Top-Directory:  /HARVEST_DIR/work1/gatherers/testgather
               Access-Delay:   0

               <RootNodes>
               http://www.MYSITE.com/ Search=Breadth
               </RootNodes>


  33..1122..  WWhhyy iiss tthhee ggaatthheerreerr ssttiillll ssoo ssllooww??

  Harvest's gatherer is designed to handle many types of documents and
  many types of protocols. To achieve this flexibility it uses external
  programs to handle the different types of documents and protocols. For
  example, when gathering HTML documents via HTTP, the document is
  parsed twice. First to get list of candidates to gather and then to
  get a summary of the document. The summarizer is started each time
  when a document arrives, quits after summarizing that document and has
  to be restarted for the next document. Compared to more HTTP/HTML
  oriented approaches this causes a significant overhead when gathering
  HTTP/HTML only.

  Harvest retrieves one document at a time which causes slowdown if you
  encounter a slow site. Due to implementation, the Gathering process is
  quite heavyweight and uses up to 25 MB of RAM per Gatherer. For this
  reason, there were no attempts to spawn more gatherers to optimize the
  bandwidth usage.


  33..1133..  HHooww ddoo II rreeqquueesstt ""330044 NNoott MMooddiiffiieedd"" aannsswweerrss ffrroomm HHTTTTPP sseerrvveerrss??

  To send "Last Modified: xx" headers and get "304 Not Modified" answers
  from HTTP servers, add following line to the gatherer's configuration
  file:


               HTTP-If-Modified-Since: Yes


  If the document hasn't changed since last gathering, the gatherer will
  use the data from its database, instead of retrieving it again. This
  will save bandwidth and speed up gathering significantly.


  33..1144..  WWhhyy ddooeess HHaarrvveesstt ggaatthheerr ddiiffffeerreenntt UURRLLss bbeettwweeeenn ggaatthheerriinnggss??

  When HHTTTTPP--IIff--MMooddiiffiieedd--SSiinnccee is enabled, the candidate selection scheme
  of the http enumerators will change for successful database lookups.
  For unchanged URLs, the enumerators will behave more like depth first
  gatherer. The result of the gatherings should be the same if you are
  gathering all URLs of a site, but if you gather only parts of a site
  by using UURRLL==nn with nn << nnuummbbeerr ooff UURRLLss ooff aa ssiittee you will get
  different subset of the system you gather.


  33..1155..  WWhhyy hhaass tthhee GGaatthheerreerr''ss ddaattaabbaassee vvaanniisshheedd aafftteerr ggaatthheerriinngg??

  The Gatherer uses GDBM databases to store its data on disk. Database
  files for Gatherer can grow very large depending on how much data you
  gather. On some systems, (e.g. i386 based Linux) the maximum file size
  is 2GB. If the amount of data surpasses this limit, the GDBM database
  file will be wiped from the disk.


  33..1166..  HHooww ccaann II aavvooiidd GGDDBBMM ffiilleess ggrroowwiinngg vveerryy bbiigg dduurriinngg GGaatthheerriinngg??

  The Gatherer's temporary GDMB database file _W_O_R_K_I_N_G_._g_d_b_m will grow
  very rapidly when gathering nested objects like tar, tar.gz, zip etc.
  archives. GDBM databases keep growing when tuples are inserted and
  deleted from them, because GDBM reuses only fractions of the empty
  filespace. To get rid of unused space, the GDBM database has to be
  reorganized. The reorganization however is slow and will slow down the
  gathering, so the default is not to reorganize the gatherer's
  temporary database. This should work well for small to medium sized
  Gatherers, but for large Gatherers it may be necessary to reorganize
  the temporary database during gathering to keep the size of the
  database at manageable level. To reorganize the _W_O_R_K_I_N_G_._g_d_b_m every 100
  deletions add following line to your gatherer configuration file:


               Essence-Options: --max-deletions 100


  Don't set this value too low, since it will consume significant share
  of CPU time and disk I/O. Reorganizing every 10 to 100 deletions seems
  to be a reasonable value.


  33..1177..  CCaann II uussee HHttddiigg aass GGaatthheerreerr?? CCaann tthhee BBrrookkeerr iimmppoorrtt ddaattaa ffrroomm
  HHttddiigg??

  The perl module _M_e_t_a_d_a_t_a from Dave Beckett can dump data from Htdig
  database into a SOIF stream. Metadata only supports GDBM databases, so
  this only works with versions earlier than Htdig 3.1, because newer
  versions of Htdig switched from GDBM to Sleepycat's Berkeley DB.


  33..1188..  HHooww ccaann II ccoonnttrrooll aacccceessss ttoo GGaatthheerreerr''ss ddaattaabbaassee??

  Edit _$_H_A_R_V_E_S_T___H_O_M_E_/_g_a_t_h_e_r_e_r_s_/_Y_O_U_R___G_A_T_H_E_R_E_R_/_d_a_t_a_/_g_a_t_h_e_r_d_._c_f to allow or
  deny access. A line that begins with AAllllooww is followed by any number
  of domain or host names that are allowed to connect to the Gatherer.
  If the word aallll is used, then all hosts are matched. DDeennyy is the
  opposite of AAllllooww. The following example will only allow hosts in the
  ccss..ccoolloorraaddoo..eedduu or uusscc..eedduu domain access the Gatherer's database:


               Allow  cs.colorado.edu usc.edu
               Deny   all


  33..1199..  DDooeess HHaarrvveesstt''ss GGaatthheerreerr ssuuppppoorrtt WWAAPP//WWMMLL,, GGnnuutteellllaa,, NNaappsstteerr??

  No. Harvest's Gatherer doesn't support WAP. Peer to peer services like
  Gnutella, Napster, etc. are also unsupported.


  33..2200..  HHooww ddoo II ggaatthheerr ffttpp UURRLLss ffrroomm wwuu--ffttpp ddaaeemmoonnss??

  Changes in wu-ftpd 2.6.x broke ftpget. There is a replacement for it
  in contrib directory which wraps any ftp client to behave like ftpget.


  33..2211..  WWhhyy ddooeessnn''tt ffiillee UURRLLss iinn LLeeaaffNNooddeess wwoorrkk aass eexxppeecctteedd??

  File URLs pointing to directories like _f_i_l_e_:_/_/_m_i_s_c_/_d_o_c_u_m_e_n_t_s_/ in
  LeafNodes are considered as nested object which will be unnested.


  33..2222..  WWhhyy ddooeess ggaatthheerriinngg ffrroomm aa ssiittee ffaaiill ccoommpplleetteellyy oorr ffoorr ppaarrttss ooff
  tthhee ssiittee??

  This may be caused by the site's _r_o_b_o_t_s_._t_x_t. You can check this by
  typing "http://www.SOME.SITE.com/robots.txt" into your favourite web
  browser.


  44..  SSuummmmaarriizzeerr


  44..11..  WWhhyy ddooeessnn''tt PPoosstt--SSuummmmaarriizziinngg wwoorrkk??

  The most common error is that the instructions are indented by spaces
  instead of a tab-stop. Check the PPoosstt--SSuummmmaarriizziinngg rule file and make
  sure that instructions are indented by a tab-stop. The PPoosstt--
  SSuummmmaarriizziinngg rule file uses a syntax like in _M_a_k_e_f_i_l_e.  Conditions
  begin in the first column and instructions are indented by a tab-stop.


  44..22..  HHooww ccaann II ssuummmmaarriizzee mmeettaa ttaaggss iinn HHTTMMLL ddooccuummeennttss??

  In Harvest 1.5.20.kj-0.3, the default summarizer for HTML data was
  switched to HTML-lax.sum which does not handle meta tags. Edit
  $HARVEST_HOME/lib/gatherer/HTML.sum and uncomment the SGML or Perl
  based summarizer.


  44..33..  WWhhyy aarree rraaww HHTTMMLL ttaaggss iinn ssoommee qquueerryy rreessuullttss??

  If you see raw HTML tags in query results, the HTML summarizer was not
  able to parse the page correctly. Harvest comes with three different
  summarizers for HTML. If the default summarizer fails try the other
  two summarizers. To do this, edit $HARVEST_HOME/lib/gatherer/HTML.sum
  and uncomment one of the summarizers.


  44..44..  HHooww ccaann II ssuummmmaarriizzee DDVVII ffiilleess??

  Use Harvest older than 1.5.20-kj-0.8 or newer than 1.7.2.  The
  versions between these two versions have a bug which prevents DVI
  files being summarized.


  44..55..  HHooww ccaann II ssuummmmaarriizzee PPddff ffiilleess??

  You need _x_p_d_f to summarize Pdf files. Harvest uses pdftotext from _x_p_d_f
  to summarize Pdf files.

  Alternatively, you can use acroread to convert Pdf files to Postscript
  and pass it to Postscript summarizer. To do this, edit
  $HARVEST_HOME/lib/gatherer/Pdf.sum accordingly.


  44..66..  WWhheerree ccaann II ggeett ppddffttootteexxtt??

  pdftotext is part of _x_p_d_f. It is available at Xpdf homepage
  http://www.foolabs.com/xpdf/.


  44..77..  HHooww ccaann II iimmpprroovvee ssuummmmaarriizzeerr ffoorr MMiiccrroossoofftt WWoorrdd ffiilleess??

  Harvest uses _c_a_t_d_o_c to summarize Microsoft Word files. If you get bad
  summaries for Microsoft Word files, you might want to try wvHtml,
  which is part of _w_v_W_a_r_e, instead of _c_a_t_d_o_c.


  44..88..  WWhheerree ccaann II ggeett wwvvWWaarree??

  _w_v_W_a_r_e is available at wvWare homepage http://www.wvware.com/.


  44..99..  HHooww ccaann II aadddd ssuuppppoorrtt ffoorr nneeww ffiillee ttyyppee??

  Give the new file type a name and make Harvest know how to recognize
  the new file type by modifying _b_y_n_a_m_e_._c_f (to determine filetype by its
  name), _b_y_u_r_l_._c_f (to determine filetype by the URL), or _m_a_g_i_c and
  _b_y_c_o_n_t_e_n_t_._c_f (to determine filetype by looking at the content of the
  file). You will find _b_y_c_o_n_t_e_n_t_._c_f, _b_y_n_a_m_e_._c_f, _b_y_u_r_l_._c_f and _m_a_g_i_c in
  your _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/ directory.

  Create a summarizer (a programm or script) which takes the filename as
  first argument and prints a SOIF stream "Attributename{length of
  data}:<tab>your data" to stdout. For file type "Xyz", you have to
  create a summarizer called Xyz.sum in the _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/
  directory.

  In most of the cases it might be easiest to convert filetype "Xyz" to
  a supported filetype like HTML, PostScript, etc. and use an existing
  summarizer on the converted file.


  44..1100..  HHooww ccaann II uussee nnssggmmllss iinnsstteeaadd ooff ssggmmllss ttoo ssuummmmaarriizzee ddooccuummeennttss??

  Edit $HARVEST_HOME/lib/gatherer/SGML.sum and set $$ssggmmllss__ccmmdd ==
  ""//uussrr//llooccaall//bbiinn//nnssggmmllss"" or where ever you have installed nsgmls.


  55..  BBrrookkeerr


  55..11..  HHooww ccaann II ssttaarrtt aa BBrrookkeerr aatt bboooott ttiimmee??

  Some user contributed startup scripts are located in _c_o_n_t_r_i_b_/_e_t_c_/
  directory of Harvest source distribution. Modify apropriate files and
  copy them to your startup script directory.


  55..22..  HHooww ccaann II ssttaarrtt aa BBrrookkeerr wwiitthhoouutt ssttaarrttiinngg aa ccoolllleeccttiioonn??

  When a Broker starts, it starts collecting data, which can take some
  time. To avoid this, use the --nnooccooll option when invoking RunBroker.

  If you have installed Harvest in _/_u_s_r_/_l_o_c_a_l_/_h_a_r_v_e_s_t_/, put following
  line into your startup file, e.g. /etc/rc.local:


               /usr/local/harvest/brokers/YOUR_BROKER/RunBroker -nocol


  Replace _/_u_s_r_/_l_o_c_a_l_/_h_a_r_v_e_s_t_/ with the directory where you have
  installed Harvest.


  55..33..  WWhhyy ddoonn''tt tthhee ddooccuummeennttss wwhhiicchh II hhaavvee ggaatthheerreedd rriigghhtt nnooww sshhooww uupp
  iinn tthhee BBrrookkeerr??

  The Broker imports data from the Gatherer once in every 24 hours. If
  you want to import the data immediately after gathering, just restart
  the Broker or signal the Broker to import data.

  You can signal the broker with the command line client brkclient,
  located in _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_b_r_o_k_e_r_/ by typing:


               # brkclient localhost 8501 '#ADMIN #Password secret #collection'


  Replace hostname, port and password if necessary.

  Other easier method is to use the WWW based admin interface at:
  "http://www.YOUR_SERVER.com/Harvest/brokers/YOUR_BROKER/admin/admin.html".


  55..44..  WWhhyy ddoo II ggeett eerrrroorr mmeessssaaggeess wwhheenn II ttrryy ttoo aacccceessss
  ""hhttttpp::////ssoommee..hhoosstt//HHaarrvveesstt//bbrrookkeerrss//yyoouurr--bbrrookkeerr--ppaatthh//"" aafftteerr rruunnnniinngg
  $$HHAARRVVEESSTT__HHOOMMEE//RRuunnHHaarrvveesstt??

  Check the error log of your http daemon. The http daemon must be able
  to follow symbolic links. For apache httpd you can do this by adding:

               <Location /Harvest/brokers/your-broker-path/>
                       Options FollowSymLinks
               </Location>


  to your _h_t_t_p_d_._c_o_n_f.

  If you don't want symbolic links, delete the symbolic link and copy
  the file to the new name.


  55..55..  WWhhyy aarree NNEEWWSS UURRLLss bbrrookkeenn?? WWhheerree aarree tthhee hhoossttnnaammeess iinn NNEEWWSS UURRLLss??
  HHooww ccaann II ffoollllooww NNEEWWSS UURRLLss??

  Harvest's Gatherer doesn't put hostnames into NEWS URLs. If your web
  browser complains about missing news server, configure your web
  browser to use the news server of your provider, company or
  organization as your default news server.

  For more information why Harvest doesn't put hostnames into NEWS URLs,
  see RFC-1738 chapter 3.6 and 3.7.


  55..66..  WWhhyy ddoonn''tt II ggeett aannyy rreessuullttss iiff II uussee aa lloonngg oorr ccoommpplleexx qquueerryy
  ssttrriinngg??

  The length of a query string is limited to 30 characters when using
  regluar expressions (wildcards), excluding the escape characters.


  55..77..  CCaann II uussee wwiillddccaarrddss iinn aattttrriibbuuttee vvaalluuee ffoorr ssttrruuccttuurreedd qquueerriieess??

  No, regular expressions for attribute names and attribute values in
  structured queries aren't supported. So, queries like "Author: Smi.*"
  or "Auth.*: Smith" won't do what you might expect.


  55..88..  AArree tthhee aattttrriibbuuttee nnaammeess ccaassee sseennssiittiivvee??

  No, the attribute names are not case sensitiv. So, "Time-To-Live" is
  the same like "Time-to-Live", "Time-to-live", "time-to-live", etc.


  55..99..  WWhhyy ddooeessnn''tt ccoolllleeccttiinngg ffrroomm bbrrookkeerr wwoorrkk??

  This is due to a bug introduced in Harvest 1.5.18. The bug was fixed
  in 1.7.8. To make it work again, update to 1.7.8 or higher.


  55..1100..  HHooww ccaann II ccuussttoommiizzee tthhee HHaarrvveesstt uusseerr iinntteerrffaaccee??

  The query pages are located in
  _$_H_A_R_V_E_S_T___H_O_M_E_/_b_r_o_k_e_r_s_/_Y_O_U_R___B_R_O_K_E_R_/_q_u_e_r_y_-_*.  Most likely, you don't
  want to make all the variables visible to users who want to query your
  broker. Edit _q_u_e_r_y_-_* and use the hhiiddddeenn type to set suitable defaults
  for variables you want to hide.

  The result set presentation can be customized by choosing or modifying
  the configuration files located in _$_H_A_R_V_E_S_T___H_O_M_E_/_c_g_i_-_b_i_n_/_l_i_b_/
  directory. The configuration files _S_a_m_p_l_e_._c_f_, _c_l_a_s_s_i_c_._c_f_, _m_o_d_e_r_n_._c_f
  and some _L_A_N_G_U_A_G_E_._c_f are already installed in _$_H_A_R_V_E_S_T___H_O_M_E_/_c_g_i_-
  _b_i_n_/_l_i_b_/ directory. You can either create a new configuration file or
  modify one of th configuration files to get the result set
  presentation you want. See the Harvest User's Manual for information
  about available options for the configuration file.

  If you want to customize the result presentation even further, then
  edit $HARVEST_HOME/cgi-bin/search.cgi.


  55..1111..  HHooww ddoo II llooccaalliizzee//ttrraannssllaattee uusseerr iinntteerrffaaccee??

  To localize the user interface, do:


  1. Create _s_r_c_/_b_r_o_k_e_r_/_e_x_a_m_p_l_e_/_b_r_o_k_e_r_s_/_s_k_e_l_e_t_o_n_/_q_u_e_r_y_-_g_l_i_m_p_s_e_-
     _m_o_d_e_r_n_._h_t_m_l_._x_x_._i_n, where _x_x is a two letter abbreviation for your
     language/country, by translating either _q_u_e_r_y_-_g_l_i_m_p_s_e_-
     _m_o_d_e_r_n_._h_t_m_l_._i_n or other _q_u_e_r_y_-_g_l_i_m_p_s_e_-_m_o_d_e_r_n_._h_t_m_l_._y_y_._i_n. This is
     the localized query page.

  2. Create _c_o_m_p_o_n_e_n_t_s_/_b_r_o_k_e_r_/_s_t_a_n_d_a_r_d_/_W_W_W_/_l_a_n_g_u_a_g_e_._c_f by translating
     _m_o_d_e_r_n_._c_f or other translated configuration file like _s_p_a_n_i_s_h_._c_f,
     _g_e_r_m_a_n_._c_f, etc. This will localize the result pages and error
     messages.

  3. Create _s_r_c_/_b_r_o_k_e_r_/_e_x_a_m_p_l_e_/_b_r_o_k_e_r_s_/_s_k_e_l_e_t_o_n_/_q_u_e_r_y_-_g_l_i_m_p_s_e_._h_t_m_l_._x_x_._i_n
     by translating _q_u_e_r_y_-_g_l_i_m_p_s_e_._h_t_m_l_._i_n or _q_u_e_r_y_-_g_l_i_m_p_s_e_._h_t_m_l_._y_y_._i_n.
     This is the advanced query page.

  4. Translate _s_r_c_/_b_r_o_k_e_r_/_e_x_a_m_p_l_e_/_b_r_o_k_e_r_s_/_*_._h_t_m_l to get localized
     additional help pages.


  55..1122..  HHooww ccaann II rreeppllaaccee tthhee bbuunnddlleedd GGlliimmppssee wwiitthh aann ootthheerr vveerrssiioonn ooff
  GGlliimmppssee??

  Edit _$_H_A_R_V_E_S_T___H_O_M_E_/_b_r_o_k_e_r_s_/_Y_O_U_R___B_R_O_K_E_R_/_a_d_m_i_n_/_b_r_o_k_e_r_._c_o_n_f to let
  Harvest know the location of your glimpse, glimpseindex, and
  glimpseserver.


  66..  TTeerrmmss


  66..11..  WWhhaatt iiss aa GGaatthheerreerr??

  A Gatherer is a system that retrieves documents from various sources
  (Web-, News-, FTP-server, local files) for processing. In HTML/HTTP
  context, it is also often called _c_r_a_w_l_e_r, _r_o_b_o_t, or _s_p_i_d_e_r.


  66..22..  WWhhaatt iiss LLooccaall--MMaappppiinngg??

  To reduce the CPU load and speed up Gathering, Harvest can map local
  files to URLs. The gatherer can bypass the server and use local file,
  while pretending that the objects were gatherered as usual to the rest
  of the Harvest system.


  66..33..  WWhhaatt iiss aa SSuummmmaarriizzeerr??

  A Summarizer transforms a document into a form which is more suitable
  for fulltext searching.

  The HTML summarizer for example, extracts the title of a document,
  removes all HTML tags, generates a wordlist, etc.


  66..44..  WWhhaatt iiss aa BBrrookkeerr??

  A Broker processes search requests received from a user by a cgi-
  script and presents the search results.


  77..  MMiisscceellllaanneeoouuss


  77..11..  WWhhoo aarree tthhee mmaaiinnttaaiinneerrss ooff HHaarrvveesstt??

  Kang-Jin Lee lee@arco.de and Harald Weinreich harald@weinreichs.de are
  maintaining Harvest.


  77..22..  II hhaavvee ffoouunndd aa bbuugg.. WWhhaatt sshhoouulldd II ddoo??

  Post a bug report to the newsgroup comp.infosystems.harvest or mail it
  to Kang-Jin Lee lee@arco.de and Harald Weinreich harald@weinreichs.de.


  77..33..  IIss tthheerree aa mmaaiilliinngglliisstt ffoorr HHaarrvveesstt?? WWhhaatt aabboouutt aa nneewwssggrroouupp??

  There is a Harvest developer's mailinglist
  http://lists.sourceforge.net/lists/listinfo/harvest-devel/ for Harvest
  users and developers. There also is a Harvest newsgroup
  news:comp.infosystems.harvest <news:comp.infosystems.harvest>.