Harvest FAQ
Kang-Jin Lee lee@arco.de
2002-12-22
Harvest frequently asked questions (FAQ) with answers
______________________________________________________________________
Table of Contents
1. Harvest
1.1 Where can I get more information about Harvest?
1.2 Where can I download Harvest?
1.3 What is Harvest-ng?
1.4 What is the copyright status of Harvest?
1.5 Which Operating System do I need to run Harvest?
1.6 Does Harvest run under Windows NT/2000/XP?
1.7 What Hardware do I need to use Harvest?
1.8 Which version of Harvest should I use?
1.9 What are "harvest-modified-by-RL-Stajsic", "harvest-MathNet", and "harvest-1.5.20-kj"?
1.10 What are the limits of Harvest?
1.11 Do I need root access to install and run Harvest?
1.12 How do I block Harvest from my site? How do I identify Harvest?
1.13 What can I do to help?
2. Building Harvest
2.1 Where can I get bison and flex?
2.2 How can I install Harvest in "/my/directory/harvest" instead of "/usr/local/harvest"?
2.3 How can I avoid "syntax error before `regoff_t'" error message when compiling Harvest?
2.4 Where can I get more information for building Harvest on FreeBSD?
3. Gatherer
3.1 Does the Gatherer support cookies?
3.2 Why doesn't Local-Mapping work?
3.3 Does the Gatherer gather the Root- and LeafNode-URLs periodically?
3.4 Can Harvest gather https URLs?
3.5 When will Harvest be able to gather https URLs?
3.6 Does Harvest support client based scripting/plugin like Javascript, Flash?
3.7 Why does the gatherer stop after gathering few pages?
3.8 How can I index local newsgroups? How can I put hostname into News URL?
3.9 What do the gatherer options "Search=Breadth" and "Search=Depth" do and which keywords are available for "Search=" option?
3.10 How can I index html pages generated by cgi scripts? How can I index URLs which has a "?" (question mark) in it?
3.11 Why is the gatherer so slow? How can I make it faster?
3.12 Why is the gatherer still so slow?
3.13 How do I request "304 Not Modified" answers from HTTP servers?
3.14 Why does Harvest gather different URLs between gatherings?
3.15 Why has the Gatherer's database vanished after gathering?
3.16 How can I avoid GDBM files growing very big during Gathering?
3.17 Can I use Htdig as Gatherer? Can the Broker import data from Htdig?
3.18 How can I control access to Gatherer's database?
3.19 Does Harvest's Gatherer support WAP/WML, Gnutella, Napster?
3.20 How do I gather ftp URLs from wu-ftp daemons?
3.21 Why doesn't file URLs in LeafNodes work as expected?
3.22 Why does gathering from a site fail completely or for parts of the site?
4. Summarizer
4.1 Why doesn't Post-Summarizing work?
4.2 How can I summarize meta tags in HTML documents?
4.3 Why are raw HTML tags in some query results?
4.4 How can I summarize DVI files?
4.5 How can I summarize Pdf files?
4.6 Where can I get pdftotext?
4.7 How can I improve summarizer for Microsoft Word files?
4.8 Where can I get wvWare?
4.9 How can I add support for new file type?
4.10 How can I use nsgmls instead of sgmls to summarize documents?
5. Broker
5.1 How can I start a Broker at boot time?
5.2 How can I start a Broker without starting a collection?
5.3 Why don't the documents which I have gathered right now show up in the Broker?
5.4 Why do I get error messages when I try to access "http://some.host/Harvest/brokers/your-broker-path/" after running $HARVEST_HOME/RunHarvest?
5.5 Why are NEWS URLs broken? Where are the hostnames in NEWS URLs? How can I follow NEWS URLs?
5.6 Why don't I get any results if I use a long or complex query string?
5.7 Can I use wildcards in attribute value for structured queries?
5.8 Are the attribute names case sensitive?
5.9 Why doesn't collecting from broker work?
5.10 How can I customize the Harvest user interface?
5.11 How do I localize/translate user interface?
5.12 How can I replace the bundled Glimpse with an other version of Glimpse?
6. Terms
6.1 What is a Gatherer?
6.2 What is Local-Mapping?
6.3 What is a Summarizer?
6.4 What is a Broker?
7. Miscellaneous
7.1 Who are the maintainers of Harvest?
7.2 I have found a bug. What should I do?
7.3 Is there a mailinglist for Harvest? What about a newsgroup?
______________________________________________________________________
11.. HHaarrvveesstt
11..11.. WWhheerree ccaann II ggeett mmoorree iinnffoorrmmaattiioonn aabboouutt HHaarrvveesstt??
See Harvest homepage http://harvest.sourceforge.net/ for informations
about Harvest.
11..22.. WWhheerree ccaann II ddoowwnnllooaadd HHaarrvveesstt??
Harvest is available for download at Harvest download page
http://prdownloads.sourceforge.net/harvest/.
11..33.. WWhhaatt iiss HHaarrvveesstt--nngg??
Harvest-ng is a reimplementation of Harvest's gatherer by Simon
Wilkinson. You can get more info about Harvest-ng at Harvest-ng
homepage http://webharvest.sourceforge.net/ng/.
11..44.. WWhhaatt iiss tthhee ccooppyyrriigghhtt ssttaattuuss ooff HHaarrvveesstt??
The core of Harvest located in _s_r_c directory is under GPL. Additional
components, located in _c_o_m_p_o_n_e_n_t_s directory are under GPL or similar
copyright.
11..55.. WWhhiicchh OOppeerraattiinngg SSyysstteemm ddoo II nneeeedd ttoo rruunn HHaarrvveesstt??
Harvest should run on any *nix like platforms including FreeBSD, Linux
and Solaris.
11..66.. DDooeess HHaarrvveesstt rruunn uunnddeerr WWiinnddoowwss NNTT//22000000//XXPP??
Michael Schlenker has ported Harvest to Windows platforms using Cygwin
http://sources.redhat.com/cygwin/.
11..77.. WWhhaatt HHaarrddwwaarree ddoo II nneeeedd ttoo uussee HHaarrvveesstt??
A Pentium 120MHz with 64MB RAM should achieve reasonable performance
for around 350 MB of fulltext data in ca. 20.000 objects. A Pentium
650MHz with 256MB RAM should be able to handle around 1.5 GB of
fulltext data in ca. 100.000 objects.
11..88.. WWhhiicchh vveerrssiioonn ooff HHaarrvveesstt sshhoouulldd II uussee??
+o If you want to help developing Harvest, use the most recent version
of Harvest.
+o If you are cautious, a version older than a week should reasonably
be safe to use.
+o If you don't want to use development versions of Harvest, use the
last version marked as stable.
11..99.. WWhhaatt aarree ""hhaarrvveesstt--mmooddiiffiieedd--bbyy--RRLL--SSttaajjssiicc"",, ""hhaarrvveesstt--MMaatthhNNeett"",,
aanndd ""hhaarrvveesstt--11..55..2200--kkjj""??
After the original authors ceased working on Harvest, there were some
periods where Harvest was unmaintained. During this time there were
following forked versions of Harvest:
+o "harvest-modified-by-RL-Stajsic" was released by R.L. Stajsic and
Tim Samshuijzen with some bugfixes.
+o "harvest-MathNet" is a modified version of Harvest-1.5.20 to
improve the handling of German specical characters ("Umlaute",
"scharfes S").
+o "harvest-1.5.20-kj" series were released by me with bugfixes to
Harvest 1.5.20.
All these forked trees were merged into Harvest 1.6.
11..1100.. WWhhaatt aarree tthhee lliimmiittss ooff HHaarrvveesstt??
+o Harvest's Gatherer uses GDBM database to store the summarized data.
On some architecture/OS, the maximum file size is 2 GB, so you
can't have a database larger than 2 GB per Gatherer on those
systems. To collect more data, you have to set up multiple
Gatherers.
+o The Broker stores the data as single files. On most OS, performance
degrades noticeably with increasing number of files in a directory.
Since the Broker uses finite number of directories defined in
_s_r_c_/_b_r_o_k_e_r_/_s_t_o_r___m_a_n_._c to store the files, the broker will slow down
with increasing number files.
11..1111.. DDoo II nneeeedd rroooott aacccceessss ttoo iinnssttaallll aanndd rruunn HHaarrvveesstt??
For initial setup, you must be able to modify the webserver
configuration and to schedule cron jobs. After the initial setup, it
is recommended to run Harvest as a different user for security
reasons.
11..1122.. HHooww ddoo II bblloocckk HHaarrvveesstt ffrroomm mmyy ssiittee?? HHooww ddoo II iiddeennttiiffyy HHaarrvveesstt??
Put a line like this to your robots.txt:
User-agent: Harvest
Disallow: /
11..1133.. WWhhaatt ccaann II ddoo ttoo hheellpp??
There are many ways to help depending your skills and time you want to
contribute to improve Harvest:
+o Use Harvest and let others know that you are using Harvest.
+o Use Harvest and let me know why you are using Harvest.
+o Submit ideas, feature requests and bug reports.
+o Contribute localization.
+o Contribute documentation.
+o Contribute code.
22.. BBuuiillddiinngg HHaarrvveesstt
22..11.. WWhheerree ccaann II ggeett bbiissoonn aanndd fflleexx??
Bison and flex are available at GNU FTP Site and
its mirrors.
22..22.. HHooww ccaann II iinnssttaallll HHaarrvveesstt iinn ""//mmyy//ddiirreeccttoorryy//hhaarrvveesstt"" iinnsstteeaadd ooff
""//uussrr//llooccaall//hhaarrvveesstt""??
Do
# ./configure --prefix=/my/directory/harvest
# make
# make install
22..33.. HHooww ccaann II aavvooiidd ""ssyynnttaaxx eerrrroorr bbeeffoorree ``rreeggooffff__tt''"" eerrrroorr mmeessssaaggee
wwhheenn ccoommppiilliinngg HHaarrvveesstt??
On some systems, building Harvest may fail with following message:
Making all in util
gcc -I../include -I./../include -c buffer.c
In file included from ../include/config.h:350,
from ../include/util.h:112,
from buffer.c:86:
/usr/include/regex.h:46: syntax error before `regoff_t'
/usr/include/regex.h:46: warning: data definition has no type or storage class
/usr/include/regex.h:56: syntax error before `regoff_t'
*** Error code 1
If you get this error, edit _s_r_c_/_c_o_m_m_o_n_/_i_n_c_l_u_d_e_/_a_u_t_o_c_o_n_f_._h and add
"#define USE_GNU_REGEX 1" before typing make to build Harvest.
22..44.. WWhheerree ccaann II ggeett mmoorree iinnffoorrmmaattiioonn ffoorr bbuuiillddiinngg HHaarrvveesstt oonn
FFrreeeeBBSSDD??
See FreshPorts Harvest page http://www.freshports.org/www/harvest/ for
more informations about building Harvest on FreeBSD.
33.. GGaatthheerreerr
33..11.. DDooeess tthhee GGaatthheerreerr ssuuppppoorrtt ccooookkiieess??
No, Harvest's Gatherer doesn't support cookies.
33..22.. WWhhyy ddooeessnn''tt LLooccaall--MMaappppiinngg wwoorrkk??
In Harvest 1.7.7, the default HTML enumerator was switched from
httpenum-depth to httpenum-breadth. The breadth first enumerator had a
bug in LLooccaall--MMaappppiinngg, which was fixed in Harvest 1.7.19. To make
LLooccaall--MMaappppiinngg work, use depth first enumerator or update to Harvest
1.7.19 or later.
Local mapping will fail if the file is not readable by the gatherer
process, or the file is not a regular file, or the file has execute
bits set, or the filename contains characters that have to be escaped
(like tilde, space, curly brace, quote, etc). So, for directories,
symbolic links and cgi scripts, the gatherer will always contact the
server instead of using local file.
33..33.. DDooeess tthhee GGaatthheerreerr ggaatthheerr tthhee RRoooott-- aanndd LLeeaaffNNooddee--UURRLLss ppeerriiooddii--
ccaallllyy??
No, the Gatherer gathers Root- and LeafNode URLs only once. To check
the URLs periodically, you have to use cron (see "man 8 cron") to run
$HARVEST_HOME/gatherers/YOUR_GATHERER/RunGatherer.
33..44.. CCaann HHaarrvveesstt ggaatthheerr hhttttppss UURRLLss??
No, https is not supported by Harvest. To gather https URLs, use
Harvest-ng from Simon Wilkinson. It is available at Harvest-ng
homepage http://webharvest.sourceforge.net/ng/.
33..55.. WWhheenn wwiillll HHaarrvveesstt bbee aabbllee ttoo ggaatthheerr hhttttppss UURRLLss??
This is not on top of my to-do list and may take some time.
33..66.. DDooeess HHaarrvveesstt ssuuppppoorrtt cclliieenntt bbaasseedd ssccrriippttiinngg//pplluuggiinn lliikkee
JJaavvaassccrriipptt,, FFllaasshh??
No, Harvest's gatherer does not support Javascript, Flash, etc., and
there are no plans to add support for them.
33..77.. WWhhyy ddooeess tthhee ggaatthheerreerr ssttoopp aafftteerr ggaatthheerriinngg ffeeww ppaaggeess??
Harvest's gatherer doesn't support Javascript, Flash, etc. Check the
site you want to gather and make sure that the site is browsable
without any plugins, Javascript, etc.
33..88.. HHooww ccaann II iinnddeexx llooccaall nneewwssggrroouuppss?? HHooww ccaann II ppuutt hhoossttnnaammee iinnttoo
NNeewwss UURRLL??
You will find a News URL hostname patch by Collin Smith in the _c_o_n_t_r_i_b
directory.
NOTE: Even though most web browsers support this, this violates
RFC-1738.
33..99.. WWhhaatt ddoo tthhee ggaatthheerreerr ooppttiioonnss ""SSeeaarrcchh==BBrreeaaddtthh"" aanndd ""SSeeaarrcchh==DDeepptthh""
ddoo aanndd wwhhiicchh kkeeyywwoorrddss aarree aavvaaiillaabbllee ffoorr ""SSeeaarrcchh=="" ooppttiioonn??
Search option selects an enumerator for http and gopher URLs. Harvest
comes with breadth first (Search=Breadth) and depth first
(Search=Depth) enumerator for http and gopher. They have different
strategy when following the URLs to get a list of candidates for
processing. The breadth first enumerator processes all links in a
level before descending to next level. In case of limiting the number
of URLs to gather from a site, it will give you a more representative
overview of the site. The depth first enumerator will descend to next
level as soon as possible. When there are no links left for the
current branch, it will process the next branch. The depth first
enumerator doesn't use as much memory as the breadth first enumerator.
If you don't have compelling reasons to switch from an enumerator to
the other, the default value should be a reasonable choice.
33..1100.. HHooww ccaann II iinnddeexx hhttmmll ppaaggeess ggeenneerraatteedd bbyy ccggii ssccrriippttss?? HHooww ccaann II
iinnddeexx UURRLLss wwhhiicchh hhaass aa ""??"" ((qquueessttiioonn mmaarrkk)) iinn iitt??
Remove _H_T_T_P_-_Q_u_e_r_y from _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_t_o_p_l_i_s_t_._c_f and
_$_H_A_R_V_E_S_T___H_O_M_E_/_g_a_t_h_e_r_e_r_s_/_Y_O_U_R___G_A_T_H_E_R_E_R_/_l_i_b_/_s_t_o_p_l_i_s_t_._c_f. For versions
earlier than 1.7.5, you also have to create a (symbolic) link from
$HARVEST_HOME/lib/gatherer/HTML.sum to
$HARVEST_HOME/lib/gatherer/HTTP-Query.sum. To do this, type:
# cd $HARVEST_HOME/lib/gatherer
# ln -s HTML.sum HTTP-Query.sum
33..1111.. WWhhyy iiss tthhee ggaatthheerreerr ssoo ssllooww?? HHooww ccaann II mmaakkee iitt ffaasstteerr??
The gatherer's default setting is to sleep one second after retrieving
an URL. This is to avoid an overload of the webserver. If you gather
from webservers under your control and know that they can handle the
additional load caused by the gatherer add "Delay=0" in your root node
specification to disable the sleep.
The lines should look like:
http://www.SOMESERVER.com/ Search=Breadth Delay=0
Alternatively, you can set the delay value for all root nodes by
adding AAcccceess--DDeellaayy:: 00 in your configuration file.
It should look like:
Gatherer-Name: YOUR Gatherer
Gatherer-Port: 8500
Top-Directory: /HARVEST_DIR/work1/gatherers/testgather
Access-Delay: 0
http://www.MYSITE.com/ Search=Breadth
33..1122.. WWhhyy iiss tthhee ggaatthheerreerr ssttiillll ssoo ssllooww??
Harvest's gatherer is designed to handle many types of documents and
many types of protocols. To achieve this flexibility it uses external
programs to handle the different types of documents and protocols. For
example, when gathering HTML documents via HTTP, the document is
parsed twice. First to get list of candidates to gather and then to
get a summary of the document. The summarizer is started each time
when a document arrives, quits after summarizing that document and has
to be restarted for the next document. Compared to more HTTP/HTML
oriented approaches this causes a significant overhead when gathering
HTTP/HTML only.
Harvest retrieves one document at a time which causes slowdown if you
encounter a slow site. Due to implementation, the Gathering process is
quite heavyweight and uses up to 25 MB of RAM per Gatherer. For this
reason, there were no attempts to spawn more gatherers to optimize the
bandwidth usage.
33..1133.. HHooww ddoo II rreeqquueesstt ""330044 NNoott MMooddiiffiieedd"" aannsswweerrss ffrroomm HHTTTTPP sseerrvveerrss??
To send "Last Modified: xx" headers and get "304 Not Modified" answers
from HTTP servers, add following line to the gatherer's configuration
file:
HTTP-If-Modified-Since: Yes
If the document hasn't changed since last gathering, the gatherer will
use the data from its database, instead of retrieving it again. This
will save bandwidth and speed up gathering significantly.
33..1144.. WWhhyy ddooeess HHaarrvveesstt ggaatthheerr ddiiffffeerreenntt UURRLLss bbeettwweeeenn ggaatthheerriinnggss??
When HHTTTTPP--IIff--MMooddiiffiieedd--SSiinnccee is enabled, the candidate selection scheme
of the http enumerators will change for successful database lookups.
For unchanged URLs, the enumerators will behave more like depth first
gatherer. The result of the gatherings should be the same if you are
gathering all URLs of a site, but if you gather only parts of a site
by using UURRLL==nn with nn << nnuummbbeerr ooff UURRLLss ooff aa ssiittee you will get
different subset of the system you gather.
33..1155.. WWhhyy hhaass tthhee GGaatthheerreerr''ss ddaattaabbaassee vvaanniisshheedd aafftteerr ggaatthheerriinngg??
The Gatherer uses GDBM databases to store its data on disk. Database
files for Gatherer can grow very large depending on how much data you
gather. On some systems, (e.g. i386 based Linux) the maximum file size
is 2GB. If the amount of data surpasses this limit, the GDBM database
file will be wiped from the disk.
33..1166.. HHooww ccaann II aavvooiidd GGDDBBMM ffiilleess ggrroowwiinngg vveerryy bbiigg dduurriinngg GGaatthheerriinngg??
The Gatherer's temporary GDMB database file _W_O_R_K_I_N_G_._g_d_b_m will grow
very rapidly when gathering nested objects like tar, tar.gz, zip etc.
archives. GDBM databases keep growing when tuples are inserted and
deleted from them, because GDBM reuses only fractions of the empty
filespace. To get rid of unused space, the GDBM database has to be
reorganized. The reorganization however is slow and will slow down the
gathering, so the default is not to reorganize the gatherer's
temporary database. This should work well for small to medium sized
Gatherers, but for large Gatherers it may be necessary to reorganize
the temporary database during gathering to keep the size of the
database at manageable level. To reorganize the _W_O_R_K_I_N_G_._g_d_b_m every 100
deletions add following line to your gatherer configuration file:
Essence-Options: --max-deletions 100
Don't set this value too low, since it will consume significant share
of CPU time and disk I/O. Reorganizing every 10 to 100 deletions seems
to be a reasonable value.
33..1177.. CCaann II uussee HHttddiigg aass GGaatthheerreerr?? CCaann tthhee BBrrookkeerr iimmppoorrtt ddaattaa ffrroomm
HHttddiigg??
The perl module _M_e_t_a_d_a_t_a from Dave Beckett can dump data from Htdig
database into a SOIF stream. Metadata only supports GDBM databases, so
this only works with versions earlier than Htdig 3.1, because newer
versions of Htdig switched from GDBM to Sleepycat's Berkeley DB.
33..1188.. HHooww ccaann II ccoonnttrrooll aacccceessss ttoo GGaatthheerreerr''ss ddaattaabbaassee??
Edit _$_H_A_R_V_E_S_T___H_O_M_E_/_g_a_t_h_e_r_e_r_s_/_Y_O_U_R___G_A_T_H_E_R_E_R_/_d_a_t_a_/_g_a_t_h_e_r_d_._c_f to allow or
deny access. A line that begins with AAllllooww is followed by any number
of domain or host names that are allowed to connect to the Gatherer.
If the word aallll is used, then all hosts are matched. DDeennyy is the
opposite of AAllllooww. The following example will only allow hosts in the
ccss..ccoolloorraaddoo..eedduu or uusscc..eedduu domain access the Gatherer's database:
Allow cs.colorado.edu usc.edu
Deny all
33..1199.. DDooeess HHaarrvveesstt''ss GGaatthheerreerr ssuuppppoorrtt WWAAPP//WWMMLL,, GGnnuutteellllaa,, NNaappsstteerr??
No. Harvest's Gatherer doesn't support WAP. Peer to peer services like
Gnutella, Napster, etc. are also unsupported.
33..2200.. HHooww ddoo II ggaatthheerr ffttpp UURRLLss ffrroomm wwuu--ffttpp ddaaeemmoonnss??
Changes in wu-ftpd 2.6.x broke ftpget. There is a replacement for it
in contrib directory which wraps any ftp client to behave like ftpget.
33..2211.. WWhhyy ddooeessnn''tt ffiillee UURRLLss iinn LLeeaaffNNooddeess wwoorrkk aass eexxppeecctteedd??
File URLs pointing to directories like _f_i_l_e_:_/_/_m_i_s_c_/_d_o_c_u_m_e_n_t_s_/ in
LeafNodes are considered as nested object which will be unnested.
33..2222.. WWhhyy ddooeess ggaatthheerriinngg ffrroomm aa ssiittee ffaaiill ccoommpplleetteellyy oorr ffoorr ppaarrttss ooff
tthhee ssiittee??
This may be caused by the site's _r_o_b_o_t_s_._t_x_t. You can check this by
typing "http://www.SOME.SITE.com/robots.txt" into your favourite web
browser.
44.. SSuummmmaarriizzeerr
44..11.. WWhhyy ddooeessnn''tt PPoosstt--SSuummmmaarriizziinngg wwoorrkk??
The most common error is that the instructions are indented by spaces
instead of a tab-stop. Check the PPoosstt--SSuummmmaarriizziinngg rule file and make
sure that instructions are indented by a tab-stop. The PPoosstt--
SSuummmmaarriizziinngg rule file uses a syntax like in _M_a_k_e_f_i_l_e. Conditions
begin in the first column and instructions are indented by a tab-stop.
44..22.. HHooww ccaann II ssuummmmaarriizzee mmeettaa ttaaggss iinn HHTTMMLL ddooccuummeennttss??
In Harvest 1.5.20.kj-0.3, the default summarizer for HTML data was
switched to HTML-lax.sum which does not handle meta tags. Edit
$HARVEST_HOME/lib/gatherer/HTML.sum and uncomment the SGML or Perl
based summarizer.
44..33.. WWhhyy aarree rraaww HHTTMMLL ttaaggss iinn ssoommee qquueerryy rreessuullttss??
If you see raw HTML tags in query results, the HTML summarizer was not
able to parse the page correctly. Harvest comes with three different
summarizers for HTML. If the default summarizer fails try the other
two summarizers. To do this, edit $HARVEST_HOME/lib/gatherer/HTML.sum
and uncomment one of the summarizers.
44..44.. HHooww ccaann II ssuummmmaarriizzee DDVVII ffiilleess??
Use Harvest older than 1.5.20-kj-0.8 or newer than 1.7.2. The
versions between these two versions have a bug which prevents DVI
files being summarized.
44..55.. HHooww ccaann II ssuummmmaarriizzee PPddff ffiilleess??
You need _x_p_d_f to summarize Pdf files. Harvest uses pdftotext from _x_p_d_f
to summarize Pdf files.
Alternatively, you can use acroread to convert Pdf files to Postscript
and pass it to Postscript summarizer. To do this, edit
$HARVEST_HOME/lib/gatherer/Pdf.sum accordingly.
44..66.. WWhheerree ccaann II ggeett ppddffttootteexxtt??
pdftotext is part of _x_p_d_f. It is available at Xpdf homepage
http://www.foolabs.com/xpdf/.
44..77.. HHooww ccaann II iimmpprroovvee ssuummmmaarriizzeerr ffoorr MMiiccrroossoofftt WWoorrdd ffiilleess??
Harvest uses _c_a_t_d_o_c to summarize Microsoft Word files. If you get bad
summaries for Microsoft Word files, you might want to try wvHtml,
which is part of _w_v_W_a_r_e, instead of _c_a_t_d_o_c.
44..88.. WWhheerree ccaann II ggeett wwvvWWaarree??
_w_v_W_a_r_e is available at wvWare homepage http://www.wvware.com/.
44..99.. HHooww ccaann II aadddd ssuuppppoorrtt ffoorr nneeww ffiillee ttyyppee??
Give the new file type a name and make Harvest know how to recognize
the new file type by modifying _b_y_n_a_m_e_._c_f (to determine filetype by its
name), _b_y_u_r_l_._c_f (to determine filetype by the URL), or _m_a_g_i_c and
_b_y_c_o_n_t_e_n_t_._c_f (to determine filetype by looking at the content of the
file). You will find _b_y_c_o_n_t_e_n_t_._c_f, _b_y_n_a_m_e_._c_f, _b_y_u_r_l_._c_f and _m_a_g_i_c in
your _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/ directory.
Create a summarizer (a programm or script) which takes the filename as
first argument and prints a SOIF stream "Attributename{length of
data}:your data" to stdout. For file type "Xyz", you have to
create a summarizer called Xyz.sum in the _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/
directory.
In most of the cases it might be easiest to convert filetype "Xyz" to
a supported filetype like HTML, PostScript, etc. and use an existing
summarizer on the converted file.
44..1100.. HHooww ccaann II uussee nnssggmmllss iinnsstteeaadd ooff ssggmmllss ttoo ssuummmmaarriizzee ddooccuummeennttss??
Edit $HARVEST_HOME/lib/gatherer/SGML.sum and set $$ssggmmllss__ccmmdd ==
""//uussrr//llooccaall//bbiinn//nnssggmmllss"" or where ever you have installed nsgmls.
55.. BBrrookkeerr
55..11.. HHooww ccaann II ssttaarrtt aa BBrrookkeerr aatt bboooott ttiimmee??
Some user contributed startup scripts are located in _c_o_n_t_r_i_b_/_e_t_c_/
directory of Harvest source distribution. Modify apropriate files and
copy them to your startup script directory.
55..22.. HHooww ccaann II ssttaarrtt aa BBrrookkeerr wwiitthhoouutt ssttaarrttiinngg aa ccoolllleeccttiioonn??
When a Broker starts, it starts collecting data, which can take some
time. To avoid this, use the --nnooccooll option when invoking RunBroker.
If you have installed Harvest in _/_u_s_r_/_l_o_c_a_l_/_h_a_r_v_e_s_t_/, put following
line into your startup file, e.g. /etc/rc.local:
/usr/local/harvest/brokers/YOUR_BROKER/RunBroker -nocol
Replace _/_u_s_r_/_l_o_c_a_l_/_h_a_r_v_e_s_t_/ with the directory where you have
installed Harvest.
55..33.. WWhhyy ddoonn''tt tthhee ddooccuummeennttss wwhhiicchh II hhaavvee ggaatthheerreedd rriigghhtt nnooww sshhooww uupp
iinn tthhee BBrrookkeerr??
The Broker imports data from the Gatherer once in every 24 hours. If
you want to import the data immediately after gathering, just restart
the Broker or signal the Broker to import data.
You can signal the broker with the command line client brkclient,
located in _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_b_r_o_k_e_r_/ by typing:
# brkclient localhost 8501 '#ADMIN #Password secret #collection'
Replace hostname, port and password if necessary.
Other easier method is to use the WWW based admin interface at:
"http://www.YOUR_SERVER.com/Harvest/brokers/YOUR_BROKER/admin/admin.html".
55..44.. WWhhyy ddoo II ggeett eerrrroorr mmeessssaaggeess wwhheenn II ttrryy ttoo aacccceessss
""hhttttpp::////ssoommee..hhoosstt//HHaarrvveesstt//bbrrookkeerrss//yyoouurr--bbrrookkeerr--ppaatthh//"" aafftteerr rruunnnniinngg
$$HHAARRVVEESSTT__HHOOMMEE//RRuunnHHaarrvveesstt??
Check the error log of your http daemon. The http daemon must be able
to follow symbolic links. For apache httpd you can do this by adding:
Options FollowSymLinks
to your _h_t_t_p_d_._c_o_n_f.
If you don't want symbolic links, delete the symbolic link and copy
the file to the new name.
55..55.. WWhhyy aarree NNEEWWSS UURRLLss bbrrookkeenn?? WWhheerree aarree tthhee hhoossttnnaammeess iinn NNEEWWSS UURRLLss??
HHooww ccaann II ffoollllooww NNEEWWSS UURRLLss??
Harvest's Gatherer doesn't put hostnames into NEWS URLs. If your web
browser complains about missing news server, configure your web
browser to use the news server of your provider, company or
organization as your default news server.
For more information why Harvest doesn't put hostnames into NEWS URLs,
see RFC-1738 chapter 3.6 and 3.7.
55..66.. WWhhyy ddoonn''tt II ggeett aannyy rreessuullttss iiff II uussee aa lloonngg oorr ccoommpplleexx qquueerryy
ssttrriinngg??
The length of a query string is limited to 30 characters when using
regluar expressions (wildcards), excluding the escape characters.
55..77.. CCaann II uussee wwiillddccaarrddss iinn aattttrriibbuuttee vvaalluuee ffoorr ssttrruuccttuurreedd qquueerriieess??
No, regular expressions for attribute names and attribute values in
structured queries aren't supported. So, queries like "Author: Smi.*"
or "Auth.*: Smith" won't do what you might expect.
55..88.. AArree tthhee aattttrriibbuuttee nnaammeess ccaassee sseennssiittiivvee??
No, the attribute names are not case sensitiv. So, "Time-To-Live" is
the same like "Time-to-Live", "Time-to-live", "time-to-live", etc.
55..99.. WWhhyy ddooeessnn''tt ccoolllleeccttiinngg ffrroomm bbrrookkeerr wwoorrkk??
This is due to a bug introduced in Harvest 1.5.18. The bug was fixed
in 1.7.8. To make it work again, update to 1.7.8 or higher.
55..1100.. HHooww ccaann II ccuussttoommiizzee tthhee HHaarrvveesstt uusseerr iinntteerrffaaccee??
The query pages are located in
_$_H_A_R_V_E_S_T___H_O_M_E_/_b_r_o_k_e_r_s_/_Y_O_U_R___B_R_O_K_E_R_/_q_u_e_r_y_-_*. Most likely, you don't
want to make all the variables visible to users who want to query your
broker. Edit _q_u_e_r_y_-_* and use the hhiiddddeenn type to set suitable defaults
for variables you want to hide.
The result set presentation can be customized by choosing or modifying
the configuration files located in _$_H_A_R_V_E_S_T___H_O_M_E_/_c_g_i_-_b_i_n_/_l_i_b_/
directory. The configuration files _S_a_m_p_l_e_._c_f_, _c_l_a_s_s_i_c_._c_f_, _m_o_d_e_r_n_._c_f
and some _L_A_N_G_U_A_G_E_._c_f are already installed in _$_H_A_R_V_E_S_T___H_O_M_E_/_c_g_i_-
_b_i_n_/_l_i_b_/ directory. You can either create a new configuration file or
modify one of th configuration files to get the result set
presentation you want. See the Harvest User's Manual for information
about available options for the configuration file.
If you want to customize the result presentation even further, then
edit $HARVEST_HOME/cgi-bin/search.cgi.
55..1111.. HHooww ddoo II llooccaalliizzee//ttrraannssllaattee uusseerr iinntteerrffaaccee??
To localize the user interface, do:
1. Create _s_r_c_/_b_r_o_k_e_r_/_e_x_a_m_p_l_e_/_b_r_o_k_e_r_s_/_s_k_e_l_e_t_o_n_/_q_u_e_r_y_-_g_l_i_m_p_s_e_-
_m_o_d_e_r_n_._h_t_m_l_._x_x_._i_n, where _x_x is a two letter abbreviation for your
language/country, by translating either _q_u_e_r_y_-_g_l_i_m_p_s_e_-
_m_o_d_e_r_n_._h_t_m_l_._i_n or other _q_u_e_r_y_-_g_l_i_m_p_s_e_-_m_o_d_e_r_n_._h_t_m_l_._y_y_._i_n. This is
the localized query page.
2. Create _c_o_m_p_o_n_e_n_t_s_/_b_r_o_k_e_r_/_s_t_a_n_d_a_r_d_/_W_W_W_/_l_a_n_g_u_a_g_e_._c_f by translating
_m_o_d_e_r_n_._c_f or other translated configuration file like _s_p_a_n_i_s_h_._c_f,
_g_e_r_m_a_n_._c_f, etc. This will localize the result pages and error
messages.
3. Create _s_r_c_/_b_r_o_k_e_r_/_e_x_a_m_p_l_e_/_b_r_o_k_e_r_s_/_s_k_e_l_e_t_o_n_/_q_u_e_r_y_-_g_l_i_m_p_s_e_._h_t_m_l_._x_x_._i_n
by translating _q_u_e_r_y_-_g_l_i_m_p_s_e_._h_t_m_l_._i_n or _q_u_e_r_y_-_g_l_i_m_p_s_e_._h_t_m_l_._y_y_._i_n.
This is the advanced query page.
4. Translate _s_r_c_/_b_r_o_k_e_r_/_e_x_a_m_p_l_e_/_b_r_o_k_e_r_s_/_*_._h_t_m_l to get localized
additional help pages.
55..1122.. HHooww ccaann II rreeppllaaccee tthhee bbuunnddlleedd GGlliimmppssee wwiitthh aann ootthheerr vveerrssiioonn ooff
GGlliimmppssee??
Edit _$_H_A_R_V_E_S_T___H_O_M_E_/_b_r_o_k_e_r_s_/_Y_O_U_R___B_R_O_K_E_R_/_a_d_m_i_n_/_b_r_o_k_e_r_._c_o_n_f to let
Harvest know the location of your glimpse, glimpseindex, and
glimpseserver.
66.. TTeerrmmss
66..11.. WWhhaatt iiss aa GGaatthheerreerr??
A Gatherer is a system that retrieves documents from various sources
(Web-, News-, FTP-server, local files) for processing. In HTML/HTTP
context, it is also often called _c_r_a_w_l_e_r, _r_o_b_o_t, or _s_p_i_d_e_r.
66..22.. WWhhaatt iiss LLooccaall--MMaappppiinngg??
To reduce the CPU load and speed up Gathering, Harvest can map local
files to URLs. The gatherer can bypass the server and use local file,
while pretending that the objects were gatherered as usual to the rest
of the Harvest system.
66..33.. WWhhaatt iiss aa SSuummmmaarriizzeerr??
A Summarizer transforms a document into a form which is more suitable
for fulltext searching.
The HTML summarizer for example, extracts the title of a document,
removes all HTML tags, generates a wordlist, etc.
66..44.. WWhhaatt iiss aa BBrrookkeerr??
A Broker processes search requests received from a user by a cgi-
script and presents the search results.
77.. MMiisscceellllaanneeoouuss
77..11.. WWhhoo aarree tthhee mmaaiinnttaaiinneerrss ooff HHaarrvveesstt??
Kang-Jin Lee lee@arco.de and Harald Weinreich harald@weinreichs.de are
maintaining Harvest.
77..22.. II hhaavvee ffoouunndd aa bbuugg.. WWhhaatt sshhoouulldd II ddoo??
Post a bug report to the newsgroup comp.infosystems.harvest or mail it
to Kang-Jin Lee lee@arco.de and Harald Weinreich harald@weinreichs.de.
77..33.. IIss tthheerree aa mmaaiilliinngglliisstt ffoorr HHaarrvveesstt?? WWhhaatt aabboouutt aa nneewwssggrroouupp??
There is a Harvest developer's mailinglist
http://lists.sourceforge.net/lists/listinfo/harvest-devel/ for Harvest
users and developers. There also is a Harvest newsgroup
news:comp.infosystems.harvest .