Harvest FAQ: Gatherer

3. Gatherer

3.1 Does the Gatherer support cookies?

No, Harvest's Gatherer doesn't support cookies.

3.2 Why doesn't Local-Mapping work?

In Harvest 1.7.7, the default HTML enumerator was switched from httpenum-depth to httpenum-breadth. The breadth first enumerator had a bug in Local-Mapping, which was fixed in Harvest 1.7.19. To make Local-Mapping work, use depth first enumerator or update to Harvest 1.7.19 or later.

Local mapping will fail if the file is not readable by the gatherer process, or the file is not a regular file, or the file has execute bits set, or the filename contains characters that have to be escaped (like tilde, space, curly brace, quote, etc). So, for directories, symbolic links and cgi scripts, the gatherer will always contact the server instead of using local file.

3.3 Does the Gatherer gather the Root- and LeafNode-URLs periodically?

No, the Gatherer gathers Root- and LeafNode URLs only once. To check the URLs periodically, you have to use cron (see "man 8 cron") to run $HARVEST_HOME/gatherers/YOUR_GATHERER/RunGatherer.

3.4 Can Harvest gather https URLs?

No, https is not supported by Harvest. To gather https URLs, use Harvest-ng from Simon Wilkinson. It is available at Harvest-ng homepage http://webharvest.sourceforge.net/ng/.

3.5 When will Harvest be able to gather https URLs?

This is not on top of my to-do list and may take some time.

3.6 Does Harvest support client based scripting/plugin like Javascript, Flash?

No, Harvest's gatherer does not support Javascript, Flash, etc., and there are no plans to add support for them.

3.7 Why does the gatherer stop after gathering few pages?

Harvest's gatherer doesn't support Javascript, Flash, etc. Check the site you want to gather and make sure that the site is browsable without any plugins, Javascript, etc.

3.8 How can I index local newsgroups? How can I put hostname into News URL?

You will find a News URL hostname patch by Collin Smith in the contrib directory.

NOTE: Even though most web browsers support this, this violates RFC-1738.

3.9 What do the gatherer options "Search=Breadth" and "Search=Depth" do and which keywords are available for "Search=" option?

Search option selects an enumerator for http and gopher URLs. Harvest comes with breadth first (Search=Breadth) and depth first (Search=Depth) enumerator for http and gopher. They have different strategy when following the URLs to get a list of candidates for processing. The breadth first enumerator processes all links in a level before descending to next level. In case of limiting the number of URLs to gather from a site, it will give you a more representative overview of the site. The depth first enumerator will descend to next level as soon as possible. When there are no links left for the current branch, it will process the next branch. The depth first enumerator doesn't use as much memory as the breadth first enumerator. If you don't have compelling reasons to switch from an enumerator to the other, the default value should be a reasonable choice.

3.10 How can I index html pages generated by cgi scripts? How can I index URLs which has a "?" (question mark) in it?

Remove HTTP-Query from $HARVEST_HOME/lib/gatherer/stoplist.cf and $HARVEST_HOME/gatherers/YOUR_GATHERER/lib/stoplist.cf. For versions earlier than 1.7.5, you also have to create a (symbolic) link from $HARVEST_HOME/lib/gatherer/HTML.sum to $HARVEST_HOME/lib/gatherer/HTTP-Query.sum. To do this, type:


        # cd $HARVEST_HOME/lib/gatherer
        # ln -s HTML.sum HTTP-Query.sum

3.11 Why is the gatherer so slow? How can I make it faster?

The gatherer's default setting is to sleep one second after retrieving an URL. This is to avoid an overload of the webserver. If you gather from webservers under your control and know that they can handle the additional load caused by the gatherer add "Delay=0" in your root node specification to disable the sleep.

The lines should look like:


        <RootNodes>
        http://www.SOMESERVER.com/ Search=Breadth Delay=0
        </RootNodes>

Alternatively, you can set the delay value for all root nodes by adding Acces-Delay: 0 in your configuration file.

It should look like:


        Gatherer-Name:  YOUR Gatherer
        Gatherer-Port:  8500
        Top-Directory:  /HARVEST_DIR/work1/gatherers/testgather
        Access-Delay:   0

        <RootNodes>
        http://www.MYSITE.com/ Search=Breadth
        </RootNodes>

3.12 Why is the gatherer still so slow?

Harvest's gatherer is designed to handle many types of documents and many types of protocols. To achieve this flexibility it uses external programs to handle the different types of documents and protocols. For example, when gathering HTML documents via HTTP, the document is parsed twice. First to get list of candidates to gather and then to get a summary of the document. The summarizer is started each time when a document arrives, quits after summarizing that document and has to be restarted for the next document. Compared to more HTTP/HTML oriented approaches this causes a significant overhead when gathering HTTP/HTML only.

Harvest retrieves one document at a time which causes slowdown if you encounter a slow site. Due to implementation, the Gathering process is quite heavyweight and uses up to 25 MB of RAM per Gatherer. For this reason, there were no attempts to spawn more gatherers to optimize the bandwidth usage.

3.13 How do I request "304 Not Modified" answers from HTTP servers?

To send "Last Modified: xx" headers and get "304 Not Modified" answers from HTTP servers, add following line to the gatherer's configuration file:


        HTTP-If-Modified-Since: Yes

If the document hasn't changed since last gathering, the gatherer will use the data from its database, instead of retrieving it again. This will save bandwidth and speed up gathering significantly.

3.14 Why does Harvest gather different URLs between gatherings?

When HTTP-If-Modified-Since is enabled, the candidate selection scheme of the http enumerators will change for successful database lookups. For unchanged URLs, the enumerators will behave more like depth first gatherer. The result of the gatherings should be the same if you are gathering all URLs of a site, but if you gather only parts of a site by using URL=n with n < number of URLs of a site you will get different subset of the system you gather.

3.15 Why has the Gatherer's database vanished after gathering?

The Gatherer uses GDBM databases to store its data on disk. Database files for Gatherer can grow very large depending on how much data you gather. On some systems, (e.g. i386 based Linux) the maximum file size is 2GB. If the amount of data surpasses this limit, the GDBM database file will be wiped from the disk.

3.16 How can I avoid GDBM files growing very big during Gathering?

The Gatherer's temporary GDMB database file WORKING.gdbm will grow very rapidly when gathering nested objects like tar, tar.gz, zip etc. archives. GDBM databases keep growing when tuples are inserted and deleted from them, because GDBM reuses only fractions of the empty filespace. To get rid of unused space, the GDBM database has to be reorganized. The reorganization however is slow and will slow down the gathering, so the default is not to reorganize the gatherer's temporary database. This should work well for small to medium sized Gatherers, but for large Gatherers it may be necessary to reorganize the temporary database during gathering to keep the size of the database at manageable level. To reorganize the WORKING.gdbm every 100 deletions add following line to your gatherer configuration file:


        Essence-Options: --max-deletions 100

Don't set this value too low, since it will consume significant share of CPU time and disk I/O. Reorganizing every 10 to 100 deletions seems to be a reasonable value.

3.17 Can I use Htdig as Gatherer? Can the Broker import data from Htdig?

The perl module Metadata from Dave Beckett can dump data from Htdig database into a SOIF stream. Metadata only supports GDBM databases, so this only works with versions earlier than Htdig 3.1, because newer versions of Htdig switched from GDBM to Sleepycat's Berkeley DB.

3.18 How can I control access to Gatherer's database?

Edit $HARVEST_HOME/gatherers/YOUR_GATHERER/data/gatherd.cf to allow or deny access. A line that begins with Allow is followed by any number of domain or host names that are allowed to connect to the Gatherer. If the word all is used, then all hosts are matched. Deny is the opposite of Allow. The following example will only allow hosts in the cs.colorado.edu or usc.edu domain access the Gatherer's database:


        Allow  cs.colorado.edu usc.edu
        Deny   all