The most common error is that the instructions are indented by spaces instead of a tab-stop. Check the Post-Summarizing rule file and make sure that instructions are indented by a tab-stop. The Post-Summarizing rule file uses a syntax like in Makefile. Conditions begin in the first column and instructions are indented by a tab-stop.
In Harvest 1.5.20.kj-0.3, the default summarizer for HTML data was
switched to HTML-lax.sum
which does not handle meta tags. Edit
$HARVEST_HOME/lib/gatherer/HTML.sum
and uncomment the SGML or
Perl based summarizer.
If you see raw HTML tags in query results, the HTML summarizer was not
able to parse the page correctly. Harvest comes with three different
summarizers for HTML. If the default summarizer fails try the other
two summarizers. To do this, edit
$HARVEST_HOME/lib/gatherer/HTML.sum
and uncomment one of the
summarizers.
Use Harvest older than 1.5.20-kj-0.8 or newer than 1.7.2. The versions between these two versions have a bug which prevents DVI files being summarized.
You need xpdf to summarize Pdf files. Harvest uses pdftotext
from xpdf to summarize Pdf files.
Alternatively, you can use acroread
to convert Pdf files to
Postscript and pass it to Postscript summarizer. To do this, edit
$HARVEST_HOME/lib/gatherer/Pdf.sum
accordingly.
pdftotext
is part of xpdf. It is available at
Xpdf homepage http://www.foolabs.com/xpdf/.
Harvest uses catdoc to summarize Microsoft Word files. If you get
bad summaries for Microsoft Word files, you might want to try
wvHtml
, which is part of wvWare, instead of catdoc.
wvWare is available at wvWare homepage http://www.wvware.com/.
Give the new file type a name and make Harvest know how to recognize the new file type by modifying byname.cf (to determine filetype by its name), byurl.cf (to determine filetype by the URL), or magic and bycontent.cf (to determine filetype by looking at the content of the file). You will find bycontent.cf, byname.cf, byurl.cf and magic in your $HARVEST_HOME/lib/gatherer/ directory.
Create a summarizer (a programm or script) which takes the filename as
first argument and prints a SOIF stream "Attributename{length of
data}:<tab>
your data" to stdout. For file type "Xyz",
you have to create a summarizer called Xyz.sum
in the
$HARVEST_HOME/lib/gatherer/ directory.
In most of the cases it might be easiest to convert filetype "Xyz" to a supported filetype like HTML, PostScript, etc. and use an existing summarizer on the converted file.
Edit $HARVEST_HOME/lib/gatherer/SGML.sum
and set
$sgmls_cmd = "/usr/local/bin/nsgmls" or where ever you have
installed nsgmls.