next up previous contents index
Next: 5 The Broker Up: 4 The Gatherer Previous: 4.9 Local file system

4.10 Troubleshooting

   

Symptom
The Gatherer doesn't pick up all the objects pointed to by some of my RootNodes.

Solution
The Gatherer places various limits on enumeration to prevent a misconfigured Gatherer from abusing servers or running wildly. See section 4.3 for details on how to override these limits.

Symptom
Local-Mapping did not work for me---it retrieved the objects via the usual remote access protocols.

Solution
A local mapping will fail if:

So for directories, symlinks and cgi scripts, the http server is always contacted. We don't do URL translation for local mappings. If your URL's have funny characters that must be escaped, then the local mapping will also fail.

If you are using the hv source dn, you can turn on debugging in src/common/url.c and see how the local filenames are constructed.

Symptom
Using the --full-text option I see a lot of raw data in the content summaries, with few keywords I can search.

Solution
At present --full-text simply includes the full data content in the SOIF summaries. Using the individual file type summarizing mechanism described in Section 4.4.4 will work better in this regard, but will require you to specify how data are extracted for each individual file type. In a future version of Harvest we will change the Essence --full-text option to perform content extraction before including the full text of documents.

Symptom
The ``Last-Modification-Time'' of the gathered data is always 0.

Solution
At present we do not fill in this field for HTTP documents; we use MD5 [21] checksums instead. In a future version of Harvest, we will set this field according to HTTP's MIME response header.

Symptom
Gathered data are not being updated.

Solution
The Gatherer does not automatically do periodic updates. See Section 4.8 for details.

Symptom
When I run my Gatherer after changing one of the files it recently gathered, it does not retrieve the changed file.

Solution
The Gatherer maintains a local disk cache to reduce network load from restarting after a machine crash. You can force a reload by removing running the urlpurge program before running it again (see Appendix A.

Symptom
The Gatherer puts slightly different URLs in the SOIF summaries than I specified in the Gatherer configuration file.

Solution
This happens because the Gatherer attempts to put URLs into a canonical format. It does this by removing default port numbers http bookmarks, and similar cosmetic changes. Also, by default, Essence (the content extraction subsystem within the Gatherer) removes the standard stoplist.cf types, which includes HTTP-Query (the cgi-bin stuff).

   

Symptom
There are no Last-Modification-Time or MD5 attributes in my gatherered SOIF data, so the Broker can't do duplicate elimination.

Solution
If you gatherer remote, manually-created information (as in our PC Software Broker), it is pulled into Harvest using ``exploders'' that translate from the remote format into SOIF. That means they don't have a direct way to fill in the Last-Modification-Time or MD5 information per record. Note also that this will mean one update to the remote records would cause all records to look updated, which will result in more network load for Brokers that collect from this Gatherer's data. As a solution, you can compute MD5s for all objects, and store them as part of the record. Then, when you run the exploder you only generate timestamps for the ones for which the MD5s changed---giving you real last-modification times.

Symptom
When I search using keywords I know are in a document I have indexed with Harvest, the document isn't found.

Solution
Harvest uses a content extraction subsystem called Essence that by default does not extract every keyword in a document. Instead, it uses heuristics to try to select promising keywords. You can change what keywords are selected by customizing the summarizers for that type of data, as discussed in Section 4.4.4. Or, you can tell Essence to use full text summarizing if you feel the added disk space costs are merited, as discussed in Section 4.5.

Symptom
I'm running Harvest on HP-UX, but the essence process in the Gatherer takes too much memory.

Solution
The supplied regular expression library has memory leaks on HP-UX, so you need to use the regular expression library supplied with HP-UX. Change the Makefile in src/gatherer/essence to read:

        REGEX_DEFINE    = -DUSE_POSIX_REGEX
        REGEX_INCLUDE   =
        REGEX_OBJ       =
        REGEX_TYPE      = posix

Symptom
I built the configuration files to customize how Essence types/content extracts data, but it uses the standard typing/extracting mechanisms anyway.

Solution
Verify that you have the Lib-Directory set to the lib/ directory that you put your configuration files. Lib-Directory is defined in your Gatherer configuration file.

Symptom
Essence dumps core when run (from the Gatherer)

Solution
Check if you're running a non-stock version of the Domain Naming System (DNS) under SunOS. There is a version that fixes some security holes, but is not compatible with the version of the DNS resolver library with which we link essence for the binary Harvest distribution. If this is indeed the problem, you can either run the binary Harvest distribution on a stock SunOS machine, or rebuild Harvest from source (more specifically, rebuild essence, linking with the non-stock DNS resolver library).

     

Symptom
I am having problems resolving host names on SunOS.

Solution
In order to gather data from hosts outside of your organization, your system must be able to resolve fully qualified domain names into IP addresses. If your system cannot resolve hostnames, you will see error messages such as ``Unknown Host.'' In this case, either:

  1. the hostname you gave does not really exist; or
  2. your system is not configured to use the DNS.

To verify that your system is configured for DNS, make sure that the file /etc/resolv.conf exists and is readable. Read the resolv.conf(5) manual page for information on this file. You can verify that DNS is working with the nslookup command.

The Harvest executables for SunOS (4.1.3_U1) are statically linked with the stock resolver library from /usr/lib/libresolv.a. If you seem to have problems with the statically linked executables, please try to compile Harvest from the source code (see Section 3). This will make use of your local libraries, which may have been modified for your particular organization.

Some sites may use Sun Microsystem's Network Information Service (NIS) instead of, or in addition to, DNS. We believe that Harvest works on systems where NIS has been properly configured. The NIS servers (the names of which you can determine from the ypwhich command) must be configured to query DNS servers for hostnames they do not know about. See the -b option of the ypxfr command.

We would welcome reports of Harvest successfully working with NIS. Please email us at harvest-dvl@cs.colorado.edu.

 

Symptom
I cannot get the Gatherer to work across our firewall gateway.

Solution
Harvest currently will not operate across a strict Internet firewall. The Gatherer and Broker and Replicator can not (yet) request objects through a proxy server. You can either run these Harvest components internally (behind the firewall) or or else on the firewall host itself.

If you see the ``Host is unreachable'' message, these are the likely problems:

  1. your connection to the Internet is temporarily down due to a circuit or routing failure; or
  2. you are behind a firewall.

If you see the ``Connection refused'' message, the likely problem is that you are trying to connect with an unused port on the destination machine. In other words, there is no program listening for connections on that port.

The Harvest gatherer is essentially a WWW client. You should expect it to work the same as Mosaic, but without proxy support. We would be interested to hear about problems with Harvest and hostnames under the condition that the gatherer is unable to contact a host, yet you are able to use other network programs (Mosaic, telnet, ping) to that host without going through a proxy.

 



next up previous contents index
Next: 5 The Broker Up: 4 The Gatherer Previous: 4.9 Local file system



Darren Hardy
Mon Apr 3 15:22:37 MDT 1995