How to handle Heritrix stale file handle exception

Heritrix is the Internet Archive-s web archival software, essentially a web crawling bot that takes a list of web sites, and saves them as ARC/WARC files in order to create a web archive like the one at archive.org.

Sometimes, like every other piece of software, it can produce error messages that might not be trivial.

One of them is the following:

Caused by: java.nio.file.FileSystemException: /path/to/file: Stale file handle

Other than the exception, you might face the following problems:

  • The REST API returns empty responses for certain jobs, instead of their status.
  • The web UI shows a long chain of exception (including Stale file handle FileSystemException as the root cause) when navigating to the job’s status page

Cause:
One possible cause this issue is that Heritrix has a file open that is on a remote filesystem, and during Heritrix’s run the connection to that filesystem broke due to a network outage for example.

Solution:

  • Safely shut down Heritrix’s other jobs ( pause, checkpoint )
  • Restart Heritrix

After the restart if you continue the jobs they will be fine, and the error is gone.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.