Caching in PrizmDoc for Java

PrizmDoc® for Java uses a memory cache (EhCache v3.3.1) to reduce document retrieval time. When PrizmDoc® for Java retrieves a document from the content handler, the document is inserted into the cache to speed up subsequent retrieval.

What is cached?

PrizmDoc® for Java’s most-used cache is the document cache. This holds two different types of objects. The first is a wrapper for document data. The second holds a significantly complex object representing the layout of certain document formats, including Office formats. Each PrizmDoc® for Java instance uses its own cache. To stay synchronized among instances, PrizmDoc® for Java removes a document from the cache when it is modified and saved.

PrizmDoc® for Java also uses two other memory caches, though to a lesser degree:

PrizmDoc® for Java will cache OCR data: once OCR is complete, it returns a PDF defining positional text data, and this PDF is cached to avoid redoing OCR in the same session.
PrizmDoc® for Java maintains a cache called the validation cache; if the content handler allows or disallows use of the document cache for a certain document, that response will be stored in the validation cache.

Configuration

The memory cache is configured in the file WEB-INF/ehcache.xml. There are three caches configured in ehcache.xml. The first is the main cache, labelled “vvDocumentCache”; if PrizmDoc® for Java documentation mentions a cache, it is referring to this main document cache. The second is the OCR cache, labelled “vvOcrCache.” This caches OCR data, so OCR does not have to be repeated in a session. Finally, “vvValidationCache” caches responses from the content handler method validateCache.

Each cache is described by an alias attribute in the main cache tag, and two tags called key-type and value-type. None of these values may be changed.

The expiry tag and the resources tag may be modified. By default, the document cache will remove entries that haven’t been used in 60 minutes, and the validation cache will remove entries after 5 minutes; the OCR cache will keep entries without an expiry time.

Within the resources tag, the heap tag configures the maximum size of the cache. In each cache, the heap is described in the unit “entries.” This means that the cache will limit how much it can store based on the count of entries rather than their size. While it is possible to set the units attribute to some memory unit like MB or GB, this is not recommended.

Using a unit other than “entries” will cause ehcache to try to figure out how large an entry is by walking the entire tree of that entry when it is inserted into the cache. This will significantly decrease performance, and will increase memory usage.

Additional configuration can be found in WEB-INF/web.xml, via init-params.

The init-param enableDocumentCache takes a boolean. If this is set to false, the document cache will not be used. It is highly recommended to leave the document cache enabled; disabling the cache will cause significant performance degradation. The document cache should never be disabled if users are viewing document formats that use SnowDoc, like Microsoft Office formats. SnowDoc formats require the document cache for performance optimization. For other format types, however, the document cache could be disabled in favor of another cache solution implemented in the content handler.

The init-param clearCacheOnSave also takes a boolean. If this is set to true, when a user saves a document, the document will be removed from the cache. The document will then be re-requested from the content handler if it needs to be displayed again. This allows the content handler to implement synchronization of user sessions. It is recommended to keep this item set to true.

Server & Client API

Aside from configuration, there are several ways to control cache behavior dynamically.

On the client, the API functions virtualViewer.seedCache(documentId, pages, clientInstanceId) and virtualViewer.removeDocumentFromCache(documentId, clientInstanceId) will respectively add and remove documents from the document cache:

seedCache will retrieve a document from the content handler and add it to the cache. For SnowDoc documents, this may also initiate page layout operations. This function takes two parameters. The documentId parameter is the document to be added to the cache and is mandatory. The pages parameter is optional and only affects Sparse Documents; it would hold an array of page numbers to add to the cache. Finally, the clientInstanceId parameter is optional, and is a way to directly pass a clientInstanceId, which is a piece of data that will be passed all the way to the content handler.
removeDocumentFromCache will manually remove a document from the cache. It takes two parameters, the mandatory ID of the document to remove, and the optional clientInstanceId.

On the server, implementing the content handler interface CacheValidator allows fine-grained control over which documents are allowed to enter the cache. The interface defines one function, validateCache.

validateCache is called before each document is stored in or retrieved from the PrizmDoc® for Java document cache. It can confirm the operation or prevent it on a document-by-document basis.

The response for each document and operation is cached for a short time in PrizmDoc® for Java to prevent asking about the same operation multiple times in quick succession. In other words, if a specific document is prevented from being cached, PrizmDoc® for Java will not ask again for a few minutes and the document will remain uncached for that time. To modify how long a response from validateCache will be remembered, configure the expiry time attribute of the validation cache in ehcache.xml.

Like all content handler API functions, validateCache takes a ContentHandlerInput object and returns a ContentHandlerResult object.

The ContentHandlerInput object contains the following values:

The key KEY_CACHE_ACTION gets a value of either ContentHandlerInput.VALUE_CACHE_GET or ContentHandlerInput.VALUE_CACHE_PUT, the action to be confirmed for the specified document. GET asks whether the document should be retrieved from the cache, while PUT asks if it should be stored.
The key KEY_DOCUMENT_ID stores the ID value that represents the document. This can be retrieved with the code String documentId = input.getDocumentId();
The key KEY_CLIENT_INSTANCE_ID stores a custom configurable value used to pass data from client to content handler. If not set then will be the session ID. This can be retrieved with the code String clientInstanceId = input.getClientInstanceId();
The key KEY_HTTP_SERVLET_REQUEST stores the request that called this method. This can be retrieved with the code HttpServletRequest request = input.getHttpServletRequest(); The returned ContentHandlerResult must contain one value:
- The key KEY_USE_OF_CACHE_ALLOWED must store a boolean value. True allows the operation to continue, and false prevents it. This response will be remembered for a few minutes.