Search Tasks

The search task API is designed for a viewer to perform server-side searching and text retrieval against the source document of a viewing session.

A search task represents an asynchronous full-text search of a document and yields results as they become available.

Available URLs

URL	Description
`POST /v2/viewingSessions/{viewingSessionId}/searchTasks`	Starts an asynchronous full-text search against a viewing session's source document.
`GET /v2/searchTasks/{processId}/results`	Gets available search results.

POST /v2/viewingSessions/{viewingSessionId}/searchTasks

Starts an asynchronous full-text search against a viewing session's source document.

After a successful POST to create the search task, we immediately begin a background process to start populating search results for you to GET. You do not need to wait for the full set of results to be available; you can start retrieving partial search results as soon as they are available. Once the full text of the document has been searched and no more results will be added, the search task state will change from "processing" to "complete".

Request

Request Headers

Name	Description
`Content-Type`	Must be `application/json`

Request Body

input
- searchTerms[] (Array of Objects) Required and must contain at least one item. Each item must be an object which conforms to one of the following:
  - Simple (finds all occurrences of a single regex pattern):
    - type: "simple" (String) Required. Must be set to "simple" to indicate this is a simple term object.
    - pattern (String) Required. Regular expression to search for, using a [JavaScript-flavored regular expression string].
    - caseSensitive (Boolean) Determines whether we consider case when matching this term. Default is false.
    - contextPadding (Integer) Maximum number of characters to include both before and after the search result in the returned context string. For example, a value of 25 would allow up to 25 preceding and 25 following characters of content. Default is 25.
    - termId (String) Optional id of your choosing which, if provided, will be included as a termId property on each search result produced by this term. When used, we do not enforce uniqueness; it is your responsibility to use a unique termId for each term.
  - Proximity (finds all occurrences of multiple regex patterns which are near each other):
    - type: "proximity" (String) Required. Must be set to "proximity" to indicate this is a proximity term object.
    - subTerms[] (Array of Objects) Required and must contain at least two items. Each item may contain:
      - pattern (String) Required. Regular expression for this particular term, using a [JavaScript-flavored regular expression string].
      - caseSensitive (Boolean) Determines whether we consider case when matching this term. Default is false.
    - distance (Integer) Required. Maximum number of words allowed between any two consecutive sub-terms.
    - contextPadding (Integer) Maximum number of characters to include both before and after the search result in the returned context string. For example, a value of 25 would allow up to 25 preceding and 25 following characters of content. Default is 25.
    - termId (String) Optional id of your choosing which, if provided, will be included as a termId property on each search result produced by this term. When used, we do not enforce uniqueness; it is your responsibility to use a unique termId for each term.
minSecondsAvailable (Integer) The minimum number of seconds this search task will remain available. The actual lifetime may be longer.

Successful Response

Response Body

JSON with metadata about the created search task.

input (Object) Input we accepted to create the search task.
processId (String) Unique id for this search task.
affinityToken (String) Affinity token for this search task. Present when clustering is enabled.
state (String) State of getting search results.
- "processing" - The search is still being executed. Additional results may become available.
- "complete" - The search is complete. No additional results will become available.
- "error" - There was a problem performing the search. No additional results will become available.
percentComplete (Integer) Percentage of the document text which has been searched (from 0 to 100).
expirationDateTime (String) Currently planned date and time when the search task resource will expire and no longer be available for use. Format is RFC 3339 Internet Date/Time profile of ISO 8601, e.g. "2016-11-05T08:15:30.494Z".

Error Responses

Status Code	JSON errorCode	Description
`404`		No viewing session with the provided `{viewingSessionId}` could be found.
`480`	`"DocumentNotProvidedYet"`	The viewing session does not yet have a source document attached.
`480`	`"MissingInput"`	A required input value was not provided. See `errorDetails` in the response body.
`480`	`"InvalidInput"`	An invalid input value was used. See `errorDetails` in the response body.
`480`	`"MissingInputForSimpleTerm"`	An invalid input value was used in a `"simple"` term object. See `errorDetails` in the response body.
`480`	`"InvalidInputForSimpleTerm"`	An invalid input value was used in a `"simple"` term object. See `errorDetails` in the response body.
`480`	`"MissingInputForProximityTerm"`	An invalid input value was used in a `"proximity"` term object. See `errorDetails` in the response body.
`480`	`"InvalidInputForProximityTerm"`	An invalid input value was used in a `"proximity"` term object. See `errorDetails` in the response body.
`480`	`"FeatureDisabled"`	The viewing session was created with `"serverSideSearch"` disabled.
`501`	`"NotImplemented"`	Server-side searching is not yet implemented for a viewing session which uses a cached viewing package.
`580`	`"InternalError"`	The server encountered an internal error when handling the request.

Example

Request

This POST begins a search task which finds all instances of the word "quick":

POST /v2/viewingSessions/DLbVh9sTmXJAmd1GeXbS9Gn3WHxs8oib2xPsW2xEFjnIDdoJcudPtxciodSYFQq6zYGabQ_rJIecdbkImTTkSA/searchTasks
Content-Type: application/json

{
  "input": {
    "searchTerms": [{
      "type": "simple",
      "pattern": "quick"
    }]
  }
}

Response

HTTP/1.1 200 OK
Content-Type: application/json

{
  "input": {
    "searchTerms": [{
      "type": "simple",
      "pattern": "quick",
      "caseSensitive": false,
      "contextPadding": 25
    }]
  },
  "processId": "pR5X6nPDgMwat6cxlmn0Q3",
  "state": "processing",
  "percentComplete": 0,
  "expirationDateTime": "2016-12-17T20:38:39.796Z"
}

Additional Examples

Start a case-sensitive search for an exact phrase

This POST begins a case-sensitive search for the exact phrase "The quick brown fox jumped over the lazy dog.". Notice that we had to escape the period character because it is a special regex character (\.), and because this is a JSON string value, the backslash itself must also be escaped ("\\."):

POST /v2/viewingSessions/DLbVh9sTmXJAmd1GeXbS9Gn3WHxs8oib2xPsW2xEFjnIDdoJcudPtxciodSYFQq6zYGabQ_rJIecdbkImTTkSA/searchTasks
Content-Type: application/json

{
  "input": {
    "searchTerms": [{
      "type": "simple",
      "pattern": "The quick brown fox jumped over the lazy dog\\.",
      "caseSensitive": true
    }]
  }
}

Start a search for every instance of the word "quick" or "brown" or "fox"

This POST begins a search for the words "quick" or "brown" or "fox", locating all instances of each of these words:

POST /v2/viewingSessions/DLbVh9sTmXJAmd1GeXbS9Gn3WHxs8oib2xPsW2xEFjnIDdoJcudPtxciodSYFQq6zYGabQ_rJIecdbkImTTkSA/searchTasks
Content-Type: application/json

{
  "input": {
    "searchTerms": [{
      "type": "simple",
      "pattern": "quick"
    }, {
      "type": "simple",
      "pattern": "fox"
    }, {
      "type": "simple",
      "pattern": "dog"
    }]
  }
}

Start a search for "quick" and "fox" and "dog" where there are no more than 5 words between any two consecutive occurrences of them

POST /v2/viewingSessions/DLbVh9sTmXJAmd1GeXbS9Gn3WHxs8oib2xPsW2xEFjnIDdoJcudPtxciodSYFQq6zYGabQ_rJIecdbkImTTkSA/searchTasks
Content-Type: application/json

{
  "input": {
    "searchTerms": [{
      "type": "proximity",
      "subTerms": [{
        "pattern": "quick"
      }, {
        "pattern": "fox"
      }, {
        "pattern": "dog"
      }],
      "distance": 5
    }]
  }
}

Start a case-sensitive search for "John Doe" within 30 words of what looks like a social security number

POST /v2/viewingSessions/DLbVh9sTmXJAmd1GeXbS9Gn3WHxs8oib2xPsW2xEFjnIDdoJcudPtxciodSYFQq6zYGabQ_rJIecdbkImTTkSA/searchTasks
Content-Type: application/json

{
  "input": {
    "searchTerms": [{
      "type": "proximity",
      "subTerms": [{
        "pattern": "John Doe",
        "caseSensitive": true
      }, {
        "pattern": "\\d{3}-\\d{2}-\\d{4}"
      }],
      "distance": 30
    }]
  }
}

GET /v2/searchTasks/{processId}/results?limit={limit}&continueToken={continueToken}

Gets a block of newly-available search results up to a limit.

This URL is designed to give you the results in chunks as they become available. Each GET request will return the currently-known results up to a limit (default is 100). If a response contains a continueToken, it indicates that additional results may be available and that you should issue another GET request using that continueToken as a query string parameter to skip the results you have already received. As long as a response contains a continueToken, use it to issue a subsequent GET for more results. When you encounter a response which does not have a continueToken, you have received all of the results and no more GET requests are necessary.

In order to optimize the number of network requests you make, any response which contains a continueToken will also contain a continueAfter value with a recommended number of milliseconds you should wait before sending the next GET request.

Request

URL Parameters

Parameter	Description
`{processId}`	The `processId` which identifies the search task.
`{limit}`	The maximum number of results to return for this HTTP request. Must be an integer greater than `0`. Default is `100`.
`{continueToken}`	Used to continue getting results from the point where a previous GET request left off.

Request Headers

Name	Description
`Accusoft-Affinity-Token`	The `affinityToken` of the search task. Required when server clustering is enabled.

Successful Response

Response Body

JSON with any available search results.

results (Array of Objects) Always present. Array of newly-available search results. If no new results are available, this array will be empty.
- id (Integer) Unique number assigned to this search result.
- pageIndex (Integer) Zero-indexed page number where this search result occurs in the document.
- text (String) Text which was matched.
- context (String) Contextual excerpt, including the matched text itself. The amount of leading and trailing characters to include in this value is controlled by input.contextPadding in the initial POST to create the search task.
- boundingRectangle (Object) Bounding rectangle dimensions of the matched text on the page where it occurs.
  - x (Number) Distance from the left edge of the page to the left edge of the search result bounding box.
  - y (Number) Distance from the top edge of the page to the top edge of the search result bounding box.
  - width (Number) Width of the search result bounding box.
  - height (Number) Height of the search result bounding box.
- searchTerm (Object) Search term which produced this result. The value will correspond to one of the items passed in to input.searchTerms in the initial POST to create the search task.
  - When type is "simple":
    - type (String) Always present with a value of "simple".
    - pattern (String) Always present. Regular expression which produced the result.
    - caseSensitive (Boolean) Always present. Indicates whether or not case was considered for this result.
    - contextPadding (Integer) Always present. Amount of context padding requested for this term in the initial POST.
    - termId (String) When provided in the initial POST, termId of the term which produced this result.
  - When type is "proximity":
    - type (String) Always present with a value of "proximity".
    - subTerms[] (Array of Objects) Always present. The sub-terms which contributed to this result. Each item will contain:
      - pattern (String) Always present. Regular expression for this particular sub-term.
      - caseSensitive (Boolean) Always present. Indicates whether or not case was considered when matching this particular sub-term in the result.
    - distance (Integer) Always present. Maximum number of words allowed between any two consecutive sub-terms.
    - contextPadding (Integer) Always present. Amount of context padding requested for this term in the initial POST.
    - termId (String) When provided in the initial POST, termId of the term which produced this result.
- startIndex (Integer) JavaScript string index into the full-page text string where the matched text begins.
- startIndexInContext (Integer) JavaScript string index into the returned context string where the matched text begins.
pagesWithoutText (Array of Integers) Always present. Currently known pages in the document which do not contain any text content at all. Values are zero-indexed page numbers. If the search task is still processing (a continueToken is present in the response), the data should be considered partial. Note that, unlike results, this value is cumulative (we always deliver the entire set of pages we know to not contain text data).
continueToken (String) When present, indicates that more search results may be available. An additional GET request should be made for more results using this value as the continueToken query string parameter. When not present, indicates that the search is complete and no further results will be available.
continueAfter (Number) Recommended milliseconds to delay before issuing the next GET request for more results.

Error Responses

Status Code	JSON errorCode	Description
`404`		No search task with the provided `{processId}` could be found.
`480`	`"MissingInput"`	A required input value was not provided. See `errorDetails` in the response body.
`480`	`"InvalidInput"`	An invalid input value was used. See `errorDetails` in the response body.
`480`	`"ResourceNotUsable"`	Can occur when the search task is in a `state` of `"error"`.
`580`	`"InternalError"`	The server encountered an internal error when handling the request.

Example

Say you have a search task which was created to find the regex "manag[a-z]*" in a particular whitepaper. Here is an example sequence of requests and responses illustrating how you would acquire the full set of results for the search task (for brevity, the total number of search results in this example is small).

You would start with an initial GET:

GET /v2/searchTasks/pR5X6nPDgMwat6cxlmn0Q3/results
Accusoft-Affinity-Token: ejN9/kXEYOuken4Pb9ic9hqJK45XIad9LQNgCgQ+BkM=

HTTP/1.1 200 OK
Content-Type: application/json

{
  "results": [
    {
      "id": 0,
      "pageIndex": 0,
      "text": "Management",
      "context": "Enterprise Content Management Best Practices",
      "boundingRectangle": { "x": 24.20, "y": 13.74, "width": 234.20, "height": 26.10 },
      "searchTerm": {
        "type": "simple",
        "pattern": "manag[a-z]*",
        "caseSensitive": false,
        "contextPadding": 25
      },
      "startIndex": 19,
      "startIndexInContext": 19
    },
    {
      "id": 1,
      "pageIndex": 0,
      "text": "management",
      "context": "ue of enterprise content management software should go way b",
      "boundingRectangle": { "x": 156.07, "y": 352.19, "width": 105.00, "height": 13.41 },
      "searchTerm": {
        "type": "simple",
        "pattern": "manag[a-z]*",
        "caseSensitive": false,
        "contextPadding": 25
      },
      "startIndex": 527,
      "startIndexInContext": 25
    }
  ],
  "pagesWithoutText": [],
  "continueToken": "Cx07GHlkmi32gxAQhv49WZ",
  "continueAfter": 500
}

The initial response has given us two results for the first page of the document (page index 0) and a continueToken which we should use to get more results after waiting 500 milliseconds.

So, half a second later, we issue a follow-up request with the continueToken passed in as a query string parameter (so we skip over the results we already have):

GET /v2/searchTasks/pR5X6nPDgMwat6cxlmn0Q3/results?continueToken=Cx07GHlkmi32gxAQhv49WZ
Accusoft-Affinity-Token: ejN9/kXEYOuken4Pb9ic9hqJK45XIad9LQNgCgQ+BkM=

HTTP/1.1 200 OK
Content-Type: application/json

{
  "results": [
    {
      "id": 2,
      "pageIndex": 1,
      "text": "management",
      "context": "Enterprise content management software helps eliminate",
      "boundingRectangle": { "x": 310.21, "y": 562.14, "width": 254.03, "height": 26.10 },
      "searchTerm": {
        "type": "simple",
        "pattern": "manag[a-z]*",
        "caseSensitive": false,
        "contextPadding": 25
      },
      "startIndex": 652,
      "startIndexInContext": 19
    }
  ],
  "pagesWithoutText": [2,3],
  "continueToken": "B4uGe7m0ZtxR3lkqA07Nmj",
  "continueAfter": 500
}

This time we get back a new result as well as some new information about pagesWithoutText: we now know that at least page indices 2 and 3 (zero-indexed page numbers) have no text at all.

The presence of a new continueToken tells us there may be more results, so we submit another request with the new continueToken:

GET /v2/searchTasks/pR5X6nPDgMwat6cxlmn0Q3/results?continueToken=B4uGe7m0ZtxR3lkqA07Nmj
Accusoft-Affinity-Token: ejN9/kXEYOuken4Pb9ic9hqJK45XIad9LQNgCgQ+BkM=

HTTP/1.1 200 OK
Content-Type: application/json

{
  "results": [
    {
      "id": 3,
      "pageIndex": 5,
      "text": "management",
      "context": "upply chains to contract management, or HR processes to gove",
      "boundingRectangle": { "x": 67.00, "y": 142.53, "width": 254.03, "height": 26.10 },
      "searchTerm": {
        "type": "simple",
        "pattern": "manag[a-z]*",
        "caseSensitive": false,
        "contextPadding": 25
      },
      "startIndex": 113,
      "startIndexInContext": 25
    }
  ],
  "pagesWithoutText": [2,3,4]
}

This time we get a new result for page index 5, and we now know that page indices 2, 3, and 4 all contain no text at all (apparently this was not much of a whitepaper!). The lack of a continueToken tells us we have received all of the results, so there are no more GET requests to make.