Form Extractors
The form extractors API allows you to detect form field elements in PDF and raster documents.
A form extractor resource represents an asynchronous form extraction process. Each form extractor that is created is assigned a unique processId
.
Available URLs
URL | Description |
---|---|
GET /PCCIS/V1/ViewingSession/u{viewingSessionId}/FormInfo |
Returns what kind of form field data, if any, is available in a viewing session's source document. |
POST /v2/formExtractors |
Creates a new form extractor for a work file, starting the process of extracting form field data. |
GET /v2/formExtractors/{processId} |
Gets the status and final output of a form extractor. |
Output Schemas
GET /PCCIS/V1/ViewingSession/u{viewingSessionId}/FormInfo
Returns what kind of form field data, if any, is available in a viewing session's source document.
Request
URL Parameters
Parameter | Description |
---|---|
{viewingSessionId} |
The viewingSessionId which identifies the viewing session. |
Successful Response
Response Body
JSON with information about what kind of form data, if any, is available in the source document of the viewing session.
-
formType[]
(Array of strings) Array of values indicating what types of form data, if any, are available for extraction from this viewing session's source document. Values will be one of the following:"acroform"
- The source document is a PDF which contains AcroForm data. The data can be extracted by using aninput.formType
of"acroform"
in a subsequent POST to create a form extractor process."xfa"
- The source document is a PDF which contains XFA form data. We do not yet support extraction of XFA data."rasterForm"
- The source document is a raster file which may or may not contain detectable form fields. You can attempt to extract form data by using aninput.formType
of"rasterForm"
in a subsequent POST to create a form extractor process.
Error Responses
Status Code | JSON errorCode | Description |
---|---|---|
404 |
No viewing session with the provided {viewingSessionId} could be found. |
|
480 |
"DocumentNotProvidedYet" |
A source document has not been provided to the viewing session. |
580 |
"InternalError" |
The server encountered an internal error when handling the request. |
Example
Request
GET /PCCIS/V1/ViewingSession/uDLbVh9sTmXJAmd1GeXbS9Gn3WHxs8oib2xPsW2xEFjnIDdoJcudPtxciodSYFQq6zYGabQ_rJIecdbkImTTkSA/FormInfo
Response
HTTP/1.1 200 OK
Content-Type: application/json
{
"formType": ["acroform"]
}
POST /v2/formExtractors
Creates a new form extractor for a work file, starting the process of extracting form field data.
Request
Request Headers
Name | Description |
---|---|
Content-Type |
Must be application/json |
Accusoft-Affinity-Token |
The affinityToken of the work file specified by input.fileId . Required when server clustering is enabled. |
Request Body
-
input
fileId
(String) Required. The id of the work file to extract form field data from.password
(String) Password to open the source document, if required.-
formType
(String) Required. Type of form field data to extract. Must be one of the following:"acroform"
- Extract AcroForm field data from a PDF and return results in our"acroform"
JSON format."rasterForm"
- Detect visible form fields in a raster document and return results in our"rasterForm"
JSON format.
minSecondsAvailable
(Integer) The minimum number of seconds this process will remain available to GET its status. The actual lifetime may be longer.
Successful Response
Response Body
JSON with metadata about the created form extractor process. You can check on the status of the form extraction process with additional GET requests.
input
(Object) Input we accepted to create the form extractor process.processId
(String) Unique id for the newly-created form extractor process.affinityToken
(String) Affinity token for this form extractor. Present when clustering is enabled.-
state
(String) State of extracting form field data:"processing"
- The server is extracting form field data."complete"
- All form field data has been extracted."error"
- There was a problem extracting form field data.
percentComplete
(Integer) Percentage of form extraction which has completed (from0
to100
).expirationDateTime
(String) Currently planned date and time when the form extractor resource will expire and no longer be available. This time may be extended if we have need to keep using the data. Format is [RFC 3339 Internet Date/Time profile of ISO 8601], e.g."2016-11-05T08:15:30.494Z"
.errorCode
(String) Descriptive error code. Present whenstate
is"error"
.errorDetails
(Object) Additional error details, if any. May be present whenerrorCode
is present.
Error Responses
Status Code | JSON errorCode | Description |
---|---|---|
400 |
"MissingInput" |
Can occur when clustering is enabled and an Accusoft-Affinity-Token request header was not provided. |
480 |
"MissingInput" |
A required input value was not provided. See errorDetails in the response body. |
480 |
"InvalidInput" |
An invalid input value was used. See errorDetails in the response body. |
480 |
"FeatureNotLicensed" |
You are not licensed to use the form extraction feature. |
580 |
"InternalError" |
The server encountered an internal error when handling the request. |
Example
Request
POST /v2/formExtractors
Content-Type: application/json
Accusoft-Affinity-Token: ejN9/kXEYOuken4Pb9ic9hqJK45XIad9LQNgCgQ+BkM=
{
"input": {
"fileId": "ek5Zb123oYHSUEVx1bUrVQ",
"formType": "acroform"
}
}
Response
HTTP/1.1 200 OK
Content-Type: application/json
{
"input": {
"fileId": "ek5Zb123oYHSUEVx1bUrVQ",
"formType": "acroform"
},
"processId": "ElkNzWtrUJp4rXI5YnLUgw",
"state": "processing",
"percentComplete": 0,
"expirationDateTime": "2016-12-17T20:38:39.796Z",
"affinityToken": "ejN9/kXEYOuken4Pb9ic9hqJK45XIad9LQNgCgQ+BkM="
}
GET /v2/formExtractors/{processId}
Gets the status and final output of a form extractor.
Request
URL Parameters
Parameter | Description |
---|---|
{processId} |
The processId which identifies the form extractor process. |
Request Headers
Name | Description |
---|---|
Accusoft-Affinity-Token |
The affinityToken of the form extraction process. Required when server clustering is enabled. |
Successful Response
Response Body
JSON with metadata about the form extractor process and the final output
, if available. You can check on the status of the form extraction process with additional GET requests.
input
(Object) Input we accepted to create the form extraction process.processId
(String) Unique id for this form extractor process.affinityToken
(String) Affinity token for this form extractor. Present when clustering is enabled.-
state
(String) State of extracting form field data:"processing"
- The server is extracting form field data."complete"
- All form field data has been extracted."error"
- There was a problem extracting form field data.
percentComplete
(Integer) Percentage of form extraction which has completed (from0
to100
).expirationDateTime
(String) Currently planned date and time when the form extractor resource will expire and no longer be available. This time may be extended if we have need to keep using the data. Format is [RFC 3339 Internet Date/Time profile of ISO 8601], e.g."2016-11-05T08:15:30.494Z"
.errorCode
(String) Descriptive error code. Present whenstate
is"error"
.errorDetails
(Object) Additional error details, if any. May be present whenerrorCode
is present.-
output
(Object) Present whenstate
is"complete"
:acroform
(Object) Present wheninput.formType
is"acroform"
. See"acroform"
Output below for details.rasterForm
(Object) Present wheninput.formType
is"rasterForm"
. See"rasterForm"
Output below for details.
Error Responses
Status Code | JSON errorCode | Description |
---|---|---|
400 |
"MissingInput" |
Can occur when clustering is enabled and an Accusoft-Affinity-Token request header was not provided. |
404 |
No form extractor could be found for the given {processId} . |
|
580 |
"InternalError" |
The server encountered an internal error when handling the request. |
Example
Request
GET /v2/formExtractors/gLoltqCVnRKzXz2QFNptqw
Accusoft-Affinity-Token: D+Rmn9kB4FrLfrHoNL2bag6WpuNn2ox2qhT2GbLdf9A=
Response
HTTP/1.1 200 OK
Content-Type: application/json
{
"input": {
"fileId": "-eo_zmq3qmPS0WKZlP_Lug",
"formType": "acroform"
},
"output": {
"acroform": {
"pages": [
{
"page": 1,
"height": 792,
"width": 612,
"fields": [
{
"fieldType": "Text",
"name": "email",
"required": true,
"readOnly": "true",
"tabOrder": 0,
"appearance": {
"textColor": "0 g",
"font": "Helvetica"
},
"boundingBox": {
"lowerLeftX": 89,
"lowerLeftY": 646,
"upperRightX": 239,
"upperRightY": 668
},
"options": {
"multiline": false,
"maxLen": -1
},
"format": {
"formatCategory": "None"
}
},
{
"fieldType": "Text",
"name": "fullName",
"required": false,
"readOnly": "false",
"tabOrder": 1,
"appearance": {
"textColor": "0 g",
"font": "Helvetica"
},
"boundingBox": {
"lowerLeftX": 89,
"lowerLeftY": 676,
"upperRightX": 239,
"upperRightY": 698
},
"options": {
"multiline": false,
"maxLen": -1
},
"format": {
"formatCategory": "None"
}
}
]
}
]
}
},
"expirationDateTime": "2016-10-11T03:30:33.166Z",
"percentComplete": 100,
"processId": "gLoltqCVnRKzXz2QFNptqw",
"state": "complete",
"affinityToken": "D+Rmn9kB4FrLfrHoNL2bag6WpuNn2ox2qhT2GbLdf9A="
}
"acroform"
Output
The output.acroform
object will conform to the following. All properties are always present unless otherwise noted:
-
pages[]
(Array of Objects) Pages in the document which contains acroform fields. Array will be empty if document does not contain any acroform fields. Each item will contain:page
(Integer) One-indexed page number.height
(Number) Page height in points.width
(Number) Page width in points.-
fields[]
(Array of Objects) Acroform fields in the current page. Items may contain:-
fieldType
(String) Field type. Will be one of the following:"Text"
- Text field-
"Button"
- Push button, check box, or radio button:- push button when
options.pushButton
istrue
- check box when
options.pushButton
andoptions.radio
are bothfalse
- radio button when
options.radio
istrue
- push button when
"Signature"
- Signature field
name
(String) Unique field or radio button group name.required
(Boolean) Indicates whether or not this field is required for the form to be considered complete.readOnly
(Boolean) Indicates whether or not this field is read only inside the form.tabOrder
(Integer) Tab order of the field within the document.-
boundingBox
(Object) Position and size of this field. Object will contain:lowerLeftX
(Number) Distance in points from the left edge of the page to the left side of this field.lowerLeftY
(Number) Distance in points from the bottom edge of the page to the bottom edge of this field.upperRightX
(Number) Distance in points from the left edge of the page to the right edge of this field.upperRightY
(Number) Distance in points from the bottom edge of the page to the top edge of this field.
-
appearance
(Object) Field appearance details:textColor
(String) Text fill color. Not always present.font
(String) Font name to use for this field. Not always present.
-
format
(Object) Field formatting details:-
formatCategory
(String) Will be one of the following:"None"
- Indicates there are no additionalformatOptions
for this field."Date"
- For text fields, requires the field value to be a date.
-
formatOptions
Additional options for the givenformatCategory
, if any:- When
formatCategory
is"Date"
: (String) Date format string to use when formatting the date value for display.
- When
-
-
options
(Object) Additional field options, present for some field types:-
When
fieldType
is"Text"
:multiline
(Boolean) Indicates whether or not this is a multi-line text field.maxLen
(Integer) Indicates the maximum number of characters this form field accepts, or-1
if there is no limit.
-
When
fieldType
is"Button"
:pushButton
(Boolean)true
if this field is a push button,false
otherwise.radio
(Boolean)true
if this field is a radio button,false
otherwise.- When both
pushButton
andradio
are false, this field is a check box.
-
When
fieldType
is"Button"
andpushButton
isfalse
:buttonOnValue
(String) Indicates the form value to use when this radio button or checkbox is selected/checked.buttonOffValue
(String) Indicates the form value to use when this radio button or checkbox is not selected/checked. Value will always be"Off"
.buttonValue
(String) Indicates whether or not this radio button or checkbox should be initially selected/checked. When the value matchesbuttonOnValue
, then this radio button or checkbox should be initially selected/checked. Otherwise (when the value is"Off"
), this radio button or checkbox should not be initially selected/checked.
-
When
-
Fill Color Strings
A string of one or more numbers followed by an operator indicating what the numbers represent:
- Grayscale value (when string ends in
"g"
): A single number between 0 and 1 followed by"g"
represents the amount of white which forms a grayscale color value. For example:"0 g"
- black"0.5 g"
- 50% gray"1 g"
- white
- RGB value (when string ends in
"rg"
): Three numbers between 0 and 1 followed by"rg"
represent the amount of red, green, and blue light which are additively mixed to form the final color. For example:"1 0 0 rg"
- red"1 1 0 rg"
- yellow"0.5 0.25 0.75 rg"
- 50% red, 25% blue, 75% green
- CMYK (when string ends in
"k"
): Four numbers between 0 and 1 followed by"k"
represent the amount of cyan, magenta, yellow, and black which should be subtractively mixed to form the final color. For example:"0 0 0 1 k"
- black"1 1 1 0 k"
- black"1 1 1 1 k"
- black"1 0 0 0 k"
- cyan"0.25 0.88 0.2 0.16 k"
- 25% cyan, 88% magenta, 20% yellow, 16% black
Date Format Strings
Date format strings use the following special substitution patterns:
yy
- 2-digit year (e.g.16
for the year 2016)yyyy
- 4-digit year (e.g.2016
)m
- Month number with no zero padding (e.g.7
for July)mm
- Month number zero-padded to always be two characters long (e.g.07
for July)mmm
- Abbreviated month name (e.g.Jan
)mmmm
- Full month name (e.g.January
)d
- Day of the month with no zero padding (e.g.4
for the fourth day of the month)dd
- Day of the month zero-padded to always be two characters (e.g.04
for the fourth day of the month)ddd
- Abbreviated day of the week (e.g.Sun
)dddd
- Full name for the day of the week (e.g.Sunday
)h
- Hour number in 12-hour time with no zero padding (e.g.2
for 2 o'clock)hh
- Hour number in 12-hour time zero-padded to always be two characters (e.g.02
for 2 o'clock)H
- Hour number in 24-hour time with no zero padding (e.g.13
for the 1:00 pm hour)HH
- Hour number in 24-hour time zero-padded to always be two characters (e.g.02
for the 2:00 am hour)M
- Minute without zero paddingMM
- Minute, zero-padded to always be two digitss
- Second without zero-paddingss
- Second, zero-padded to always be two digitsz
- Offset from UTC (e.g.-0400
)j
- Abbreviated Japanese era and year (e.g.H28
for the year 2016).jj
- Full Japanese era and year (e.g.平成28
for the year 2016).jjj
- Japanese era year without specifying the era (e.g.28
for the year 2016).
All other characters are considered literal punctuation for the format string. The special characters used above may be used literally by escaping them with a backslash.
"rasterForm"
Output
The output.rasterForm
object will conform to the following. All properties are always present unless otherwise noted:
-
pages[]
(Array of Objects) Information about each page in the raster document. Each item will contain:page
(Integer) One-indexed page number.height
(Number) Page height in pixels.width
(Number) Page width in pixels.-
fields[]
(Array of Objects) Fields detected in the current page. Array will be empty if no fields were detected. Items will contain:name
(String) Unique name we have automatically assigned to this field in the document (e.g."field5"
).-
fieldType
(String) Field type. Will be one of the following:"Text"
- Text field"CheckBox"
- Check box
confidence
(Number) Our confidence in the correct detection of this field using a scale of 0 (no confidence) to 100 (complete confidence).-
boundingBox
(Object) Position and size of this field. Object will contain:x
(Number) Distance in pixels from the left edge of the page to the left side of this field.y
(Number) Distance in pixels from the top edge of the page to the top edge of this field.width
(Number) Distance in pixels from the left edge of this field (x
) to the right edge of this field.height
(Number) Distance in pixels from the top edge of this field (y
) to the bottom edge of this field.
-
tables[]
(Array of Objects) Tables detected in the current page. Array will be empty if no tables were detected. Items will contain:numOfColumns
(Integer) Number of columns in the detected table.numOfRows
(Integer) Number of rows in the detected table.-
fields[]
(Array of Objects) Fields detected in the current table. Items will contain:name
(String) Unique name we have automatically assigned to this field in the document (e.g."field5"
).-
fieldType
(String) Field type. Will be one of the following:"Text"
- Text field"CheckBox"
- Check box
confidence
(Number) Our confidence in the correct detection of this field using a scale of 0 (no confidence) to 100 (complete confidence).-
boundingBox
(Object) Position and size of this field. Object will contain:x
(Number) Distance in pixels from the left edge of the page to the left side of this field.y
(Number) Distance in pixels from the top edge of the page to the top edge of this field.width
(Number) Distance in pixels from the left edge of this field (x
) to the right edge of this field.height
(Number) Distance in pixels from the top edge of this field (y
) to the bottom edge of this field.