Services Forms Processing Services Steps in forms processing

Steps in forms processing

E-mail Print PDF
Just as documents must be prepared in order to be fed into a scanner by removing staples, smoothing wrinkles, positioning them for optimal registration, etc., so the image of a form document must be prepared by following these steps before it can be intelligently recognized:

Document scanning - Pages of forms are scanned and converted into bit-mapped (usually TIFF) images of forms which are either compressed and stored for later batch processing, or are passed immediately in an uncompressed format to an ICR engine for recognition.

Image analysis - The document image is cleaned up. Character image quality is improved, using image enhancement techniques. Background "noise" is removed from the form.

Form alignment - The image is registered and deskewed by the ICR software, which automatically aligns the form by locating special symbols on the document called registration marks as guides.

Form identification - The document is identified by certain predefined characteristics that the ICR software is trained to look for, so that the zones containing the fields designated for recognition can be located by a customized, predefined ICR template. Form ID attributes can include form numbers, corporate logos, or the name of the form itself imprinted somewhere on the form.

Form background removal - This stage is not necessary if the document is a form that was originally printed in a colored ("drop out") ink that is invisible to the scanner being used. If colored ink is not used, the form image may contain lines, boxes, fine print, and other form attributes-passive data-that tend to confuse the ICR engine. These form attributes must be extracted from the image of the form, so that only the character images-the active data-are left behind. Broken and fragmented characters are automatically repaired and restored to their original shapes.

Character field location - The predefined ICR template automatically locates the fields that contain character data. The template identifies which individual fields on the form image require character recognition, and what the nature of those fields are-hand print, machine print, numeric, alphabetic, alphanumeric, etc. The template also identifies which areas are barcodes or check box recognition zones.

Character segmentation - Sophisticated software routines analyze, separate, and break down the character fields into isolated characters. If the form is "ICR -friendly," characters are segmented with the aid of graphic devices such as boxes, tick-marks, and connected boxes called "combs" that serve to force the form user to legibly separate the characters from one another.

Character classification - Individual characters are classified by ICR algorithms according to their ASCII category and assigned a confidence value, which is an index of how "certain" the ICR engine "feels" about the selection it has made. Alternate character choices are ranked according to those values, so that they can be incorporated into editing procedures that improve ICR accuracy. For example, the alternate choice "1" might be used instead of the first-ranked choice "I" when contextual analysis reports that the field is all-numeric.

Post-processing - The initial or "raw" recognition results are validated using edit procedures such as grammatical rules, spell-checkers, dictionaries, check-sum routines, and look-up tables. Ambiguous and erroneous data fields-the "rejects"- are identified and sent to data entry operators at workstations for manual correction.

Manual correction of rejected character fields - The manner in which the data entry operator is presented the rejected data for correction can dramatically impact both the speed and the accuracy of the reject repair process. In particular, the data entry GUI is important because the ergonomics of data entry are what enable a given data entry operator to reach his or her maximum correction speed.

What is interesting in forms processing is that only one of the steps-character classification-is specifically concerned with identifying character data. The rest of the steps have to do with either preparing the imaged characters for classification or interpreting the results of character classification. With so much opportunity for error increasing at each successive step of the way, it is remarkable that ICR accuracy rates can attain (and sometimes exceed) human performance levels.
 

Technologies From...

Featured Sponsors

www.eradoc.com

Search

Sponsored Links

Featured Product..

Fujitsu Document Scanner fi-5015C

The fi-5015C image scanner (USB interface) offers high speed , simplex color scanning capabilities within a compact body. Fujitsu is proud of these excellent features, which will ensure this product is used well into the future.

Featured Accessories..

Canon Imprinter 50F for DR-5010C
Get your DR-5010C Scanner hooked up with this Imprinter 50F for printing on the front, and be able to add data to scanned documentation. This imprinter utilizes special technology which prints bitmap images such as signatures, stamps, and logos in black, red, or purple. You will also be able to add numbers, codes names, data, and times which will ensure that all scanned data can be fully controlled and authenticated.