Content Scraping

Scraping content data is an Optical Character Reader (OCR) function that captures content electronically, in a few different ways:

  1. Zones. Data capture can be organized to look at specific areas within a document. These areas are called “zones.” The zones are scraped to capture machine generated characters, barcodes, text and other generated data. The scraped data is then interpreted and written to a hidden text file attached to the image and can be used as a search or index field value. This is a fast and effective method to capture specific data on a form.

  2. Full OCR. Scraping the entire document is the generally accepted method for a full OCR search of the document’s content. This was effectively employed for all machine generated documents, including typewritten or type set documents. As in the Zone method, a hidden text file is attached to the document image and can be used for content search. Many documents in the library were able to be processed in this manner.

  3. Custom Reference Library. The custom reference library is used when a document has been handwritten or created in a non-standard character set such that conventional scraping is ineffective. In this process, sample letters are captured in pixel format and entered into a character table. This table has corresponding matching characters for each handwritten character; several language interpretations are allowed. Although there is a lower probability of success, most Custom Reference Libraries can provide a reasonable definition of the content in electronic format and, most importantly, allow both electronic viewing and content searching.