Our standard optical character recognition software uses Able2Extract, and then it uses the running total on the bank statement to check the correctness of the scanning. If there are any errors, then it is able to correct them automatically in a number of steps, highlighting the corrections made so that a human supervisor can check them. Sometimes scanning conditions are poor, so we display the scanned results to the human operator before the correction, and we rely upon the fact that humans are good at pattern recognition, and can make changes before the running total check.
Sometimes the bank statements lack a running total, sometimes we get credit card statements as well, and sometimes we get spreadsheets or CSV files from the client in a non-standard format. These four types of data are a significant fraction of our data entry work taken together.
We now have an auxiliary OCR system to deal with this. As soon as we get the information on or onto a spreadsheet, we can examine it visually for errors. We have an on-screen toolkit that we can summon up to make amendments to process the information into a standard format. For example, we can bulk-amend dates so a date like 30.5.17 as typed in by a client is converted to 30/5/2017. If the data is in reverse order, we can tip it the right way up. We can concatenate two columns of narrative into one. We can split a mixed column of positive and negative numbers so all the positives are in one column and all the negatives are in another, with any pound signs and minus signs removed. We can remove lots of blank rows in one go, and we can space out the columns to make the data readable, all by clicking on one button. This system is easy to add to as new types of data arrive, and we hope to be able to re-use existing code with a bit of editing. We can then use buttons to copy and do a paste special, values into our standard system.
The point of this is our insistence on graceful degradation. We need to have the resources to do something useful when scanning is less than perfect, or when we get data on client-supplied spreadsheets in an unusual format, and our auxiliary OCR system is the remedy. There’s nothing complicated about this system. It is just a collection of short routines to tidy up the data, but it is well worth having. We are now adding routines for various hypothetical situations, and as real data appears, we should have something that can be re-used after a little amendment as necessary.
We are also in a position to give a verdict on the adoption of OCR. Yes, it is very useful, but no panacea. We need to support it with a variety of other new technologies such as datepointing and narrative prediction, and there’s more than one kind of OCR system to be considered. In the future, we will carry on improving what we have, and add the ability to work with OCR software other than Able2Extract, so the accounts clerk will get a choice. This is consistent with professional status and the Law of Requisite Variety.