Up to now we have been using optical character recognition on bank statements, but we only scan dates and numbers, which makes the job a bit easier. To enter narratives, we enter a few, run a Narrative Prediction routine to guess the rest, and then overtype the guesses that are wrong. Sometimes NP works really well, and at other times it doesn’t.
We are upgrading to a hybrid system where we use full OCR to scan narratives as well as dates and numbers. The accounts clerk then goes through the narratives to check them. If a narrative needs changing, the clerk can run a tidy up routine which does this. As an example, a narrative which reads “DD Acme Trading Co” is tidied to read just “Acme Trading Co”. A narrative with weird characters is just deleted on the assumption that something has gone wrong with the OCR system. All narratives below the active line are processed at once by the routine.
When the clerk comes to a blank line, the NP routine is run to fill it in along with all other blank lines below the active line. If any narrative is then wrong it will need to be overtyped. The clerk has autocomplete to help out, and can also reprogram the function keys F1 .. F10 to produce key narratives.
The NP and tidy up routines may both need to be run more than once, and can be run in any order. Most narratives will have been scanned correctly by OCR, and this is just a matter of making adjustments. Cheque book stubs will need to be typed in separately, but these are becoming rare.
It should be remembered that the NP routine is also useful to deal with handwritten records that happen to resemble bank statements. We enter by the column all the numbers, all the dates, a few narratives, run NP, and then overtype to make any corrections. In our opinion, it would be foolish to develop an OCR system and overlook NP.
In statistical terms, dates have low entropy which is amenable to inventing things to process them. Numbers have high entropy so we need direct scanning by OCR. Narratives have variable entropy for which a hybrid OCR/NP system is appropriate. If the client has ticked and written all over the bank statements, our system will go on working, albeit a bit more slowly, and will not push us over any metaphorical cliff edge.
If lots of numbers are scanned wrongly due to the presence of extraneous ticks and annotations, then the computer-assisted blink comparator will sort it out. If lots of narratives are scanned wrongly, then they will be deleted and NP will be used to fill in the gaps in the first round, and the clerk can check and overtype in the second round. If lots of dates are scanned wrongly, they will be deleted and simple interpolation will be used to replace them. If the bank statements are of poor quality, then the job may take longer, but not longer out of all proportion. This is called graceful degradation.