A few years ago I conceived the idea that accountants could process bank statements using optical character recognition. Quite a few other people had the same idea, but my advantage was that of being both a computer programmer and a working accountant. Knowing the records we get, what I wanted was a system with lots of graceful degradation. What happens, for example, if there are a few scribbles on the bank statements? The frequency of scribbling is such that any standard OCR system will be wrecked, but it is still inept to abandon OCR. These are just the bank statements we get in the real world.
My system is essentially “merely” an optical number recognition system which reads the numbers, which gives me an easier task to tackle. It can now do dates as well, but as an afterthought. The ONR system is backed up by a blink comparator-like display so a visual check can be made that what has been scanned onto a spreadsheet approximates to what we were expecting. Anything which has been disrupted by a scribble is easy to spot, and we just need to type it in from the keyboard. As an accountant, I am the number-happy sort, and I can program the computer to spot when a number is wrong, or when a statement is upside-down, and so on, and to make amendments. Bank statements have a running total by which errors can actually be fixed, and after the blink comparator stage the error rate is likely to be low so the fix is reliable. My ONR software is basically a Fortran program!
We don’t have to date absolutely everything, so if a date is obviously wrong we just throw it away. Dates in the future and more than four years or 1,461 days in the past are also ignored, so ancient bank statements will need to be done manually. Dates are definitely an afterthought. However a block of dates on the blink comparator are helpful. A run of dates can be used to detect an upside-down statement (it could be printed off from an online bank account) although the software can also detect inversion by internal inspection of the numbers as a backup system. These uses of dates tolerate misreading and rejection of the odd individual date.
I wasn’t sure that this was all going to work, so independently of this I developed fast conventional methods of data entry. To take narratives, if every narrative were the same, then Excel’s Autocomplete would help with typing them in. Autocomplete kicks in as soon as it detects a unique trigger, but if we had two narratives like “Accommodation in Carlisle” and “Accommodation in Penrith” then Autocomplete would be close to useless. My system would detect these long-trigger narratives and reprogram the function keys to generate them, so I have a clever keyboard. I call this super-Autocomplete.
Continuing in this style, I have a system where a month’s worth of narratives can be typed in with my super-Autocomplete system. I can then run a Narrative Predictor which predicts the next 11 months. The narratives which are wrong need to be overtyped, but there are often not many that are wrong, and we still have super-Autocomplete. After two months we can rerun the Narrative Predictor to refine the prediction for the last 10 months.
I think that with the bank statements we get, Narrative Predictor will easily outperform OCR. Narratives have low information density (the English language has a lot of redundancy) but also low entropy (they tend to be predictable). Numbers have high information density and high entropy. The use of OCR tends to introduce a few random errors, so for narrative the signal-to-noise ratio could be problematic given the low information density. Errors in numbers can be fixed by reference to the running total. This is the justification for using OCR for numbers, but another technique for narratives.
So it’s ONR for numbers, Narrative Prediction for narrative, and dates are an afterthought. I have now got the system I wanted a few years ago. My ambition has been realised. Of course someone out there will tell me why a pure OCR system is so much better. Well good luck to them!