Stats NZ has a new website.

For new releases go to

www.stats.govt.nz

As we transition to our new site, you'll still find some Stats NZ information here on this archive site.

  • Share this page to Facebook
  • Share this page to Twitter
  • Share this page to Google+
Numeric Recognition Study 1998

Introduction

Statistics New Zealand (SNZ) used imaging and Intelligent Character Recognition (ICR) for the first time in the processing of the 1996 Census of Population and Dwellings. Although the numeric recognition results achieved were within contractual limits, the data quality, for certain variables
in particular, was not as high as anticipated. To some extent, these deficiencies in the data quality stemmed from our lack of understanding of the total recognition process.

The specific aim of this research was to make a recommendation whether or not to use numeric recognition for the 2001 Census. To do this, we felt it necessary to gain an increased understanding of the recognition process and to assess some currently available systems.

The research was a joint venture between SNZ and Datamail. The study used images retained for research purposes from the 1996 Census. Data that had been key entered and verified was compared to recognised data. Datamail was responsible for performing the numeric recognition and also
provided the software for capturing the key entry data. SNZ designed the study and carried out all aspects of the analysis.

To investigate possible differences between recognition engines we included two engines in the trial - RE and MITEK. For the RE engine, we also assessed the benefit of applying image enhancement techniques prior to recognition. The RE engine is Datamail’s current engine, while the MITEK engine is regularly used by their associate company in Australia.

In order to look at the trade-off between cost and quality, we established the relationship between the proportion of data rejected (requiring key entry) and the associated error rate in the data. We considered the digit error rate and the variable error rate in the data, both before and after simple edits.

top

List of tables

  1. Digit substitution pattern for the RE engine on unenhanced images at a rejection rate of 17.54 percent (handwritten numerics)
  2. Digit substitution pattern for the RE engine on enhanced images at a rejection rate of 18.04 percent (hand-written numerics)
  3. Digit substitution pattern for the MITEK engine on enhanced images at a rejection rate of 18.68 percent (hand-written numerics)
  4. Variable error and rejection rates for the RE engine on enhanced images, with the confidence threshold for all digits at 650
  5. Number and type of variable errors for the RE engine on enhanced images, with the confidence threshold for all digits at 650
  6. Variable error and rejection rates before and after basic edits, for the RE engine on enhanced images, with the confidence threshold for all digits at 650
  7. Number and type of variable errors after basic edits, for the RE engine on enhanced images, with the confidence threshold for all digits at 650 (compare with Table 5)
  8. Variable error and rejection rates for ‘number of bedrooms’ before and after basic edits, for the RE engine on enhanced images, for different confidence thresholds
  9. Variable error and rejection rates before and after size edits, for the RE engine on enhanced images, with the confidence threshold for all digits at 650
  10. Number and type of variable errors after basics edits and size edits, for the RE engine on enhanced images, with the confidence threshold for all digits at 650 (compared with Table 7)
  11. Digit specific confidence thresholds
  12. Variable error and rejection rates after basic edits and size edits, for the RE engine on enhanced images, with the confidence threshold for all digits at 650 and with different confidence threholds as in Table 11.
  13. Number and type of variable errors after basic edits and size edits, for the RE engine on enhanced images, with different confidence thresholds as in Table 11 (compare with Table 10)

List of figures

  1. Rejection rates and digit substitution rates for the RE and MITEK engines, as well as those achieved in the 1996 Census and the 1992 US OCR Conference (hand-written numerics)
  2. Keyed and recognised distributions of ‘number of bedrooms’ for the RE engine on enhanced images
  3. Keyed and recognised distributions of ‘hours worked in main job’ for the RE engine on enhanced images.
  4. Keyed and recognised distributions of ‘fertility’ for the RE engine on enhanced images

top

Printable version

The contents of the attached file is in Adobe Acrobat Reader format. If you do not have the Adobe Acrobat Reader you may download the reader to view or print this file.

  • Share this page to Facebook
  • Share this page to Twitter
  • Share this page to Google+
Top
  • Share this page to Facebook
  • Share this page to Twitter
  • Share this page to Google+