Stats NZ has a new website.

For new releases go to

www.stats.govt.nz

As we transition to our new site, you'll still find some Stats NZ information here on this archive site.

  • Share this page to Facebook
  • Share this page to Twitter
  • Share this page to Google+
Understanding data quality

The census programme actively manages quality across all phases of the census, from planning, collection, processing, and data evaluation, to creation of products and services. This chapter explains how we ensure data is ‘fit for use’ before we release it.

How do we manage data quality?

Census variables and topics are ranked by three 'quality levels' – foremost, defining, and supplementary. These levels are used to guide the amount of resources spent on quality control at all phases of the census.

Level 1 – foremost variables / topics are core census variables. Their outputs are the main reason for conducting a census and include information on age, sex, ethnicity, and location. Some of these variables produce the key outputs used for maintaining the accuracy of population estimates. Across all phases of the census, foremost variables are given the highest priority in terms of quality control, time, and resources.

Level 2 – defining variables / topics describe the key subject populations that the census provides measures for. They are important for policy development, evaluation, or monitoring. Defining variables are used frequently in cross-tabulations with foremost variables. They represent key sub-populations and the measures that are of high public interest, for example, birthplace and labour force status. These variables are closely linked to the main purpose of the census, and in the New Zealand context may only be available in detail, for example, at the subnational level. These variables have second priority in terms of quality control, time, and resources across all phases of the census.

Level 3 – supplementary variables / topics do not fit directly with the primary purpose of the census but are important to some groups of users. Examples include occupation, language, and religious affiliation. These variables have third priority in terms of effort and resources.

A list of census output variables / topics, together with the quality level assigned to each variable, is given in Census variables / topics by quality level.

The evaluation phase assesses each variable’s quality for its ‘fitness for use’. Every variable needs to meet minimum standards before we release any output data.

2013 Census information by variable has information on the data quality of variables.

How do we ensure census data is fit for use?

A comprehensive testing programme is undertaken before the census to ensure that its key processes and systems are working efficiently and to specification. The programme consists of different types of tests. These usually include form tests (cognitive and usability tests), field tests, system tests, integration tests, and a dress rehearsal.

Before a new question is included on the census form, or before a change is made to an existing question, the questions undergo cognitive testing. Cognitive testing shows how well a question meets the combined needs of respondents and users. It confirms that the question is collecting good quality data.

Field tests ensure that the census forms collect reliable information and that field procedures are working effectively. The data processing system is thoroughly tested to ensure that the system treats the responses on the census form correctly. Individual systems are tested, and then combined and tested as a group.

After the data has been processed it is evaluated before it is published. This evaluation phase ensures that the data provided by respondents has not been changed and that errors have not been introduced by the systems or processes used to input the data. The results of this evaluation are published in 2013 Census information by variable.

2011 Census testing strategy was adopted for the 2013 Census.

What are the possible sources of error?

The census covers the entire population of New Zealand, and is not subject to sampling error. Sampling error occurs when a sample of people in the population is surveyed and their responses are used to estimate the results of a survey of the whole population. A number of errors may be introduced due to how the sample is drawn, the sample size, and population variability.

However, census data may be subject to non-sampling errors resulting from respondent errors, collection or processing errors, and undercounts. We strive to reduce each of these error types and provide data that is fit for use.

Being self-administered, the census may be subject to errors made by individuals when filling in census forms. These could happen because individuals misunderstand the question, accidentally mark the wrong box, or give a partial response or no response at all to census questions that were relevant to them. To minimise these errors, census forms have been designed so that questions are as easy to understand and as simple to answer as possible. Online census forms also help minimise respondent errors. Built-in editing functionality directs individuals to the appropriate questions and ensures that their responses are valid.

To minimise purposeful distortion of information by individuals, the importance of the census is communicated through a variety of media channels – such as television, radio, the Internet, and newspapers – and through a community outreach programme. Guide notes (delivered with both paper and online census forms), other online help, and the toll-free census helpline number help individuals complete their census forms.

Collection errors are errors made by collectors when forms are delivered to, or collected from, dwellings. These could include assigning dwellings to an incorrect meshblock, misidentifying a dwelling as occupied or unoccupied, or incorrectly classifying a dwelling as private or non-private. We have checks and balances at different stages of the collection, processing, and evaluation phases to identify and fix these errors.

Examples of errors that can occur during data processing include incorrectly classifying responses and misrecognition of written responses (processing of Internet forms is less subject to these types of errors). Checks are made during data processing to identify possible errors and correct them if necessary. The data processing phase is followed by a data evaluation phase, where further checks on the data are done to ensure that it meets quality standards and is fit for use.

While we aim to collect information on everyone living in New Zealand, some people may be missed and some may be counted more than once. Our collection processes seek to minimise these errors. In most censuses, more people are missed than overcounted, which results in a net undercount. This is measured through the Post-enumeration Survey discussed in How do we know how many people were missed?

What about missing forms or questions that aren’t answered?

Census aims to give complete coverage of New Zealand’s population. The previous chapter explains how everyone is counted in the census. The collection phase is a key part of the process of ensuring high-quality data. Our collection programme includes strategies for hard-to-reach groups such as those living in apartments. However, some people will not fill in census forms that have been delivered and some people will not fill in all the questions that are relevant to them.

These uncollected forms or non-responses to questions affect data quality. The higher the non-response the more it means we cannot be sure we have information on the characteristics of all the people in an area. We use statistical techniques, called imputation, to fill these data gaps with information that is most likely to have been on the forms. These imputed records are created to improve census coverage. Imputed records are included in the final population and dwelling counts.

What is imputation?

There are two types of imputation: imputation of records to create a substitute individual or dwelling records, and imputation of variables where respondents have not answered a question.

What are substitute records?

Substitute dwelling records are created where there is sufficient evidence that an occupied dwelling exists but we have no corresponding dwelling form. Similarly, substitute individual records are created where there is sufficient evidence that a person exists but we have no corresponding individual form.

These imputed records are created to maintain the high level of census coverage. The first step in ascertaining if there are missing dwelling forms or individual forms is to attach individual forms to dwelling forms in household groups. The dwelling form shows the number of people present in a dwelling on census night. There are three key processes within this household balancing step.

First, a three-way check is done on the number of occupants, number of names on the household grid from the dwelling form, and the number of individual forms received. Most households pass this check.

Second, collectors' field books are checked to resolve errors and flag households with unresolved errors as 'unbalanced'. Further checks, to look for missing dwelling or individual forms, extra dwelling forms, and unoccupied dwellings, result in more households being balanced.

Third, substitute dwelling and individual records are created to complete balancing.

Substitute forms have some variables imputed, but the majority are given a non-response code. Substitute dwelling forms have all variables coded to the non-response category apart from number of occupants, dwelling address, dwelling type, and dwelling occupancy status. Substitute individual forms have all variables coded to the non-response category apart from age, sex, usual residence address, census night address, and individual record type.

How are variables imputed?

Imputation is the replacement of missing information with a best estimate of what the true value might be. The estimate is typically based on available non-missing information. In the 2013 Census, four variables are imputed on the output dataset. These are: sex, age, census usual residence meshblock, and work and labour force status.

The census uses three main types of imputation. They are:

  • Deterministic imputation. This involves gathering information from other responses on the census form to determine a response to a question without a valid answer. For example, using age information on the dwelling form to impute a missing age on the individual form.
  • Stochastic imputation. This involves imputing missing values according to an existing distribution. For example, using New Zealand's age distribution to impute a person who has not responded to the age question. Available information on the respondent from their census forms narrows the stochastic distribution range. For example, if a respondent lives in a rest home, they are likely to be aged over 58 years.
  • Donor method imputation. This involves matching the non-respondent (recipient) to a respondent (donor) for a particular question, based on a set of matching variables that are closely related to the missing variable. The method copies the missing information from the donor to the recipient. For example, if the work and labour force status information is missing for a 35-year-old male, a male in the same age group is found as a donor of this information.

In practice, we use one or several of these imputation approaches depending on the level of information respondents provide.

What about questions that aren’t imputed?

We only impute the answer for sex, age, census usual residence meshblock, and work and labour force status, as these variables are used as the base for detailed population estimates and projections. For questions that are not answered and not imputed, we classify the answer as ‘not stated’.

We calculate non-response rates by counting the number of responses entered as ‘not stated’ as a proportion of the subject population for that question, unless otherwise stated. For the 1981–96 Censuses, these were called ‘not specified’.

High non-response rates may affect data quality. There is no standard scale for deciding what low, moderate, or high non-response rates are. The scale shown in the table below is based on the non-response rate together with information about the likely effects of different non-response rates. You can use this scale to think about the impacts of non-response on data quality.

Effects of different non-response rates on data quality(1)
Non-response rate Description of non-response level Likely effect on data quality
<3.0 percent Low Little or none
3.0–4.9 percent Relatively low Low
5.0–6.9 percent Moderate Some reduction in data quality may have occurred
7.0–8.9 percent Relatively high Some reduction in data quality is likely
9.0+ percent High Data quality will have been reduced
1. Adapted from: A guide to using data from the New Zealand census: 1981–2006 (Errington, Cotterell, van Randow, & Milligan, 2008).

Response rates to questions may vary. Where non-response rates are available, these are provided in 2013 Census information by variable.

We suggest you refer to the information about variables you intend to use to be aware of any data you need to interpret with caution.  

  • Share this page to Facebook
  • Share this page to Twitter
  • Share this page to Google+
Top
  • Share this page to Facebook
  • Share this page to Twitter
  • Share this page to Google+