Stats NZ has a new website.

For new releases go to

www.stats.govt.nz

As we transition to our new site, you'll still find some Stats NZ information here on this archive site.

  • Share this page to Facebook
  • Share this page to Twitter
  • Share this page to Google+
Imputation and Balancing Methodologies for the 2006 Census

Summary

The proposed imputation and balancing methodologies for the 2006 Census have been produced on a minimal change basis. Minor improvements have been made where issues were identified in the 2001 Census, and where new initiatives will improve the quality of the data.

The review of the 2001 methodology, and all of the proposed changes, went through a consultation and review process with stakeholders. This has ensured that the 2006 methodology has the support of subject matter experts and other people affected by the imputations. A summary of these changes to the methodologies, all of them minor, are outlined in the following table.

Methodology Key elements unchanged Percentage imputed in 2001 Changes from the 2001 methodology 
Balancing Balancing organises census forms into household groups. The overall balancing process and checks on households have not changed from 2001.   The introduction of the Internet response option means more duplicate forms may be received. To prepare for this, specific guidelines have been introduced for operators on how to handle duplicate forms discovered at balancing (see Section 2.2)
Work and labour
force status (WLFS)
imputation
WLFS imputation imputes individuals using a donor method. This is the same method that was used in 2001. Overall level of imputation 8.2%
  • Geographic variation in the WLFS imputation will be accounted for by introducing region as a matching variable (see Section 3.3).
  • An individual record can be used as a donor record no more than 15 times in the WLFS imputation (see Section 3.4).
Sex imputation Sex imputation uses a mix of deterministic and stochastic methods to impute a sex for respondents missing this information. This is the same method that was used in 2001. Overall level of imputation 4.1 percent
  • Improved methods for identifying and imputing single-sex non-private dwellings, eg prisons, have been incorporated (see Section 4.2).
  • The number of sex imputation monitoring codes has been increased (see Section 4.4).
Usual residence
imputation
Usual residence imputation uses deterministic and donor methods to assign each individual to a meshblock. This is the same method that was used in 2001. Overall level of imputation 0.7 percent  An investigation into the usual residence imputation predictor variables led to sex being dropped, and employment status added, as a predictor variable (see Section 5.2).
Age imputation In 2006, age imputation uses a mixture of deterministic, stochastic and donor methods. In 2001, the same deterministic and stochastic methods were used for age imputation. Overall level of imputation 3.7 percent 
  • When an entire substitute household is created, the ages of the residents will be imputed from donors (as opposed to the general distribution). This donor method maintains the natural distribution of ages within a household (see Section 6.2).
  • Minor shifts in the age distribution of the New Zealand population have caused the stochastic imputation distributions to be updated (see Section 6.3).
  • Changes in the family coding methodology have led to the dropping of two stochastic imputation methods (see Section 6.3).

top

1. Introduction

Imputation involves inserting a value when a respondent has not provided a valid response. In the 2006 Census, four variables will be imputed on the output dataset. These are:

  • work and labour force status (WLFS)
  • sex
  • usual residence meshblock (URM)
  • age.

A fifth imputation, for income, was proposed and investigated. Because of resource constraints, census management decided not to proceed with an income imputation for 2006. Methodologies and specifications for four imputations, and a balancing methodology, have been created. This document provides a short summary of how each method worked in the 2001 Census, together with the proposed changes to the methodologies for 2006.

The census uses three main types of imputation. They are:

  • Deterministic imputation. This involves gathering information from other responses on the census form to determine a response to the unresolved question. For example, using age information on the dwelling form to impute a missing date of birth on the individual form.
  • Stochastic imputation. This involves imputing missing values according to an existing distribution. For example, using New Zealand's age distribution to impute a person who has not responded to the age question. The stochastic distribution ranges can be narrowed using any available information on the respondent. For example, if a respondent lives in a rest home, they are likely to be aged over 58 years.
  • Donor method imputation. This involves matching the non-respondent (recipient) to a respondent (donor) for a particular question, based on a set of matching variables that are closely related to the missing variable. The missing information is copied from the donor to the recipient. For example, where work and labour force status information is missing for a 35-year-old male, they may be matched with a male in the same age group who possesses this information.

These examples and descriptions have been simplified to explain the general methods of imputation we use in the census, and often one methodology will use several of these imputation approaches, depending on the level of information provided by the respondent.

The census dress rehearsal was held in March 2005 and it tested the imputation methodologies before the census proper. Some parts of the methodologies will be tested in later phases. In addition, all of the methodologies undergo system and user acceptance testing as part of the census testing programme.

The proposed improvements for the 2006 methodologies have been absorbed within the current development timeframe. While taking a little extra time to specify and program, the cost impact has been minimal. These changes are unlikely to negatively affect the processing time, because most of the imputation systems are run automatically, and they will have no effect on the timing of the census outputs.

As the following sections show, no major changes have been proposed to any of the four 2001 imputation methodologies addressed. This minimal change approach was adopted because of the success of the 2001 methodologies. Most of the imputation methodologies used in 2006 have actually been used since 1996 (over the last two censuses), with only minimal changes being made. This approach maintains consistency between censuses, which is consistent with Quality Goal 2 of the 2006 Census Quality Management Strategy (QMS). It also reduces the pressure on time and budgets, which is consistent with Quality Goal 5 of the 2006 Census QMS.

Internationally, there is a trend towards using donor imputation methods in order to preserve the relationships between variables. Statistics Canada has led this movement with their CANCEIS editing and imputation system.

top

2. Balancing

This section covers issues relating to the 2006 balancing methodology.

Overview

Balancing is the process where dwelling forms are attached to individual forms in household groups. This information is used to work out the final population and dwelling counts, and to create substitute forms where there is evidence that a form is missing.

In 2001, balancing data went through a three-way check on the number of occupants, number of names on the household grid, and the number of individual forms received. Households that passed this check were considered balanced, while those that failed it, were sent on to household balancing repair. Checks were made with the collectors' field books to try and resolve errors. If inconsistencies could not be resolved, the households were flagged as 'unbalanced'.

Near the end of processing, checks were performed to look for missing dwelling or individual forms, extra dwelling forms, and unoccupied dwellings. At the very end of form processing, substitutes were created for the missing forms.

The success of the 2001 balancing methodology has led to the 2006 methodology being closely modelled on the 2001 approach.

Internet response option

The introduction of an Internet census form is one significant change between the 2001 and 2006 censuses. This raises the likelihood of additional duplicates (paper and Internet forms) being received. However, this is unlikely to cause problems for the balancing methodology itself, because the Internet forms are integrated into input processing before balancing. Forms will be flagged to identify which mode of return was used and, during balancing, Internet forms will be treated the same as paper forms.

In 2001, duplicates identified at balancing were judged on an individual case by case basis, with one of the forms being deleted. In 2006, this process will be formalised with a set of guidelines for operators on how to handle duplicate forms discovered at the point of balancing.

The operator guidelines for the dress rehearsal cover two or more forms (dwelling, individual or continuation forms) with the same ID numbers. When an operator discovers a duplicate, they will check to ensure the ID and household are correctly recognised and matched. Any errors will be corrected and the operator will check to see if exactly the same information is shown on both forms.

If the forms are identical, the first form will be deleted by the operator (with their team leader). If there are differences between the duplicate forms, then the one with the fewest inconsistencies will be selected (the guidelines detail specific questions to compare between the forms).

The operator guidelines will be reviewed and, if necessary, revised for the census proper.

Other changes

Improvements in the operational side of balancing for 2006 include creating a holding status for loose forms (individual forms without a dwelling form), introducing an edit facility on the dwelling control file , and adding a facility to delete loose forms. There will also be a bulk attachment and deletion facility for loose forms from non-private dwellings. A non-private dwelling register is being introduced, which will help substitute form creation and ensure all non-private dwellings have been processed. These improvements have been introduced to address processing issues identified in 2001.

The age imputation, which is described in Section 6, proposes the copying of the dwelling form ages from a donor household for full substitute households at the same time as copying the number of occupants. This will involve a small addition to the balancing process.

Key changes

The following key changes have been made to the balancing methodology for 2006:

  • introduction of the Internet response option, and specific guidelines for operators on how to handle duplicate forms
  • for full substitute households, dwelling form age will be copied, in addition to number of occupants from donor households.

top

3. Work and labour force status imputation

This section covers issues relating to the 2006 work and labour force status (WLFS) imputation methodology.

Overview

Derivation 62, which classifies people aged 15 years and over living in New Zealand according to their inclusion or exclusion from the labour force, derives WLFS for the majority of census respondents. This derivation uses a variety of fields to assign WLFS to respondents of employed (full- or part-time), unemployed, or not in the labour force. Respondents are sent for imputation when there is insufficient information to allocate them to one of these categories.

In 2001, the WLFS imputation was divided into two parts:

  • For persons aged 75 years and over, the WLFS imputation was deterministic, depending on the individual's response to the job indicator question. Positive responses were coded to employed part-time, and negative responses were coded to not in the labour force.
  • For persons aged 15–74, a 'hot deck' donor method was used to impute WLFS. The non-imputed records formed a hot deck from which donors could be found. The deck grew as more records were processed. Donors were paired up with recipients using a set of matching variables, which ensured, on average, the imputation of realistic WLFS depending on other characteristics of the non-respondent.
    Overall, the WLFS imputation worked well in 2001, and an approach of minimal change has been adopted for the 2006 methodology.

Region as a matching variable

In 2001, it was assumed that any geographic variation in WLFS would be incorporated into the hot deck imputation via the processing order (because records are processed regionally). Unfortunately, the reprocessing of records had a flow-on effect, and the geographic order of processing was lost. To prevent this issue arising in 2006, regional council has now been added as a matching variable for the hot deck imputation.

Setting a donor limit

Although there were no problems in 2001, the issue of repeated donor use was raised. In any hot deck imputation system with many matching variables there is a risk that certain donor records will be used too many times. This can have an undesirable effect, for example, distorting distributions. Because there is an additional matching variable (region), there is a greater risk of repeated donor usage. In order to alleviate this potential problem in 2006, a limit of 15 will be placed on the number of times a specific donor can be used. In 2001, the maximum number of times a donor was used was 14, so reaching the limit of 15 is unlikely, but possible. In situations where the limit is reached, region will be dropped as a matching variable, which will widen the available set of donors.

Other issues

Early on during the 2001 WLFS imputation, an alarming number of records were found to be requiring imputation. This anomaly was due to the dropping of the seeking work variable in the WLFS derivation (which feeds into the imputation). This derivation error was corrected and the records reprocessed. For 2006, seeking work has been added back in to the derivation, which will prevent this issue re-occurring with the imputation.

Key changes

The following key changes have been made to the 2006 WLFS imputation methodology:

  • Region has been introduced as a matching variable.
  • A donor limit of 15 has been introduced.

top

4. Sex imputation

This section covers issues relating to the 2006 sex imputation methodology.

Overview

Sex imputation occurs for all New Zealand residents, absentees and overseas visitors, who do not have values for the sex variable. In 2001, two forms of sex imputation were used:

  • Deterministic sex imputation involved operators examining the first name of an individual and their relationship to a reference person.
  • Stochastic sex imputation was applied where a name and/or relationship was not present or did not help identify the sex of an individual. Using stochastic imputation, sex was imputed randomly for private households (and most non-private dwellings) according to a nationally calculated sex distribution. For certain single sex non-private dwellings, different sex distributions were used for the imputation (eg a male prison may have a 94 percent male distribution).

In general, sex imputation worked well in 2001, and an approach of minimal change has been adopted for the creation of the 2006 sex imputation methodology.

Imputing single-sex non-private dwellings

One of the main improvements for 2006 involves the approach to imputing single-sex non-private dwellings. In 2001, a global edit had to be run at the end of processing to fix issues with single-sex non-private dwellings. In 2006, the imputation of sex for non-private dwellings will take place at the code and edit stage, using the dwelling control file variable npdsex. Information from a new non-private dwelling register will update the dwelling control file and single-sex dwellings will be identifiable by their npdsex codes.

The Statistical Methods unit investigated the proportion of males imputed into non-private dwellings and determined that only very minor changes to the proportions used in 2001 were needed.

Other issues

In order to aid evaluation, an increased number of monitor codes will be used in the 2006 sex imputation. The monitor codes will be applied automatically when a change to sex is made. This is a change from 2001, when the deterministic monitor codes were applied manually, and there was a risk that operators would forget to turn the monitor code on.

There is a risk of creating same-sex couples when sex is imputed. If left in the data, these spurious couples may distort the numbers of genuine same-sex couples. In order to combat this problem, couples which have at least one member stochastically imputed will be edited after family coding, and the sex of the imputed partner will be changed.

Key changes

The following key changes have been made:

  • improved methods for imputing same-sex non-private dwellings
  • an increased number of sex imputation monitoring codes.

top

5. Usual residence meshblock imputation

This section covers issues relating to the 2006 usual residence meshblock (URM) imputation methodology.

Overview

URM imputation assigns a meshblock to all New Zealand residents (including substitutes) who are away from home, but elsewhere in New Zealand, on census night. URM imputation is performed for census as well as electoral purposes. The imputation is based on the assumption that similar people live near each other.

Information provided by the respondent is used to firstly, try and identify the location down to an area unit. If an area unit is unable to be identified, the most detailed geographic information provided (eg territorial authority, regional council or workplace regional council) is matched up with predictor variables. These predictor variables are used to assign a probability that the respondent lived in a particular area unit, based upon the characteristics of non-imputees.

Area units with less than 350 people, and those not in a territorial authority (mostly meshblocks covering a sea, lake or water area), are excluded from having people imputed into them. If there are insufficient numbers of respondents who match the predictor variables, one or more of the predictors can be collapsed to increase the number of matching donors.

Once all respondents have an area unit (either known or imputed), a meshblock is randomly chosen from within that area unit based on the relative populations of each meshblock. This methodology ensures the imputed respondents are spread over a variety of meshblocks, and located near similar people.

The 2001 URM methodology worked well, and it has therefore been maintained for 2006. The following sections describe the details of minor changes made to the 2006 URM imputation methodology.

Predictor variables

In 2006, the Statistical Methods unit investigated the predictor variables used in URM imputation. The analysis demonstrated that occupation was closely correlated with employment status, and it was therefore not needed as a predictor variable. In 2006, workplace address will be used to give an estimate of usual residence to regional council level. The investigation also recommended sex be dropped as a predictor variable because of its low predictive power.

Other issues

In 2001, URM imputation was carried out as part of the output system in order to meet the timing of electoral requirements. An alternative to this is to run URM imputation at the end of processing. The Statistical Methods unit investigated this proposal and concluded there was no evidence that by placing URM imputation at the end of processing, there would be a significant improvement in quality. This conclusion, coupled with the additional risks and complexities of shifting the imputation, and the fact that region is now used in the WLFS imputation, led to the decision to leave URM imputation in the output system for 2006. The procedure of collapsing imputation cells when the number of donors is low has been streamlined for 2006.

Key changes

The following key change has been made to the 2006 URM imputation methodology:

Sex was dropped, and employment status added, as a predictor variable.

top

6. Age imputation

This section covers issues relating to the 2006 age imputation methodology.

Overview

Age imputation assigns an age to all people present in New Zealand on census night (including absentees, overseas visitors and substitutes) when this information is not provided by the respondent. Age information is requested in two places on the census forms: on the individual form, Question 4 asks for date of birth; and on the dwelling form, Question 6 asks for the age of each resident. The age imputation methodology consists of several parts. The following paragraph describes the methodology used in 2001.

If a valid age was shown on the dwelling form, this was copied into the age variable in a deterministic imputation. If records could not be imputed by this method, a stochastic imputation was carried out. The stochastic imputation set minimum and maximum age limits with a default range of 0–90. Stochastic age imputation was positioned after sex imputation and family coding to enable the output from these procedures to be used in determining an age.

Information about the respondent (eg marital status, dwelling type, living arrangements, qualifications, income, and years in New Zealand) was used to narrow the range of imputable ages. Despite this, the majority of respondents requiring age imputation in 2001 were from full substitute households. These respondents were imputed from the general distributions, because there was no further information on which to base the imputation.

Age imputation worked well in 2001, and an approach of minimal change has been adopted for the 2006 methodology.

Age imputation in substitute households from donors

In 2006, instead of imputing age for substitute persons in full substitute households using the general distribution, the age on the dwelling form will be copied from a donor household. This will be known as Step 1 of the age imputation process, and had it been employed in 2001, approximately 80 percent of the imputees would have been imputed by this method. This new method maintains the natural distribution of ages within households. The number of occupants and their ages are transferred from a donor household to a substitute at the balancing stage.

Changes to stochastic age imputation

The stochastic imputation used in 2006 will vary slightly from that used in 2001. Default minimum and maximum ages are set and refined using other information. These minimum and maximum ages have been revised for 2006 on the basis of the 2001 distributions. Two stochastic methods have been dropped to simplify age imputation. The stochastic age range is narrowed using the person's closest relative, with ranges available for partner, child, parent and same-generation relatives. The imputation is dependent on family coding information. Changes to family coding have caused flow-on changes to the age imputation methodology.

Other issues

In 2006, family coding operators will be able to impute incorrectly recognised ages using dwelling form data. This will require the permission of the family coding team leader, and will only be applied where there is clear evidence of an incorrect age (eg a child aged seven has been incorrectly recognised as 70). This represents a change from 2001, when the operators could not alter incorrectly recognised age data. In order to monitor the imputation more closely, new monitor codes have been introduced.

Key changes

The following key changes have been made to the 2006 age imputation methodology:

  • Substitute households will be imputed from donors as opposed to the general distribution.
  • Changes have been made to stochastic imputation, stemming from changes in family coding methodology and updated age distributions.
  • Stochastic age imputation has been simplified.

 

  • Share this page to Facebook
  • Share this page to Twitter
  • Share this page to Google+
Top
  • Share this page to Facebook
  • Share this page to Twitter
  • Share this page to Google+