Stats NZ has a new website.

For new releases go to

As we transition to our new site, you'll still find some Stats NZ information here on this archive site.

  • Share this page to Facebook
  • Share this page to Twitter
  • Share this page to Google+
How should we derive ethnicity from linked individual responses?

We have several sources of a person’s ethnicity, each with different coverage patterns and different quality issues. An individual may have ethnicity recorded in one or up to six of the main sources in any combination. We need to determine the ‘best’ ethnic response for an individual. We have compared two basic ways of solving this problem. 

Ever-recorded ethnicity

The method used in the IDI personal details table up to 2015 is to take a ‘yes’ on any individual source for each ethnic group to be a ‘yes’ in the final ethnic profile, regardless of what is recorded in other sources.

However, this method is likely to result in too many people being counted as members of some ethnic groups. This is because every ‘yes’ response, whether from any mistakes at source, linking errors in the IDI, or changes over time in the person’s self-identification, will be elevated to the person’s final ethnic profile.

Up to 10 percent of individuals have more than one ethnicity recorded in the census and in individual administrative sources, but approximately 20 percent of individuals have more than one ethnicity in the personal details table.

Source ranking

Alternatively, we could define a ranking of the administrative sources and use the highest-ranked information available for each individual. This method would be expected to reduce some of the overcounting problems seen in the ‘ever-recorded’ approach, but would not be able to correct for linking errors and measurement errors in particular sources.

Using the earlier results for individual sources, we created three different source rankings to use for comparison and to investigate how sensitive the results are to the ranking chosen. These example rankings are shown in table 6.

We put births first in each ranking, as it was the best source on all measures. It covers approximately 18 percent of linked census people. Different rankings of the education and health data were tested because they each have good quality but different patterns of performance for different ethnic profiles. Depending on the ordering, health was the source for 30 percent to 80 percent of records. The education sources only accounted for a few percent of people when they were ranked lower, but for up to 14 percent for schools and up to 44 percent for tertiary enrolments when ranked above health.

Table 6

 The three source rankings used to decide from which source to take individual’s ethnicity response


 Rank  Ranking 1  Ranking 2  Ranking 3
 1  Births  Births  Births
 2  MOH  MOE schools  MOE tertiary
 3  MOE schools  MOH  MOE schools
 4  MOE tertiary  MOE tertiary  MOH
 5  ACC claims  ACC claims  ACC claims
 6  MSD benefits  MSD benefits  MSD benefits

Comparison of the methods

We can use the same comparisons as we did for individual sources to evaluate different methods for combining ethnicity from multiple sources. Table 7 is the same as table 4 except it uses the results of the combination rules described in the previous section.

Table 7

 Comparison of total response counts between census and different methods of combining sources
 Ethnic group


 Combined administrative sources to census ratio for total responses to ethnic group


 Ranking 1

 Ranking 2

 Ranking 3











 Pacific peoples















 Note: MELAA = Middle Eastern/Latin American/African

The pattern of this table demonstrates limitations of the ‘ever-reported ethnicity’ method used in the IDI personal details table. The ‘ever-recorded’ method overestimates the number of people in all ethnic groups, and for the Māori, Pacific peoples, and especially MELAA groups, the overestimate is very large.

In contrast, using the ranked source method results in an underestimate of the number of people in each ethnic group. That is, people tend to be missing ethnicities which they have in the census. The different rankings chosen do produce different results, but the general pattern is much the same. Ranking 3 has the best overall performance for all ethnic groups, and Māori in particular.

In figure 6 we have compared the ethnicity aggregation methods using the same ethnic profile comparison as we used in figure 5 for the individual sources. The same definitions of ethnic profile agreement were used for these graphs as for the ones in the earlier section. We used ranking 3 in the comparison because it performed the best in the total response ratio comparison in table 7.

Figure 6
Graph, Percent agreement with census ethnic profiles, for two methods of combining sources.

The ranked source method results in higher rates of agreement than the ever-recorded ethnicity method for all of the single ethnic group profiles. The agreement rates for multiple ethnicity profiles are mixed – ‘European and Māori’ show closer agreement with census using the ever-recorded method, while the other two-way ethnic combinations are slightly better in the ranked data.

Other methods

We have compared two simple but contrasting methods of combining multiple data sources. The results show that the choice of method can have a major impact on the estimates for different ethnic groups. Figure 6 and table 7 show clearly that the ranked sources method is better than the ever-recorded method. They also show, however, that the ranked sources method is still far from perfect, especially for people with more than one ethnicity in the census. Although we will always be ultimately limited by the accuracy of the administrative source data, complex rulesets or statistical models could improve on the simple methods presented above.

One possibility would be to introduce a more complex set of rules, which could be based on a majority vote idea (eg if 3 out of 4 sources say a person has an ethnicity, then assign them that ethnicity) or on a combined rank/vote system (eg the ethnicity reported in a high-ranked source might be overruled if three ‘low-quality’ sources disagree). Removing older data may mean results align more closely with census results.

Statistical models and machine learning methods can effectively create rulesets that are more complex and optimised according to some statistical measure. An example is latent class analysis, which has been used to analyse questionnaire responses where multiple questions try to measure the same concept but appear to be unreliable or inconsistent (Biemer, 2011). Machine learning methods such as classification trees, which are designed for similar problems, could also be investigated.

  • Share this page to Facebook
  • Share this page to Twitter
  • Share this page to Google+
  • Share this page to Facebook
  • Share this page to Twitter
  • Share this page to Google+