Assessing the predictive accuracy of official-statistics registers. The Global Mean Squared Error measure (Invited Talk @ ITACOSM2023)

Abstract

The Italian National Statistical Institute (ISTAT) is undergoing a significant modernization process, transitioning from a statistical paradigm based uniquely on independent sample and census surveys to an integrated system of statistical registries (ISSR). The ISSR is the result of a comprehensive integration of administrative and survey data and should represent the basis of all official statistics. Clearly, to allow the ISSR to be utilized as a single informative infrastructure, different statistical techniques (such as record linkage, statistical matching, or imputation/prediction) have been adopted to achieve the integration goal and reconstruct unit-level data. To illustrate, the reconstruction of the attained level of education was achieved by means of a sequence of log-linear models to impute the education level, combining the different informative contents of administrative data (provided by the Ministry of Education, University and Research; MIUR), census data (the 2011 Italian census), and sample survey data. It is thus the result of statistical processes subject to different sources of statistical uncertainty. These include, among others, the sampling errors due to sample surveys and the adoption of statistical models for imputing or predicting missing data at the unit level. The aim of this work is to evaluate, compare and propose novel methods for assessing the quality of data derived for specific cases of the ISSR. More specifically, our goal is to establish feasible computational statistical measures for calculating the accuracy of the provided estimates in the specific context of the attained level of education. In this work, we focus on the estimates of population totals, and allow for specific user-defined domains, e.g., the number of illiterate individuals in the province of Bologna. We build on the recent proposal described in Alleva et al. (2021: J Off Stat, 37, 481-503), where a new global measure for the estimation error, that is, the Global Mean Squared Error (GMSE), is developed, and assessed by simulations in the case of a logistic model. Accounting for two types of uncertainty, i.e., the sampling and the modeling uncertainty, we generalize the GMSE to multinomial response variables to resemble the categorical entity of the education level (overall 8 categories). The underlying strategy is based on three linearization steps, and it only involves the first two moments of the distribution. We evaluate the proposed method on a subsample of the Base Register of Individuals (BRI) of the ISSR and validate it with a bootstrap-based approach carried out on the same data. The GMSE results in a reliable, interpretable, computationally feasible, and flexible approach, which may allow different users to evaluate the accuracy of their own statistics produced from the ISSR. With this work, we aim to support the advancement of current practices in official statistics, facilitating a flexible yet correct use of registry data by augmenting the production of estimates with their corresponding quality.

Date
Jun 8, 2023 12:00 AM
Location
University of Calabria