Important empirical information on household behavior and finances is obtained from surveys, and these data are used heavily by researchers, central banks, and for policy consulting. However, various interdependent factors that can be controlled only to a limited extent lead to unit and item nonresponse, and missing data on certain items is a frequent source of difficulties in statistical practice. More than ever, it is important to explore techniques for the imputation of large survey data. This paper presents the theoretical underpinnings of a Markov chain Monte Carlo multiple imputation procedure and outlines important technical aspects of the application of MCMC-type algorithms to large socio-economic data sets. In an illustrative application it is found that MCMC algorithms have good convergence properties even on large data sets with complex patterns of missingness, and that the use of a rich set of covariates in the imputation models has a substantial effect on the distributions of key financial variables.

In this work, we consider a hierarchical spatio-temporal model for particulate matter (PM) concentration in the North-Italian region Piemonte. The model involves a Gaussian Field (GF), affected by a measurement error, and a state process characterized by a first order autoregressive dynamic model and spatially correlated innovations. This kind of model is well discussed and widely used in the air quality literature thanks to its flexibility in modelling the effect of relevant covariates (i.e. meteorological and geographical variables) as well as time and space dependence. However, Bayesian inference—through Markov chain Monte Carlo (MCMC) techniques—can be a challenge due to convergence problems and heavy computational loads. In particular, the computational issue refers to the infeasibility of linear algebra operations involving the big dense covariance matrices which occur when large spatio-temporal datasets are present. The main goal of this work is to present an effective estimating and spatial prediction strategy for the considered spatio-temporal model. This proposal consists in representing a GF with Matérn covariance function as a Gaussian Markov Random Field (GMRF) through the Stochastic Partial Differential Equations (SPDE) approach. The main advantage of moving from a GF to a GMRF stems from the good computational properties that the latter enjoys. In fact, GMRFs are defined by sparse matrices that allow for computationally effective numerical methods. Moreover, when dealing with Bayesian inference for GMRFs, it is possible to adopt the Integrated Nested Laplace Approximation (INLA) algorithm as an alternative to MCMC methods giving rise to additional computational advantages. The implementation of the SPDE approach through the R-library INLA ( www.r-inla.org ) is illustrated with reference to the Piemonte PM data. In particular, providing the step-by-step R-code, we show how it is easy to get prediction and probability of exceedance maps in a reasonable computing time.

The analysis of time series of counts is an emerging field of science. To obtain an ARMA-like autocorrelation structure, many models make use of thinning operations to adapt the ARMA recursion to the integer-valued case. Most popular among these probabilistic operations is the concept of binomial thinning, leading to the class of INARMA models. These models are proved to be useful, especially for processes of Poisson counts, but may lead to difficulties in the case of different count distributions. Therefore, several alternative thinning concepts have been developed. This article reviews such thinning operations and shows how they are successfully applied to define integer-valued ARMA models.

On the one hand, kernel density estimation has become a common tool for empirical studies in any research area. This goes hand in hand with the fact that this kind of estimator is now provided by many software packages. On the other hand, since about three decades the discussion on bandwidth selection has been going on. Although a good part of the discussion is about nonparametric regression, this parameter choice is by no means less problematic for density estimation. This becomes obvious when reading empirical studies in which practitioners have made use of kernel densities. New contributions typically provide simulations only to show that the own selector outperforms some of the existing methods. We review existing methods and compare them on a set of designs that exhibit few bumps and exponentially falling tails. We concentrate on small and moderate sample sizes because for large ones the differences between consistent methods are often negligible, at least for practitioners. As a byproduct we find that a mixture of simple plug-in and cross-validation methods produces bandwidths with a quite stable performance.

With the influx of complex and detailed tracking data gathered from electronic tracking devices, the analysis of animal movement data has recently emerged as a cottage industry among biostatisticians. New approaches of ever greater complexity are continue to be added to the literature. In this paper, we review what we believe to be some of the most popular and most useful classes of statistical models used to analyse individual animal movement data. Specifically, we consider discrete-time hidden Markov models, more general state-space models and diffusion processes. We argue that these models should be core components in the toolbox for quantitative researchers working on stochastic modelling of individual animal movement. The paper concludes by offering some general observations on the direction of statistical analysis of animal movement. There is a trend in movement ecology towards what are arguably overly complex modelling approaches which are inaccessible to ecologists, unwieldy with large data sets or not based on mainstream statistical practice. Additionally, some analysis methods developed within the ecological community ignore fundamental properties of movement data, potentially leading to misleading conclusions about animal movement. Corresponding approaches, e.g. based on Lévy walk-type models, continue to be popular despite having been largely discredited. We contend that there is a need for an appropriate balance between the extremes of either being overly complex or being overly simplistic, whereby the discipline relies on models of intermediate complexity that are usable by general ecologists, but grounded in well-developed statistical practice and efficient to fit to large data sets.

Composite marginal likelihoods are pseudolikelihoods constructed by compounding marginal densities. In several applications, they are convenient surrogates for the ordinary likelihood when it is too cumbersome or impractical to compute. This paper presents an overview of the topic with emphasis on applications.

In this paper, we consider the single-index measurement error model with mismeasured covariates in the nonparametric part. To solve the problem, we develop a simulation-extrapolation (SIMEX) algorithm based on the local linear smoother and the estimating equation. For the proposed SIMEX estimation, it is not needed to assume the distribution of the unobserved covariate. We transform the boundary of a unit ball in $${\mathbb {R}}^p$$ R p to the interior of a unit ball in $${\mathbb {R}}^{p-1}$$ R p - 1 by using the constraint $$\Vert \beta \Vert =1$$ ‖ β ‖ = 1 . The proposed SIMEX estimator of the index parameter is shown to be asymptotically normal under some regularity conditions. We also derive the asymptotic bias and variance of the estimator of the unknown link function. Finally, the performance of the proposed method is examined by simulation studies and is illustrated by a real data example.

A unified testing framework is presented for large-dimensional mean vectors of one or several populations which may be non-normal with unequal covariance matrices. Beginning with one-sample case, the construction of tests, underlying assumptions and asymptotic theory, is systematically extended to multi-sample case. Tests are defined in terms of U-statistics-based consistent estimators, and their limits are derived under a few mild assumptions. Accuracy of the tests is shown through simulations. Real data applications, including a five-sample unbalanced MANOVA analysis on count data, are also given.

The subjective assessment of quality of life, personal skills and the agreement with a certain opinion are common issues in clinical, social, behavioral and marketing research. A wide set of surveys providing ordinal data arises. Beside such variables, other common surveys generate responses on a continuous scale, where the variable actual point value cannot be observed since data belong to certain groups. This paper introduces a re-formalization of the recent “Monotonic Dependence Coefficient” (MDC) suitable to all frameworks in which, given two variables, the independent variable is expressed in ordinal categories and the dependent variable is grouped. We denote this novel coefficient with $$\mathrm{MDC}\mathrm{go}$$ MDC go . The $$\mathrm{MDC}\mathrm{go}$$ MDC go behavior and the scenarios in which it presents better performance with respect to the alternative correlation/association measures, such as Spearman’s $$r_\mathrm{S}$$ r S , Kendall’s $$\tau _b$$ τ b and Somers’ $$\varDelta $$ Δ coefficients, are explored through a Monte Carlo simulation study. Finally, to shed light on the usefulness of the proposal in real surveys, an application to drug-expenditure data is considered.

Complexity of longitudinal data lies in the inherent dependence among measurements from same subject over different time points. For multiple longitudinal responses, the problem is challenging due to inter-trait and intra-trait dependence. While linear mixed models are popularly used for analysing such data, appropriate inference on the shape of the population cannot be drawn for non-normal data sets. We propose a linear mixed model for joint quantile regression of multiple longitudinal responses. We consider an asymmetric Laplace distribution for quantile regression and estimate model parameters by Monte Carlo EM algorithm. Nonparametric bootstrap resampling method is used for estimating confidence intervals of parameter estimates. Through extensive simulation studies, we investigate the operating characteristics of our proposed model and compare the performance to other traditional quantile regression models. We apply proposed model for analysing data from nutrition education programme on hypercholesterolemic children of the USA.

Standard Poisson and negative binomial truncated regression models for count data include the regressors in the mean of the non-truncated distribution. In this paper, a new approach is proposed so that the explanatory variables determine directly the truncated mean. The main advantage is that the regression coefficients in the new models have a straightforward interpretation as the effect of a change in a covariate on the mean of the response variable. A simulation study has been carried out in order to analyze the performance of the proposed truncated regression models versus the standard ones showing that coefficient estimates are now more accurate in the sense that the standard errors are always lower. Also, the simulation study indicates that the estimates obtained with the standard models are biased. An application to real data illustrates the utility of the introduced truncated models in a hurdle model. Although in the example there are slight differences in the results between the two approaches, the proposed one provides a clear interpretation of the coefficient estimates.

We consider a problem of allocation of a sample in two- and three-stage sampling. We seek allocation which is both multi-domain and population efficient. Choudhry et al. (Survey Methods 38(1):23–29, 2012) recently considered such problem for one-stage stratified simple random sampling without replacement in domains. Their approach was through minimization of the sample size under constraints on relative variances in all domains and on the overall relative variance. To attain this goal, they used nonlinear programming. Alternatively, we minimize here the relative variances in all domains (controlling them through given priority weights) as well as the overall relative variance under constraints imposed on total (expected) cost. We consider several two- and three-stage sampling schemes. Our aim is to shed some light on the analytic structure of solutions rather than in deriving a purely numerical tool for sample allocation. To this end, we develop the eigenproblem methodology introduced in optimal allocation problems in Niemiro and Wesołowski (Appl Math 28:73–82, 2001) and recently updated in Wesołowski and Wieczorkowski (Commun Stat Theory Methods 46(5):2212–2231, 2017) by taking under account several new sampling schemes and, more importantly, by the (single) total expected variable cost constraint. Such approach allows for solutions which are direct generalization of the Neyman-type allocation. The structure of the solution is deciphered from the explicit allocation formulas given in terms of an eigenvector $${\underline{v}}^*$$ v ̲ ∗ of a population-based matrix $$\mathbf{D}$$ D . The solution we provide can be viewed as a multi-domain version of the Neyman-type allocation in multistage stratified SRSWOR schemes.

We propose a new estimation procedure for estimating the unknown parameters and function in partial functional linear regression. The asymptotic distribution of the estimator of the vector of slope parameters is derived, and the global convergence rate of the estimator of unknown slope function is established under suitable norm. The convergence rate of the mean squared prediction error for the proposed estimators is also established. Based on the proposed estimation procedure, we further construct the penalized regression estimators and establish their variable selection consistency and oracle properties. Finite sample properties of our procedures are studied through Monte Carlo simulations. A real data example about the real estate data is used to illustrate our proposed methodology.

The ordinary variable inspection plans rely on the normality of the underlying populations. However, this assumption is vague or even not satisfied. Moreover, ordinary variable sampling plans are sensitive against deviations from the distribution assumption. Nonconforming items occur in the tails of the distribution. They can be approximated by a generalized Pareto distribution (GPD). We investigate several estimates of their parameters according to their usefulness not only for the GPD, but also for arbitrary continuous distributions. The likelihood moment estimates (LMEs) of Zhang (Aust N Z J Stat 49:69–77, 2007) and the Bayesian estimate (ZSE) of Zhang and Stephens (Technometrics 51:316–325, 2009) turn out to be the best for our purpose. Then, we use these parameter estimates to estimate the fraction defective. The asymptotic normality of the LME (cf. Zhang 2007) and that of the fraction defective are used to construct the sampling plan. The difference to the sampling plans constructed in Kössler (Allg Stat Arch 83:416–433, 1999; in: Steland, Rafajlowicz, Szajowski (eds) Stochastic models, statistics, and their applications, Springer, Heidelberg, pp 93–100, 2015) is that we now use the new parameter estimates. Moreover, in contrast to the aforementioned papers, we now also consider two-sided specification limits. An industrial example illustrates the method.

We introduce the novel notion of the integrated characteristic function and its empirical counterpart. Some basic properties of these new objects are mentioned and in turn utilized in order to construct new procedures for testing goodness of fit to parametric distributions, for testing symmetry and homogeneity, and for testing independence. Asymptotic results are obtained, while corresponding Monte Carlo results on the finite-sample behavior of the procedures are also included.

A scalar-response functional model describes the association between a scalar response and a set of functional covariates. An important problem in the functional data literature is to test nullity or linearity of the effect of the functional covariate in the context of scalar-on-function regression. This article provides an overview of the existing methods for testing both the null hypotheses that there is no relationship and that there is a linear relationship between the functional covariate and scalar response, and a comprehensive numerical comparison of their performance. The methods are compared for a variety of realistic scenarios: when the functional covariate is observed at dense or sparse grids and measurements include noise or not. Finally, the methods are illustrated on the Tecator data set.