All variables are generated via the appropriate transformation of standard normal variables, as described below. See the documentation for `corrvar`

and `corrvar2`

for more details about all inputs. The parameter inputs should be checked first using `validpar`

. Summaries of simulated variables can be obtained using `summary_var`

. Some code has been modified from the **SimMultiCorrData** package (Fialkowski 2018).

`method`

= “Fleishman”) or Headrick (2002)’s fifth-order (`method`

= “Polynomial”) power method transformation (PMT). This is a computationally efficient algorithm that simulates continuous distributions through the method of moments. It works by matching standardized cumulants – the first four (mean, variance, skew, and standardized kurtosis) for Fleishman’s third-order method, or the first six (mean, variance, skew, standardized kurtosis, and standardized fifth and sixth cumulants) for Headrick’s fifth-order method. The transformation is expressed as follows:\[\begin{equation} Y = c_{0} + c_{1} * Z + c_{2} * Z^2 + c_{3} * Z^3 + c_{4} * Z^4 + c_{5} * Z^5,\ Z \sim iid\ N(0,1), \tag{1} \end{equation}\]

where \(c_{4}\) and \(c_{5}\) both equal \(0\) for Fleishman’s method. The real constants are calculated by `SimMultiCorrData::find_constants`

, which solves the system of equations given in `SimMultiCorrData::fleish`

or `SimMultiCorrData::poly`

. All variables are simulated with mean \(0\) and variance \(1\), and then transformed to the specified mean and variance at the end.

\[\begin{equation} h_{Y}(y) = \sum_{i=1}^{k} \pi_{i} f_{Y_{i}}(y),\ \ \sum_{i=1}^{k} \pi_{i} = 1. \tag{2} \end{equation}\]

The \(\pi_{i}\) are mixing parameters which determine the weight of each component distribution \(f_{Y_{i}}(y)\) in the overall probability distribution. As long as each component has a valid PDF, the overall distribution \(h_{Y}(y)\) has a valid PDF. The main assumption is statistical independence between the process of randomly selecting the component distribution and the distributions themselves: Assume there is a random selection process that first generates the numbers \(1,\ ...,\ k\) with probabilities \(\pi_{1},\ ...,\ \pi_{k}\). After selecting number \(i\), where \(1 \leq i \leq k\), a random variable \(y\) is drawn from component distribution \(f_{Y_{i}}(y)\).

Mixture distributions provide a useful way for describing heterogeneity in a population, especially when an outcome is a composite response from multiple sources. They may also be used to model distributions with outliers. For example, the\[\begin{equation} h_{Y}(y) = \pi \phi(0,\ 9\sigma^2)(y) + (1-\pi) \phi(0,\ \sigma^2)(y),\ \ -\infty < y < \infty. \tag{3} \end{equation}\]

Here, \(\phi(\mu,\ \sigma^2)(y)\) is the PDF of the normal distribution (Davenport, Bezder, and Hathaway 1988; Schork, Allison, and Thiel 1996; Everitt 1996; Pearson 2011).

Continuous mixture variables are generated at the component level. Each component variable is created using the PMT, and the mixture variable is generated from these based on a random multinomial variable described by the mixing probabilities. Correlations are also controlled at the component level. Users specify the target pairwise correlations between the components across continuous mixture variables and between components and other types of variables.

The function `rho_M1M2`

approximates the expected correlation between two mixture variables \(M1\) and \(M2\) based on the component distributions and the correlations between components. The function `rho_M1Y`

approximates the expected correlation between a mixture variable \(M1\) and another random variable \(Y\), that may have an ordinal, continuous, or count distribution. If the user desires a specific correlation between a mixture variable \(M1\) and another simulated variable \(M2\) or \(Y\), various component level correlations can be tried in these 2 functions to see if the desired correlation can be achieved. See the Expected Cumulants and Correlations for Continuous Mixture Variables vignette for more details on continuous mixture variables.

The

**required parameters**for simulating continuous variables include: skewness, standardized kurtosis (kurtosis - 3), and standardized fifth and sixth cumulants (for the fifth-order method). If the goal is to simulate a theoretical distribution (i.e. Gamma, Beta, Logistic, etc.), these values can be obtained using`SimMultiCorrData::calc_theory`

. If the goal is to mimic an empirical data set, these values can be found using`SimMultiCorrData::calc_moments`

(using the method of moments) or`SimMultiCorrData::calc_fisherk`

(using Fisher’s k-statistics). If the standardized cumulants are obtained from`calc_theory`

, the user may need to use rounded values as inputs (i.e.`skews = round(skews, 8)`

). For example, in order to ensure that skew is exactly \(0\) for symmetric distributions. Due to the nature of the integration involved in`calc_theory`

, the results are approximations. Greater accuracy can be achieved by increasing the number of subdivisions (`sub`

) used in the integration process.For mixture variables, the parameters are specified at the component level by the inputs

`mix_skews`

,`mix_skurts`

,`mix_fifths`

,`mix_sixths`

, and`mix_Six`

. The mixing probabilities, means and standard deviations of the component variables are given by`mix_pis`

,`mix_mus`

and`mix_sigmas`

.The means and variances of all continuous variables are specified by

`means`

and`vars`

. These are at the variable level, i.e., they refer to the continous non-mixture and mixture variables themselves. The function`calc_mixmoments`

calculates the expected mean, standard deviation, and standardized cumulants for mixture variables based on the component distributions.For some sets of cumulants, it is either not possible to find power method constants or the calculated constants do not generate valid power method PDFs. In these situations, adding a value to the sixth cumulant may provide solutions (see

`find_constants`

). When using Headrick’s fifth-order approximation, if simulation results indicate that a continuous variable does not generate a valid PDF, the user can try`find_constants`

with various sixth cumulant correction vectors to determine if a valid PDF can be found. These sixth cumulant corrections are specified in the simulation functions using`Six`

or`mix_Six`

.**Choice of Fleishman’s or Headrick’s Method:**Using the fifth-order approximation allows additional control over the fifth and sixth moments of the generated distribution, improving accuracy. In addition, the range of feasible standardized kurtosis (\(\gamma_{2}\)) values, given skew (\(\gamma_{1}\)) and standardized fifth (\(\gamma_{3}\)) and sixth (\(\gamma_{4}\)) cumulants, is larger than with the third-order approximation. For example, Fleishman’s method can not be used to generate a non-normal distribution with a ratio of \(\gamma_{1}^2/\gamma_{2} > 9/14\) (Headrick and Kowalchuk 2007). This eliminates the family of distributions, which has a constant ratio of \(\gamma_{1}^2/\gamma_{2} = 2/3\). The fifth-order method also generates more distributions with valid PDF’s. However, if the fifth and sixth cumulants do not exist, the Fleishman approximation should be used.

**2) Ordinal Variables:** Ordinal variables (\(r \ge 2\) categories) are generated by discretizing the standard normal variables at quantiles. These quantiles are determined by evaluating the inverse standard normal CDF at the cumulative probabilities defined by each variable’s marginal distribution. If the support for variable \({Y}_{i}\) is not provided, the default is to use \(\{1,\ 2,\ ...,\ r_{i}\}\).

The

**required parameters**for simulating ordinal variables include: the cumulative marginal probabilities and support values (if desired). The probabilities should be combined into a list of length equal to the number of ordinal variables. The \(i^{th}\) element is a vector of the cumulative probabilities defining the marginal distribution of the \(i^{th}\) variable. If the variable can take \(r\) values, the vector will contain \(r - 1\) probabilities (the \(r^{th}\) is assumed to be \(1\)).For

**binary variables**, the user-supplied probability should be the probability of the \(1^{st}\) (lower) support value. This would ordinarily be considered the probability of*failure*(\(q\)), while the probability of the \(2^{nd}\) (upper) support value would be considered the probability of*success*(\(p = 1 - q\)). The support values should be combined into a separate list. The \(i^{th}\) element is a vector containing the \(r\) ordered support values. If not provided, the default is for the \(i^{th}\) element to be the vector \(1, ..., r\).

**3) Poisson and Negative Binomial Variables:** Count variables are generated using the inverse CDF method. The cumulative distribution function of a standard normal variable has a uniform distribution. The appropriate quantile function \(F_{Y}^{-1}\) is applied to this uniform variable with the designated parameters to generate the count variable: \(Y = F_{y}^{-1}(\Phi(Z))\).

\[\begin{equation} P(Y_{ZI} = 0) = \phi + (1 - \phi) * P(Y = 0),\ \ 0 < \phi < 1. \tag{4} \end{equation}\]

Therefore, \(\phi\) is the probability of a structural zero, and setting \(\phi = 0\) reduces \(Y_{ZI}\) to the variable \(Y\). In **SimCorrMix**, \(Y\) can have either a Poisson distribution (\(Y_P\)) or a NB distribution (\(Y_{NB}\)) (Ismail and Zamani 2013; Lambert 1992; Zhang, Mallick, and Yi 2016).

**Zero-inflated Poisson (ZIP):**The model for \(Y_{ZIP} \sim ZIP(\lambda,\ \phi),\ \lambda>0,\ 0 < \phi < 1\) is:

The mean of \(Y_{ZIP}\) is \((1 - \phi) * \lambda\), and the variance is \(\lambda + \lambda^2 * \phi/(1 - \phi)\) (see `VGAM::dzipois`

).

The zero-deflated Poisson distribution may be obtained by setting \(\phi \in (-1/(exp(\lambda) - 1),\ 0)\), so that the probability of a zero count is less than the nominal Poisson value. In this case, \(\phi\) no longer represents a probability.

When \(\phi = -1/(exp(\lambda) - 1)\), the random variable has a positive-Poisson distribution (see `VGAM::dpospois`

). The probability of a zero response is \(0\), and the other probabilities are scaled to sum to \(1\).

The parameters \(\lambda\) and \(\phi\) are specified through the inputs `lam`

and `p_zip`

. Setting `p_zip = 0`

(default setting) generates a regular Poisson variable. Parameters for zero-inflated Poisson variables should follow those for regular Poisson variables in all function inputs. For **Correlation Method 2**, a vector of total cumulative probability truncation values should be given in `pois_eps`

. These values represent the amount removed from the total cumulative probability when creating finite supports. The value may differ by variable, but the default value is \(0.0001\) (suggested by Barbiero and Ferrari (2015)). For example, `pois_eps = 0.0001`

means that the support values removed have a total probability of \(0.0001\) of occurring in the distribution of that variable. The effect is to remove improbable values, which may be of concern if the user wishes to replicate a distribution with outliers.

**Zero-inflated Negative Binomial (ZINB):**The model for \(Y_{ZINB} \sim ZINB(size,\ p,\ \phi),\ size>0,\ 0<p\leq1,\ \ 0 < \phi < 1\) is:

In this formulation, the Negative Binomial component \(Y_{NB}\) represents the number of failures which occur in a sequence of independent Bernoulli trials before a target number of successes (\(size\)) is reached. The probability of success in each trial is \(p\). \(Y_{NB}\) may also be parametrized by the mean \(\mu\) and dispersion parameter \(size\) so that \(p = size/(size + \mu)\) or \(\mu = size * (1-p)/p\) (see `stats::dnbinom`

).

The mean of \(Y_{ZINB}\) is \((1 - \phi) * \mu\), and the variance is \((1 - \phi) * \mu * (1 + \mu * (\phi + 1/size))\) (see `VGAM::dzinegbin`

).

The zero-deflated NB distribution may be obtained by setting \(\phi \in (-p^{size}/(1 - p^{size}),\ 0)\), so that the probability of a zero count is less than the nominal NB value. Again, in this case, \(\phi\) no longer represents a probability.

The positive-NB distribution can be obtained with \(\phi = -p^{size}/(1 - p^{size})\) (see `VGAM::dposnegbin`

). The probability of a zero response is \(0\), and the other probabilities are scaled to sum to \(1\).

The parameters \(size\), \(p\), \(\mu\), and \(\phi\) are specified through the inputs `size`

, `prob`

, `mu`

, and `p_zinb`

. Either `prob`

or `mu`

should be given for all NB and ZINB variables. Setting `p_zinb = 0`

(default setting) generates a regular NB variable. Parameters for zero-inflated NB variables should follow those for regular NB variables in all function inputs. For **Correlation Method 2**, a vector of total cumulative probability truncation values should be given in `nb_eps`

. The default value in the simulation functions is \(0.0001\).

The distributions functions are taken from the **VGAM** package (Yee 2018).

The **error loop** may be used to correct the final pairwise correlation of simulated variables to be within a user-specified precision value (`epsilon`

) of the target correlation. It updates the pairwise intermediate MVN correlation iteratively in a loop until either the maximum error is less than `epsilon`

or the number of iterations exceeds the maximum number set by the user (`maxit`

). Some code has been modified from the **SimMultiCorrData** package (Fialkowski 2018). Below is a description of the algorithm, which is executed within each variable pair across all `k = k_cat + k_cont + k_comp + k_pois + k_nb`

variables (`k_comp`

is the number of component distributions for the continuous mixture variables). `rho_calc`

is the calculated final correlation matrix updated in each iteration, `rho`

is the target final correlation, `Sigmaold`

is the intermediate correlation from the previous iteration, `it`

is the iteration number, `q`

is the row number, `r`

is the column number, and `maxerr`

is the absolute pairwise correlation error between `rho_calc[q, r]`

and `rho[q, r]`

.

If

`rho[q, r] = 0`

, set`Sigma[q, r] = 0`

.- While
`maxerr`

is greater than`epsilon`

and`it`

is less than`maxit`

:

- Calculate new intermediate correlation:
- If
`rho[q, r] * (rho[q, r]/rho_calc[q, r]) <= -1`

, then:`Sigma[q, r] = Sigmaold[q, r] * (1 + 0.1 * (1 - Sigmaold[q, r]) * -sign(rho[q, r] - rho_old[q, r]))`

- If
`rho[q, r] * (rho[q, r]/rho_calc[q, r]) >= 1`

, then:`Sigma[q, r] = Sigmaold[q, r] * (1 + 0.1 * (1 - Sigmaold[q, r]) * sign(rho[q, r] - rho_old[q, r]))`

- Else,
`Sigma[q, r] = Sigmaold[q, r] * (rho[q, r]/rho_calc[q, r])`

.

- If
- Eigenvalue decomposition is done on the
`kxk`

`Sigma`

correlation matrix. If`Sigma`

is not positive-definite, the negative eigenvalues are replaced with \(0\).

- Generate new normal variables \(X_{i}\), \(1 \leq i \leq\)
`k`

with correlation matrix`Sigma`

.

- Generate new \(Y_{i}\), \(1 \leq i \leq\)
`k`

using the appropriate transformations on the \(X_{i}\).

- The correlation
`rho_calc`

of all the \(\bm{Y}\) variables is calculated.

- Set
`Sigmaold = Sigma`

and increase`it`

by 1.

- Store the number of iterations in the matrix
`niter`

.

The error loop does increase simulation time, but it can improve accuracy in most situations. It may be unsuccessful in more difficult to obtain correlation structures. Some cases utilizing negative correlations will have results similar to those without the error loop. Trying different values of `epsilon`

(i.e., \(0.01\) instead of \(0.001\)) can help in these cases. For a given row (`q`

= 1, …, `nrow(Sigma)`

), the error loop progresses through the intermediate correlation matrix `Sigma`

by increasing column index (`r`

= 2, …, `ncol(Sigma)`

, `r`

not equal to `q`

). Each time a new pairwise correlation `Sigma[q, r]`

is calculated, the new `Sigma`

matrix is imposed on the intermediate normal variables `X`

, the appropriate transformations are applied to get `Y`

, and the final correlation matrix `rho_calc`

is found. Even though the intermediate correlations from previous `q, r`

combinations are not changed, the final correlations are affected. The fewer iterations for a given `q, r`

combination, the less `rho_calc[q, r]`

changes. Since larger values of `epsilon`

require fewer iterations, using `epsilon = 0.01`

may give better results than `epsilon = 0.001`

.

Each simulation pathway in **SimCorrMix** has its own function to calculate correlation boundaries, given specified distributional parameters, and to check if a target correlation matrix `rho`

falls within these boundaries. Correlation method 1 uses `validcorr`

and correlation method 2 uses `validcorr2`

. The parameter inputs should be checked first using `validpar`

. Some code has been modified from the **SimMultiCorrData** package (Fialkowski 2018). The distribution functions for Poisson and Negative Binomial variables are taken from the **VGAM** package (Yee 2018).

The GSC algorithm is a flexible method for determining empirical correlation bounds when the theoretical bounds are unknown. The steps are as follows:

Generate independent random samples from the desired distributions using a large number of observations (i.e. \(n = 100,000\)).

**Lower Bound:**Sort the two variables in opposite directions (i.e., one increasing and one decreasing) and find the sample correlation.**Upper Bound:**Sort the two variables in the same direction and find the sample correlation.

Demirtas and Hedeker (2011) showed that the empirical bounds computed from the GSC method are similar to the theoretical bounds (when they are known).

\[\begin{equation} \Big\{cor\left(F_1^{-1}(U), F_2^{-1}(1 - U)\right),\ cor\left(F_1^{-1}(U), F_2^{-1}(U)\right)\Big\}. \tag{1} \end{equation}\]

First, the calculations which are equivalent in the two pathways will be discussed by variable type.

**Ordinal Variables:**

*Binary pairs:*The correlation bounds are determined as in Demirtas, Hedeker, and Mermelstein (2012), who used the method of Emrich and Piedmonte (1991). The joint distribution is determined by “borrowing” the moments of a multivariate normal distribution. For two binary variables \(Y_1\) and \(Y_2\), with success probabilities \(p_1\) and \(p_2\), the boundaries are given by:

\[\begin{equation} \Big\{max\left(-\sqrt{(p_1p_2)/(q_1q_2)},\ -\sqrt{(q_1q_2)/(p_1p_2)}\right),\ \ \ min\left(\sqrt{(p_1q_2)/(q_1p_2)},\ \sqrt{(q_1p_2)/(p_1q_2)}\right)\Big\}, \tag{2} \end{equation}\]where \(q_1 = 1 - p_1\) and \(q_2 = 1 - p_2\).

*Binary-Ordinal or Ordinal-Ordinal pairs:*Randomly generated variables with the given marginal distributions are used in the GSC algorithm to find the correlation bounds.

**Continuous Variables:** Continuous variables are randomly generated using constants from `SimMultiCorrData::find_constants`

and a vector of sixth cumulant correction values (if provided). The GSC algorithm is used to find the lower and upper bounds.

**Continuous - Ordinal Pairs:** Randomly generated ordinal variables with the given marginal distributions and the previously generated continuous variables are used in the GSC algorithm to find the correlation bounds.

The GSC bounds are used for all variable pair types except Poisson and Negative Binomial variables. The Frechet-Hoeffding bounds are used for Poisson-Poisson, NB-NB, and Poisson-NB variable pairs.

In correlation method 2, count variables (regular or zero-inflated) are treated as “ordinal” by truncating their infinite supports. The maximum support values for the Poisson variables, given the cumulative probability truncation values (`pois_eps`

), means (`lam`

), and probabilities of structural zeros (`p_zip`

for zero-inflated Poisson), are calculated using `maxcount_support`

. The finite supports are used to determine marginal distributions for each Poisson variable. The maximum support values for the Negative Binomial variables, given the cumulative probability truncation values (`nb_eps`

), sizes (`size`

), success probabilities (`prob`

) or means (`mu`

), and probabilities of structural zeros (`p_zinb`

for zero-inflated NB), are calculated using `maxcount_support`

. The finite supports are used to determine marginal distributions for each NB variable.

The GSC bounds are used for all variable pair types.

Barbiero, A, and P A Ferrari. 2015. “Simulation of Correlated Poisson Variables.” *Applied Stochastic Models in Business and Industry* 31: 669–80. https://doi.org/10.1002/asmb.2072.

Davenport, JW, JC Bezder, and RJ Hathaway. 1988. “Parameter Estimation for Finite Mixture Distributions.” *Computers & Mathematics with Applications* 15 (10): 819–28.

Demirtas, H, and D Hedeker. 2011. “A Practical Way for Computing Approximate Lower and Upper Correlation Bounds.” *The American Statistician* 65 (2): 104–9. https://doi.org/10.1198/tast.2011.10090.

Demirtas, H, D Hedeker, and R J Mermelstein. 2012. “Simulation of Massive Public Health Data by Power Polynomials.” *Statistics in Medicine* 31 (27): 3337–46. https://doi.org/10.1002/sim.5362.

Emrich, L J, and M R Piedmonte. 1991. “A Method for Generating High-Dimensional Multivariate Binary Variates.” *The American Statistician* 45: 302–4. https://doi.org/10.1080/00031305.1991.10475828.

Everitt, B S. 1996. “An Introduction to Finite Mixture Distributions.” *Statistical Methods in Medical Research* 5 (2): 107–27. https://doi.org/10.1177/096228029600500202.

Fialkowski, A C. 2018. *SimMultiCorrData: Simulation of Correlated Data with Multiple Variable Types*. https://CRAN.R-project.org/package=SimMultiCorrData.

Fleishman, A I. 1978. “A Method for Simulating Non-Normal Distributions.” *Psychometrika* 43: 521–32. https://doi.org/10.1007/BF02293811.

Fréchet, M. 1951. “Sur Les Tableaux de Corrélation Dont Les Marges Sont Données.” *Annales de L’Université de Lyon. Section A: Sciences Mathématiques et Astronomie* 9: 53–77.

Headrick, T C. 2002. “Fast Fifth-Order Polynomial Transforms for Generating Univariate and Multivariate Non-Normal Distributions.” *Computational Statistics and Data Analysis* 40 (4): 685–711. https://doi.org/10.1016/S0167-9473(02)00072-5.

Headrick, T C, and R K Kowalchuk. 2007. “The Power Method Transformation: Its Probability Density Function, Distribution Function, and Its Further Use for Fitting Data.” *Journal of Statistical Computation and Simulation* 77: 229–49. https://doi.org/10.1080/10629360600605065.

Hoeffding, W. 1994. “Scale-Invariant Correlation Theory.” In *The Collected Works of Wassily Hoeffding*, edited by N I Fisher and P K Sen, 57–107. Springer Series in Statistics (Perspectives in Statistics). New York: Springer-Verlag. https://doi.org/10.1007/978-1-4612-0865-5_4.

Ismail, N, and H Zamani. 2013. “Estimation of Claim Count Data Using Negative Binomial, Generalized Poisson, Zero-Inflated Negative Binomial and Zero-Inflated Generalized Poisson Regression Models.” *Casualty Actuarial Society E-Forum* 41 (20): 1–28.

Lambert, D. 1992. “Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing.” *Technometrics* 34 (1): 1–14.

Pearson, RK. 2011. “Exploring Data in Engineering, the Sciences, and Medicine.” In. New York: Oxford University Press.

Schork, N J, D B Allison, and B Thiel. 1996. “Mixture Distributions in Human Genetics Research.” *Statistical Methods in Medical Research* 5: 155–78. https://doi.org/10.1177/096228029600500204.

Yee, Thomas W. 2018. *VGAM: Vector Generalized Linear and Additive Models*. https://CRAN.R-project.org/package=VGAM.

Zhang, X, H Mallick, and N Yi. 2016. “Zero-Inflated Negative Binomial Regression for Differential Abundance Testing in Microbiome Studies.” *Journal of Bioinformatics and Genomics* 2 (2): 1–9. https://doi.org/10.18454/jbg.2016.2.2.1.