adelie.data.snp_unphased#

adelie.data.snp_unphased(n: int, p: int, *, K: int = 1, glm: str = 'gaussian', sparsity: float = 0.95, missing_ratio: float = 0.1, one_ratio: float = 0.25, two_ratio: float = 0.05, zero_penalty: float = 0, snr: float = 1, seed: int = 0)[source]#

Creates a SNP unphased dataset.

This dataset is only used for lasso, so groups is simply each individual feature and group_sizes is a vector of ones.
The calldata matrix X has sparsity ratio 1 - one_ratio - two_ratio where one_ratio of the entries are randomly set to 1 and two_ratio are randomly set to 2. The user only sees a masked version of X where missing_ratio of the entries are set to -9.
The true coefficients \(\beta\) are such that sparsity proportion of the entries are set to \(0\).
The response y is generated from the GLM specified by glm.
The penalty factors are by default set to np.sqrt(group_sizes), however if zero_penalty > 0, a random set of penalties will be set to zero, in which case, penalty is rescaled such that the \(\ell_2\) norm squared is p.

Parameters:

nint

Number of data points.

pint

Number of SNPs.

Kint, optional

Number of classes for multi-response GLMs. Default is 1.

glmstr, optional

GLM name. It must be one of the following:

"binomial"

"cox"

"gaussian"

"multigaussian"

"multinomial"

"poisson"

Default is "gaussian".

sparsityfloat, optional

Proportion of \(\beta\) entries to be zeroed out. Default is 0.95.

missing_ratiofloat, optional

Proportion of the entries of X that is set to -9 (missing). Default is 0.1.

one_ratiofloat, optional

Proportion of the entries of X that is set to 1. Default is 0.25.

two_ratiofloat, optional

Proportion of the entries of X that is set to 2. Default is 0.05.

zero_penaltyfloat, optional

Proportion of penalty entries to be zeroed out. Default is 0.

snrfloat, optional

Signal-to-noise ratio. Default is 1.

seedint, optional

Random seed. Default is 0.

Returns:

datadict

A dictionary containing the generated data:

"X": feature matrix.

"y": response vector.

"groups": mapping of group index to the starting column index of X.

"group_sizes": mapping of group index to the group size.

"penalty": penalty factor for each group index.