adelie.io.snp_phased_ancestry#

class adelie.io.snp_phased_ancestry(filename: str, read_mode: str = 'file')[source]#

IO handler for SNP phased, ancestry matrix.

A SNP phased, ancestry matrix is a matrix that combines a (phased) calldata and local ancestry information.

Let \(X \in \mathbb{R}^{n \times sA}\) denote such a matrix where \(n\) is the number of samples, \(s\) is the number of SNPs, and \(A\) is the number of ancestries. Every \(A\) (contiguous) columns is called an ancestry block corresponding to a single SNP with the same structure described next. Let \(H \in \mathbb{R}^{n \times A}\) denote any such ancestry block for some SNP. Then, \(H = H_0 + H_1\) where \(H_k \in \mathbb{R}^{n \times A}\) represents the phased calldata marked by the ancestry indicator for each of the two haplotypes of a SNP, that is,

\[\begin{split}\begin{align*} H_k = \begin{bmatrix} \unicode{x2014} & \delta^k_1 \cdot e_{a^k_1}^\top & \unicode{x2014} \\ \unicode{x2014} & \delta^k_2 \cdot e_{a^k_2}^\top & \unicode{x2014} \\ \vdots & \vdots & \vdots \\ \unicode{x2014} & \delta^k_n \cdot e_{a^k_n}^\top & \unicode{x2014} \\ \end{bmatrix} \end{align*}\end{split}\]

where for each individual \(i\) and haplotype \(k\), \(\delta^k_i \in \{0,1\}\) is \(1\) if and only if there is a mutation and \(a^k_i \in \{1,\ldots,A\}\) is the ancestry labeling. Here, \(e_j \in \mathbb{R}^A\) is the \(j\) th standard basis vector.

Parameters:
filenamestr

File name containing the SNP data in .snpdat format.

read_modestr, optional

Reading mode of the SNP data. It must be one of the following:

  • "file": reads the file using standard file IO. This method is the most general and portable, however, with large files, it is the slowest option.

  • "mmap": reads the file using mmap. This method is only supported on Linux and MacOS. It is the most efficient way to read large files.

Default is "file".

Methods

__init__(self, filename, read_mode)

read(self)

Reads and loads the matrix from file.

to_dense(self[, n_threads])

Creates a dense SNP phased, ancestry matrix from the file.

write(calldata, ancestries, A[, n_threads])

Writes a dense SNP phased, ancestry matrix to the file in .snpdat format.

Attributes

ancestries

Number of ancestries.

cols

Number of columns.

endian

Endianness used in the file.

is_read

True if the IO handler has read the file content and otherwise False.

nnz0

Number of non-zero entries for each column for haplotype 0.

nnz1

Number of non-zero entries for each column for haplotype 1.

rows

Number of rows.

snps

Number of SNPs.

__init__(self: adelie.adelie_core.io.IOSNPPhasedAncestry, filename: str, read_mode: str) None[source]#
read(self: adelie.adelie_core.io.IOSNPBase) int#

Reads and loads the matrix from file.

Returns:
total_bytesint

Number of bytes read.

to_dense(self: adelie.adelie_core.io.IOSNPPhasedAncestry, n_threads: int = 1) numpy.ndarray[numpy.int8[m, n]]#

Creates a dense SNP phased, ancestry matrix from the file.

Parameters:
n_threadsint, optional

Number of threads. Default is 1.

Returns:
dense(n, s*A) ndarray

Dense SNP phased, ancestry matrix.

write(calldata: ndarray, ancestries: ndarray, A: int, n_threads: int = 1)[source]#

Writes a dense SNP phased, ancestry matrix to the file in .snpdat format.

Note

The calldata and ancestries matrices must not contain any missing values.

Parameters:
calldata(n, 2*s) ndarray

SNP phased calldata in dense format. calldata[i, 2*j+k] is the data for individual i, SNP j, and haplotype k. It must only contain values in \(\{0,1\}\).

ancestries(n, 2*s) ndarray

Local ancestry information in dense format. ancestries[i, 2*j+k] is the ancestry for individual i, SNP j, and haplotype k. It must only contain values in \(\{0,\ldots, A-1\}\).

Aint

Number of ancestries.

n_threadsint, optional

Number of threads. Default is 1.

Returns:
total_bytesint

Number of bytes written.

benchmarkdict

Dictionary of benchmark timings for each step of the serializer.

ancestries#

Number of ancestries.

cols#

Number of columns.

endian#

Endianness used in the file. It is "big" if the system is big-endian otherwise "little".

Note

We recommend that users read/write from/to the file on the same machine. The .snpdat format depends on the endianness of the machine. So, unless the endianness is the same across two different machines, it is undefined behavior reading a file that was generated on a different machine.

is_read#

True if the IO handler has read the file content and otherwise False.

nnz0#

Number of non-zero entries for each column for haplotype 0.

nnz1#

Number of non-zero entries for each column for haplotype 1.

rows#

Number of rows.

snps#

Number of SNPs.