adelie.io.snp_phased_ancestry#
- class adelie.io.snp_phased_ancestry(filename: str, read_mode: str = 'file')[source]#
IO handler for SNP phased, ancestry matrix.
A SNP phased, ancestry matrix is a matrix that combines a (phased) calldata and local ancestry information.
Let \(X \in \mathbb{R}^{n \times sA}\) denote such a matrix where \(n\) is the number of samples, \(s\) is the number of SNPs, and \(A\) is the number of ancestries. Every \(A\) (contiguous) columns is called an ancestry block corresponding to a single SNP with the same structure described next. Let \(H \in \mathbb{R}^{n \times A}\) denote any such ancestry block for some SNP. Then, \(H = H_0 + H_1\) where \(H_k \in \mathbb{R}^{n \times A}\) represents the phased calldata marked by the ancestry indicator for each of the two haplotypes of a SNP, that is,
\[\begin{split}\begin{align*} H_k = \begin{bmatrix} \unicode{x2014} & \delta^k_1 \cdot e_{a^k_1}^\top & \unicode{x2014} \\ \unicode{x2014} & \delta^k_2 \cdot e_{a^k_2}^\top & \unicode{x2014} \\ \vdots & \vdots & \vdots \\ \unicode{x2014} & \delta^k_n \cdot e_{a^k_n}^\top & \unicode{x2014} \\ \end{bmatrix} \end{align*}\end{split}\]where for each individual \(i\) and haplotype \(k\), \(\delta^k_i \in \{0,1\}\) is \(1\) if and only if there is a mutation and \(a^k_i \in \{1,\ldots,A\}\) is the ancestry labeling. Here, \(e_j \in \mathbb{R}^A\) is the \(j\) th standard basis vector.
- Parameters:
- filenamestr
File name containing the SNP data in
.snpdat
format.- read_modestr, optional
Reading mode of the SNP data. It must be one of the following:
"file"
: reads the file using standard file IO. This method is the most general and portable, however, with large files, it is the slowest option."mmap"
: reads the file using mmap. This method is only supported on Linux and MacOS. It is the most efficient way to read large files.
Default is
"file"
.
Methods
__init__
(self, filename, read_mode)read
(self)Reads and loads the matrix from file.
to_dense
(self[, n_threads])Creates a dense SNP phased, ancestry matrix from the file.
write
(calldata, ancestries, A[, n_threads])Writes a dense SNP phased, ancestry matrix to the file in
.snpdat
format.Attributes
Number of ancestries.
Number of columns.
Endianness used in the file.
True
if the IO handler has read the file content and otherwiseFalse
.Number of non-zero entries for each column for haplotype 0.
Number of non-zero entries for each column for haplotype 1.
Number of rows.
Number of SNPs.
- __init__(self: adelie.adelie_core.io.IOSNPPhasedAncestry, filename: str, read_mode: str) None [source]#
- read(self: adelie.adelie_core.io.IOSNPBase) int #
Reads and loads the matrix from file.
- Returns:
- total_bytesint
Number of bytes read.
- to_dense(self: adelie.adelie_core.io.IOSNPPhasedAncestry, n_threads: int = 1) numpy.ndarray[numpy.int8[m, n]] #
Creates a dense SNP phased, ancestry matrix from the file.
- Parameters:
- n_threadsint, optional
Number of threads. Default is
1
.
- Returns:
- dense(n, s*A) ndarray
Dense SNP phased, ancestry matrix.
- write(calldata: ndarray, ancestries: ndarray, A: int, n_threads: int = 1)[source]#
Writes a dense SNP phased, ancestry matrix to the file in
.snpdat
format.Note
The calldata and ancestries matrices must not contain any missing values.
- Parameters:
- calldata(n, 2*s) ndarray
SNP phased calldata in dense format.
calldata[i, 2*j+k]
is the data for individuali
, SNPj
, and haplotypek
. It must only contain values in \(\{0,1\}\).- ancestries(n, 2*s) ndarray
Local ancestry information in dense format.
ancestries[i, 2*j+k]
is the ancestry for individuali
, SNPj
, and haplotypek
. It must only contain values in \(\{0,\ldots, A-1\}\).- Aint
Number of ancestries.
- n_threadsint, optional
Number of threads. Default is
1
.
- Returns:
- total_bytesint
Number of bytes written.
- benchmarkdict
Dictionary of benchmark timings for each step of the serializer.
- ancestries#
Number of ancestries.
- cols#
Number of columns.
- endian#
Endianness used in the file. It is
"big"
if the system is big-endian otherwise"little"
.Note
We recommend that users read/write from/to the file on the same machine. The
.snpdat
format depends on the endianness of the machine. So, unless the endianness is the same across two different machines, it is undefined behavior reading a file that was generated on a different machine.
- is_read#
True
if the IO handler has read the file content and otherwiseFalse
.
- nnz0#
Number of non-zero entries for each column for haplotype 0.
- nnz1#
Number of non-zero entries for each column for haplotype 1.
- rows#
Number of rows.
- snps#
Number of SNPs.