# Introduction

The genio (GenIO = Genetics I/O) package aims to facilitate reading and writing genetics data. The focus of this vignette is processing plink BED/BIM/FAM files.

There are some limited alternatives for reading and/or writing BED files in R, which are slower and harder to use, which motivated me to write this package. Here we make direct comparisons to those packages, to illustrate the advantages of genio.

# Memory estimation

One drawback of genio and other related approaches is high memory consumption (see more on that at the end of this vignette). It is entirely possible that an attempt to parse a genotype matrix into memory will fail with an “out of memory” error message. Let’s estimate memory usage here.

Genotypes are stored as integers by genio, which in R on a modern machine (64 bit architecture) consumes 4 bytes! So an $$n \times m$$ matrix takes up at least $$4 n m$$ bytes (an R matrix contains an additional constant overhead, which does not depend on $$n,m$$ and is relatively small for large matrices).

To get an idea of how much this is, let’s assume a standard genotyping array, where $$m$$ is about half a million. In this scenario, we get a simple rule of thumb that every 1000 individuals consume a little less than 2G:

# Constants
bytes_per_genotype <- 4
bytes_per_gb <- 1024 ^ 3
# Example data dimensions
num_ind <- 1000
num_loci <- 500000
# Gigabytes per 1000 individuals for a typical genotyping array
bytes_per_genotype * num_ind * num_loci / bytes_per_gb
#> [1] 1.862645

This is the amount of free memory required just to load the matrix. Several common matrix operations for genotypes consume even more memory, so more free memory will be needed to accomplish anything.

# Comparison to other packages

There are several R packages that provide BED readers, and to a lesser extent some of the other functionality of genio. Here we compare directly to BEDMatrix, snpStats, and lfa. Each of these packages has different goals so they are optimized for their use cases. The genio parser is optimized to read the entire genotype data into a native R matrix, so that it is easy to inspect and manipulate for R beginners. Here we demonstrate that genio is not only the fastest at this task, but also the easiest to obtain in terms of requiring the least amount of coding.

## BEDMatrix

The BEDMatrix package allows access to genotypes from a BED file without loading the entire matrix in memory. This is very convenient for large datasets, as long as only small portions of the data are pulled at once. However, there are some disadvantages:

• The BEDMatrix return object is not a regular R matrix, which confuses some users.
• The BEDMatrix package provides no way to access the full annotation tables (BIM and FAM files).
• From BIM, only the locus ID and reference allele are returned, merged into a vector of strings and placed as column names.
• From FAM, only the family and individual IDs are returned, merged into a vector of strings and placed as row names.
• BEDMatrix does not have any write functions.

Here is an example of that usage and how the data compares. Note in particular that the BEDMatrix representation of the genotype matrix is transposed compared to the genio matrix:

Therefore, for very large datasets, if it suffices to access the genotype data in small portions and the user is willing to deal with the limitations of the object, and no detailed annotation table information is required, then BEDMatrix is a better solution than the read_bed function provided in the genio package.

To compare the two genotype reading functions on an equal footing, let us assume that the entire genotype matrix is required in memory and stored in a regular R matrix.

## snpStats

The snpStats package is the most fully-featured alternative to genio. One of its advantages is its memory efficiency, encoding the entire data in less memory than a native R matrix. However, this memory efficiency comes at the cost of making the data harder to access, especially for inexperienced R users. Like genio, and in contrast to the other packages, snpStats also reads the BIM and FAM tables fully, and provides a BED writer, but it is considerably harder to use.

First we illustrate data parsing. The annotation tables are similar but column names are different, and certain missing values (zero in text files) are converted to NA instead.

As for BEDMatrix, assuming we ultimately desire to convert the entire data into a regular R matrix, an extra step is required:

snpStats is the only package other than genio to provide a BED writer. Here we demonstrate how hard it is to use it to write our data.

# Let's write this to another file

# Copy objects to not change originals
X_snpStats <- X
bim_snpStats <- as.data.frame(bim) # to use rownames
fam_snpStats <- as.data.frame(fam) # ditto

# All data requires matching row and/or column names.
# These first two were already done above.
#rownames(X_snpStats) <- bim$id #colnames(X_snpStats) <- fam$id
# Row names here are redundant but required.
rownames(bim_snpStats) <- bim$id rownames(fam_snpStats) <- fam$id

# We shall time several required steps in order to write genotypes in a standard R matrix,
# and the related annotation tables, to BED.
time_write_snpStats <- system.time({
# Transpose and convert our genotypes to SnpMatrix object.
# We flip 0s and 2s before converting
X_snpStats <- as(2 - t(X_snpStats), 'SnpMatrix')

# This complicated command is required to write the data.
# Although X, fam, and bim are passed as for genio's write_plink,
# here the name of every column must be specified (there are no reasonable defaults).
# Interestingly, the parameter names of snpStats' write.plink
snps = X_snpStats,
subject.data = fam_snpStats,
pedigree = fam,
id = id,
father = pat,
mother = mat,
sex = sex,
phenotype = pheno,
snp.data = bim_snpStats,
chromosome = chr,
genetic.distance = posg,
position = pos,
allele.1 = ref,
allele.2 = alt
)
})
#> Writing FAM file to /tmp/Rtmp6m33wC/vignette-random-data-copy750038649398.fam
#> Writing extended MAP file to /tmp/Rtmp6m33wC/vignette-random-data-copy750038649398.bim
#> Writing BED file to /tmp/Rtmp6m33wC/vignette-random-data-copy750038649398.bed (SNP-major mode)

# remove the new file, no longer need it

## LFA

The lfa package also provides a minimal BED parser, ignoring the BIM and FAM files except to count loci and individuals. Its genotype matrix is exactly like genio’s: same class (R matrix) and orientation (loci along rows). The package does not provide file writers or other such utilities.

## Overall time comparisons

Note that reading performance varies on different machines depending on the balance between hard drive access and processor speeds (where the relative bottleneck is). That being said, the genio reader is consistently the fastest, if not a close second (see below for tests on your machine), for several reasons. Genotypes are read very efficiently using Rcpp code, and stored directly into an R matrix, which is most efficient if that is the end goal. The annotation tables are also read most efficiently, using the readr package internally.

The BEDMatrix reader is also written in Rcpp, but its main goal is to provide efficient random access to genotypes stored in a file, which makes obtaining an R matrix more awkward, though surprisingly without paying a time penalty for it. The annotation tables are read with data.table, which is actually the fastest, though the difference in performance is small compared to readr for reasonably-sized files. However, BEDMatrix does not process or return full annotation tables, which gives it an unfair advantage compared to genio, as BEDMatrix takes shortcuts to only read the columns it needs. If the annotation tables are needed, reading them will incur an additional (and in some cases considerable) time penalty.

The snpStats reader is written in C, so it is also very fast for its initial step, but converting the genotypes from the snpStats format into a native R matrix proves too expensive. The annotation tables are read with the base function read.table, which is also the slowest. In terms of completeness of output (full genotypes and annotations tables), only this package matches genio, which makes it the fairest comparison.

The lfa reader is written in pure R, which explains why it is the slowest in this comparison. The annotation tables are read with read.table, as in snpStats, but are not returned at all.

Now we compare the writers. Only snpStats and genio have a BED writer. Not only was the genio writer much easier to use, it is also considerably faster.

## Memory comparison

High memory consumption is the main sacrifice for the ease of use of native R matrices, and in this respect genio consumes the most memory (second only to lfa).

We use the pryr package to compare the memory usage of each native genotype object.

The BEDMatrix package is ideal for very large files, since data is loaded into memory only as needed. So the main object does not actually hold any genotypes in memory. It is up to the user to decide how much data to access at once to trade-off speed and memory consumption.

snpStats manages memory better than genio, but strictly by a factor of 4. Each genotype is encoded as an integer in R (in general) and genio (in particular), which uses 4 bytes (this is actually platform dependent). On the other hand, snpStats uses strictly one byte per genotype.

lfa uses 8 bytes per genotype because its genotype matrix treats is elements as doubles rather than integers, so it uses twice as much memory as genio.

Overall, there is a narrow window in data sizes in which a genotype matrix is too large for a native R matrix but small enough for a snpStats object. That, and the much more complex interface of snpStats, makes it not very worthwhile in my opinion, unless there is need for functions provided only by that package. I prefer to use BEDMatrix for large datasets.

# Cleanup

This handy function removes the three BED/BIM/FAM files we generated at the beginning of this vignette.

Let’s close by showing the package versions used when this vignette was last built, as the implementations compared here could change in the future.

sessionInfo()
#> R version 3.5.3 (2019-03-11)
#> Platform: x86_64-redhat-linux-gnu (64-bit)
#> Running under: Fedora 30 (Workstation Edition)
#>
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
#>
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base
#>
#> other attached packages:
#> [1] pryr_0.1.4      lfa_1.12.0      snpStats_1.32.0 Matrix_1.2-15
#> [5] survival_2.43-3 BEDMatrix_1.5.0 genio_1.0.10
#>
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.1          compiler_3.5.3      pillar_1.4.1
#>  [4] tools_3.5.3         zlibbioc_1.28.0     crochet_2.1.0
#>  [7] zeallot_0.1.0       digest_0.6.19       evaluate_0.13
#> [10] tibble_2.1.1        lattice_0.20-38     pkgconfig_2.0.2
#> [13] rlang_0.3.4         cli_1.1.0           yaml_2.2.0
#> [16] parallel_3.5.3      xfun_0.7            stringr_1.4.0
#> [19] knitr_1.23          hms_0.4.2           vctrs_0.1.0
#> [22] grid_3.5.3          R6_2.4.0            fansi_0.4.0