Handling Genome-Wide Association study data in Julia

Objectives

The dimensions of data on DNA variation such as single nucleotide polymorphisms or SNPs can be very large, involving thousands or millions of SNPs, measured on potentially thousands of individuals. Typical genotyping platforms may examine from 50K(K=thousand) to 2.5M (M= millions) SNPs. Some platforms could be even denser. There are 2 nucleotides (A, C, G or T) at each position (one on each chromosome). If the genotyping read is not sufficiently good, a missing value could be recorded in one or both chromosomes for that position/SNP. A frequently used re-codification of the nocleotide data is to replace the characters (i.e. alleles) by the count of the allele with the lower frequency in the sample, or according to a pre-specified allele as determined in the genotyping platform and software. Thus, instead of storing a pair of nucleotides (e.g., AA, AG, GG), researchers store the individual’s genotype as either 0,1,2, or NA. In this coding scheme, 0 and 2 indicate homozygotes (e.g. AA, GG), and1 is the heterozygote (e.g. AG) for that SNP. Another possible re-coding could be -1,0,1 where 0 is the heterozygote.

The documentation for the plink software describes some common formats in which this information can be presented.

Because such a study can represent an enormous amount of information, it is important to store and manipulate it in a compact format if possible. For example, the BED file format collapses the data on each SNP/individual to 2 bits.

The BGLR package for R provides functions for reading a BED-format file in R and provides a moderate sized result in the mouse dataset.

> library(BGLR)
Package 'BGLR', 1.0 (2013-05-17). 
Type 'help(BGLR)' for summary information
> data("mouse")
> ls()
[1] "mouse.A"     "mouse.pheno" "mouse.X"    
> str(mouse.X)
 num [1:1825, 1:10346] 1 1 2 1 2 1 1 1 1 2 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:1825] "A048005080" "A048006063" "A048006555" "A048007096" ...
  ..$ : chr [1:10346] "rs3683945_G" "rs3707673_G" "rs6269442_G" "rs6336442_G" ...
> object.size(mouse.X)
151843784 bytes

This numeric (i.e. double-precision floating point) matrix provides the 0,1,2 encoding of 10,346 SNPs on 1825 mice. Although the numeric format is expensive in terms of storage, it provides easy access to numeric manipulations. In this gist we will show calculation of the 1825 by 1825 positive-semidefinite symmetric matrix formed by standardizing each column to have mean zero and unit standard deviation then forming what is called in R the transposed cross-product matrix and in genetics the genomic relationship matrix.

> str(XsXst <- tcrossprod(scale(mouse.X, TRUE, TRUE)))
 num [1:1825, 1:1825] 9595 -626 191 425 1097 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:1825] "A048005080" "A048006063" "A048006555" "A048007096" ...
  ..$ : chr [1:1825] "A048005080" "A048006063" "A048006555" "A048007096" ...
> system.time(XsXst <- tcrossprod(scale(mouse.X, TRUE, TRUE)))
   user  system elapsed 
  8.044   0.728   5.824

This is reasonable performance but the size of the arrays will become prohibitive with larger studies. In Julia it is possible to maintain compact representations and to perform such computations reasonably quickly.

Moving the data to Julia

It is possible to read saved R data sets in Julia but for this illustration we will write the data in comma-separated value (csv) format to a file and read it into Julia. First, write the file from R

> unique(as.vector(mouse.X))
[1] 1 2 0
> any(is.na(mouse.X))
[1] FALSE
> write.table(mouse.X, file="/tmp/mouse.csv", row.names=FALSE, col.names=FALSE, sep=',')

then read it into Julia.

julia> x = readcsv("/tmp/mouse.csv",Uint8)
1825x10346 Uint8 Array:
 0x01  0x01  0x01  0x01  0x00  0x01  0x00  0x01  …  0x00  0x00  0x00  0x00  0x00  0x00  0x01  0x01
 0x01  0x01  0x02  0x01  0x01  0x01  0x01  0x02     0x00  0x00  0x00  0x00  0x00  0x00  0x02  0x00
 0x02  0x00  0x02  0x02  0x00  0x02  0x00  0x02     0x00  0x00  0x00  0x00  0x00  0x00  0x02  0x02
 0x01  0x01  0x01  0x01  0x01  0x01  0x01  0x02     0x00  0x00  0x00  0x00  0x00  0x00  0x02  0x00
 0x02  0x00  0x02  0x02  0x00  0x02  0x00  0x02     0x01  0x01  0x01  0x01  0x01  0x01  0x01  0x01
 0x01  0x01  0x01  0x01  0x00  0x01  0x00  0x01  …  0x00  0x00  0x00  0x00  0x00  0x00  0x02  0x02
 0x01  0x01  0x01  0x01  0x00  0x01  0x00  0x01     0x00  0x00  0x00  0x00  0x00  0x00  0x02  0x02
 0x01  0x01  0x02  0x01  0x01  0x01  0x01  0x02     0x00  0x00  0x00  0x00  0x00  0x00  0x02  0x01
 0x01  0x01  0x02  0x01  0x01  0x01  0x01  0x02     0x00  0x00  0x00  0x00  0x00  0x00  0x02  0x00
 0x02  0x00  0x02  0x02  0x00  0x02  0x00  0x02     0x01  0x01  0x01  0x01  0x01  0x00  0x01  0x01
    ⋮                             ⋮              ⋱                 ⋮                             ⋮
 0x02  0x00  0x02  0x02  0x00  0x02  0x00  0x02  …  0x00  0x00  0x00  0x00  0x00  0x00  0x02  0x02
 0x01  0x01  0x01  0x01  0x01  0x01  0x01  0x02     0x00  0x00  0x00  0x00  0x00  0x00  0x02  0x01
 0x02  0x00  0x02  0x02  0x00  0x02  0x00  0x02     0x00  0x00  0x00  0x00  0x00  0x00  0x02  0x02
 0x01  0x01  0x01  0x01  0x00  0x01  0x00  0x01     0x00  0x00  0x00  0x00  0x00  0x00  0x02  0x00
 0x02  0x00  0x02  0x02  0x00  0x02  0x00  0x02     0x00  0x00  0x00  0x00  0x00  0x00  0x02  0x01
 0x01  0x01  0x01  0x01  0x00  0x01  0x00  0x01  …  0x00  0x00  0x00  0x00  0x00  0x00  0x02  0x02
 0x01  0x01  0x02  0x01  0x01  0x01  0x01  0x02     0x00  0x00  0x00  0x00  0x00  0x00  0x02  0x02
 0x01  0x01  0x02  0x01  0x01  0x01  0x01  0x02     0x00  0x00  0x00  0x00  0x00  0x00  0x02  0x02
 0x01  0x01  0x01  0x01  0x01  0x01  0x01  0x02     0x00  0x01  0x00  0x00  0x00  0x02  0x02  0x00
 0x00  0x02  0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x02  0x02  0x00

We have checked that all the values are 0, 1 or 2 so we can store this information as a matrix of unsigned 8-bit integers, represented here in hexadecimal notation. Later we will show a more compact representation of the data as a 2 by 1825 by 10346 bit-array. A further enhancement is to use memory-mapped arrays so that we can work with arrays larger than can fit comfortably in memory.

Manipulating the data

We want to work with standardized columns but without generating the entire matrix of standardized values. Because we start off with only 3 possible distinct values in a column we will only get 3 distinct values after standardization so we can keep the information in the column stored as 0, 1 and 2 and simply calculate and store the three distinct values that will occur in a column.

For each column we count the number of zeros, ones and twos and use those to evaluate the mean, the centered values, the standard deviation and the standardized values.

First counting the incidence of zeros, ones and twos. Those familiar with R or Matlab/octave programming know you should try to operate on the whole object if possible. Looping over individual elements of an array can be horribly slow. In Julia you do not need to fear the loop. My first attempt at this was

function colcounts(x::Matrix)
    m,n = size(x)
    counts = zeros(Int, 3, n)
    for j in 1:n, i in 1:m
        xij = x[i,j]
        if xij == 0 counts[1,j] += 1
        elseif xij == 1 counts[2,j] += 1
        elseif xij == 2 counts[3,j] += 1
        else error("x[$i,$j] = $xij should be 0, 1 or 2")
        end
    end
end

and that worked reasonably well.

julia> colcounts(x)
3x10346 Int64 Array:
 344  545  242  345  1373  345  1339  181  347  …  1511  1515  1508  1507  1509  1400    92  637
 935  935  837  935   420  935   454  746  933      256   251   258   260   256   295   304  516
 546  345  746  545    32  545    32  898  545       58    59    59    58    60   130  1429  672

julia> @elapsed colcounts(x)
0.115006816

Notice that the += operator for incrementing is available in Julia. Also, we declare the type of the argument x to be a Matrix as it must be. This will define a method for the colcounts function.

We can be more specific and define the method for the type Matrix{Uint8} but we may find we want to use other integer representations. We can define a templated class of methods by changing the declaration to

function colcounts{T <: Integer}(x::Matrix{T})

(If you have ever done template programming in a language like C++ this will look ridiculously easy in comparison.)

As far as the algorithm itself goes, using if, elseif, etc. on the values of the array elements is a bit wasteful because you can calculate the row index in counts from the value of xij. At this point we realize that we don't need the temporary value xij and can shorten the loop to

function colcounts{T <: Integer}(x::Matrix{T})
    m,n = size(x)
    counts = zeros(Int, 3, n)
    for j in 1:n, i in 1:m counts[x[i,j]+1,j] += 1 end
    counts
end

giving us

julia> countcols(x)
3x10346 Int64 Array:
 344  545  242  345  1373  345  1339  181  347  …  1511  1515  1508  1507  1509  1400    92  637
 935  935  837  935   420  935   454  746  933      256   251   258   260   256   295   304  516
 546  345  746  545    32  545    32  898  545       58    59    59    58    60   130  1429  672

julia> @elapsed countcols(x)
0.044554878

There is one further simplification we can make here. Indices in Julia are 1-based which means that the three rows in the counts matrix are numbered 1, 2 and 3 whereas the values in the array x are 0, 1 and 2. Instead of adding 1 to the matrix element every time we wish to use it, we can change the coding in x to 1, 2 and 3 and leave the value 0 to represent missing data.

julia> x += 1
1825x10346 Uint8 Array:
 0x02  0x02  0x02  0x02  0x01  0x02  0x01  0x02  …  0x01  0x01  0x01  0x01  0x01  0x01  0x02  0x02
 0x02  0x02  0x03  0x02  0x02  0x02  0x02  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x01
 0x03  0x01  0x03  0x03  0x01  0x03  0x01  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x02  0x02  0x02  0x02  0x02  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x01
 0x03  0x01  0x03  0x03  0x01  0x03  0x01  0x03     0x02  0x02  0x02  0x02  0x02  0x02  0x02  0x02
 0x02  0x02  0x02  0x02  0x01  0x02  0x01  0x02  …  0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x02  0x02  0x01  0x02  0x01  0x02     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02  0x02  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x02
 0x02  0x02  0x03  0x02  0x02  0x02  0x02  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x01
 0x03  0x01  0x03  0x03  0x01  0x03  0x01  0x03     0x02  0x02  0x02  0x02  0x02  0x01  0x02  0x02
    ⋮                             ⋮              ⋱                 ⋮                             ⋮
 0x03  0x01  0x03  0x03  0x01  0x03  0x01  0x03  …  0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x02  0x02  0x02  0x02  0x02  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x02
 0x03  0x01  0x03  0x03  0x01  0x03  0x01  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x02  0x02  0x01  0x02  0x01  0x02     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x01
 0x03  0x01  0x03  0x03  0x01  0x03  0x01  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x02
 0x02  0x02  0x02  0x02  0x01  0x02  0x01  0x02  …  0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02  0x02  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02  0x02  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x02  0x02  0x02  0x02  0x02  0x03     0x01  0x02  0x01  0x01  0x01  0x03  0x03  0x01
 0x01  0x03  0x01  0x01  0x01  0x01  0x01  0x01     0x01  0x01  0x01  0x01  0x01  0x03  0x03  0x01

In general we should be careful about the types of constants that we combine with array elements because we could end up changing the storage representation (although that didn't happen here). The safer expression would be

x += one(eltype(x))

The loop in countcols should now be

    for j in 1:n, i in 1:m counts[x[i,j],j] += 1 end

From here we can calculate the standardized values but this is about the time we should consider a user-defined type to keep this information organized.

Creating a type to store the information

Like methods, types in Julia can be templated. A type to store the array values, the counts, the centered and scaled values and the possible row and column names is

    type SNP{T<:Integer}
        mat::Matrix{T}
        counts::Matrix{Int}
        vals::Matrix{Float64}
        colnames::Vector{ASCIIString}
        rownames::Vector{ASCIIString}
    end

We will do the work of counting the incidence of (now) 1's, 2's and 3's in each column in an external constructor for the type

    function SNP{T<:Integer}(x::Matrix{T})
        m,n = size(x)
        counts = zeros(Int, (3, n))
        vals = Array(Float64, (3, n))
        for j in 1:n
            for i in 1:m counts[x[i,j],j] += 1 end
            mn = (counts[2,j] + 2*counts[3,j])/m
            dev = [0.:2.] - mn
            stddev = sqrt(sum(counts[:,j] .* dev .* dev)/float(m-1))
            vals[:,j] = dev/stddev
        end
        SNP{T}(x, counts, vals, ASCIIString[], ASCIIString[])
    end

For calculating the column mean, mn, and standard deviation, stddev, we revert to the 0,1,2 coding because it is slightly easier. The standardized values will be invariant under linear transformation of the original values.

Applying this to x produces

julia> ss = SNP(x)
SNP{Uint8}(1825x10346 Uint8 Array:
 0x02  0x02  0x02  0x02  0x01  0x02  0x01  0x02  …  0x01  0x01  0x01  0x01  0x01  0x01  0x02  0x02
 0x02  0x02  0x03  0x02  0x02  0x02  0x02  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x01
 0x03  0x01  0x03  0x03  0x01  0x03  0x01  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x02  0x02  0x02  0x02  0x02  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x01
 0x03  0x01  0x03  0x03  0x01  0x03  0x01  0x03     0x02  0x02  0x02  0x02  0x02  0x02  0x02  0x02
 0x02  0x02  0x02  0x02  0x01  0x02  0x01  0x02  …  0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x02  0x02  0x01  0x02  0x01  0x02     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02  0x02  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x02
 0x02  0x02  0x03  0x02  0x02  0x02  0x02  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x01
 0x03  0x01  0x03  0x03  0x01  0x03  0x01  0x03     0x02  0x02  0x02  0x02  0x02  0x01  0x02  0x02
    ⋮                             ⋮              ⋱                 ⋮                             ⋮
 0x03  0x01  0x03  0x03  0x01  0x03  0x01  0x03  …  0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x02  0x02  0x02  0x02  0x02  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x02
 0x03  0x01  0x03  0x03  0x01  0x03  0x01  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x02  0x02  0x01  0x02  0x01  0x02     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x01
 0x03  0x01  0x03  0x03  0x01  0x03  0x01  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x02
 0x02  0x02  0x02  0x02  0x01  0x02  0x01  0x02  …  0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02  0x02  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02  0x02  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x02  0x02  0x02  0x02  0x02  0x03     0x01  0x02  0x01  0x01  0x01  0x03  0x03  0x01
 0x01  0x03  0x01  0x01  0x01  0x01  0x01  0x01     0x01  0x01  0x01  0x01  0x01  0x03  0x03  0x01,3x10346 Int64 Array:
 344  545  242  345  1373  345  1339  181  347  …  1511  1515  1508  1507  1509  1400    92  637
 935  935  837  935   420  935   454  746  933      256   251   258   260   256   295   304  516
 546  345  746  545    32  545    32  898  545       58    59    59    58    60   130  1429  672,3x10346 Float64 Array:
 -1.6104    -1.29069   -1.87074   -1.6084    …  -0.430103  -0.51092  -3.17986   -1.20338  
 -0.160484   0.158854  -0.404831  -0.158854      1.6575     1.16913  -1.34455   -0.0226443
  1.28943    1.6084     1.06108    1.29069       3.7451     2.84918   0.490757   1.15809  ,[],[])

julia> @elapsed ss = SNP(x)
0.431200836

We can check with R that the counts and standardized values are as expected

> xtabs(~mouse.X[,1])
mouse.X[, 1]
  0   1   2 
344 935 546 
> Xs <- scale(mouse.X)
> unique(Xs[,1])
[1] -0.1604836  1.2894303 -1.6103976

Saving and loading the data

Reading in a very large matrix from a text file and converting it to the Uint8 representation can often take longer than the actual operation on the data. Especially if the array is very large it would be advantageous to create a disk image of the array and memory-map the image. A memory-mapped array must be homogeneous so we only store the matrix of index values 1, 2 and 3 with 0 representing missing data.

First we open a file, declare the mmap_array of the appropriate dimension and type, copy the contents, synchronize the array and close the stream.

julia> s = open("/var/tmp/mouse.dat", "w+")
IOStream(<file /var/tmp/mouse.dat>)

julia> mm = mmap_array(Uint8, size(x), s)
1825x10346 Uint8 Array:
 0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00  …  0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00  …  0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00
    ⋮                             ⋮              ⋱                 ⋮                             ⋮
 0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00  …  0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00  …  0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00

julia> for j in 1:size(x,2), i in 1:size(x,1) mm[i,j] = x[i,j] end

julia> msync(mm)

julia> close(s)

julia> run(`ls -l /var/tmp/mouse.dat`)
-rw-r--r-- 1 bates bates 18881450 May 30 15:22 /var/tmp/mouse.dat

Now, on restarting julia we simply create the memory-mapped array from a read-only stream.

julia> s = open("/var/tmp/mouse.dat")
IOStream(<file /var/tmp/mouse.dat>)

julia> ss = SNP(mmap_array(Uint8, (1825,10346), s))
SNP{Uint8}(1825x10346 Uint8 Array:
 0x02  0x02  0x02  0x02  0x01  0x02  0x01  0x02  …  0x01  0x01  0x01  0x01  0x01  0x01  0x02  0x02
 0x02  0x02  0x03  0x02  0x02  0x02  0x02  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x01
 0x03  0x01  0x03  0x03  0x01  0x03  0x01  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x02  0x02  0x02  0x02  0x02  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x01
 0x03  0x01  0x03  0x03  0x01  0x03  0x01  0x03     0x02  0x02  0x02  0x02  0x02  0x02  0x02  0x02
 0x02  0x02  0x02  0x02  0x01  0x02  0x01  0x02  …  0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x02  0x02  0x01  0x02  0x01  0x02     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02  0x02  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x02
 0x02  0x02  0x03  0x02  0x02  0x02  0x02  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x01
 0x03  0x01  0x03  0x03  0x01  0x03  0x01  0x03     0x02  0x02  0x02  0x02  0x02  0x01  0x02  0x02
    ⋮                             ⋮              ⋱                 ⋮                             ⋮
 0x03  0x01  0x03  0x03  0x01  0x03  0x01  0x03  …  0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x02  0x02  0x02  0x02  0x02  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x02
 0x03  0x01  0x03  0x03  0x01  0x03  0x01  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x02  0x02  0x01  0x02  0x01  0x02     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x01
 0x03  0x01  0x03  0x03  0x01  0x03  0x01  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x02
 0x02  0x02  0x02  0x02  0x01  0x02  0x01  0x02  …  0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02  0x02  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x03  0x02  0x02  0x02  0x02  0x03     0x01  0x01  0x01  0x01  0x01  0x01  0x03  0x03
 0x02  0x02  0x02  0x02  0x02  0x02  0x02  0x03     0x01  0x02  0x01  0x01  0x01  0x03  0x03  0x01
 0x01  0x03  0x01  0x01  0x01  0x01  0x01  0x01     0x01  0x01  0x01  0x01  0x01  0x03  0x03  0x01,3x10346 Int64 Array:
 344  545  242  345  1373  345  1339  181  347  …  1511  1515  1508  1507  1509  1400    92  637
 935  935  837  935   420  935   454  746  933      256   251   258   260   256   295   304  516
 546  345  746  545    32  545    32  898  545       58    59    59    58    60   130  1429  672,3x10346 Float64 Array:
 -1.6104    -1.29069   -1.87074   -1.6084    …  -0.430103  -0.51092  -3.17986   -1.20338  
 -0.160484   0.158854  -0.404831  -0.158854      1.6575     1.16913  -1.34455   -0.0226443
  1.28943    1.6084     1.06108    1.29069       3.7451     2.84918   0.490757   1.15809  ,[],[])

An alternative representation, providing more compact storage, would be to save the values as a BitArray of size (2, 1825, 10346).

julia> mm = ss.mat;

julia> m,n = size(mm)
(1825,10346)

julia> bb = BitArray(2,m,n);

julia> for j in 1:n, i in 1:m mij = mm[i,j]; bb[1,i,j] = bool(mij&0x01); bb[2,i,j]=bool(mij&0x02) end

julia> bb
2x1825x10346 BitArray:
[:, :, 1] =
 false  false  true  false  true  false  false  …  false  true  false  false  false  false   true
  true   true  true   true  true   true   true      true  true   true   true   true   true  false

[:, :, 2] =
 false  false   true  false   true  false  false  …  false   true  false  false  false  false  true
  true   true  false   true  false   true   true      true  false   true   true   true   true  true

[:, :, 3] =
 false  true  true  false  true  false  false  true  …  false  true  false  true  true  false   true
  true  true  true   true  true   true   true  true      true  true   true  true  true   true  false

...

[:, :, 10344] =
  true   true   true   true  false   true   true  …   true   true   true   true   true  true  true
 false  false  false  false   true  false  false     false  false  false  false  false  true  true

[:, :, 10345] =
 false  true  true  true  false  true  true  true  …  true  true  true  true  true  true  true  true
  true  true  true  true   true  true  true  true     true  true  true  true  true  true  true  true

[:, :, 10346] =
 false   true  true   true  false  true  true  …   true  false  true  true  true   true   true
  true  false  true  false   true  true  true     false   true  true  true  true  false  false

To reconstruct the indices for column 1 we use

julia> int(bb[1,:,1]) + 2bb[2,:,1]
1x1825 Int64 Array:
 2  2  3  2  3  2  2  2  2  3  3  2  2  2  2  3  …  1  3  3  3  3  3  3  2  3  2  3  2  2  2  2  1

After converting bb to a memory-mapped BitArray we have a representation that is essentially as concise as a .bed file.

dmbates/casestudy.md

Objectives

Moving the data to Julia

Manipulating the data

Creating a type to store the information

Saving and loading the data