The dimensions of data on DNA variation such as single nucleotide polymorphisms or SNPs can be very large, involving thousands or millions of SNPs, measured on potentially thousands of individuals. Typical genotyping platforms may examine from 50K(K=thousand) to 2.5M (M= millions) SNPs. Some platforms could be even denser. There are 2 nucleotides (A, C, G or T) at each position (one on each chromosome). If the genotyping read is not sufficiently good, a missing value could be recorded in one or both chromosomes for that position/SNP. A frequently used re-codification of the nocleotide data is to replace the characters (i.e. alleles) by the count of the allele with the lower frequency in the sample, or according to a pre-specified allele as determined in the genotyping platform and software. Thus, instead of storing a pair of nucleotides (e.g., AA, AG, GG), researchers store the individual’s genotype as either 0,1,2, or NA. In this coding scheme, 0 and 2 indicate homozygotes (e.g. AA, GG), and1 is the heterozygote (e.g. AG) for that SNP. Another possible re-coding could be -1,0,1 where 0 is the heterozygote.
The documentation for the plink software describes some common formats in which this information can be presented.
Because such a study can represent an enormous amount of information, it is important to store and manipulate it in a compact format if possible. For example, the BED file format collapses the data on each SNP/individual to 2 bits.
The BGLR package for
R provides functions for reading a
BED-format file in R and provides a moderate sized result in the
mouse
dataset.
> library(BGLR)
Package 'BGLR', 1.0 (2013-05-17).
Type 'help(BGLR)' for summary information
> data("mouse")
> ls()
[1] "mouse.A" "mouse.pheno" "mouse.X"
> str(mouse.X)
num [1:1825, 1:10346] 1 1 2 1 2 1 1 1 1 2 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:1825] "A048005080" "A048006063" "A048006555" "A048007096" ...
..$ : chr [1:10346] "rs3683945_G" "rs3707673_G" "rs6269442_G" "rs6336442_G" ...
> object.size(mouse.X)
151843784 bytes
This numeric (i.e. double-precision floating point) matrix provides
the 0,1,2 encoding of 10,346 SNPs on 1825 mice. Although the numeric format is
expensive in terms of storage, it provides easy access to numeric
manipulations. In this gist we will show calculation of the
1825 by 1825 positive-semidefinite symmetric matrix formed by
standardizing each column to have mean zero and unit standard
deviation then forming what is called in R
the transposed
cross-product matrix and in genetics the genomic relationship matrix.
> str(XsXst <- tcrossprod(scale(mouse.X, TRUE, TRUE)))
num [1:1825, 1:1825] 9595 -626 191 425 1097 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:1825] "A048005080" "A048006063" "A048006555" "A048007096" ...
..$ : chr [1:1825] "A048005080" "A048006063" "A048006555" "A048007096" ...
> system.time(XsXst <- tcrossprod(scale(mouse.X, TRUE, TRUE)))
user system elapsed
8.044 0.728 5.824
This is reasonable performance but the size of the arrays will become prohibitive with larger studies. In Julia it is possible to maintain compact representations and to perform such computations reasonably quickly.
It is possible to read saved R
data sets in Julia
but for this
illustration we will write the data in comma-separated value (csv)
format to a file and read it into Julia. First, write the file from R
> unique(as.vector(mouse.X))
[1] 1 2 0
> any(is.na(mouse.X))
[1] FALSE
> write.table(mouse.X, file="/tmp/mouse.csv", row.names=FALSE, col.names=FALSE, sep=',')
then read it into Julia
.
julia> x = readcsv("/tmp/mouse.csv",Uint8)
1825x10346 Uint8 Array:
0x01 0x01 0x01 0x01 0x00 0x01 0x00 0x01 … 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x01
0x01 0x01 0x02 0x01 0x01 0x01 0x01 0x02 0x00 0x00 0x00 0x00 0x00 0x00 0x02 0x00
0x02 0x00 0x02 0x02 0x00 0x02 0x00 0x02 0x00 0x00 0x00 0x00 0x00 0x00 0x02 0x02
0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x02 0x00 0x00 0x00 0x00 0x00 0x00 0x02 0x00
0x02 0x00 0x02 0x02 0x00 0x02 0x00 0x02 0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x01
0x01 0x01 0x01 0x01 0x00 0x01 0x00 0x01 … 0x00 0x00 0x00 0x00 0x00 0x00 0x02 0x02
0x01 0x01 0x01 0x01 0x00 0x01 0x00 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x02 0x02
0x01 0x01 0x02 0x01 0x01 0x01 0x01 0x02 0x00 0x00 0x00 0x00 0x00 0x00 0x02 0x01
0x01 0x01 0x02 0x01 0x01 0x01 0x01 0x02 0x00 0x00 0x00 0x00 0x00 0x00 0x02 0x00
0x02 0x00 0x02 0x02 0x00 0x02 0x00 0x02 0x01 0x01 0x01 0x01 0x01 0x00 0x01 0x01
⋮ ⋮ ⋱ ⋮ ⋮
0x02 0x00 0x02 0x02 0x00 0x02 0x00 0x02 … 0x00 0x00 0x00 0x00 0x00 0x00 0x02 0x02
0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x02 0x00 0x00 0x00 0x00 0x00 0x00 0x02 0x01
0x02 0x00 0x02 0x02 0x00 0x02 0x00 0x02 0x00 0x00 0x00 0x00 0x00 0x00 0x02 0x02
0x01 0x01 0x01 0x01 0x00 0x01 0x00 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x02 0x00
0x02 0x00 0x02 0x02 0x00 0x02 0x00 0x02 0x00 0x00 0x00 0x00 0x00 0x00 0x02 0x01
0x01 0x01 0x01 0x01 0x00 0x01 0x00 0x01 … 0x00 0x00 0x00 0x00 0x00 0x00 0x02 0x02
0x01 0x01 0x02 0x01 0x01 0x01 0x01 0x02 0x00 0x00 0x00 0x00 0x00 0x00 0x02 0x02
0x01 0x01 0x02 0x01 0x01 0x01 0x01 0x02 0x00 0x00 0x00 0x00 0x00 0x00 0x02 0x02
0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x02 0x00 0x01 0x00 0x00 0x00 0x02 0x02 0x00
0x00 0x02 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x02 0x02 0x00
We have checked that all the values are 0, 1 or 2 so we can store this information as a matrix of unsigned 8-bit integers, represented here in hexadecimal notation. Later we will show a more compact representation of the data as a 2 by 1825 by 10346 bit-array. A further enhancement is to use memory-mapped arrays so that we can work with arrays larger than can fit comfortably in memory.
We want to work with standardized columns but without generating the entire matrix of standardized values. Because we start off with only 3 possible distinct values in a column we will only get 3 distinct values after standardization so we can keep the information in the column stored as 0, 1 and 2 and simply calculate and store the three distinct values that will occur in a column.
For each column we count the number of zeros, ones and twos and use those to evaluate the mean, the centered values, the standard deviation and the standardized values.
First counting the incidence of zeros, ones and twos. Those familiar
with R
or Matlab/octave
programming know you should try to operate
on the whole object if possible. Looping over individual elements of
an array can be horribly slow. In Julia you do not need to fear the
loop. My first attempt at this was
function colcounts(x::Matrix)
m,n = size(x)
counts = zeros(Int, 3, n)
for j in 1:n, i in 1:m
xij = x[i,j]
if xij == 0 counts[1,j] += 1
elseif xij == 1 counts[2,j] += 1
elseif xij == 2 counts[3,j] += 1
else error("x[$i,$j] = $xij should be 0, 1 or 2")
end
end
end
and that worked reasonably well.
julia> colcounts(x)
3x10346 Int64 Array:
344 545 242 345 1373 345 1339 181 347 … 1511 1515 1508 1507 1509 1400 92 637
935 935 837 935 420 935 454 746 933 256 251 258 260 256 295 304 516
546 345 746 545 32 545 32 898 545 58 59 59 58 60 130 1429 672
julia> @elapsed colcounts(x)
0.115006816
Notice that the +=
operator for incrementing is available in
Julia
. Also, we declare the type of the argument x
to be a
Matrix
as it must be. This will define a method for the colcounts
function.
We can be more specific and define the method for the type
Matrix{Uint8}
but we may find we want to use other integer
representations. We can define a templated class of methods by
changing the declaration to
function colcounts{T <: Integer}(x::Matrix{T})
(If you have ever done template programming in a language like C++ this will look ridiculously easy in comparison.)
As far as the algorithm itself goes, using if
, elseif
, etc. on the
values of the array elements is a bit wasteful because you can
calculate the row index in counts
from the value of xij
. At this
point we realize that we don't need the temporary value xij
and can
shorten the loop to
function colcounts{T <: Integer}(x::Matrix{T})
m,n = size(x)
counts = zeros(Int, 3, n)
for j in 1:n, i in 1:m counts[x[i,j]+1,j] += 1 end
counts
end
giving us
julia> countcols(x)
3x10346 Int64 Array:
344 545 242 345 1373 345 1339 181 347 … 1511 1515 1508 1507 1509 1400 92 637
935 935 837 935 420 935 454 746 933 256 251 258 260 256 295 304 516
546 345 746 545 32 545 32 898 545 58 59 59 58 60 130 1429 672
julia> @elapsed countcols(x)
0.044554878
There is one further simplification we can make here. Indices in
Julia
are 1-based which means that the three rows in the counts
matrix are numbered 1, 2 and 3 whereas the values in the array x
are
0, 1 and 2. Instead of adding 1 to the matrix element every time we
wish to use it, we can change the coding in x
to 1, 2 and 3 and
leave the value 0 to represent missing data.
julia> x += 1
1825x10346 Uint8 Array:
0x02 0x02 0x02 0x02 0x01 0x02 0x01 0x02 … 0x01 0x01 0x01 0x01 0x01 0x01 0x02 0x02
0x02 0x02 0x03 0x02 0x02 0x02 0x02 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x01
0x03 0x01 0x03 0x03 0x01 0x03 0x01 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x02 0x02 0x02 0x02 0x02 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x01
0x03 0x01 0x03 0x03 0x01 0x03 0x01 0x03 0x02 0x02 0x02 0x02 0x02 0x02 0x02 0x02
0x02 0x02 0x02 0x02 0x01 0x02 0x01 0x02 … 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x02 0x02 0x01 0x02 0x01 0x02 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x03 0x02 0x02 0x02 0x02 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x02
0x02 0x02 0x03 0x02 0x02 0x02 0x02 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x01
0x03 0x01 0x03 0x03 0x01 0x03 0x01 0x03 0x02 0x02 0x02 0x02 0x02 0x01 0x02 0x02
⋮ ⋮ ⋱ ⋮ ⋮
0x03 0x01 0x03 0x03 0x01 0x03 0x01 0x03 … 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x02 0x02 0x02 0x02 0x02 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x02
0x03 0x01 0x03 0x03 0x01 0x03 0x01 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x02 0x02 0x01 0x02 0x01 0x02 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x01
0x03 0x01 0x03 0x03 0x01 0x03 0x01 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x02
0x02 0x02 0x02 0x02 0x01 0x02 0x01 0x02 … 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x03 0x02 0x02 0x02 0x02 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x03 0x02 0x02 0x02 0x02 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x02 0x02 0x02 0x02 0x02 0x03 0x01 0x02 0x01 0x01 0x01 0x03 0x03 0x01
0x01 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03 0x01
In general we should be careful about the types of constants that we combine with array elements because we could end up changing the storage representation (although that didn't happen here). The safer expression would be
x += one(eltype(x))
The loop in countcols
should now be
for j in 1:n, i in 1:m counts[x[i,j],j] += 1 end
From here we can calculate the standardized values but this is about the time we should consider a user-defined type to keep this information organized.
Like methods, types in Julia
can be templated. A type to store the
array values, the counts, the centered and scaled values and the
possible row and column names is
type SNP{T<:Integer}
mat::Matrix{T}
counts::Matrix{Int}
vals::Matrix{Float64}
colnames::Vector{ASCIIString}
rownames::Vector{ASCIIString}
end
We will do the work of counting the incidence of (now) 1's, 2's and 3's in each column in an external constructor for the type
function SNP{T<:Integer}(x::Matrix{T})
m,n = size(x)
counts = zeros(Int, (3, n))
vals = Array(Float64, (3, n))
for j in 1:n
for i in 1:m counts[x[i,j],j] += 1 end
mn = (counts[2,j] + 2*counts[3,j])/m
dev = [0.:2.] - mn
stddev = sqrt(sum(counts[:,j] .* dev .* dev)/float(m-1))
vals[:,j] = dev/stddev
end
SNP{T}(x, counts, vals, ASCIIString[], ASCIIString[])
end
For calculating the column mean, mn
, and standard deviation, stddev
, we
revert to the 0,1,2 coding because it is slightly easier. The
standardized values will be invariant under linear transformation of
the original values.
Applying this to x
produces
julia> ss = SNP(x)
SNP{Uint8}(1825x10346 Uint8 Array:
0x02 0x02 0x02 0x02 0x01 0x02 0x01 0x02 … 0x01 0x01 0x01 0x01 0x01 0x01 0x02 0x02
0x02 0x02 0x03 0x02 0x02 0x02 0x02 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x01
0x03 0x01 0x03 0x03 0x01 0x03 0x01 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x02 0x02 0x02 0x02 0x02 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x01
0x03 0x01 0x03 0x03 0x01 0x03 0x01 0x03 0x02 0x02 0x02 0x02 0x02 0x02 0x02 0x02
0x02 0x02 0x02 0x02 0x01 0x02 0x01 0x02 … 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x02 0x02 0x01 0x02 0x01 0x02 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x03 0x02 0x02 0x02 0x02 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x02
0x02 0x02 0x03 0x02 0x02 0x02 0x02 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x01
0x03 0x01 0x03 0x03 0x01 0x03 0x01 0x03 0x02 0x02 0x02 0x02 0x02 0x01 0x02 0x02
⋮ ⋮ ⋱ ⋮ ⋮
0x03 0x01 0x03 0x03 0x01 0x03 0x01 0x03 … 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x02 0x02 0x02 0x02 0x02 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x02
0x03 0x01 0x03 0x03 0x01 0x03 0x01 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x02 0x02 0x01 0x02 0x01 0x02 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x01
0x03 0x01 0x03 0x03 0x01 0x03 0x01 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x02
0x02 0x02 0x02 0x02 0x01 0x02 0x01 0x02 … 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x03 0x02 0x02 0x02 0x02 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x03 0x02 0x02 0x02 0x02 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x02 0x02 0x02 0x02 0x02 0x03 0x01 0x02 0x01 0x01 0x01 0x03 0x03 0x01
0x01 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03 0x01,3x10346 Int64 Array:
344 545 242 345 1373 345 1339 181 347 … 1511 1515 1508 1507 1509 1400 92 637
935 935 837 935 420 935 454 746 933 256 251 258 260 256 295 304 516
546 345 746 545 32 545 32 898 545 58 59 59 58 60 130 1429 672,3x10346 Float64 Array:
-1.6104 -1.29069 -1.87074 -1.6084 … -0.430103 -0.51092 -3.17986 -1.20338
-0.160484 0.158854 -0.404831 -0.158854 1.6575 1.16913 -1.34455 -0.0226443
1.28943 1.6084 1.06108 1.29069 3.7451 2.84918 0.490757 1.15809 ,[],[])
julia> @elapsed ss = SNP(x)
0.431200836
We can check with R
that the counts and standardized values are as expected
> xtabs(~mouse.X[,1])
mouse.X[, 1]
0 1 2
344 935 546
> Xs <- scale(mouse.X)
> unique(Xs[,1])
[1] -0.1604836 1.2894303 -1.6103976
Reading in a very large matrix from a text file and converting it to
the Uint8
representation can often take longer than the actual
operation on the data. Especially if the array is very large it would
be advantageous to create a disk image of the array and memory-map the
image. A memory-mapped array must be homogeneous so we only store the
matrix of index values 1, 2 and 3 with 0 representing missing data.
First we open a file, declare the mmap_array of the appropriate dimension and type, copy the contents, synchronize the array and close the stream.
julia> s = open("/var/tmp/mouse.dat", "w+")
IOStream(<file /var/tmp/mouse.dat>)
julia> mm = mmap_array(Uint8, size(x), s)
1825x10346 Uint8 Array:
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 … 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 … 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
⋮ ⋮ ⋱ ⋮ ⋮
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 … 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 … 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
julia> for j in 1:size(x,2), i in 1:size(x,1) mm[i,j] = x[i,j] end
julia> msync(mm)
julia> close(s)
julia> run(`ls -l /var/tmp/mouse.dat`)
-rw-r--r-- 1 bates bates 18881450 May 30 15:22 /var/tmp/mouse.dat
Now, on restarting julia
we simply create the memory-mapped array
from a read-only stream.
julia> s = open("/var/tmp/mouse.dat")
IOStream(<file /var/tmp/mouse.dat>)
julia> ss = SNP(mmap_array(Uint8, (1825,10346), s))
SNP{Uint8}(1825x10346 Uint8 Array:
0x02 0x02 0x02 0x02 0x01 0x02 0x01 0x02 … 0x01 0x01 0x01 0x01 0x01 0x01 0x02 0x02
0x02 0x02 0x03 0x02 0x02 0x02 0x02 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x01
0x03 0x01 0x03 0x03 0x01 0x03 0x01 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x02 0x02 0x02 0x02 0x02 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x01
0x03 0x01 0x03 0x03 0x01 0x03 0x01 0x03 0x02 0x02 0x02 0x02 0x02 0x02 0x02 0x02
0x02 0x02 0x02 0x02 0x01 0x02 0x01 0x02 … 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x02 0x02 0x01 0x02 0x01 0x02 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x03 0x02 0x02 0x02 0x02 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x02
0x02 0x02 0x03 0x02 0x02 0x02 0x02 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x01
0x03 0x01 0x03 0x03 0x01 0x03 0x01 0x03 0x02 0x02 0x02 0x02 0x02 0x01 0x02 0x02
⋮ ⋮ ⋱ ⋮ ⋮
0x03 0x01 0x03 0x03 0x01 0x03 0x01 0x03 … 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x02 0x02 0x02 0x02 0x02 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x02
0x03 0x01 0x03 0x03 0x01 0x03 0x01 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x02 0x02 0x01 0x02 0x01 0x02 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x01
0x03 0x01 0x03 0x03 0x01 0x03 0x01 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x02
0x02 0x02 0x02 0x02 0x01 0x02 0x01 0x02 … 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x03 0x02 0x02 0x02 0x02 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x03 0x02 0x02 0x02 0x02 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03
0x02 0x02 0x02 0x02 0x02 0x02 0x02 0x03 0x01 0x02 0x01 0x01 0x01 0x03 0x03 0x01
0x01 0x03 0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x01 0x03 0x03 0x01,3x10346 Int64 Array:
344 545 242 345 1373 345 1339 181 347 … 1511 1515 1508 1507 1509 1400 92 637
935 935 837 935 420 935 454 746 933 256 251 258 260 256 295 304 516
546 345 746 545 32 545 32 898 545 58 59 59 58 60 130 1429 672,3x10346 Float64 Array:
-1.6104 -1.29069 -1.87074 -1.6084 … -0.430103 -0.51092 -3.17986 -1.20338
-0.160484 0.158854 -0.404831 -0.158854 1.6575 1.16913 -1.34455 -0.0226443
1.28943 1.6084 1.06108 1.29069 3.7451 2.84918 0.490757 1.15809 ,[],[])
An alternative representation, providing more compact storage, would
be to save the values as a BitArray
of size (2, 1825, 10346)
.
julia> mm = ss.mat;
julia> m,n = size(mm)
(1825,10346)
julia> bb = BitArray(2,m,n);
julia> for j in 1:n, i in 1:m mij = mm[i,j]; bb[1,i,j] = bool(mij&0x01); bb[2,i,j]=bool(mij&0x02) end
julia> bb
2x1825x10346 BitArray:
[:, :, 1] =
false false true false true false false … false true false false false false true
true true true true true true true true true true true true true false
[:, :, 2] =
false false true false true false false … false true false false false false true
true true false true false true true true false true true true true true
[:, :, 3] =
false true true false true false false true … false true false true true false true
true true true true true true true true true true true true true true false
...
[:, :, 10344] =
true true true true false true true … true true true true true true true
false false false false true false false false false false false false true true
[:, :, 10345] =
false true true true false true true true … true true true true true true true true
true true true true true true true true true true true true true true true true
[:, :, 10346] =
false true true true false true true … true false true true true true true
true false true false true true true false true true true true false false
To reconstruct the indices for column 1 we use
julia> int(bb[1,:,1]) + 2bb[2,:,1]
1x1825 Int64 Array:
2 2 3 2 3 2 2 2 2 3 3 2 2 2 2 3 … 1 3 3 3 3 3 3 2 3 2 3 2 2 2 2 1
After converting bb
to a memory-mapped BitArray we have a
representation that is essentially as concise as a .bed
file.