In this manuscript, we will explain how to perform OnlinePCA.jl against 1.3 million (1.3M) single cell dataset of ( https://community.10xgenomics.com/t5/10x-Blog/Our-1-3-million-single-cell-dataset-is-ready-to-download/ba-p/276 ), which is the largest single-cell RNA-Seq (scRNA-Seq) dataset at this time.
Current version of OnlinePCA.jl assumes the input data to be CSV format for universal application to wide variety of research region. Since the 1.3M data is saved as a HDF5 format which is 10X Genomics defined, we will firstly convert the HDF5 to CSV (c.f. Saving the HDF5 file of 10X Genomics as CSV format). We know there is some attempt to unify such ultra-large scRNA-Seq data such as beachmat, Loom (LoomExperiment, Loompy), TENxGenomics, scanpy, Seurat, and 10X-HDF5, ...etc. According to user's needs, we will implement the preprocess command to directly convert such specific binary file to julia binary file using below.
In this manusctipt, users are assumed to install properly some following tools into the user's machine.
- Julia : v-1.0.0 or higher
- OnlinePCA.jl (Julia Package)
- R : v-3.5.0 or higher
- RColorBrewer : v-1.1-2 or higher (R Package)
- scales : v-1.0.0 or higher (R Package)
- Python : v-3.6.4 or higher
- umap-learn : v-0.3.6 or higher (Python Package)
- sklearn : v-0.20.0 or higher (Python Package)
- numpy : v-1.15.2 or higher (Python Package)
- matplotlib : v-3.0.0 or higher (Python Package)
- Python : v-2.7.9
- bhtsne : v-0.1.9
For the details of the installation of the OnlinePCA.jl package, see README.md of OnlinePCA.jl. After the installation, the package is stored in the directory like below.
ls ~/.julia/packages
# ArgParse Mux
# Arpack NamedTuples
# AssetRegistry Nullables
# BinDeps Observables
# BinaryProvider OnlinePCA
# ...
Hereafter, we will use the latest julia (v-1.0.1, 2018/10/18):
alias julia="~/julia-1.0.1/bin/julia"
, and the CSV file of 1.3M data is assumed to be saved at the directory 1M_neurons.
ls 1M_neurons/*.csv
# ... 58G ... 1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.csv
After converting the 10X-HDF5 to CSV, we binalize the data for data compression and acceleration of data I/O speed. csv2bin command converts the CSV to Zstandard (Zstd), which is a compression format implemented by Facebook and known as its high I/O speed (c.f. https://facebook.github.io/zstd/).
julia ~/.julia/packages/OnlinePCA/*/bin/csv2bin \
--csvfile 1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.csv \
--binfile 1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.zst
ls -lth 1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.*
# ... 58G ... 1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.csv
# ... 3.2G ... 1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.zst
After the data compression, large CSV file (58GB) have been compressed to Zstd format efficiently (3.2GB).
After the binalization, sumr command extracts the row/column-wise summary statistics of the data matrix.
julia ~/.julia/packages/OnlinePCA/*/bin/sumr \
--binfile 1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.zst \
--outdir 1M_neurons
ls -lth 1M_neurons/{Feature,Sample}*
# ... 119K ... 1M_neurons/Feature_NoZeros.csv : Row-wise number of zeros vector
# ... 427K ... 1M_neurons/Feature_CV2s.csv : Row-wise CV2 value vector
# ... 478K ... 1M_neurons/Feature_FTTVars.csv : Row-wise variance vector (with FTT)
# ... 493K ... 1M_neurons/Feature_LogVars.csv : Row-wise variance vector (with log10)
# ... 479K ... 1M_neurons/Feature_Vars.csv : Row-wise variance vector
# ... 434K ... 1M_neurons/Feature_FTTMeans.csv : Row-wise mean vector (with FTT)
# ... 488K ... 1M_neurons/Feature_LogMeans.csv : Row-wise mean vector (with log10)
# ... 481K ... 1M_neurons/Feature_Means.csv : Row-wise mean vector
# ... 6.3M ... 1M_neurons/Sample_NoCounts.csv : Column-wise number of counts vector
hvg command finds highly variable genes (HVG), which is known as a feature selection method for scRNA-Seq datasets (c.f. Identifying highly variable genes).
julia ~/.julia/packages/OnlinePCA/*/bin/hvg \
--binfile 1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.zst \
--rowmeanlist 1M_neurons/Feature_Means.csv \
--rowvarlist 1M_neurons/Feature_Vars.csv \
--rowcv2list 1M_neurons/Feature_CV2s.csv \
--outdir 1M_neurons
ls -lth 1M_neurons/HVG_*
# .. 19B .. 1M_neurons/HVG_a0.csv : Parameters of HVG
# .. 19B .. 1M_neurons/HVG_a1.csv : Parameters of HVG
# .. 430K .. 1M_neurons/HVG_afit.csv : Parameters of HVG
# .. 8B .. 1M_neurons/HVG_df.csv : Parameters of HVG
# .. 97K .. 1M_neurons/HVG_pval.csv : P-value of HVG
# .. 65K .. 1M_neurons/HVG_useForFit.csv : Parameters of HVG
# .. 438K .. 1M_neurons/HVG_varFitRatio.csv : Parameters of HVG
The results of sumr and hvg are useful for quality control (QC) of the data matrix, that is, low-quality genes and cells can be trimmed by the values. This step will avoid mistakenly regarding the false positive artefact as biologicaly meaningful signal. Since the summary statistics and p-values of HVG are small size vectors, any other programming language such as R and Python can load them. Here, we confirm the values by R.
# Data Loading
m <- unlist(read.csv("1M_neurons/Feature_Means.csv", header=FALSE))
v <- unlist(read.csv("1M_neurons/Feature_Vars.csv", header=FALSE))
c <- unlist(read.csv("1M_neurons/Feature_CV2s.csv", header=FALSE))
p <- unlist(read.csv("1M_neurons/HVG_Pvals.csv", header=FALSE))
s <- unlist(read.csv("1M_neurons/Sample_NoCounts.csv", header=FALSE))
# Gene-wise QC
highcv <- rep("black", length(p))
highcv[which(p <= 1E-5)] <- rgb(1,0,0,0.5)
par(mar=c(5,5,4,2))
par(ps=30)
png(file="1M_neurons/GeneQC.png")
plot(log10(m), log10(c), col=highcv, xlab="log10(Mean)", ylab="log10(CV2)", pch=16)
dev.off()
# Cell-wise QC
quantile(log10(s))
par(mar=c(5,5,4,2))
par(ps=30)
png(file="1M_neurons/CellQC.png")
plot(1:length(s), log10(s), type="l", xlab="Cell", ylab="log10(# Counts)")
dev.off()
Here, we extract only 1841 HVGs, which have lower 1E-5 p-values. filtering command extract corresponding row-vectors and save as different ZSTD.
mkdir -p 1M_neurons/filtered
/usr/bin/time -v julia ~/.julia/packages/OnlinePCA/*/bin/filtering \
-i 1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.zst \
--featurelist 1M_neurons/HVG_Pvals.csv \
--thr1 0.00001 \
--direct1 - \
-o 1M_neurons/filtered/filtered.zst
# Summary
julia ~/.julia/packages/OnlinePCA/*/bin/sumr \
--binfile 1M_neurons/filtered/filtered.zst \
--outdir 1M_neurons/filtered
ls -lth 1M_neurons/filtered/{Feature,Sample}*
# .. 33K .. Feature_CV2s.csv
# .. 34K .. Feature_FTTMeans.csv
# .. 36K .. Feature_FTTVars.csv
# .. 38K .. Feature_LogMeans.csv
# .. 38K .. Feature_LogVars.csv
# .. 37K .. Feature_Means.csv
# .. 14K .. Feature_NoZeros.csv
# .. 36K .. Feature_Vars.csv
# .. 5.0M .. Sample_NoCounts.csv
# .. 250M .. filtered.zst : Small binary data containing only HVGs
Finally, we perform the online PCA by oja command. The algorithm is based on stochastic gradient descent (SGD) and its convergence is susceptible to the value of learning rate (step size) parameter. To confirm the convergence, we set multiple stepsize options. Using Sun Grid Engine (SGE), this step can be easily performed in parallel.
# Array job of SGE
qsub ./ArrayJob.sh
We write the source code of ArrayJob.sh as follows.
#!/bin/bash
#$ -l nc=4
#$ -t 1-6
#$ -q node.q
i=$(expr $SGE_TASK_ID - 1)
stepsizes=(0.1 1 10 100 1000 10000)
stepsize=${stepsizes[$i]}
mkdir -p 1M_neurons/filtered/Step${stepsizes[$i]}
mkdir -p 1M_neurons/filtered/log/Step${stepsizes[$i]}
alias julia="~/julia-1.0.1/bin/julia"
julia ~/.julia/packages/OnlinePCA/*/bin/oja \
--scale ftt \
--input 1M_neurons/filtered/filtered.zst \
--outdir 1M_neurons/filtered/Step${stepsizes[$i]} \
--numepoch 5 \
--rowmeanlist 1M_neurons/filtered/Feature_FTTMeans.csv \
--dim 12 \
--evalfreq 10000000000 \
--stepsize $stepsize \
--logdir 1M_neurons/filtered/log/Step${stepsizes[$i]}
The log file in each stepsize and each epoch is stored in the directory 1M_neurons/filtered/log/Step${stepsizes[$i]} and the explainced variance (%) can be confirmed as follows:
Stepsize | Epoch1 | Epoch2 | Epoch3 | Epoch4 | Epoch5 |
---|---|---|---|---|---|
0.1 | 7.24e-4 | 8.14e-4 | 8.76e-4 | 9.25e-4 | 9.67e-4 |
1 | 24.04 | 32.22 | 38.7 | 42.6 | 44.4 |
10 | 39.86 | 55.46 | 59.9 | 61.1 | 61.5 |
100 | 38.82 | 58.53 | 60.4 | 61.5 | 62.3 |
1000 | 30.12 | 39.93 | 45.3 | 48.4 | 50.9 |
10000 | 18.61 | 15.62 | 22.6 | 22.0 | 25.0 |
In this data, the calculation of stepsize 100 seems to be converged. Next, we confirm the cellular distribution by pair plot of principal components (PCs) calculated by online PCA. In the R-console, we type following source code:
# Package Loading
library("RColorBrewer")
library("scales")
# cellular label
label <- read.csv("1M_neurons/analysis/clustering/graphclust/clusters.csv", header=TRUE)[,2]
names(label) <- label
label[] <- c(brewer.pal(11, "Spectral"),
brewer.pal(12, "Set3"), brewer.pal(8, "Dark2"),
brewer.pal(8, "Set2"), brewer.pal(9, "Set1"),
brewer.pal(8, "Pastel1"), brewer.pal(8, "Accent"))[label]
# Loading
PCs <- read.delim("1M_neurons/filtered/Step100/Eigen_vectors.csv", sep=",", header=FALSE)
colnames(PCs) <- NULL
png(file="1M_neurons/PairsPCA.png", width=2000, height=2000)
pairs(PCs, labels=paste0("PC", 1:ncol(PCs)), col=alpha(label, 0.5), pch=16, cex=0.5)
dev.off()
The higher PCs such as PC9-12 seem to be influenced by the outlier cells.
We also perform the permutation-based method, which randomly shuffles the data matrix, calculates only the first eigenvalue and constructs null eigenvalues distribution.
In the same way as ArrayJob.sh, we also write ArrayJob2.sh to perform the permutation-based method 5000 times.
# Array job of SGE
qsub ./ArrayJob2.sh
We write the source code of ArrayJob2.sh as follows.
#!/bin/bash
#$ -l nc=4
#$ -t 1-1000
#$ -q node.q
mkdir -p 1M_neurons/filtered/Permutation
mkdir -p 1M_neurons/filtered/Permutation/$SGE_TASK_ID
alias julia="~/julia-1.0.1/bin/julia"
julia ~/.julia/packages/OnlinePCA/*/bin/oja \
--perm true \
--scale ftt \
--input 1M_neurons/filtered/filtered.zst \
--outdir 1M_neurons/filtered/Permutation/$SGE_TASK_ID \
--numepoch 1 \
--rowmeanlist 1M_neurons/filtered/Feature_FTTMeans.csv \
--dim 1 \
--evalfreq 10000000000 \
--stepsize 100 \
--logdir 1M_neurons/filtered/Permutation/$SGE_TASK_ID
Next, we compare the actual eigenvalues calculated by the real data and null eigenvalues distribution from the permutation-based method.
cat 1M_neurons/filtered/Permutation/*/Eigen_values.csv > 1M_neurons/filtered/Permutation/Permutaion.csv
permval <- unlist(read.delim("1M_neurons/filtered/Permutation/Permutaion.csv", header=FALSE))
realval <- unlist(read.delim("1M_neurons/filtered/Step100/Eigen_values.csv", header=FALSE))
mx <- max(permval, realval)
mi <- min(permval, realval)
# Plot
png(file="1M_neurons/Permutation.png", width=2500, height=1000)
# Setting
par(mar=c(4.5,4.5,3,10))
par(ps = 23)
# Histgram (Permutation)
hist(log10(permval), breaks=100, xlim=log10(c(mi, mx)),
xlab="log10(Eigenvalues)", ylab="", main="")
abline(v=log10(min(permval)), col="black", lty=2)
abline(v=log10(max(permval)), col="black", lty=2)
# Histgram (Real Data)
hist(rep(log10(realval), 10), breaks=100, xlim=log10(c(mi, mx)),
col="blue", add=TRUE)
dev.off()
}
This figure suggests that at least the eigenvalue of PC12 is very small and hard to be distinguished from random noise.
Finally, we save only PC1 to PC11 for next analysis.
# Loading
write.csv(PCs[, 1:8], file="1M_neurons/PC1_11.csv", row.names=FALSE)
Using the result of online PCA, we perform UMAP (Uniform Manifold Approximation and Projection), which is a sophisticated non-linear dimensional reduction technique.
# Package Loading
import umap
import numpy as np
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from matplotlib.colors import LinearSegmentedColormap
# original function
def generate_cmap(colors):
values = range(len(colors))
vmax = np.ceil(np.max(values))
color_list = []
for v, c in zip(values, colors):
color_list.append( ( v/ vmax, c) )
return LinearSegmentedColormap.from_list('custom_cmap', color_list)
# Original Color map
cm = generate_cmap(["#9E0142", "#D53E4F", "#F46D43", "#FDAE61", "#FEE08B",
"#FFFFBF", "#E6F598", "#ABDDA4", "#66C2A5", "#3288BD",
"#5E4FA2", "#8DD3C7", "#FFFFB3", "#BEBADA", "#FB8072",
"#80B1D3", "#FDB462", "#B3DE69", "#FCCDE5", "#D9D9D9",
"#BC80BD", "#CCEBC5", "#FFED6F", "#1B9E77", "#D95F02",
"#7570B3", "#E7298A", "#66A61E", "#E6AB02", "#A6761D",
"#666666", "#66C2A5", "#FC8D62", "#8DA0CB", "#E78AC3",
"#A6D854", "#FFD92F", "#E5C494", "#B3B3B3", "#E41A1C",
"#377EB8", "#4DAF4A", "#984EA3", "#FF7F00", "#FFFF33",
"#A65628", "#F781BF", "#999999", "#FBB4AE", "#B3CDE3",
"#CCEBC5", "#DECBE4", "#FED9A6", "#FFFFCC", "#E5D8BD",
"#FDDAEC", "#7FC97F", "#BEAED4", "#FDC086", "#FFFF99"])
# Data Loading
pcs = np.loadtxt("1M_neurons/PC1_11.csv", delimiter=",")
label = np.loadtxt("1M_neurons/analysis/clustering/graphclust/clusters.csv", delimiter=",", skiprows=1, usecols=1)
# UMAP
embedding = umap.UMAP(metric='euclidean', local_connectivity=10.0, random_state=1234, verbose=True).fit_transform(pcs)
# Plot
plt.scatter(embedding[:,0], embedding[:,1], c=label, cmap=cm, s=1)
plt.colorbar()
plt.savefig('1M_neurons/umap.png')
ls -lth 1M_neurons/umap.png
UMAP plot is surely generated.
To compare the result of UMAP, we also performed t-SNE, which is a very popular non-linear dimensional reduction method in scRNA-Seq analysis.
Note that the following source code is performed by python v-2.7.9.
# Package Loading
import bhtsne
import numpy as np
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from matplotlib.colors import LinearSegmentedColormap
# original function
def generate_cmap(colors):
values = range(len(colors))
vmax = np.ceil(np.max(values))
color_list = []
for v, c in zip(values, colors):
color_list.append( ( v/ vmax, c) )
return LinearSegmentedColormap.from_list('custom_cmap', color_list)
# original color map
cm = generate_cmap(["#9E0142", "#D53E4F", "#F46D43", "#FDAE61", "#FEE08B",
"#FFFFBF", "#E6F598", "#ABDDA4", "#66C2A5", "#3288BD",
"#5E4FA2", "#8DD3C7", "#FFFFB3", "#BEBADA", "#FB8072",
"#80B1D3", "#FDB462", "#B3DE69", "#FCCDE5", "#D9D9D9",
"#BC80BD", "#CCEBC5", "#FFED6F", "#1B9E77", "#D95F02",
"#7570B3", "#E7298A", "#66A61E", "#E6AB02", "#A6761D",
"#666666", "#66C2A5", "#FC8D62", "#8DA0CB", "#E78AC3",
"#A6D854", "#FFD92F", "#E5C494", "#B3B3B3", "#E41A1C",
"#377EB8", "#4DAF4A", "#984EA3", "#FF7F00", "#FFFF33",
"#A65628", "#F781BF", "#999999", "#FBB4AE", "#B3CDE3",
"#CCEBC5", "#DECBE4", "#FED9A6", "#FFFFCC", "#E5D8BD",
"#FDDAEC", "#7FC97F", "#BEAED4", "#FDC086", "#FFFF99"])
# Data Loading
pcs = np.loadtxt("1M_neurons/PC1_11.csv", delimiter=",")
label = np.loadtxt("1M_neurons/analysis/clustering/graphclust/clusters.csv", delimiter=",", skiprows=1, usecols=1)
# t-SNE
tsne = bhtsne.tsne(pcs, perplexity=50, rand_seed=123)
# Plot
plt.scatter(tsne[:,0], tsne[:,1], c=label, cmap=cm, s=1)
plt.colorbar()
plt.savefig('1M_neurons/tsne.png')
ls -lth 1M_neurons/tsne.png
t-SNE plot is also surely generated.
Although both methods captured some cluster structures, the calculation time of UMAP and t-SNE are about 46 mins and 42 hours, respectively, which means UMAP is much faster.
Owing to the OnlinePCA.jl, large data matrix is compressed as small size PC vector without high-spec machines and the compressed form of data accerelates the down-stream analyses. In our computational enviromenet, total calculation time and maximum memory usage were 9.5 hours and 15.8 GB, respectively.
Command | Elapsed (wall clock) time (h:mm:ss or m:ss) | Maximum resident set size (kbytes) |
---|---|---|
csv2bin | 2:05:56 | 376844 |
sumr (before filtering) | 24:28.68 | 342320 |
hvg | 0:39.99 | 352156 |
filtering | 2:23.98 | 399196 |
sumr (after filtering) | 2:12.70 | 340800 |
oja (5epoch, average) | 49:33.33 | 661292 |
oja (Permutation, 1epoch, average) | 3:06.20 | 427104 |
Pair plot (R) | 17:04.09 | 1804880 |
UMAP and plot (Python v-3.6.4) | 45:55.12 | 15896664 |
t-SNE and plot (Python v-2.7.9) | 41:46:05 | 5950504 |
- OS : CentOS Linux release 7.2.1511 (Core)
- CPU : Intel Xeon E5-2670 v3 (2.30GHz)
- Memory : 96 GB
- HDD : 1.9TB
Koki Tsuyuzaki <koki.tsuyuzaki [at] gmail.com>
2018/10/30