In this manuscript, we will explain how to extract gene × cell matrix from the HDF5 file provided by 10X Genomics and saving the data as CSV format.
Firstly, we download the HDF5 file from 10X Genomics site. The data is stored at Amazon AWS and easily downloaded by wget commant like below.
wget https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5
This file contains 1306127 (1.3 M) cells of mouse. Despite of the huge number of cells, the file size is about 4GB and very compact. This is because, the corresponding data is stored as a sparse matrix format. However, this data is not easy to be used for data analysis. Hence, here we convert the data as a dense matrix. 10X Genomics provides two way of preprocess the HDF5 file, cellrangerRkit (R package) and cellranger (python command tools). In the case of 1.3 M data, the R package could not load the HDF5 appropriately. This may be because the H5Fopen function of rhdf5 package does not work against 64bit integer data.
# This code does not work against 1.3M data...
source("http://s3-us-west-2.amazonaws.com/10x.files/code/rkit-install-1.1.0.R")
library(cellrangerRkit)
neuron <- get_matrix_from_h5("1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5")
Hereafter, following example are performed by the cellranger.
Next, we download and install the Cell Ranger (2.1, latest version at 2018/4/28) by wget.
wget --no-check-certificate -O cellranger-1.3.0.tar.gz "https://s3-us-west-2.amazonaws.com/10x.downloads/cellranger-1.3.0.tar.gz?AWSAccessKeyId=AKIAJAZONYDS6QUPQVBA&Expires=1487446357&Signature=Yt%2BqSTuJdJ8zqdAXzoV8fisZFXo%3D"
We also add the path of cellranger program to the PYTHONPATH.
export PYTHONPATH=./cellranger-1.3.0/cellranger-cs/1.3.0/lib/python:$PYTHONPATH
export PYTHONPATH=./cellranger-1.3.0/cellranger-cs/1.3.0/tenkit/lib/python:$PYTHONPATH
export PYTHONPATH=./cellranger-1.3.0/anaconda-cr-cs/2.2.0-anaconda-cr-cs-c7/lib/python2.7/site-packages/:$PYTHONPATH
Finaly, we boot the REPL mode of python and execute the script in the window as below. In addition to the cellranger, we also install other python packages like h5py, numpy, scipy, subprocess and scikit-learn by pip command. Because of the data size, we chunk the data as 1/100 size and incrementally save the data by appending mode.
# Python Version : 2.7
# coding:utf-8
import cellranger.matrix as cr_matrix
import h5py
import numpy
import subprocess
import os
from sklearn import preprocessing
from scipy.sparse import *
# Setting
step=100
orgname="mm10"
hdf5file="1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5"
# Data Loading from HDF5
matdata = cr_matrix.GeneBCMatrices.load_h5(hdf5file)
matdata = matdata.get_matrix(orgname)
# Remove ERCC spikein
erccpos = []
for i in range(matdata.m.shape[0]):
genename = matdata.genes[i][1]
if 'Ercc' in genename:
erccpos.append(i)
target = list(set(range(matdata.m.shape[0])) - set(erccpos))
matdata.m = matdata.m[target, ]
# Remove Variance zero genes
zvpos = []
term1 = (matdata.m.multiply(matdata.m)).mean(axis=1)
term2 = matdata.m.mean(axis=1)
term2 = term2.multiply(term2)
rowvar = term1 - term2
for i in range(matdata.m.shape[0]):
rv = rowvar[i]
if rv == 0:
zvpos.append(i)
target = list(set(range(matdata.m.shape[0])) - set(zvpos))
matdata.m = matdata.m[target, ]
# Data Saving as CSV
csvfile="1M_neurons/Data.csv"
for i in range(0, matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if matdata.m.shape[0] - end + step < step:
idx = range(start,matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = csr_matrix(matdata.m[idx], dtype=numpy.int64).todense()
numpy.savetxt(f_handle, tmp, fmt="%i", delimiter=",")
Corresponding CSV file is surely generated.
ls -lth 1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.csv
We also generated some log-transformed, scaled, and transposed matrices.
libsize = True
cper = 1E4
log = True
center = True
transpose = True
def tenxh52csv(matdata, csvfile, step, libsize, cper, log, center, transpose, verbose):
if os.path.exists(csvfile):
subprocess.call("rm -rf " + csvfile)
if libsize:
sumvec = numpy.sum(matdata.m, axis=0)
if transpose:
N = matdata.m.shape[1]
matdata = matdata.m.T
else:
N = matdata.m.shape[0]
matdata = matdata.m
for i in range(0, N/step+1):
if verbose:
print(i)
start = i*step
end = (i+1)*step-1
if N - end + step < step:
idx = range(start, N)
else:
idx = range(start, end)
with open(csvfile, "a") as f:
tmp = csr_matrix(matdata[idx, ], dtype=numpy.int64).todense()
if libsize & not transpose:
# どう割ればいいのか(ブロードキャストしてくれる?)
tmp = (1.0 * tmp / sumvec) * cper
if libsize & transpose:
tmp = (1.0 * tmp / sumvec) * cper
if log:
tmp = numpy.log10(tmp + 1)
if center & not transpose:
tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
if center & transpose:
tmp = preprocessing.scale(tmp, axis=1, with_mean=True, with_std=False)
numpy.savetxt(f, tmp, fmt="%i", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/LogData.csv"
for i in range(0, matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if matdata.m.shape[0] - end + step < step:
idx = range(start,matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = csr_matrix(matdata.m[idx], dtype=numpy.int64).todense()
tmp = numpy.log10(tmp + 1)
numpy.savetxt(f_handle, tmp, fmt="%i", delimiter=",")
# Data Saving as CSV
sumvec = numpy.sum(matdata.m, axis=0)
csvfile="1M_neurons/CPM.csv"
for i in range(0, matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if matdata.m.shape[0] - end + step < step:
idx = range(start,matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E6
numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/LogCPM.csv"
for i in range(0, matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if matdata.m.shape[0] - end + step < step:
idx = range(start,matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E6
tmp = numpy.log10(tmp + 1)
numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/CP10K.csv"
for i in range(0, matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if matdata.m.shape[0] - end + step < step:
idx = range(start,matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E4
numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/LogCP10K.csv"
for i in range(0, matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if matdata.m.shape[0] - end + step < step:
idx = range(start,matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E4
tmp = numpy.log10(tmp + 1)
numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
med = numpy.median(numpy.asarray(sumvec))
csvfile="1M_neurons/CPMED.csv"
for i in range(0, matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if matdata.m.shape[0] - end + step < step:
idx = range(start,matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * med
numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/LogCPMED.csv"
for i in range(0, matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if matdata.m.shape[0] - end + step < step:
idx = range(start,matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * med
tmp = numpy.log10(tmp + 1)
numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/CenteredData.csv"
for i in range(0, matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if matdata.m.shape[0] - end + step < step:
idx = range(start,matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = csr_matrix(matdata.m[idx], dtype=numpy.int64).todense()
tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
numpy.savetxt(f_handle, tmp, fmt="%i", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/CenteredLogData.csv"
for i in range(0, matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if matdata.m.shape[0] - end + step < step:
idx = range(start,matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = csr_matrix(matdata.m[idx], dtype=numpy.int64).todense()
tmp = numpy.log10(tmp + 1)
tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
numpy.savetxt(f_handle, tmp, fmt="%i", delimiter=",")
# Data Saving as CSV
sumvec = numpy.sum(matdata.m, axis=0)
csvfile="1M_neurons/CenteredCPM.csv"
for i in range(0, matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if matdata.m.shape[0] - end + step < step:
idx = range(start,matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E6
tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/CenteredLogCPM.csv"
for i in range(0, matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if matdata.m.shape[0] - end + step < step:
idx = range(start,matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E6
tmp = numpy.log10(tmp + 1)
tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/CenteredCP10K.csv"
for i in range(0, matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if matdata.m.shape[0] - end + step < step:
idx = range(start,matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E4
tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/CenteredLogCP10K.csv"
for i in range(0, matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if matdata.m.shape[0] - end + step < step:
idx = range(start,matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E4
tmp = numpy.log10(tmp + 1)
tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
med = numpy.median(numpy.asarray(sumvec))
csvfile="1M_neurons/CenteredCPMED.csv"
for i in range(0, matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if matdata.m.shape[0] - end + step < step:
idx = range(start,matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * med
tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/CenteredLogCPMED.csv"
for i in range(0, matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if matdata.m.shape[0] - end + step < step:
idx = range(start,matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = (1.0 * csr_matrix(matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * med
tmp = numpy.log10(tmp + 1)
tmp = preprocessing.scale(tmp, axis=0, with_mean=True, with_std=False)
numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Transposed matrix
t_matdata = matdata.m.T
# Data Saving as CSV
csvfile="1M_neurons/t_Data.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if t_matdata.m.shape[0] - end + step < step:
idx = range(start,t_matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = csr_matrix(t_matdata.m[idx], dtype=numpy.int64).todense()
numpy.savetxt(f_handle, tmp, fmt="%i", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/t_LogData.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if t_matdata.m.shape[0] - end + step < step:
idx = range(start,t_matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = csr_matrix(t_matdata.m[idx], dtype=numpy.int64).todense()
tmp = numpy.log10(tmp + 1)
numpy.savetxt(f_handle, tmp, fmt="%i", delimiter=",")
# Data Saving as CSV
sumvec = numpy.sum(t_matdata.m, axis=0)
csvfile="1M_neurons/t_CPM.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if t_matdata.m.shape[0] - end + step < step:
idx = range(start,t_matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = (1.0 * csr_matrix(t_matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E6
numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/t_LogCPM.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if t_matdata.m.shape[0] - end + step < step:
idx = range(start,t_matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = (1.0 * csr_matrix(t_matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E6
tmp = numpy.log10(tmp + 1)
numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/t_CP10K.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if t_matdata.m.shape[0] - end + step < step:
idx = range(start,t_matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = (1.0 * csr_matrix(t_matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E4
numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/t_LogCP10K.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if t_matdata.m.shape[0] - end + step < step:
idx = range(start,t_matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = (1.0 * csr_matrix(t_matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * 1E4
tmp = numpy.log10(tmp + 1)
numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
med = numpy.median(numpy.asarray(sumvec))
csvfile="1M_neurons/t_CPMED.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if t_matdata.m.shape[0] - end + step < step:
idx = range(start,t_matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = (1.0 * csr_matrix(t_matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * med
numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
# Data Saving as CSV
csvfile="1M_neurons/t_LogCPMED.csv"
for i in range(0, t_matdata.m.shape[0]/step+1):
print(i)
start=i*step
end=(i+1)*step-1
if t_matdata.m.shape[0] - end + step < step:
idx = range(start,t_matdata.m.shape[0])
else:
idx = range(start,end)
with open(csvfile, "a") as f_handle:
tmp = (1.0 * csr_matrix(t_matdata.m[idx], dtype=numpy.float64).todense() / sumvec) * med
tmp = numpy.log10(tmp + 1)
numpy.savetxt(f_handle, tmp, fmt="%.3e", delimiter=",")
- https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/1M_neurons
- https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger
- https://support.10xgenomics.com/single-cell/software/pipelines/latest/rkit
- https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/python
- https://stackoverflow.com/questions/12169611/how-do-i-compute-the-variance-of-a-column-of-a-sparse-matrix-in-scipy
Koki Tsuyuzaki <koki.tsuyuzaki [at] gmail.com>
2019/10/1
Hello, this approach would be useful for my research, so I am attempting to replicate it on my local system.
Have you tried this approach with Cellranger 3.0.2? Step 2 worked for me, only the specific paths had to be changed:
export PYTHONPATH=./cellranger-3.0.2/cellranger-cs/3.0.2/lib/python/:$PYTHONPATH
export PYTHONPATH=./cellranger-3.0.2/cellranger-cs/3.0.2/tenkit/lib/python/:$PYTHONPATH
export PYTHONPATH=./cellranger-3.0.2/miniconda-cr-cs/4.3.21-miniconda-cr-cs-c10/lib/python2.7/site-packages/:$PYTHONPATH
Afterwards, Cellranger could be imported to Python, while importing matrix resulted in this error:
Edit: Spelling