kokitsuyuzaki kokitsuyuzaki

この記事はWorkflow Advent Calendar 2020の二日目の記事です。

自分は最近Snakemakeでワークフローを書いている。それまではシェルスクリプト → make → Rakeという風に、色々なやり方でワークフローを書いてきてたが、環境構築・再現性や分散処理の対応がしやすいSnakemakeに現在のところは落ち着いている。Snakemakeでコードの再現性に関わる技術はAnacondaとDocker（Singularity）である。ここでは、両方の技術は一長一短であり、両技術共に痒いところに手が届かない状況があるという話しをする。なお、自分はデータ解析を生業としており、データを前処理、解析、可視化しては、使うパッケージを適宜加えたり、減らしたりする探索的データ解析（EDA）の過程でSnakemakeを使っているため、事前に処理が決まっていて、あとはワークフロー化するだけの人とは状況がかなり違っている可能性があるので注意されたし。

Anacondaのメリット

Snakemake内でのAnacondaの使い方は簡単で、Snakemakeを実行する上で必要なSnakefileの中で、実行するruleにcondaタグを追加する。condaタグの中では、conda環境を構築する上で必要な共有ライブラリやらR,Pythonのパッケージやらを記述する。 https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#integrated-package-management

あとは、snakemakeコマンドを実行する際に、--use-condaオプションを加えるだけで良い。

Best Practices for OnlinePCA.jl against 1.3M Mouse Brain Data

In this manuscript, we will explain how to perform OnlinePCA.jl against 1.3 million (1.3M) single cell dataset of ( https://community.10xgenomics.com/t5/10x-Blog/Our-1-3-million-single-cell-dataset-is-ready-to-download/ba-p/276 ), which is the largest single-cell RNA-Seq (scRNA-Seq) dataset at this time.

Step.1 : Prepare the dataset

Current version of OnlinePCA.jl assumes the input data to be CSV format for universal application to wide variety of research region. Since the 1.3M data is saved as a HDF5 format which is 10X Genomics defined, we will firstly convert the HDF5 to CSV (c.f. Saving the HDF5 file of 10X Genomics as CSV format). We know there is some attempt to unify such ultra-large scRNA-Seq data such as beachmat, Loom (LoomExperiment, Loompy), TENxGenomics, scanpy, Seurat, and 10X-HDF5, ...etc. According to user's

Converting the HDF5 file of 10X Genomics as CSV format

In this manuscript, we will explain how to extract gene × cell matrix from the HDF5 file provided by 10X Genomics and saving the data as CSV format.

Step.1 : Download the HDF5 file from the website of 10X Genomics

Firstly, we download the HDF5 file from 10X Genomics site. The data is stored at Amazon AWS and easily downloaded by wget commant like below.

Level1_3_R.markdown

次世代シークエンサーDRY解析教本（細胞工学別冊）の「Level1 [3] Rの使い方」で利用したソースコード

MeSH ORA Frameworkの使い方

このgistの内容は、Bioconductorのmeshrパッケージのvignette（パッケージの使用方法が記されたドキュメント）を和訳したものです。

1. イントロ

このgistでは以下のMeSHに関連したパッケージの使い方を説明します。

MeSH.db : MeSHの情報を提供するパッケージ
MeSH.AOR.db : MeSHの祖先-子孫関係の情報を提供するパッケージ
MeSH.PCR.db : MeSHの親子関係の情報を提供するパッケージ

BMC 関連の論文に共通した投稿規定

=======

BMC Bioinformaticsの投稿規定

=======

Hapmapに登録された89のアジア人(中国人45人、日本人44人)における83534SNPsの解析

PLINK Web site : http://pngu.mgh.harvard.edu/~purcell/plink/
PLINK Tutorial : http://pngu.mgh.harvard.edu/~purcell/plink/tutorial.shtml

=======

あらかじめ準備しておく事

コンソール画面でRとlinuxコマンドが使える環境(more、sort、headとか)
PLINKのダウンロード、インストール（PLINK Web siteから)
hapmap1.zipのダウンロード（PLINK Tutorialから)

kokitsuyuzaki kokitsuyuzaki

Anacondaのメリット

Best Practices for OnlinePCA.jl against 1.3M Mouse Brain Data

Step.1 : Prepare the dataset

Converting the HDF5 file of 10X Genomics as CSV format

Step.1 : Download the HDF5 file from the website of 10X Genomics

Level1_3_R.markdown

目次

MeSH ORA Frameworkの使い方

1. イントロ

BMC 関連の論文に共通した投稿規定

BMC Bioinformaticsの投稿規定

Hapmapに登録された89のアジア人(中国人45人、日本人44人)における83534SNPsの解析

あらかじめ準備しておく事