Skip to content

Instantly share code, notes, and snippets.

@hermidalc
Last active September 3, 2024 11:48
Show Gist options
  • Save hermidalc/7dc4bbc2d1f871eee10494a5a9c97f32 to your computer and use it in GitHub Desktop.
Save hermidalc/7dc4bbc2d1f871eee10494a5a9c97f32 to your computer and use it in GitHub Desktop.
Using sklearn-extensions to perform edgeR TMM-TPM normalization in your python ML code

Installation

Download and install Mambaforge

On Linux:

wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh
bash Mambaforge-Linux-x86_64.sh

On MacOS:

wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-MacOSX-x86_64.sh
bash Mambaforge-MacOSX-x86_64.sh

Or on MacOS via Homebrew:

brew install --cask mambaforge

Run the following after installation:

mamba init
mamba config --set auto_activate_base false

Close the shell and reopen a new one to load the mamba environment.

To update the mamba base environment in future:

mamba deactivate
mamba update --all

Creating a mamba/conda environment for your Git project

In your Git project make an envs directory and create a {git repository name}.yaml with the convention being the exact (case-sensitive) name of your Git project, e.g.:

mkdir envs
cat > envs/deeppt.yaml <<EOF
name: deeppt
channels:
  - conda-forge
  - bioconda
dependencies:
  - bioconductor-edger
  - joblib
  - libopenblas
  - numpy
  - pandas
  - python=3.8
  - r-base=3.6
  - r-data.table
  - r-statmod
  - rpy2=3.1
  - scikit-learn=0.22.2
EOF

Then create the environment:

mamba env create -f envs/deeppt.yaml

And activate:

mamba activate deeppt

To deactivate the active environment:

mamba deactivate

To update the environment, like for example to update versions of dependencies to the latest available or if you add a new dependency in the yaml file:

mamba env update -f envs/deeppt.yaml

Adding sklearn-extensions to your project

Even though we could add sklearn-extensions as a Git submodule, it's easiest for now to simply clone it into the top-level of your code repository:

git clone https://github.com:hermidalc/sklearn-extensions.git sklearn_extensions

And if you need to update it:

cd sklearn_extensions
git pull
import warnings
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
warnings.filterwarnings('ignore', category=FutureWarning,
module='rpy2.robjects.pandas2ri')
from sklearn_extensions.preprocessing import EdgeRTMMTPM
random_seed = 777
gene_ids = np.loadtxt(
'tcga_brca_slide_tissue_htseq_counts_expr_genes.tsv', dtype=str)
gene_annots = pd.read_csv(
'gencode_v22_gene_annots.tsv', sep='\t', index_col=0).loc[gene_ids]
counts = pd.read_csv(
'tcga_brca_slide_tissue_htseq_counts.tsv', sep='\t',
usecols=np.append(gene_ids, 'tissue_submitter_id'),
index_col='tissue_submitter_id')[gene_ids]
slide_meta = pd.read_csv(
'tcga_brca_slide_htseq_counts.tsv', sep='\t', usecols=range(0, 7),
index_col='slide_submitter_id')
tmm_tpm = EdgeRTMMTPM(log=True, prior_count=2, gene_length_col='Length')
cv = KFold(n_splits=5, shuffle=True, random_state=random_seed)
for train_idx, test_idx in cv.split(counts):
counts_train = counts.iloc[train_idx]
counts_test = counts.iloc[test_idx]
tmm_tpm.fit(counts_train)
tmm_tpm_train = pd.DataFrame(
tmm_tpm.transform(counts_train, feature_meta=gene_annots),
columns=counts_train.columns, index=counts_train.index)
tmm_tpm_test = pd.DataFrame(
tmm_tpm.transform(counts_test, feature_meta=gene_annots),
columns=counts_test.columns, index=counts_test.index)
slide_tmm_tpm_train = slide_meta.join(
tmm_tpm_train, on='tissue_submitter_id', how='inner')
slide_tmm_tpm_test = slide_meta.join(
tmm_tpm_test, on='tissue_submitter_id', how='inner')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment