Skip to content

Instantly share code, notes, and snippets.

View hermidalc's full-sized avatar

Leandro Hermida hermidalc

  • Pittsburgh, PA
View GitHub Profile
@hermidalc
hermidalc / tcga_impute_sandbox.R
Created September 6, 2024 16:06
Testing imputation methods on GDC TCGA clinical data
library(missForest)
library(mice)
library(ggplot2)
library(ggmice)
input_df <- gdc_case_meta[
c("project_id", "gender", "age_at_diagnosis", "tumor_stage")
]
input_df$project_id <- factor(input_df$project_id)
@hermidalc
hermidalc / ml_tmm_tpm.md
Last active September 3, 2024 11:48
Using sklearn-extensions to perform edgeR TMM-TPM normalization in your python ML code

Installation

Download and install Mambaforge

On Linux:

wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh
bash Mambaforge-Linux-x86_64.sh
@hermidalc
hermidalc / column_selector.py
Last active September 3, 2024 11:48
scikit-learn compatible ColumnSelector class
# if column selection on feature names X must be pandas df
# if used in Pipeline must be the first step or you have no
# feature selection step before it and you can then still
# use col indices
import warnings
import numpy as np
from sklearn.base import BaseEstimator
from sklearn.utils import check_X_y
from sklearn.feature_selection import SelectorMixin
@hermidalc
hermidalc / analyze_rna_seq_ml_nested_cv.py
Last active September 3, 2024 11:48
Building and evaluating ML models of RNA-seq count data using nested CV
import atexit
import os
import re
import sys
from argparse import ArgumentParser
from decimal import Decimal
from glob import glob
from pprint import pprint
from shutil import rmtree
from tempfile import mkdtemp, gettempdir
@hermidalc
hermidalc / analyze_rna_seq_gsea_preranked.R
Created May 16, 2021 22:01
RNA-seq differential gene expression GSEA preranked analysis
suppressPackageStartupMessages({
library(Biobase)
library(data.table)
library(edgeR)
library(fgsea)
library(msigdbr)
library(ggplot2)
})
set.seed(777)
@hermidalc
hermidalc / analyze_rna_seq_batch_effects.R
Last active September 3, 2024 11:48
RNA-seq batch effect analysis, plotting, and batch effect removal with DESeq2, edgeR, limma
suppressPackageStartupMessages({
library(Biobase)
library(DESeq2)
library(EDASeq)
library(edgeR)
library(limma)
library(RColorBrewer)
})
fig_dim <- 5
@hermidalc
hermidalc / analyze_rna_seq_diff_expr.R
Last active September 3, 2024 11:49
RNA-seq normalization, differential expression, transformation, volcano plotting with DESeq2, edgeR, limma-voom
suppressPackageStartupMessages({
library(Biobase)
library(DESeq2)
library(edgeR)
library(EnhancedVolcano)
library(limma)
})
fc <- 1.0
lfc <- log2(fc)

Keybase proof

I hereby claim:

  • I am hermidalc on github.
  • I am hermidalc (https://keybase.io/hermidalc) on keybase.
  • I have a public key ASBQphzIglqGvd7hq1XgFO-0f89wIWfo1xXbb8gKtfyAFQo

To claim this, I am signing this object:

@hermidalc
hermidalc / transform_feature_meta.py
Last active September 3, 2024 11:47
Inspect any scikit-learn fitted Pipeline to transform a feature metadata pandas DataFrame through the Pipeline and add model interpretation.
def transform_feature_meta(pipe, feature_meta):
transformed_feature_meta = None
for estimator in pipe:
if isinstance(estimator, ColumnTransformer):
for _, trf_pipe, trf_columns in estimator.transformers_:
if isinstance(trf_pipe, str) and trf_pipe == 'drop':
trf_feature_meta = feature_meta.iloc[
~feature_meta.index.isin(trf_columns)]
elif ((isinstance(trf_columns, slice)
and (isinstance(trf_columns.start, str)