Skip to content

Instantly share code, notes, and snippets.

View tomsing1's full-sized avatar

Thomas Sandmann tomsing1

View GitHub Profile
@tomsing1
tomsing1 / reactome_dplyr.R
Created April 6, 2022 23:10
Extracting gene sets from the Bioconductor reactome.db R package's SQLite backend with dplyr
#' Retrieve Reactome sets of Entrez identifiers for a selected species
#'
#' @param species Scalar character, the species of interest, e.g. `Homo sapiens`
#' @importFrom dplyr tbl right_join select collect mutate check_dbplyr
#' @importFrom glue glue_sql glue
#' @importFrom checkmate assert_choice
#' @export
#' @return A named list of Entrez identifiers
#' @examples
#' ReactomeSets("Mycobacterium tuberculosis")
@tomsing1
tomsing1 / kegg_and_go_gene_sets.R
Created March 24, 2022 23:34
Listifying KEGG and GO gene sets
library(limma)
library(AnnotationDbi)
library(GO.db)
library(org.Hs.eg.db)
# KEGG
kegg.names <- getKEGGPathwayNames("hsa", remove.qualifier = TRUE)
kegg.sets <- getGeneKEGGLinks("hsa", convert = TRUE)
kegg.gsc <- with(kegg.sets, split(GeneID, PathwayID))
names(kegg.gsc) <- kegg.names[
@tomsing1
tomsing1 / ena_rest.R
Last active March 10, 2022 22:56
Accessing ENA's REST APIs from R
library(checkmate)
library(dplyr)
library(glue)
library(htmltidy)
library(httr)
library(purrr)
library(xml2)
# https://ena-docs.readthedocs.io/en/latest/submit/general-guide/accessions.html
identify_accession_type <- function(accessions) {
@tomsing1
tomsing1 / ena_apis.md
Created March 10, 2022 02:03
Notes on ENA's REST apis for computational retrieval of NGS metadata

ENA APIs

The ENA has multiple APIs. The most important ones are:

  1. ENA Portal API: search ENA's databases using (potentially complex) queries.
  2. ENA Browser API: retrieve entire records programmatically

In addition, quick summaries of metadata and file retrieval locations can be retrieved

@tomsing1
tomsing1 / read_write_sqlite_with_dm.R
Created February 7, 2022 02:28
Reading and writing a SQLite database in R with the dm R package
library(DiagrammeR) # must be v 1.0.6.1 https://github.com/cynkra/dm/issues/823
library(dm)
library(RSQLite)
# download and decompress the chinook example SQLite database
zip_file <- tempfile(fileext = ".zip")
download.file(
"https://www.sqlitetutorial.net/wp-content/uploads/2018/03/chinook.zip",
destfile = zip_file)
chinook <- unzip(zip_file, exdir = tempdir())
@tomsing1
tomsing1 / mapping_read_with_subread_or_sublong.R
Created January 11, 2022 01:11
Mapping long read to a genome in R with the RSubread Bioconductor package
library(BSgenome.Hsapiens.UCSC.hg38)
library(Biostrings)
library(Rsubread)
library(GenomicRanges)
library(parallel)
library(GenomicAlignments)
kCores <- parallel::detectCores() - 1L
kQuery <- GRanges(seqnames = "chr12", IRanges(40263807, 40264221))
@tomsing1
tomsing1 / read_write_h5ad.R
Created November 5, 2021 19:12
Writing and reading h5ad files from R using the zellkonverter Bioconductor R package
libraries = c("zellkonverter", "SingleCellExperiment")
for (lib in libraries) {
suppressPackageStartupMessages(library(lib, character.only = TRUE,
quietly = TRUE))
}
# dummy SingleCellExperiment
ncells <- 100
u <- matrix(rpois(20000, 5), ncol = ncells)
v <- log2(u + 1)
@tomsing1
tomsing1 / python_and_r_via_basilisk.R
Created November 5, 2021 19:04
Using the basilisk Bioconductor package to execute python code in a controlled environment
libraries = c("basilisk", "mypackage", "zellkonverter",
"SingleCellExperiment")
for (lib in libraries) {
suppressPackageStartupMessages(library(lib, character.only = TRUE,
quietly = TRUE))
}
# create dummy SingleCellExperiment
ncells <- 100
u <- matrix(rpois(20000, 5), ncol = ncells)
@tomsing1
tomsing1 / sra_download_via_s3.sh
Last active October 1, 2021 19:55
Quickly download SRA archives using the AWS CLI and then extract the file using fasterq-dump
#!/usr/bin/env bash
set -e
set -x
set -o pipefail
# This bash script retrieves SRA archives for a runs listed in the
# SRR_Acc_List.txt file (available from the SRA Run Selector).
#
# It requires:
@tomsing1
tomsing1 / stitch_and_align.sh
Created September 23, 2021 04:04
Bash script to stitch & align paired end reads, using dockerized fastp, minimap2 and multiqc tools
#!/usr/bin/env bash
set -e
set -x
set -o pipefail
declare -r CORES=$(getconf _NPROCESSORS_ONLN)
declare -r REFERENCE="s3://your-reference-bucket/gencode/release_30/GRCh38_p12/"
declare -r S3_FASTQ="s3://your-bucket/Fastq/"
declare -r MIN_OVERLAP=10 # the larger the better