Cyriac Kandoth ckandoth

Purpose

Prototype Illumina's Nirvana on Dragen Pay-As-You-Go (PAYG) VMs on Azure.

Prerequisites

Sign up for an Azure subscription at this link if you don't already have one.
Follow these instructions to register resource providers Microsoft.Network, Microsoft.Storage, and Microsoft.Compute. Strictly speaking, you don't need Microsoft.Storage for the steps in this guide. But in a production environment, it is recommended to use blob storage for inputs/outputs. You should also upload Nirvana annotation source files into your own blob storage account so that you can quickly deploy them into multiple ephemeral VMs that run Nirvana in parallel on each sample.
Visit this page, login if needed, and increase your Quota for `To

Purpose

Reproduce the double free or corruption (fasttop) error from Dragen 4.3.6 on their CentOS 7.9 image on Azure NP10 VMs.

Update (Mar 2025)

This error was linked to Dragen's code for streaming FASTQs from Azure blob storage. As a workaround, we now use azcopy to copy FASTQs onto VM local storage before alignment, which significantly reduces runtime too. The errors also disappear in AlmaLinux 8, which all newer Dragen VM images are based on.

Prerequisites

Purpose

Build an Azure stack to operate NP-series VMs on Azure with Dragen's pay-as-you-go (PAYG) license.

Prerequisites

Sign up for an Azure subscription at this link if you don't already have one.
Follow these instructions to register resource providers Microsoft.Network, Microsoft.Storage, and Microsoft.Compute.
Visit this page, login if needed, and ensure that Status is set to Enable for the Azure subscription you intend to use. T

Purpose

A proof-of-concept high-performance server for primary and secondary NGS analyses with reasonable cost and TAT.

Hardware and OS

Acquired a Dell Precision 5820 tower workstation in mid 2018 with the following specs. Minimally, you want fast single-thread performance, at least 64GB RAM preferably ECC, and very speedy disks. A GPU with at least 16GB VRAM allows you to run Nvidia's Parabricks v4.4 or at least 12GB VRAM for Parabricks v3.8.

Intel Xeon W-2145 (supports ECC memory and AVX-512; decent single-thread performance)
208GB DDR4-2666 ECC Memory (ECC reduces odds of data corruption)

Ensembl's VEP (Variant Effect Predictor) is popular for how it picks a single effect per gene as detailed here, its CLIA-compliant HGVS variant format, and Sequence Ontology nomenclature for variant effects.

The official instructions to install VEP have never worked well from the United States because of the flaky network connection to their FTP servers in the UK. So, we will instead use conda to install VEP and its dependencies and then manually download VEP caches and reference genomes using rsync.

If you don't already have conda, download and install it into $HOME/miniconda3:

curl -sL https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o miniconda.sh
bash miniconda.sh -bup $HOME/minic

WSL2 is a Windows feature that allows near-native execution of ELF binaries like Unix bash and Gnome apps. Recent versions of Windows support GPU acceleration and mounting ext4 disks, which turns a powerful gaming PC into a powerful Linux workstation without needing to dual-boot two operating systems. The guide below helps you set up WSL2, install TensorFlow with GPU acceleration, and then test SpliceAI on a VCF.

Make sure you have the latest version of Windows 10 or 11 and administrative rights.
Hit the Windows key, type cmd, right-click on Command Prompt, Run as Administrator, and click Yes if prompted by User Account Control.
Follow these instructions to install WSL2. It should be as simple as wsl --install -d Ubuntu and reboot. But

Ensembl's VEP (Variant Effect Predictor) is popular for how it picks a single effect per gene as detailed here, its CLIA-compliant HGVS variant format, and Sequence Ontology nomenclature for variant effects.

Instead of the official instructions, we will use mamba (conda, but faster) to install VEP and its dependencies. If you don't already have mamba, use these steps to download and install it into $HOME/mambaforge, then run a script that adds it to your $PATH:

curl -L https://github.com/conda-forge/miniforge/releases/download/4.12.0-0/Mambaforge-Linux-x86_64.sh -o /tmp/mambaforge.sh
sh /tmp/mambaforge.sh -bfp $HOME/mambaforge && rm -f mambaforge.sh
. $HOME/mambaforge

	# GOAL: Prototype a bioinformatics pipeline to calculate Polygenic Risk Scores (PRS) using WGS gVCFs.

	# Let's use pgscalc (https://github.com/pgscatalog/pgsc_calc) a nextflow pipeline to calculate PRS, given PR weights and
	# a multi-sample VCF. Variant allele weights can be specified either as PGS Catalog IDs (--pgs_id) and/or as custom scoring
	# files (--scorefile). pgscalc can also perform liftover if scoring files use a different reference genome build than the
	# input VCFs (--liftover --target_build).

	# Steps below were tested on Ubuntu 24.04. But should work fine with any Linux server using bash.

	# ----- #

	# Fetch the WGS gnomAD 3.1.2 per-chrom VCFs (the large size is mostly due to INFO fields):
	mkdir gnomad
	gsutil -m cp gs://gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr*.vcf.bgz gnomad
	gsutil -m cp gs://gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr*.vcf.bgz.tbi gnomad

	# Shortlist INFO fields we want to keep when merging these into a single VCF of reduced file size:
	bcftools view -h gnomad/gnomad.genomes.v3.1.2.sites.chr21.vcf.bgz \| grep ^##INFO \| cut -f3- -d= \| grep -Ev "controls\|non_cancer\|non_neuro\|non_topmed\|non_v2\|vep" \| sort \| less -S

	cadd_phred
	cadd_raw_score

	#!/usr/bin/env python

	# Prereqs: Run "az login" to get a refresh token at "~/.azure/msal_token_cache.json" which expires only if unused for 90 days
	# Depends: pip install azure-identity azure-storage-blob
	# Sources: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/storage/azure-storage-blob/samples/blob_samples_containers.py

	STORAGE_ACCOUNT_URL = "https://blahdiblahdiblah.blob.core.windows.net"
	CONTAINER_NAME = "mdlhot"

	# Use the MSAL refresh token to get a temporary access token for use with blob storage libraries