Hirak Sarkar hiraksarkar

Useful Pandas Snippets

A personal diary of DataFrame munging over the years.

Convert Series datatype to numeric (will error if column has non-numeric values)
(h/t @makmanalp)

SAM and BAM filtering one-liners

Please extend with additional/faster/better solutions via a pull request!

BWA mapping (using piping for minimal disk I/O)

	SPC s c remove highlight

	**** Files manipulations key bindings
	Files manipulation commands (start with ~f~):

	\| Key Binding \| Description \|
	\|-------------+----------------------------------------------------------------\|
	\| ~SPC f c~ \| copy current file to a different location \|
	\| ~SPC f C d~ \| convert file from unix to dos encoding \|
	\| ~SPC f C u~ \| convert file from dos to unix encoding \|

	# Variation of information (VI)
	#
	# Meila, M. (2007). Comparing clusterings-an information
	# based distance. Journal of Multivariate Analysis, 98,
	# 873-895. doi:10.1016/j.jmva.2006.11.013
	#
	# https://en.wikipedia.org/wiki/Variation_of_information

	from math import log

	# Prior to the tutorial make sure that the script below runs without error on your R installation.
	# What you need is a working installation of Stan: http://mc-stan.org/ .
	# For installation instructions, see here:
	# https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started

	# After installation you should be able to run this script which should output
	# some summary statistics and some pretty plots, :)

	# Generating some fake data
	set.seed(123)

	import sys

	##########################################################
	#requires snakemake, python3, pyfasta to be installed
	#save this file and provide all the binaries and their path
	#in variables below.
	#to run flux pipeline:
	#snakemake run_flux_pipeline
	#to run rsem pipeline:
	#snakemake run_rsem_pipeline

	## RNA-seq analysis with DESeq2
	## Stephen Turner, @genetics_blog

	# RNA-seq data from GSE52202
	# http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse52202. All patients with
	# ALS, 4 with C9 expansion ("exp"), 4 controls without expansion ("ctl")

	# Import & pre-process ----------------------------------------------------

	# Import data from featureCounts

	#!/bin/bash
	# Usage: interleave_fastq.sh f.fastq r.fastq > interleaved.fastq
	#
	# Interleaves the reads of two FASTQ files specified on the
	# command line and outputs a single FASTQ file of STDOUT.
	#
	# Can interleave 100 million paired reads (200 million total
	# reads; a 2 x 22Gbyte files), in memory (/dev/shm), in 6m54s (414s)
	#
	# Latest code: https://gist.github.com/4544979

	# the following two lines give a two-line status, with the current window highlighted
	hardstatus alwayslastline
	hardstatus string '%{= kG}[%{G}%H%? %1`%?%{g}][%= %{= kw}%-w%{+b yk} %n*%t%?(%u)%? %{-}%+w %=%{g}][%{B}%m/%d %{W}%C%A%{g}]'

	# huge scrollback buffer
	defscrollback 5000

	# no welcome message
	startup_message off