radaniba’s gists

radaniba / firstIntronLength.awk

Created November 29, 2012 17:18

Get all UCSC refseq and calculate first intron length and normalized first intron length

	BEGIN {
	FS="\t";
	}

	{
	split($10,exonStarts,",");
	split($11,exonEnds,",");
	geneSize=1.0*int($6)-int($5);
	exonCount=int($9);
	if(exonCount<2)

radaniba / rpkmforgenes.py

Created November 29, 2012 17:17

The python script rpkmforgenes.py is written for calculating gene expression for RNA-Seq data. It was used for a study published in PLoS Computational Biology (Ramsköld D, Wang ET, Burge CB, Sandberg R. An abundance of ubiquitously expressed genes reveale

	Sequence data input:
	A common input file format is bed file. In the simplest form, it has the format
	chromosome -tab- startposition -tab- endposition
	where each line corresponds to one sequence read
	However this format can also include strand information:
	chromosome -tab- startposition -tab- endposition -tab- name -tab- score -tab- +/-
	for example:
	chr13 54005048 54005081 . 1 -
	The score field is ignored.
	Files are allowed by have an optional first line starting with "track", this line is then ignored

radaniba / nucpattern.py

Created November 29, 2012 17:15

The code snippet below will populate the store dictionary keyed by the nucleotide patterns and values as lists that contain the occupancy for each index.

	from itertools import count

	def pattern_update(sequence, width=2, store={}):
	"""
	Accumulates nucleotide patterns of a certain width with
	position counts at each index.
	"""

	# open intervals need a padding at end for proper slicing
	size = len(sequence) + 1

radaniba / fastq-to-fasta.py

Created November 29, 2012 17:14

Extract all entries from a FASTQ file with names not present in a FASTA file

	from Bio.SeqIO.QualityIO import FastqGeneralIterator

	corrected_fn = "my_input_fasta.fas"
	uncorrected_fn = "my_input_fastq.ftq"
	output_fn = "differences_fastq.ftq"

	corrected_names = set() # Use a set instead of a list
	for line in open(corrected_fn):
	if line[0] == ">":
	read_name = line.split()[0][1:]

radaniba / GOenrichment.R

Created November 29, 2012 17:13

GO term enrichment analysis with the GeneAnswers package, use the bioconductor annotation packages for the ID mapping. The GeneAnswers documentation has improved since the package came out. And once you get the hang of the annotation packages, they seem f

	library("GeneAnswers")
	library("org.Dm.eg.db")
	library("GO.db")

	# get named vector of entrez ids
	fb.entrez <- unlist(as.list(org.Dm.egFLYBASE2EG))

	# for a data frame x with flybase ids (column 1) and data values (column 2)
	# match the flybase names against the vector of entrez ids
	iv <- match(x[,1], names(fb.entrez))

radaniba / SNP-Flank.R

Created November 29, 2012 17:13

This script takes as input a list of SNP (see first line for format) and get the flanking region and output it as fasta file. Dependencies : Bioconductor : BSgenome and BSgenome.Hsapiens.UCSC.hg18 (replace with the genome you want to use) to have an idea

	#rs10012775 chr4 122896328 0.45819 T C 0.358

	#biocLite("BSgenome.Hsapiens.UCSC.hg18")
	library("BSgenome.Hsapiens.UCSC.hg18")

	con <- file("SNPS_FILE")
	Lines <- readLines(con)
	N <- length(Lines)
	for (i in 1:N){

radaniba / enrichment.R

Last active October 13, 2015 08:48

Here is a script for gene set enrichment over KEGG, the perl script serves as automation for the analysis of several gene set files inside a directory, the R script does the analysis itself using bioconductor GOStats package. Look inside the perl script

	sink("log",append=FALSE,type=c("output","message"),split=FALSE)
	library(GOstats)
	library(EMA)
	library(fdrtool)
	library(org.Pt.eg.db)

	Args <- commandArgs();

	UNIV <- Args[6]
	#GL <- Args[5]

radaniba / gbseq.py

Created November 29, 2012 17:10

The following ready-to-run script reads a genbank file, which is probably a genomic or chromosomal one. It uses the CDS feature to discover the 5' and 3' ends of ORFs. Yes, ORFs are not exactly synonymous with genes, but this is the way we did it. Also, y

	#!/usr/bin/env python
	import sys
	import Bio
	from Bio import SeqIO, SeqFeature
	from Bio.SeqRecord import SeqRecord
	import os

	# Copyright(C) 2009 Iddo Friedberg & Ian MC Fleming
	# Released under Biopython license. http://www.biopython.org/DIST/LICENSE
	# Do not remove this comment

radaniba / query-ensembl.pl

Created November 29, 2012 17:09

If you want to automate your queries to Ensembl using BioMart perl API you will face the problem of server overloading, thus an alternative is to use webservice, this script solved my problem, hope it helps, Cheers.

	# an example script demonstrating the use of BioMart webservice
	use strict;
	use LWP::UserAgent;

	open (FH,$ARGV[0]) \|\| die ("\nUsage: perl getFromEnsembl.pl Query.xml\n\n");

	my $xml;
	while (<FH>){
	$xml .= $_;
	}

radaniba / handleargs.pl

Created November 29, 2012 17:09

I've seen a lot of question of perl programmers asking how to handle arguments for their script, here is an example snippet you can customize for your needs, cheers

	#--------------------------------------------------------------------------
	# handle the command line arguments
	#
	if ( $#ARGV == -1)
	{
	&usage;
	exit 0;
	}

	while (@ARGV)

Radhouane Aniba radaniba