Skip to content

Instantly share code, notes, and snippets.

@radaniba
radaniba / firstIntronLength.awk
Created November 29, 2012 17:18
Get all UCSC refseq and calculate first intron length and normalized first intron length
BEGIN {
FS="\t";
}
{
split($10,exonStarts,",");
split($11,exonEnds,",");
geneSize=1.0*int($6)-int($5);
exonCount=int($9);
if(exonCount<2)
@radaniba
radaniba / rpkmforgenes.py
Created November 29, 2012 17:17
The python script rpkmforgenes.py is written for calculating gene expression for RNA-Seq data. It was used for a study published in PLoS Computational Biology (Ramsköld D, Wang ET, Burge CB, Sandberg R. An abundance of ubiquitously expressed genes reveale
Sequence data input:
A common input file format is bed file. In the simplest form, it has the format
chromosome -tab- startposition -tab- endposition
where each line corresponds to one sequence read
However this format can also include strand information:
chromosome -tab- startposition -tab- endposition -tab- name -tab- score -tab- +/-
for example:
chr13 54005048 54005081 . 1 -
The score field is ignored.
Files are allowed by have an optional first line starting with "track", this line is then ignored
@radaniba
radaniba / nucpattern.py
Created November 29, 2012 17:15
The code snippet below will populate the store dictionary keyed by the nucleotide patterns and values as lists that contain the occupancy for each index.
from itertools import count
def pattern_update(sequence, width=2, store={}):
"""
Accumulates nucleotide patterns of a certain width with
position counts at each index.
"""
# open intervals need a padding at end for proper slicing
size = len(sequence) + 1
@radaniba
radaniba / fastq-to-fasta.py
Created November 29, 2012 17:14
Extract all entries from a FASTQ file with names not present in a FASTA file
from Bio.SeqIO.QualityIO import FastqGeneralIterator
corrected_fn = "my_input_fasta.fas"
uncorrected_fn = "my_input_fastq.ftq"
output_fn = "differences_fastq.ftq"
corrected_names = set() # Use a set instead of a list
for line in open(corrected_fn):
if line[0] == ">":
read_name = line.split()[0][1:]
@radaniba
radaniba / GOenrichment.R
Created November 29, 2012 17:13
GO term enrichment analysis with the GeneAnswers package, use the bioconductor annotation packages for the ID mapping. The GeneAnswers documentation has improved since the package came out. And once you get the hang of the annotation packages, they seem f
library("GeneAnswers")
library("org.Dm.eg.db")
library("GO.db")
# get named vector of entrez ids
fb.entrez <- unlist(as.list(org.Dm.egFLYBASE2EG))
# for a data frame x with flybase ids (column 1) and data values (column 2)
# match the flybase names against the vector of entrez ids
iv <- match(x[,1], names(fb.entrez))
@radaniba
radaniba / SNP-Flank.R
Created November 29, 2012 17:13
This script takes as input a list of SNP (see first line for format) and get the flanking region and output it as fasta file. Dependencies : Bioconductor : BSgenome and BSgenome.Hsapiens.UCSC.hg18 (replace with the genome you want to use) to have an idea
#rs10012775 chr4 122896328 0.45819 T C 0.358
#biocLite("BSgenome.Hsapiens.UCSC.hg18")
library("BSgenome.Hsapiens.UCSC.hg18")
con <- file("SNPS_FILE")
Lines <- readLines(con)
N <- length(Lines)
for (i in 1:N){
@radaniba
radaniba / enrichment.R
Last active October 13, 2015 08:48
Here is a script for gene set enrichment over KEGG, the perl script serves as automation for the analysis of several gene set files inside a directory, the R script does the analysis itself using bioconductor GOStats package. Look inside the perl script
sink("log",append=FALSE,type=c("output","message"),split=FALSE)
library(GOstats)
library(EMA)
library(fdrtool)
library(org.Pt.eg.db)
Args <- commandArgs();
UNIV <- Args[6]
#GL <- Args[5]
@radaniba
radaniba / gbseq.py
Created November 29, 2012 17:10
The following ready-to-run script reads a genbank file, which is probably a genomic or chromosomal one. It uses the CDS feature to discover the 5' and 3' ends of ORFs. Yes, ORFs are not exactly synonymous with genes, but this is the way we did it. Also, y
#!/usr/bin/env python
import sys
import Bio
from Bio import SeqIO, SeqFeature
from Bio.SeqRecord import SeqRecord
import os
# Copyright(C) 2009 Iddo Friedberg & Ian MC Fleming
# Released under Biopython license. http://www.biopython.org/DIST/LICENSE
# Do not remove this comment
@radaniba
radaniba / query-ensembl.pl
Created November 29, 2012 17:09
If you want to automate your queries to Ensembl using BioMart perl API you will face the problem of server overloading, thus an alternative is to use webservice, this script solved my problem, hope it helps, Cheers.
# an example script demonstrating the use of BioMart webservice
use strict;
use LWP::UserAgent;
open (FH,$ARGV[0]) || die ("\nUsage: perl getFromEnsembl.pl Query.xml\n\n");
my $xml;
while (<FH>){
$xml .= $_;
}
@radaniba
radaniba / handleargs.pl
Created November 29, 2012 17:09
I've seen a lot of question of perl programmers asking how to handle arguments for their script, here is an example snippet you can customize for your needs, cheers
#--------------------------------------------------------------------------
# handle the command line arguments
#
if ( $#ARGV == -1)
{
&usage;
exit 0;
}
while (@ARGV)