Skip to content

Instantly share code, notes, and snippets.

View coleoguy's full-sized avatar

Heath Blackmon coleoguy

View GitHub Profile
foo <- gsub("T", "U", as.character(read.table("rosalind_rna.txt")[[1]]))
paste(foo)
foo <- unlist(strsplit(as.character(read.table("rosalind_revc.txt")[[1]]),""))
DNA <- c("A", "C", "G", "T")
rcDNA <- c("T", "G", "C", "A")
foo2 <- vector(length = length(foo))
for(i in 1:4){
foo2[grep(DNA[i], unlist(foo))] <- rcDNA[i]
}
paste(foo2[length(foo2):1], collapse = "", sep = "")
data <- as.numeric(read.table("rosalind_iprb.txt"))
k<-data[1]
m<-data[2]
n<-data[3]
t<-sum(data)
# Sire is AA Dam is anything
t.prob <- k/(t)
# Sire is Aa Dam is AA, Aa, aa
t.prob <- t.prob + (((m/(t)) * (k/(t-1))) + #Aa x AA
((m/(t)) * ((m-1)/(t-1)) * .75) + #Aa x Aa
@coleoguy
coleoguy / gist:6335185
Created August 25, 2013 17:35
calculate hamming distance between to aligned sequences
dna <- read.table("rosalind_hamm.txt")
dna1 <- unlist(strsplit(as.character(dna[1,1]),""))
dna2 <- unlist(strsplit(as.character(dna[2,1]),""))
ham.dist <- 0
for(i in 1:length(dna1)){
if(dna1[i] != dna2[i]){
ham.dist <- ham.dist +1
}
}
cat(ham.dist)
lifespan <- 18
months <- 85
pop <- vector(mode="numeric", length=months) # this will hold the population each month as we build it up
pop[1:2] <- 1 # rabbits take a month to mature so we go ahead and fill in month 1 and 2
for(i in 3:months){ # this loop will just build our population month by month
if(i-(lifespan+1) < 0){ # if we are early on none have died yet this if statement takes care of that
pop[i] <- pop[i-1]+pop[i-2]-0
}
if(i-(lifespan+1) == 0){ # this one takes care of the first rabbit dying
pop[i] <- pop[i-1]+pop[i-2]-1
@coleoguy
coleoguy / dstat.R
Last active June 22, 2018 20:53
This file contains functions implementing ABBA BABA type tests for gene flow between populations. The three functions included are CalcD CalcPartD and CalcPopD. CalcD: This is the original version of the ABBA BABA test and calculates the D-statistic as found in Durand et al 2011 page 2240 CalcPopD: This version of the test is able to use many se…
## ABBA BABA tests
## Heath Blackmon
## [email protected]
## 28 July 2013
## This file has algorithms found in:
## Durand, Eric Y., et al. "Testing for ancient admixture between
## closely related populations." Molecular biology and evolution
## 28.8 (2011): 2239-2252.
@coleoguy
coleoguy / retro-gff3parser.py
Last active January 3, 2016 20:29
Parse the castaneum gff file to extract all genes with multiple exons. I'd like to make this robust at some point to repeat this process with any genome gff3 file however in the interest of finishing this project and getting it written up this semester the current version is only tested on the castaneum genome (version 3) gff3 file.
# the gff3 file that was used as input for this script was downloaded from:
# ftp://ftp.ensemblgenomes.org/pub/metazoa/release-21/gff3/tribolium_castaneum
# on January 10, 2014
import re
## This script is just to parse out the part of the gff file that we
## are interested in, (i.e. lines describing exons or protein coding genes
## Here we specify the name of the gff3
@coleoguy
coleoguy / retro-EEfinder.py
Last active January 3, 2016 22:09
This python script takes a list of exons from multiple exon genes as well as fast files for each chromosome in a genome and it constructs a fasta file where each sequence is 60bp in length (last 30bp of one exon and the first 30bp of the next). These can then be used to search the genome for retroduplication events of genes. By limiting our selv…
from Bio import SeqIO # to deal with the fast files
# lets pull in our exon table
datafile = open('exons.csv', 'r')
data = []
for row in datafile:
data.append(row.strip().split(','))
# now lets start the process of creating all of our exon-exon sequences
# by first creating a list that has the name we want to assign and the LG and
@coleoguy
coleoguy / retro-lgexonFinder.R
Last active January 4, 2016 09:28
creates a fasta file with 300bp chunks from the start of all exons larger than 300 bp
setwd("~/Desktop/retrogenes/data/")
gff <- read.csv("exons.csv", header = F, as.is=T) #This is a parsed GFF3 file that contains the lines for exons
# lets get big ones >300 bp
gff <- gff[gff[, 3] - gff[, 2] > 300, ]
setwd("~/Desktop/retrogenes/data/chromosomes") #This is the directory that contains the chromosomes of the genome
library(ape)
seq <- name <- list()
@coleoguy
coleoguy / retro-blast.commands
Created January 24, 2014 22:45
these are the command line arguments to create a local blast database from the new genome assembly (in our case T. confusum), and then blast the large exon file to create what we will call a look up table.
# to construct database from an assembly in the file
# conf.scaf.fa run this in the terminal
# it will create a blast database call confscaf
makeblastdb -in conf.scaf.fa -dbtype nucl -out confscaf
# next we want to blast our large exons against this db
blastn -query=/outputs/300+bpExons.fa -db=/blastDB/confscaf -outfmt='6 qseqid qstart sseqid sstart qlen length pident' -max_target_seqs=2 -out=info
# this produces the lookup table that we can then use to analyze the degree of synteny conservation