Created
April 20, 2012 17:14
-
-
Save cbare/2430361 to your computer and use it in GitHub Desktop.
Extract gene expression data from a cmonkey output file
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Extract expression data | |
# ======================= | |
# | |
# Gene expression data comes from the cmonkey output RData file: | |
# ~/Documents/work/projects/network_portal/cm_session.RData | |
# | |
# ...in a variable: env$ratios$ratios, a 3491 by 739 matrix | |
conditions <- c('X130.1', 'X127.1', 'X435.8', 'X435.7', 'X435.6', 'X435.1', 'X435.2', 'X435.3', 'X435.4', 'X435.10', 'X435.9', 'X435.5', 'X435.11', 'X449.1', 'X450.1', 'X450.2', 'X450.3', 'X450.4', 'X450.5', 'X1271.1', 'X1271.2', 'X1271.3', 'X1271.4', 'X1271.5', 'X1271.7', 'X1271.9', 'X1271.10', 'X1709.2', 'X1709.3', 'X1709.4', 'X1709.5', 'X1709.6', 'X1709.7', 'X1709.8', 'X1709.9', 'X1709.10', 'X1709.12', 'X1709.13', 'X1709.14', 'X1709.15', 'X1713.1', 'X1713.2', 'X1713.3', 'X1713.4', 'X1713.5', 'X1736.5', 'X1736.10', 'X1736.17', 'X1736.19', 'X1736.21') | |
# dvu gene names are the second entry in the synonyms table | |
trans <- sapply(env$genome.info$synonyms, `[`, 2) | |
# the feature table is keyed by YP_ identifiers | |
# we want to use DVU gene names | |
ft = env$genome.info$feature.tab | |
dvu.features.1 <- ft[trans[rownames(env$ratios$ratios)],c('contig','strand','start_pos','end_pos')] | |
trans[rownames(env$ratios$ratios)] | |
# are all gene names in translation table? | |
all(rownames(env$ratios$ratios) %in% names(trans)) | |
# are any mapped to NAs? | |
trans[ is.na(trans) ] | |
# 'DVU0490','DVU0557','DVU0699','DVU1831','DVU2001','DVU2049','DVU2950','DVU3126','DVU3280','DVU3304' | |
# missing.genes = read.delim(file='missing-genes.txt', header=F, col.names=c('name','sequence','strand','start','end'), stringsAsFactors=F) | |
missing.genes.txt <- "DVU0490 chromosome - 558218 559487 | |
DVU0557 chromosome + 624872 625726 | |
DVU0699 chromosome + 773972 775650 | |
DVU1831 chromosome + 1896865 1897691 | |
DVU2001 chromosome + 2082778 2084039 | |
DVU2049 chromosome + 2126385 2126851 | |
DVU2950 chromosome - 3054878 3056595 | |
DVU3126 chromosome + 3273316 3274547 | |
DVU3280 chromosome + 3455425 3456410 | |
DVU3304 chromosome + 3481262 3482740 | |
" | |
# read the string above into a data.frame | |
missing.genes <- read.delim(file=textConnection(missing.genes.txt, open="r"), header=F, col.names=c('name','sequence','strand','start','end')) | |
dvu.features <- data.frame( | |
name=rownames(env$ratios$ratios), | |
sequence=sapply(dvu.features.1$contig, function(x) | |
if (is.na(x)) NA | |
else if (x=='NC_002937.3') 'chr' | |
else if (x=='NC_005863.1') 'megaplasmid' | |
else x), | |
strand=sapply(dvu.features.1$strand, function(x) | |
if (is.na(x)) NA | |
else if (x=='R') {'-'} | |
else if (x=='D') {'+'} | |
else {x}), | |
start=dvu.features.1$start_pos, | |
end=dvu.features.1$end_pos, | |
stringsAsFactors=F) | |
rownames(dvu.features) <- rownames(env$ratios$ratios) | |
# fix entries that either can't be translated or are missing from the features table | |
dvu.features[missing.genes$name,] <- missing.genes | |
dvu.expression <- cbind( dvu.features, env$ratios$ratios[, conditions] ) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Create a heatmap track of gene expression in the Gaggle Genome Browser, given expression data in a cmonkey output file for the organism Desulfovibrio vulgaris Hildenborough. This is not really a complete script. Some manual intervention is required.
The first complication is that the rows in the ratios matrix are keyed by DVU gene identifiers, while the features table that has location on the genome is keyed by some other identifiers which look like this: "YP_009868.1". We create a lookup table called trans to fix this. This covers most genes. We add the missing genes (in the ratios matrix, but not in the features table) manually.
Next, create dvu.features, translating sequence names and strand along the way.
Finally, extract the desired conditions and cbind the chromosomal locations to the expression data. The resulting data.frame can be written out with write.table and read into SQLite with .import
Ugly? Yes! But, it worked.