Skip to content

Instantly share code, notes, and snippets.

@danielecook
Created January 15, 2014 17:43
Show Gist options
  • Save danielecook/8440831 to your computer and use it in GitHub Desktop.
Save danielecook/8440831 to your computer and use it in GitHub Desktop.
This chunk of code produces 'kegg_merged.txt' which is a file consisting of genes and their respective pathways. This gist downloads a number of files from the UCSC genome browser and merge them together.
# Download KEGG Data (Pathways)
#==============================#
# Download select files from UCSC (hg19)
for var in keggPathway KeggMapDesc knownGene kgXref
do
wget --timestamping --directory-prefix test 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/$var.txt.gz'
gunzip kegg/$var.txt.gz
done
# Join kegg pathway description with ID; keggMapDesc is already sorted; kgXref has 82,960 lines.
sort kegg/keggPathway.txt -k3 | join -1 3 -2 1 -t $'\t' - kegg/keggMapDesc.txt | cut -f 1,2,4 | sort -k 2 > kegg/kegg_tmp.txt # 58,073 lines.
sort kegg/KgXref.txt -k 1 | join -1 1 -2 2 -t $'\t' - kegg/kegg_tmp.txt > kegg/kegg_merged.txt
rm kegg/kegg_tmp.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment