Skip to content

Instantly share code, notes, and snippets.

@sgsfak
Created December 8, 2015 14:17
Show Gist options
  • Select an option

  • Save sgsfak/dae8be3cbf03902d3f21 to your computer and use it in GitHub Desktop.

Select an option

Save sgsfak/dae8be3cbf03902d3f21 to your computer and use it in GitHub Desktop.
Downloads the UniGene data from NCBI, parses them with awk to get the information about the UniGene cluster, the gene symbol and Entrez Gene identifier, and the Genbank sequences
#!/bin/sh
wget ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/Hs.data.gz
gunzip Hs.data.gz
<Hs.data awk -v OFS=',' 'BEGIN { print "unigene_id", "gene_id", "symbol", "gb_acc"}
/^ID /{ug=$2}
/^GENE_ID /{eg=$2}
/^GENE /{s=$2}
/^SEQUENCE /{
sub("ACC=", "", $2);
sub(";", "", $2);
print ug, eg, s, $2}' > Hs.data.csv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment