Skip to content

Instantly share code, notes, and snippets.

@brentp
Last active January 9, 2019 08:42
Show Gist options
  • Save brentp/993337 to your computer and use it in GitHub Desktop.
Save brentp/993337 to your computer and use it in GitHub Desktop.
create a bed12 file from a ucsc database.
ORG=$1
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -NA -D $ORG -e \
"select K.chrom,chromStart,chromEnd,X.geneSymbol,G.exonCount,strand from knownCanonical as K, kgXref as X, knownGene as G where
X.kgId=K.transcript and G.name=X.kgID;"
ORG=$1
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D $ORG -P 3306 -e "select chrom,txStart,txEnd,K.name,X.geneSymbol,strand,exonStarts,exonEnds from knownGene as K,kgXref as X where X.kgId=K.name;" > tmp.notbed
grep -v txStart tmp.notbed | awk '
BEGIN { OFS = "\t"; FS = "\t"} ;
{
delete astarts;
delete aends;
split($7, astarts, /,/);
split($8, aends, /,/);
starts=""
sizes=""
exonCount=0
for(i=1; i <= length(astarts); i++){
if (! astarts[i]) continue
sizes=sizes""(aends[i] - astarts[i])","
starts=starts""(astarts[i] = astarts[i] - $2)","
exonCount=exonCount + 1
}
print $1,$2,$3,$5","$4,1,$6,$2,$3,"0",exonCount,sizes,starts
}' | sort -k1,1 -k2,2n > knownGene.$ORG.bed12
@brentp
Copy link
Author

brentp commented May 26, 2011

in ~/.bash_aliases : put

alias bed12="wget --quiet --no-check-certificate  -O - https://gist.github.com/raw/993337/ucsc-bed12.sh | sh -s"

then use like

$ bed12 mm8

or

$ bed12 hg19

and a sorted file knownGene.$ORG.bed12 will be created in the current directory. That BED file will contain 1 row per transcript in the UCSC knownGene table.

@brentp
Copy link
Author

brentp commented Aug 1, 2011

for refGene, this looks like:

mysql -D $org -e "SELECT chrom,txStart,txEnd,name2,strand,exonStarts,exonEnds from refGene;" | awk 'BEGIN{ OFS="\t" }(NR > 1){ 
                delete astarts;
                delete aends;
                split($6, astarts, /,/);
                split($7, aends, /,/);
                starts=""
                sizes=""
                exonCount=0
                for(i=0; i < length(astarts); i++){
                    if (! astarts[i]) continue
                    sizes=sizes""(aends[i] - astarts[i])","
                    starts=starts""(astarts[i] = astarts[i] - $2)","
                    exonCount=exonCount + 1
                }
                print $1,$2,$3,$4,1,$5,$2,$3,"0",exonCount,sizes,starts
}' | sort -k 1,1 -k 2,2n > refGene.$org.bed12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment