I hereby claim:
- I am sujaikumar on github.
- I am sujai (https://keybase.io/sujai) on keybase.
- I have a public key ASCyGc7HYwAEytv2WA0UK2tyl-xK4E0khlopKkyYaj7HYQo
To claim this, I am signing this object:
I hereby claim:
To claim this, I am signing this object:
# heading1 | |
Para | |
## heading2 | |
- bullet1 | |
- bullet2 | |
[Heligmosomoides bakeri][Heligmosomoides bakeri] |
Download the uniref90 xml file first (warning - this is ~15 GB, will take a while)
A single command to get the list of UNC's own 'best-hit' annotations for their 15 longest scaffolds:
curl http://weatherby.genetics.utah.edu/seq_transf/tg.default.final.gff.gz \
| zgrep -P "\tmRNA\t" | sort -k2,2gr -t 'e' \
| sort -k 1V \
| awk '{print; if(/scaffold15/){exit}}' \
| perl -plne 's/maker\tmRNA\t//; s/\.\t.*?\(/(/;' \
NCBI blastp seems to have a bug where it reports different top hits when -max_target_seqs is changed. This is a serious problem because the first 20 hits (for example) should be the same whether -max_target_seqs 100 or -max_target_seqs 500 is used.
The bug is reproducible on the command line when searching NCBI's nr blast database (dated 25-Nov-2015) using NCBI 2.2.28+, 2.2.30+ and 2.2.31+.
At first I thought it was something to do with my local exe/blastdb, but the same problem is also apparent on the NCBI blastp web interface (as of 30-Nov-2015)
How to parallelise UCSC BLAT with gnu parallel | |
============================================== | |
I spent a long time working out how to gnu-parallelise UCSC's blat and most tricks to specify the query file didn't work (e.g. "-" "</dev/stdin" etc), so am posting what did work for me: | |
cat cdna.fa | parallel --pipe --recstart ">" "blat -noHead genome.fa stdin >(cat) >/dev/null" >out.psl | |
If you don't do the >/dev/null - you get blat stdout messages like "Loaded X letters in Y sequences. Searched A bases in B sequences" in your output |