TODO Colorize distances using greengenes.lbl.gov and eletoolkit.org
- Presenting filtering algorithm using code was a bad idea. Better to illustrate using sample data.
- Ignored the “from Aeromonas” transfters since the data was weak.
- Committed changes so far to GitHub.
- Completed filtering of data using Peter’s new guidelines.
- Thiberio confirmed and explained the guidelines.
- Learned Python pandas function to broadcast index using reindex() and filtering grouped data with a lambda function using filter().
- Deleted previous output data and reran scripts from 03- to 08-.
- There is a bug in 06- where mafft tries to align all.faa, so after looking at top one has to have to ^C kill 06-.
- Aliased sshfs with default options to reconnect to giardia to be
able to visualize trees with
seaview. - Visualized paraphyletic trees with
seaviewfor cases of HGT.- To Aeromonas: 103962 160838 246298* 320665 320673 320678* 70966
- From Aeromonas: 299986 320664 320669 320674 320680 320681 320684 320687* 321038 7135*
- Met with collaborators to show T3SS work so far.
- Needed to relax tree branch selection.
- Analyzed non-monophylic trees for horizontal gene transfer.
- Needed to root the tree; used
tree.root_at_midpoint()but it’s not great way to do it. - Globbing is not great practise; replaced with
os.listdirandos.path.join. - Incorporated species to tree names.
- Using Bio.Phylo of biopython to read the newick trees and test for
monophily:
- Must only check the Aeromonas branches for monophily.
- Created new Python script to test for monophily.
- Found only 18 of 128 trees are monophylic.
- mafft:
- Talked to Thiberio about all the
mafftruns completing in less than a second and whether argument options should be added on tomafft. He said that we would need--reorderand--autoand drew on the whiteboard explaining why. - Reran
mafftwith new options.
- Talked to Thiberio about all the
- Tree building:
- Tried to install FastTree using
homebrew, but running it fails with:$ FastTree dyld: Library not loaded: /Users/omsai/apps/homebrew/lib/gcc/4.9/libgomp.1.dylib Referenced from: /Users/omsai/apps/homebrew/bin/FastTree Reason: image not found - Thiberio said he installed FastTree from the website instead of
homebrew.
- Mac install requires compilation of a single C file.
- Created Makefile in
./apps/fastree/to download and compile it.- Mac doesn’t even have
wget!
- Mac doesn’t even have
- Created and ran parallel FastTree script. Longest tree was built in ~5 seconds.
- Tried to install FastTree using
- Viewing trees:
- Trees are saved in newick format. Checked Gentoo Portage for
available commandline viewers. Installed
newick-utilsfrom homebrew and usednw_display. - Original notes from project briefing indicate using BioPython to
view tree. Installed from virtual env:
source ~/apps/venv-t3ss/bin/activate pip search biopython pip install biopython
- Trees are saved in newick format. Checked Gentoo Portage for
available commandline viewers. Installed
- Refactored blast+ manipulation into new blastplus library and BlastProcessor class.
- Wrote script to concatenate sequence using pre-blast original
filenames,
05-concatenate_homologous_and_top_hit_sequence_files.py.- Discovered bug in NR sequence retrieval script
04-get_sequences_of_top_hits.pyof not adding the comment line. - Fixed and reran NR sequence retrieval. Completed in 13 minutes.
- Now adds the comment line. Verified using diff:
for f in `ls ~/data/blastp-input/*.faa` do diff -u $f ~/data/mafft-input/$(basename $f | sed -e 's#\.faa#\.cat\.faa#') done
- Discovered bug in NR sequence retrieval script
- Wrote GNU Parallel script to run mafft.
- Needed to pipe files to stdin to GNU Parallel instead of using
:::format. - One has to end command with double-quotes to processes GNU
Parallel replacement strings (the
{}, etc).
- Needed to pipe files to stdin to GNU Parallel instead of using
mafftis already installed:mafft --version v7.221 (2014/04/16)- Thiberio explained what alignment actually does, the holy trinity of aligners, that tree building can be done with alignment, etc.
- Committed yesterday’s changes to GitHub.
- Alignment program:
- Input will be fasta files with all the homologous sequences and their blast match sequences in a single file.
- Output will be aligned sequences with dashes.
- Run with
parallel. Maybe 3 threads and 4 sequences at a time.
- Samtools failing to find sequences in nr database:
- Thiberio confirmed the fault with samtools failing after a few tries. He suspects this is due to hardware issues, and one could iteratively try each sequence till completion.
- Tried to just use
parallelwith one run per file, since it supports automatic retries, etc, but fails with a single thread. - Observing
htopindicates samtools faidx eats up all available memory and starts swapping which probably leads to a crash, and false error message.
- Thiberio suggested read the nr database directly with Python’s
xreadlines() which is iterates the file without loading it all into
memory. This ios a replacement for using samtools.
- Was not able to retrieve 8 sequences using just the first identifier in the comment line.
- Using all sequences in the comment line with regex and sets retrieved all 383 sequences of interest.
- Run completed in about 15 minutes.
- Talked to Thiberio about coding best practices:
- Personally he reviews the README files and script help output.
- Doesn’t actually push data through them.
- Some journal reviewers will run code parametrically if it’s easy; e.g. web app.
- Found paper (DOI: 10.1371/journal.pcbi.1000424) mentioned by Bill Mills’ “Bioinformatics” lesson tags here: https://mozillascience.github.io/studyGroupHandbook/lessons.html#lessons
- Spoke to Prof. Gogarten about possible PhD projects.
- Read papers Dan Needleman’s papers of Evolution overlaping will cell biology.
- Hard to a niche area since many people exporing cell biology this way.
- Our value proposition is thoroughness of work and insights.
- Committed previous weeks’ code to GitHub.
- Samtools only seems to process 5 parameters at a time and cuts off.
- Could it be OSX commandline length limitation? No, since
getconf ARG_MAXreturns 262144 andwcreturns only 924 characters total for thesubject_ids.refreference file. - Fault seems to be that
samtoolsdies at the first sign of trouble. It choked on not finding the 5th region parameter. - Tried using redirection operator instead of piping file output, but that makes it ignore all parameters, causing it to reindex the nr database in 7 minutes wall time.
- Hmm.. could do some dirty hacks of dropping parameters, or actually fix samtools faidx. GitHub seems fairly active, so chances of fix being reviewed and accepted is high and worthwhile.
- Could it be OSX commandline length limitation? No, since
blastpoutput was filled with just one column output.- Turns out one has to both, specify the numeric format and each of the columns.
- Reran blast watching output with
tail -f /path/to/all.fasta.blastp
- Read review papers about Inteins to help understand Shannon’s talk
tomorrow.
- Read about Homing Endonuclease (HEN). Thus now understand the “homing cycle” reference from the practise talk.
- Talked to Sheila about forming her research interest, forming a committee, and submitting a proposal. One can start with PI project and then find one’s own interest.
- Thiberio explained that to compare sequence length one has to
retrieve sequence the compared sequence of the “Subject ID” using
another utility,
samtools faidx.brew search samtools brew install homebrew/science/samtools samtoolsonly takes text input file. Thiberio clarified that our NR “database” is a decomressed text file, which is why it’s 38 GB.- Matt helped explain the broad scope of the problems the lab solves.
samtools faidxkeeps complaining of not being able to any queries with “[fai_fetch] Warning - Reference not found in FASTA file, returning empty sequence”. This is because the NR database was not indexed. Indexed usingsamtools faidx nr. Took about 15-20 minutes.- Told Thiberio that there of the 128 homologous groups containing
3118 genes, I got 1344 unique subject ID hits.
- He thought this was odd since one would expect a set of homologous groups to map to one or two other organisms at most.
- Need to write a script to find which homologous groups are producing so many results.
- Wrote unique ‘subject id’ hits to file to run through
samtools faidx. - Thiberio figured out that the problem was some of the hits were Archeae. Our original negative_gilist we got from NCBI was incomplete. Thiberio chose all the databases this time and confirmed using the new gilist filters out most of the previous results.
blastpkept complaining about not being able to write to output file:Error: NCBI C++ Exception: T0 "/private/tmp/blast20151014-14721-16up1aj/ncbi-blast-2.2.31+-src/c++/src/corelib/ncbifile.cpp", line 5416: Error: ncbi::CMemoryFileMap::CMemoryFileMap() - To be memory mapped the file must exist:- This was because it’s unable to parse the double quotes around the filename.
- Setup new blastp to run on corrected
sequence.giexclusion list.
- All blasp output is only 113M, so it should be fine to read all of
it at once for analysis:
$ du -h ~/data/blastp-output/ 113M /Users/omsai/data/blastp-output/ - Concatenated data into single file to make reading with pandas
easier using new shell script
concatenate-blastp.sh - The final step of selecting the maximum values to the pandas
DataFrame fails with: “Applying IndexError: positional indexers are
out-of-bounds”.
- This was because pandas preserves the original indices when filtering.
- Typically one would reindex, but there’s no need since the dataset is only about 100 MB one may as well keep all the data in memory and reference it.
- While the “evalue” filter strips 14% of all the data, all the “bit score” rows are anyway within the filter; so the filter has no effect.
- Not sure about next step. Posted notes from initial briefing online for clarification: https://public.etherpad-mozilla.org/p/G9HSaDsHNa
- Thiberio did some profiling and found that using
-negative_gilistis intensive - In any case, it would have been quicker to cat all the input files
*.faaand then run blastp once on them albeit in a single thread. - All blast processes finished in the early afternoon.
- Too tired in the evening, so chatted with Thiberio and saw his 6 vs 9 tree comparison.
- Completed shell script to generate blastp output.
- Ran script overnight.
- Discussed issue of filtering GI number with Thiberio.
- Confirmed approach of not doing filtering with blastp and to do it later with Python.
- But actually had a look at output and saw it flooded with
Aeromonas. Need to filter by GI list. Thiberio fetched from NIH
website via GenBank only. Saved GI list file to
data/blastp-input/sequence.gi.txt - Applying option for
-negative_gilistfails with error:BLAST Database error: GI list specified but no ISAM file found for GI in /Volumes/Macintosh HD 2/thiberio/blast_dbs/nr.00
- Thiberio fixed by rebuiling nr file with
makeblastdbusing-parse_seqidsoption. Normally we don’t do this since it adds on about 30 minutes to the build.
- Create Pandas table from
blastpoutput:- While local NR is rebuilding, downloaded blastp “Hit Table(text)”
output from NIH website. Copied to
data/nih-output/ - Pushed Python script and lab notes to GitHub. Cloning using git:// works, but push does not. Changed git config to use https instead of git.
- While local NR is rebuilding, downloaded blastp “Hit Table(text)”
output from NIH website. Copied to
- Setup BLAST+ per NIH standalone setup:
- Fetched installation files into
Downloads/ - One has to use the ftp program per the documentation, as the direct FTP link never seems to resolve.
- Installed into
apps/folder and added to thePATH:cd ~/Downloads/ md5 -q ncbi-blast-2.2.31+-universal-macosx.tar.gz && cat ncbi-blast-2.2.31+-universal-macosx.tar.gz.md5 cd ~/apps/ tar zxvpf ../Downloads/ncbi-blast-2.2.31+-universal-macosx.tar.gz echo 'PATH=${PATH}:~/apps/ncbi-blast-2.2.31+/bin/' >> ~/.bash_profile source ~/.bash_profile
- Fetched installation files into
- Downloaded NR database as described in BLAST help:
- Created
data/folder and downloaded the NR database. This took about 20 minutes!mkdir ~/data cd ~/data/ update_blastdb.pl nr
- Created
- Talked to Thiberio:
- Advised using homebrew for blast instead of installing directly from NIH.
- We already have NR alread here which we should use:
/Volumes/Macintosh HD 2/thiberio/blast_dbs - The 128 FASTA files are stored at:
/Volumes/Macintosh HD 2/thiberio/t3ss_fasta_files - Showed me how to use
writeto communicate between users in terminal. But we really need a a better messaging app. Decided on Telegram.
- Installed homebrew per
brewman page and online instructions.mkdir homebrew && curl -L https://github.com/Homebrew/homebrew/tarball/master | tar xz --strip 1 -C homebrew echo 'PATH=~/apps/homebrew/bin:${PATH}' >> ~/.bash_profile echo 'export HOMEBREW_CACHE=~/apps/homebrew/Library/Cache/' >> ~/.bash_profile source ~/.bash_profile mkdir -p ~/apps/homebrew/Library/Cache/
- Installed blast from homebrew:
brew install homebrew/science/blast - Setup Python virtualenv
mkdir -p ~/apps/venv-t3ss virtualenv ~/apps/venv-t3ss source ~/apps/venv-t3ss/bin/activate pip install --upgrade pip pip install ipython pip install pandas
- Read BLAST user manual.
- Tested BLAST on web page:
- Used file
10381_35.2008.faa - Excluded results from “Aeromonas allosaccharophila (taxid:656)”
- This gave 100% match to another Aeromonas :)
- Changed exclusion group to “Aeromonas (taxid:642)”
- Works now.
- Need to automate this single run locally with Python.
- Ran locally from commandline. Output is not tabulated. Also run time was 40x longer than running web query from NCBI!
- Problem may be that nr database is 38 GB. Talked to Thiberio. He said to just run overnight and not worry about timing for just 128.
- Used file