omsai/Makefile

Lab notebook

TODO Colorize distances using greengenes.lbl.gov and eletoolkit.org

<2015-11-19 Thu> Presentation

Presenting filtering algorithm using code was a bad idea. Better to illustrate using sample data.
Ignored the “from Aeromonas” transfters since the data was weak.
Committed changes so far to GitHub.

<2015-11-18 Wed> Tree HGT analysis

Completed filtering of data using Peter’s new guidelines.
- Thiberio confirmed and explained the guidelines.
- Learned Python pandas function to broadcast index using reindex() and filtering grouped data with a lambda function using filter().
Deleted previous output data and reran scripts from 03- to 08-.
- There is a bug in 06- where mafft tries to align all.faa, so after looking at top one has to have to ^C kill 06-.
Aliased sshfs with default options to reconnect to giardia to be able to visualize trees with seaview.
Visualized paraphyletic trees with seaview for cases of HGT.
- To Aeromonas: 103962 160838 246298* 320665 320673 320678* 70966
- From Aeromonas: 299986 320664 320669 320674 320680 320681 320684 320687* 321038 7135*

<2015-11-12 Thu> Collaborator meeting

Met with collaborators to show T3SS work so far.
Needed to relax tree branch selection.

<2015-11-11 Wed> Review and tree annotation

Analyzed non-monophylic trees for horizontal gene transfer.
Needed to root the tree; used tree.root_at_midpoint() but it’s not great way to do it.
Globbing is not great practise; replaced with os.listdir and os.path.join.
Incorporated species to tree names.

<2015-11-10 Tue> Tree monophily test

Using Bio.Phylo of biopython to read the newick trees and test for monophily:
- Must only check the Aeromonas branches for monophily.
- Created new Python script to test for monophily.
- Found only 18 of 128 trees are monophylic.

<2015-11-05 Thu> FastTree build

mafft:
- Talked to Thiberio about all the mafft runs completing in less than a second and whether argument options should be added on to mafft. He said that we would need --reorder and --auto and drew on the whiteboard explaining why.
- Reran mafft with new options.
Tree building:
- Tried to install FastTree using homebrew, but running it fails with:
```
$ FastTree
dyld: Library not loaded: /Users/omsai/apps/homebrew/lib/gcc/4.9/libgomp.1.dylib
Referenced from: /Users/omsai/apps/homebrew/bin/FastTree
Reason: image not found
        
```
- Thiberio said he installed FastTree from the website instead of homebrew.
  - Mac install requires compilation of a single C file.
  - Created Makefile in ./apps/fastree/ to download and compile it.
    - Mac doesn’t even have wget!
- Created and ran parallel FastTree script. Longest tree was built in ~5 seconds.
Viewing trees:
- Trees are saved in newick format. Checked Gentoo Portage for available commandline viewers. Installed newick-utils from homebrew and used nw_display.
- Original notes from project briefing indicate using BioPython to view tree. Installed from virtual env:
```
source ~/apps/venv-t3ss/bin/activate
pip search biopython
pip install biopython
        
```

<2015-11-04 Wed> Sequence alignment with mafft

Refactored blast+ manipulation into new blastplus library and BlastProcessor class.
Wrote script to concatenate sequence using pre-blast original filenames, 05-concatenate_homologous_and_top_hit_sequence_files.py.
- Discovered bug in NR sequence retrieval script 04-get_sequences_of_top_hits.py of not adding the comment line.
- Fixed and reran NR sequence retrieval. Completed in 13 minutes.
- Now adds the comment line. Verified using diff:
```
for f in `ls ~/data/blastp-input/*.faa`
do diff -u $f ~/data/mafft-input/$(basename $f | sed -e 's#\.faa#\.cat\.faa#')
done
        
```
Wrote GNU Parallel script to run mafft.
- Needed to pipe files to stdin to GNU Parallel instead of using ::: format.
- One has to end command with double-quotes to processes GNU Parallel replacement strings (the {}, etc).

<2015-11-03 Tue> Sequence alignment with mafft

mafft is already installed:

mafft --version
v7.221 (2014/04/16)

Thiberio explained what alignment actually does, the holy trinity of aligners, that tree building can be done with alignment, etc.
Committed yesterday’s changes to GitHub.
Alignment program:
- Input will be fasta files with all the homologous sequences and their blast match sequences in a single file.
- Output will be aligned sequences with dashes.
- Run with parallel. Maybe 3 threads and 4 sequences at a time.

<2015-11-02 Mon> Replacing problematic samtools faidx

Samtools failing to find sequences in nr database:
- Thiberio confirmed the fault with samtools failing after a few tries. He suspects this is due to hardware issues, and one could iteratively try each sequence till completion.
- Tried to just use parallel with one run per file, since it supports automatic retries, etc, but fails with a single thread.
- Observing htop indicates samtools faidx eats up all available memory and starts swapping which probably leads to a crash, and false error message.
Thiberio suggested read the nr database directly with Python’s xreadlines() which is iterates the file without loading it all into memory. This ios a replacement for using samtools.
- Was not able to retrieve 8 sequences using just the first identifier in the comment line.
- Using all sequences in the comment line with regex and sets retrieved all 383 sequences of interest.
- Run completed in about 15 minutes.

<2015-10-26 Mon> Samtools faidx

Talked to Thiberio about coding best practices:
- Personally he reviews the README files and script help output.
- Doesn’t actually push data through them.
- Some journal reviewers will run code parametrically if it’s easy; e.g. web app.
- Found paper (DOI: 10.1371/journal.pcbi.1000424) mentioned by Bill Mills’ “Bioinformatics” lesson tags here: https://mozillascience.github.io/studyGroupHandbook/lessons.html#lessons
Spoke to Prof. Gogarten about possible PhD projects.
- Read papers Dan Needleman’s papers of Evolution overlaping will cell biology.
- Hard to a niche area since many people exporing cell biology this way.
- Our value proposition is thoroughness of work and insights.
Committed previous weeks’ code to GitHub.
Samtools only seems to process 5 parameters at a time and cuts off.
- Could it be OSX commandline length limitation? No, since getconf ARG_MAX returns 262144 and wc returns only 924 characters total for the subject_ids.ref reference file.
- Fault seems to be that samtools dies at the first sign of trouble. It choked on not finding the 5th region parameter.
- Tried using redirection operator instead of piping file output, but that makes it ignore all parameters, causing it to reindex the nr database in 7 minutes wall time.
- Hmm.. could do some dirty hacks of dropping parameters, or actually fix samtools faidx. GitHub seems fairly active, so chances of fix being reviewed and accepted is high and worthwhile.

<2015-10-22 Thu> Rerun of BlastP

blastp output was filled with just one column output.
- Turns out one has to both, specify the numeric format and each of the columns.
- Reran blast watching output with tail -f /path/to/all.fasta.blastp
Read review papers about Inteins to help understand Shannon’s talk tomorrow.
- Read about Homing Endonuclease (HEN). Thus now understand the “homing cycle” reference from the practise talk.
Talked to Sheila about forming her research interest, forming a committee, and submitting a proposal. One can start with PI project and then find one’s own interest.

<2015-10-21 Wed> Grepping sequences with samtools faidx

Thiberio explained that to compare sequence length one has to retrieve sequence the compared sequence of the “Subject ID” using another utility, samtools faidx.
```
brew search samtools
brew install homebrew/science/samtools
    
```
samtools only takes text input file. Thiberio clarified that our NR “database” is a decomressed text file, which is why it’s 38 GB.
Matt helped explain the broad scope of the problems the lab solves.
samtools faidx keeps complaining of not being able to any queries with “[fai_fetch] Warning - Reference not found in FASTA file, returning empty sequence”. This is because the NR database was not indexed. Indexed using samtools faidx nr. Took about 15-20 minutes.
Told Thiberio that there of the 128 homologous groups containing 3118 genes, I got 1344 unique subject ID hits.
- He thought this was odd since one would expect a set of homologous groups to map to one or two other organisms at most.
- Need to write a script to find which homologous groups are producing so many results.
Wrote unique ‘subject id’ hits to file to run through samtools faidx.
Thiberio figured out that the problem was some of the hits were Archeae. Our original negative_gilist we got from NCBI was incomplete. Thiberio chose all the databases this time and confirmed using the new gilist filters out most of the previous results.

blastp kept complaining about not being able to write to output file:

Error: NCBI C++ Exception:
    T0 "/private/tmp/blast20151014-14721-16up1aj/ncbi-blast-2.2.31+-src/c++/src/corelib/ncbifile.cpp", line 5416: Error: ncbi::CMemoryFileMap::CMemoryFileMap() - To be memory mapped the file must exist:

This was because it’s unable to parse the double quotes around the filename.

Setup new blastp to run on corrected sequence.gi exclusion list.

<2015-10-20 Tue> Blastp analysis

All blasp output is only 113M, so it should be fine to read all of it at once for analysis:
```
$ du -h ~/data/blastp-output/
113M    /Users/omsai/data/blastp-output/
    
```
Concatenated data into single file to make reading with pandas easier using new shell script concatenate-blastp.sh
The final step of selecting the maximum values to the pandas DataFrame fails with: “Applying IndexError: positional indexers are out-of-bounds”.
- This was because pandas preserves the original indices when filtering.
- Typically one would reindex, but there’s no need since the dataset is only about 100 MB one may as well keep all the data in memory and reference it.
While the “evalue” filter strips 14% of all the data, all the “bit score” rows are anyway within the filter; so the filter has no effect.
Not sure about next step. Posted notes from initial briefing online for clarification: https://public.etherpad-mozilla.org/p/G9HSaDsHNa

<2015-10-19 Mon> Blastp still running

Thiberio did some profiling and found that using -negative_gilist is intensive
In any case, it would have been quicker to cat all the input files *.faa and then run blastp once on them albeit in a single thread.
All blast processes finished in the early afternoon.
Too tired in the evening, so chatted with Thiberio and saw his 6 vs 9 tree comparison.

<2015-10-18 Sun> Run blastp

Completed shell script to generate blastp output.
Ran script overnight.

<2015-10-15 Thu> Exclusion list for BlastP

Discussed issue of filtering GI number with Thiberio.
- Confirmed approach of not doing filtering with blastp and to do it later with Python.
- But actually had a look at output and saw it flooded with Aeromonas. Need to filter by GI list. Thiberio fetched from NIH website via GenBank only. Saved GI list file to data/blastp-input/sequence.gi.txt
- Applying option for -negative_gilist fails with error:
```
BLAST Database error: GI list specified but no ISAM file found for GI in /Volumes/Macintosh HD 2/thiberio/blast_dbs/nr.00
        
```
- Thiberio fixed by rebuiling nr file with makeblastdb using -parse_seqids option. Normally we don’t do this since it adds on about 30 minutes to the build.
Create Pandas table from blastp output:
- While local NR is rebuilding, downloaded blastp “Hit Table(text)” output from NIH website. Copied to data/nih-output/
- Pushed Python script and lab notes to GitHub. Cloning using git:// works, but push does not. Changed git config to use https instead of git.

<2015-10-14 Wed> Setup Homebrew and blastp

Setup BLAST+ per NIH standalone setup:

Fetched installation files into Downloads/
One has to use the ftp program per the documentation, as the direct FTP link never seems to resolve.

Installed into apps/ folder and added to the PATH:

cd ~/Downloads/
md5 -q ncbi-blast-2.2.31+-universal-macosx.tar.gz && cat ncbi-blast-2.2.31+-universal-macosx.tar.gz.md5
cd ~/apps/
tar zxvpf ../Downloads/ncbi-blast-2.2.31+-universal-macosx.tar.gz
echo 'PATH=${PATH}:~/apps/ncbi-blast-2.2.31+/bin/' >> ~/.bash_profile
source ~/.bash_profile

Downloaded NR database as described in BLAST help:
- Created data/ folder and downloaded the NR database. This took about 20 minutes!
```
mkdir ~/data
cd ~/data/
update_blastdb.pl nr
        
```
Talked to Thiberio:
- Advised using homebrew for blast instead of installing directly from NIH.
- We already have NR alread here which we should use: /Volumes/Macintosh HD 2/thiberio/blast_dbs
- The 128 FASTA files are stored at: /Volumes/Macintosh HD 2/thiberio/t3ss_fasta_files
- Showed me how to use write to communicate between users in terminal. But we really need a a better messaging app. Decided on Telegram.

Installed homebrew per brew man page and online instructions.

mkdir homebrew && curl -L https://github.com/Homebrew/homebrew/tarball/master | tar xz --strip 1 -C homebrew
echo 'PATH=~/apps/homebrew/bin:${PATH}' >> ~/.bash_profile
echo 'export HOMEBREW_CACHE=~/apps/homebrew/Library/Cache/' >> ~/.bash_profile
source ~/.bash_profile
mkdir -p ~/apps/homebrew/Library/Cache/

Installed blast from homebrew:

brew install homebrew/science/blast

Setup Python virtualenv

mkdir -p ~/apps/venv-t3ss
virtualenv ~/apps/venv-t3ss
source ~/apps/venv-t3ss/bin/activate
pip install --upgrade pip
pip install ipython
pip install pandas

Read BLAST user manual.
Tested BLAST on web page:
- Used file 10381_35.2008.faa
- Excluded results from “Aeromonas allosaccharophila (taxid:656)”
  - This gave 100% match to another Aeromonas :)
- Changed exclusion group to “Aeromonas (taxid:642)”
  - Works now.
  - Need to automate this single run locally with Python.
  - Ran locally from commandline. Output is not tabulated. Also run time was 40x longer than running web query from NCBI!
  - Problem may be that nr database is 38 GB. Talked to Thiberio. He said to just run overnight and not worry about timing for just 128.

	commit :
	git add labnotes*.org
	git add Makefile
	git commit -m "ENH: Record up to $(shell date +%Y-%m-%dT%H:%M)"
	git push

	.phony : commit