Skip to content

Instantly share code, notes, and snippets.

@omsai
Last active November 20, 2015 03:06
Show Gist options
  • Select an option

  • Save omsai/6627d40828f38940ef47 to your computer and use it in GitHub Desktop.

Select an option

Save omsai/6627d40828f38940ef47 to your computer and use it in GitHub Desktop.

Lab notebook

TODO Colorize distances using greengenes.lbl.gov and eletoolkit.org

<2015-11-19 Thu> Presentation

  • Presenting filtering algorithm using code was a bad idea. Better to illustrate using sample data.
  • Ignored the “from Aeromonas” transfters since the data was weak.
  • Committed changes so far to GitHub.

<2015-11-18 Wed> Tree HGT analysis

  • Completed filtering of data using Peter’s new guidelines.
    • Thiberio confirmed and explained the guidelines.
    • Learned Python pandas function to broadcast index using reindex() and filtering grouped data with a lambda function using filter().
  • Deleted previous output data and reran scripts from 03- to 08-.
    • There is a bug in 06- where mafft tries to align all.faa, so after looking at top one has to have to ^C kill 06-.
  • Aliased sshfs with default options to reconnect to giardia to be able to visualize trees with seaview.
  • Visualized paraphyletic trees with seaview for cases of HGT.
    • To Aeromonas: 103962 160838 246298* 320665 320673 320678* 70966
    • From Aeromonas: 299986 320664 320669 320674 320680 320681 320684 320687* 321038 7135*

<2015-11-12 Thu> Collaborator meeting

  • Met with collaborators to show T3SS work so far.
  • Needed to relax tree branch selection.

<2015-11-11 Wed> Review and tree annotation

  • Analyzed non-monophylic trees for horizontal gene transfer.
  • Needed to root the tree; used tree.root_at_midpoint() but it’s not great way to do it.
  • Globbing is not great practise; replaced with os.listdir and os.path.join.
  • Incorporated species to tree names.

<2015-11-10 Tue> Tree monophily test

  • Using Bio.Phylo of biopython to read the newick trees and test for monophily:
    • Must only check the Aeromonas branches for monophily.
    • Created new Python script to test for monophily.
    • Found only 18 of 128 trees are monophylic.

<2015-11-05 Thu> FastTree build

  • mafft:
    • Talked to Thiberio about all the mafft runs completing in less than a second and whether argument options should be added on to mafft. He said that we would need --reorder and --auto and drew on the whiteboard explaining why.
    • Reran mafft with new options.
  • Tree building:
    • Tried to install FastTree using homebrew, but running it fails with:
      $ FastTree
      dyld: Library not loaded: /Users/omsai/apps/homebrew/lib/gcc/4.9/libgomp.1.dylib
      Referenced from: /Users/omsai/apps/homebrew/bin/FastTree
      Reason: image not found
              
    • Thiberio said he installed FastTree from the website instead of homebrew.
      • Mac install requires compilation of a single C file.
      • Created Makefile in ./apps/fastree/ to download and compile it.
        • Mac doesn’t even have wget!
    • Created and ran parallel FastTree script. Longest tree was built in ~5 seconds.
  • Viewing trees:
    • Trees are saved in newick format. Checked Gentoo Portage for available commandline viewers. Installed newick-utils from homebrew and used nw_display.
    • Original notes from project briefing indicate using BioPython to view tree. Installed from virtual env:
      source ~/apps/venv-t3ss/bin/activate
      pip search biopython
      pip install biopython
              

<2015-11-04 Wed> Sequence alignment with mafft

  • Refactored blast+ manipulation into new blastplus library and BlastProcessor class.
  • Wrote script to concatenate sequence using pre-blast original filenames, 05-concatenate_homologous_and_top_hit_sequence_files.py.
    • Discovered bug in NR sequence retrieval script 04-get_sequences_of_top_hits.py of not adding the comment line.
    • Fixed and reran NR sequence retrieval. Completed in 13 minutes.
    • Now adds the comment line. Verified using diff:
      for f in `ls ~/data/blastp-input/*.faa`
      do diff -u $f ~/data/mafft-input/$(basename $f | sed -e 's#\.faa#\.cat\.faa#')
      done
              
  • Wrote GNU Parallel script to run mafft.
    • Needed to pipe files to stdin to GNU Parallel instead of using ::: format.
    • One has to end command with double-quotes to processes GNU Parallel replacement strings (the {}, etc).

<2015-11-03 Tue> Sequence alignment with mafft

  • mafft is already installed:
    mafft --version
    v7.221 (2014/04/16)
        
  • Thiberio explained what alignment actually does, the holy trinity of aligners, that tree building can be done with alignment, etc.
  • Committed yesterday’s changes to GitHub.
  • Alignment program:
    • Input will be fasta files with all the homologous sequences and their blast match sequences in a single file.
    • Output will be aligned sequences with dashes.
    • Run with parallel. Maybe 3 threads and 4 sequences at a time.

<2015-11-02 Mon> Replacing problematic samtools faidx

  • Samtools failing to find sequences in nr database:
    • Thiberio confirmed the fault with samtools failing after a few tries. He suspects this is due to hardware issues, and one could iteratively try each sequence till completion.
    • Tried to just use parallel with one run per file, since it supports automatic retries, etc, but fails with a single thread.
    • Observing htop indicates samtools faidx eats up all available memory and starts swapping which probably leads to a crash, and false error message.
  • Thiberio suggested read the nr database directly with Python’s xreadlines() which is iterates the file without loading it all into memory. This ios a replacement for using samtools.
    • Was not able to retrieve 8 sequences using just the first identifier in the comment line.
    • Using all sequences in the comment line with regex and sets retrieved all 383 sequences of interest.
    • Run completed in about 15 minutes.

<2015-10-26 Mon> Samtools faidx

  • Talked to Thiberio about coding best practices:
    • Personally he reviews the README files and script help output.
    • Doesn’t actually push data through them.
    • Some journal reviewers will run code parametrically if it’s easy; e.g. web app.
    • Found paper (DOI: 10.1371/journal.pcbi.1000424) mentioned by Bill Mills’ “Bioinformatics” lesson tags here: https://mozillascience.github.io/studyGroupHandbook/lessons.html#lessons
  • Spoke to Prof. Gogarten about possible PhD projects.
    • Read papers Dan Needleman’s papers of Evolution overlaping will cell biology.
    • Hard to a niche area since many people exporing cell biology this way.
    • Our value proposition is thoroughness of work and insights.
  • Committed previous weeks’ code to GitHub.
  • Samtools only seems to process 5 parameters at a time and cuts off.
    • Could it be OSX commandline length limitation? No, since getconf ARG_MAX returns 262144 and wc returns only 924 characters total for the subject_ids.ref reference file.
    • Fault seems to be that samtools dies at the first sign of trouble. It choked on not finding the 5th region parameter.
    • Tried using redirection operator instead of piping file output, but that makes it ignore all parameters, causing it to reindex the nr database in 7 minutes wall time.
    • Hmm.. could do some dirty hacks of dropping parameters, or actually fix samtools faidx. GitHub seems fairly active, so chances of fix being reviewed and accepted is high and worthwhile.

<2015-10-22 Thu> Rerun of BlastP

  • blastp output was filled with just one column output.
    • Turns out one has to both, specify the numeric format and each of the columns.
    • Reran blast watching output with tail -f /path/to/all.fasta.blastp
  • Read review papers about Inteins to help understand Shannon’s talk tomorrow.
    • Read about Homing Endonuclease (HEN). Thus now understand the “homing cycle” reference from the practise talk.
  • Talked to Sheila about forming her research interest, forming a committee, and submitting a proposal. One can start with PI project and then find one’s own interest.

<2015-10-21 Wed> Grepping sequences with samtools faidx

  • Thiberio explained that to compare sequence length one has to retrieve sequence the compared sequence of the “Subject ID” using another utility, samtools faidx.
    brew search samtools
    brew install homebrew/science/samtools
        
  • samtools only takes text input file. Thiberio clarified that our NR “database” is a decomressed text file, which is why it’s 38 GB.
  • Matt helped explain the broad scope of the problems the lab solves.
  • samtools faidx keeps complaining of not being able to any queries with “[fai_fetch] Warning - Reference not found in FASTA file, returning empty sequence”. This is because the NR database was not indexed. Indexed using samtools faidx nr. Took about 15-20 minutes.
  • Told Thiberio that there of the 128 homologous groups containing 3118 genes, I got 1344 unique subject ID hits.
    • He thought this was odd since one would expect a set of homologous groups to map to one or two other organisms at most.
    • Need to write a script to find which homologous groups are producing so many results.
  • Wrote unique ‘subject id’ hits to file to run through samtools faidx.
  • Thiberio figured out that the problem was some of the hits were Archeae. Our original negative_gilist we got from NCBI was incomplete. Thiberio chose all the databases this time and confirmed using the new gilist filters out most of the previous results.
  • blastp kept complaining about not being able to write to output file:
    Error: NCBI C++ Exception:
        T0 "/private/tmp/blast20151014-14721-16up1aj/ncbi-blast-2.2.31+-src/c++/src/corelib/ncbifile.cpp", line 5416: Error: ncbi::CMemoryFileMap::CMemoryFileMap() - To be memory mapped the file must exist:
        
    • This was because it’s unable to parse the double quotes around the filename.
  • Setup new blastp to run on corrected sequence.gi exclusion list.

<2015-10-20 Tue> Blastp analysis

  • All blasp output is only 113M, so it should be fine to read all of it at once for analysis:
    $ du -h ~/data/blastp-output/
    113M    /Users/omsai/data/blastp-output/
        
  • Concatenated data into single file to make reading with pandas easier using new shell script concatenate-blastp.sh
  • The final step of selecting the maximum values to the pandas DataFrame fails with: “Applying IndexError: positional indexers are out-of-bounds”.
    • This was because pandas preserves the original indices when filtering.
    • Typically one would reindex, but there’s no need since the dataset is only about 100 MB one may as well keep all the data in memory and reference it.
  • While the “evalue” filter strips 14% of all the data, all the “bit score” rows are anyway within the filter; so the filter has no effect.
  • Not sure about next step. Posted notes from initial briefing online for clarification: https://public.etherpad-mozilla.org/p/G9HSaDsHNa

<2015-10-19 Mon> Blastp still running

  • Thiberio did some profiling and found that using -negative_gilist is intensive
  • In any case, it would have been quicker to cat all the input files *.faa and then run blastp once on them albeit in a single thread.
  • All blast processes finished in the early afternoon.
  • Too tired in the evening, so chatted with Thiberio and saw his 6 vs 9 tree comparison.

<2015-10-18 Sun> Run blastp

  • Completed shell script to generate blastp output.
  • Ran script overnight.

<2015-10-15 Thu> Exclusion list for BlastP

  • Discussed issue of filtering GI number with Thiberio.
    • Confirmed approach of not doing filtering with blastp and to do it later with Python.
    • But actually had a look at output and saw it flooded with Aeromonas. Need to filter by GI list. Thiberio fetched from NIH website via GenBank only. Saved GI list file to data/blastp-input/sequence.gi.txt
    • Applying option for -negative_gilist fails with error:
      BLAST Database error: GI list specified but no ISAM file found for GI in /Volumes/Macintosh HD 2/thiberio/blast_dbs/nr.00
              
    • Thiberio fixed by rebuiling nr file with makeblastdb using -parse_seqids option. Normally we don’t do this since it adds on about 30 minutes to the build.
  • Create Pandas table from blastp output:
    • While local NR is rebuilding, downloaded blastp “Hit Table(text)” output from NIH website. Copied to data/nih-output/
    • Pushed Python script and lab notes to GitHub. Cloning using git:// works, but push does not. Changed git config to use https instead of git.

<2015-10-14 Wed> Setup Homebrew and blastp

  • Setup BLAST+ per NIH standalone setup:
    • Fetched installation files into Downloads/
    • One has to use the ftp program per the documentation, as the direct FTP link never seems to resolve.
    • Installed into apps/ folder and added to the PATH:
      cd ~/Downloads/
      md5 -q ncbi-blast-2.2.31+-universal-macosx.tar.gz && cat ncbi-blast-2.2.31+-universal-macosx.tar.gz.md5
      cd ~/apps/
      tar zxvpf ../Downloads/ncbi-blast-2.2.31+-universal-macosx.tar.gz
      echo 'PATH=${PATH}:~/apps/ncbi-blast-2.2.31+/bin/' >> ~/.bash_profile
      source ~/.bash_profile
              
  • Downloaded NR database as described in BLAST help:
    • Created data/ folder and downloaded the NR database. This took about 20 minutes!
      mkdir ~/data
      cd ~/data/
      update_blastdb.pl nr
              
  • Talked to Thiberio:
    • Advised using homebrew for blast instead of installing directly from NIH.
    • We already have NR alread here which we should use: /Volumes/Macintosh HD 2/thiberio/blast_dbs
    • The 128 FASTA files are stored at: /Volumes/Macintosh HD 2/thiberio/t3ss_fasta_files
    • Showed me how to use write to communicate between users in terminal. But we really need a a better messaging app. Decided on Telegram.
  • Installed homebrew per brew man page and online instructions.
    mkdir homebrew && curl -L https://github.com/Homebrew/homebrew/tarball/master | tar xz --strip 1 -C homebrew
    echo 'PATH=~/apps/homebrew/bin:${PATH}' >> ~/.bash_profile
    echo 'export HOMEBREW_CACHE=~/apps/homebrew/Library/Cache/' >> ~/.bash_profile
    source ~/.bash_profile
    mkdir -p ~/apps/homebrew/Library/Cache/
        
  • Installed blast from homebrew:
    brew install homebrew/science/blast
        
  • Setup Python virtualenv
    mkdir -p ~/apps/venv-t3ss
    virtualenv ~/apps/venv-t3ss
    source ~/apps/venv-t3ss/bin/activate
    pip install --upgrade pip
    pip install ipython
    pip install pandas
        
  • Read BLAST user manual.
  • Tested BLAST on web page:
    • Used file 10381_35.2008.faa
    • Excluded results from “Aeromonas allosaccharophila (taxid:656)”
      • This gave 100% match to another Aeromonas :)
    • Changed exclusion group to “Aeromonas (taxid:642)”
      • Works now.
      • Need to automate this single run locally with Python.
      • Ran locally from commandline. Output is not tabulated. Also run time was 40x longer than running web query from NCBI!
      • Problem may be that nr database is 38 GB. Talked to Thiberio. He said to just run overnight and not worry about timing for just 128.
commit :
git add labnotes*.org
git add Makefile
git commit -m "ENH: Record up to $(shell date +%Y-%m-%dT%H:%M)"
git push
.phony : commit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment