ElDeveloper · September 16, 2014 13:20
diff --git a/hu-cfar.ipynb b/hu-cfar.ipynb
 {
 "metadata": {
  "name": "",
  "signature": "sha256:c62c1c7e3da5a30d5b95e957c548c6f3dc6e358dceed50f64349431a06c70717"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## QIIME Tutorial with the IPython Notebook\n",
      "\n",
      "We have deployed seven AWS (Amazon Web Services) EC2 (Elastic Compute Cloud) instances for the purpose of this tutorial. If you are reading this, you will have connected to one of these instances through your laptop's browser. The commands you issue will be executed by the instance you are connected to, and all computation and visualization will be done through the browser. You will not be downloading any files to your local machine for this tutotial.\n",
      "\n",
      "We utilize the IPython notebook for our tutorial because it is significantly easier for people who are unfamiliar with QIIME or the command line to use. When you have installed QIIME on your local machine or cluster, you will not need to use IPython to interact with it (and most people do not), although you are welcome to do so, and the full functionality is available. For more information on using QIIME with IPython, see [our recent paper](http://www.ncbi.nlm.nih.gov/pubmed/23096404). You can find more information on the IPython Notebook [here](http://ipython.org/ipython-doc/stable/interactive/htmlnotebook.html), and the nbviewer tool (which we use to display the notebook) [here](http://nbviewer.ipython.org/).\n",
      "\n",
      "## Notes/tips for using IPython\n",
      "\n",
      "IPython acts like a hybrid `python/bash` environment. Commands prefixed by a `'!'` character are issued to the shell (bash in this case). Commands not prefixed with `'!'` are issued to the python interpreter, and behave as they normally would in python. Each 'cell' of the notebook (cells with commands in them are surrounded by grey boxes) is executable.  Shift+Enter is the way you execute (or re-execute) the commands in a given cell. You must click in the cell to gain focus in that cell, and then type Shift+Enter or hit the ```play``` button above. Hitting Enter alone will just add  an additional line. Try executing the command below.\n",
      "\n",
      "**Important: Don't edit the contents of this first cell as it sets up key variables for the multiuser environment.**"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from os import chdir, mkdir, makedirs, path\n",
      "from tempfile import mkdtemp\n",
      "\n",
      "from IPython.display import FileLinks as ipFileLinks, FileLink as ipFileLink\n",
      "\n",
      "# if this directory already exists we should be good to go thus the -p argument\n",
      "!cd\n",
      "!mkdir -p tmp\n",
      "\n",
      "# to support running in a multi-user environment, each user will work in\n",
      "# a temporary working directory with a randomly generated name\n",
      "basedir = \"tmp\"\n",
      "\n",
      "working_dir = mkdtemp(prefix='stamps2014_', dir=basedir)\n",
      "\n",
      "otu_base = \"/home/ubuntu/qiime_software/gg_otus-13_8-release/\"\n",
      "reference_seqs = path.join(otu_base,\"rep_set/97_otus.fasta\")\n",
      "reference_tree = path.join(otu_base,\"trees/97_otus.tree\")\n",
      "reference_tax = path.join(otu_base,\"taxonomy/97_otu_taxonomy.txt\")\n",
      "\n",
      "print \"Your working directory is %s\" % working_dir\n",
      "chdir(working_dir)\n",
      "\n",
      "!wget https://s3.amazonaws.com/s3-qiime_tutorial_files/moving_pictures_tutorial-1.8.0.tgz\n",
      "!tar -xzf moving_pictures_tutorial-1.8.0.tgz\n",
      "\n",
      "# To use FileLink(s), but link to files in the user's working directory\n",
      "# we wrap the call to FileLink(s) to append the working_dir to the \n",
      "# url_prefix. NOTE: This is not something that you'll generally need to\n",
      "# do - it's only important as we're working with multiple users in the \n",
      "# IPython Notebook, which is currently only a single-user environment.\n",
      "def FileLinks(path):\n",
      "    return ipFileLinks(path,url_prefix='files/%s/' % working_dir)\n",
      "\n",
      "def FileLink(path):\n",
      "    return ipFileLink(path,url_prefix='files/%s/' % working_dir)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "To view output files, you will use the commands FileLink and FileLinks. Calling FileLink('some_file.txt') produces a standard html-like link to that file which you can click on. Clicking on the link will bring up a new browser tab with the contents of 'some_file.txt' displayed. Just to practice, try executing the commands in the following cell. You should an output of a blue html-link. Click this link."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!echo 'A test txt file.' > ./practice_filelink_2.txt\n",
      "FileLink('practice_filelink_2.txt')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Illumina Overview Tutorial: Moving Pictures of the Human Microbiome\n",
      "\n",
      "This tutorial covers a full QIIME workflow using Illumina sequencing data. This tutorial is is intended to be quick to run, and as such, uses only a subset of a full Illumina Genome Analyzer II (GAIIx) run. We'll make use of the ``13_8`` release of the [Greengenes](http://www.ncbi.nlm.nih.gov/pubmed/22134646) reference OTUs. You can always find a link to the latest version of the reference OTUs on the [QIIME resources page](http://qiime.org/home_static/dataFiles.html).\n",
      "\n",
      "The data used in this tutorial is derived from the [Moving Pictures of the Human Microbiome](http://www.ncbi.nlm.nih.gov/pubmed/21624126) study, where two human subjects collected daily samples from four body sites: the tongue, the palm of the left hand, the palm of the right hand, and the gut (via fecal samples obtained by swapping used toilet paper). This data was sequenced across six lanes of an Illumina GAIIx, using the barcoding amplicon sequencing protocol described in [Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample](http://www.ncbi.nlm.nih.gov/pubmed/20534432). A more recent version of this protocol that can be used with the Illumina HiSeq 2000 and MiSeq can be found [here](http://www.ncbi.nlm.nih.gov/pubmed/22402401). \n",
      "    \n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Throughout this tutorial we make use of a reference sequence collection, tree, and taxonomy derived from the Greengenes database. We defined the paths to these files in our first executed cell with the following code:\n",
      "\n",
      "```python\n",
      "otu_base = \"/home/ubuntu/qiime_software/gg_otus-13_8-release/\"\n",
      "reference_seqs = path.join(otu_base,\"rep_set/97_otus.fasta\")\n",
      "reference_tree = path.join(otu_base,\"trees/97_otus.tree\")\n",
      "reference_tax = path.join(otu_base,\"taxonomy/97_otu_taxonomy.txt\")\n",
      "```"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Starting out\n",
      "Start by seeing what files are in our tutorial directory. We can do this using `ls` as we would on the command line, but in this case we prefix with an `!` to tell IPython that we're issuing a `bash` (i.e., command line) command, rather than a python command."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!ls moving_pictures_tutorial-1.8.0"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can use the FileLinks formatting discussed above to view whats in the tutorial folder."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "FileLinks('moving_pictures_tutorial-1.8.0')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Check our mapping file for errors\n",
      "\n",
      "The QIIME mapping file contains all of the per-sample metadata, including technical information such as primers and barcodes that were used for each sample, and information about the samples. In this data set we're looking at human microbiome samples from four sites on the bodies of two individuals at mutliple time points. The metadata in this case therefore includes a subject identifier, a timepoint, and a body site for each sample. You can review the ``filtered_mapping_l1.txt`` at the link in the previous cell to see an example of the data.\n",
      "\n",
      "In this step, we run ``validate_mapping_file.py`` to review the mapping file to confirm that it's in QIIME-compatible format. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!validate_mapping_file.py \\\n",
      "-o moving_pictures_tutorial-1.8.0/illumina/cid_l1/ \\\n",
      "-m moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l1.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this case there were no errors, but if there were we would review the resulting html summary to find out what errors are present. You could then fix those in a spreadsheet program or text editor. To view that html file, call ``FileLinks`` on the output directory from the previous step and click the link to the ``html`` file."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "FileLinks('moving_pictures_tutorial-1.8.0/illumina/cid_l1/')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For the sake of illustrating what errors in a mapping file might look like, we've created a bad mapping file (``filtered_mapping_l1.bad.txt``) and provided that as an example. Call ``validate_mapping_file.py`` on the file ``filtered_mapping_l1.bad.txt``, and then view the html output. What are the issues with that mapping file?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!validate_mapping_file.py \\\n",
      "-o moving_pictures_tutorial-1.8.0/illumina/bad_output/ \\\n",
      "-m moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l1.bad.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "FileLinks('moving_pictures_tutorial-1.8.0/illumina/bad_output/')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Demultiplexing and quality filtering seqeunces\n",
      "\n",
      "We next need to demultiplex our sequences (i.e. assigning barcoded reads to the samples they are derived from). In general, you should get seperate fastq files for your sequence and barcode reads. On the multiple-lane Illumina platforms, we typically reuse barcodes across lanes, so we must demultiplex each lane independently. To do that, run the following command (*will run for a few minutes*).\n",
      "\n",
      "This is a big command, but it's relatively straight-forward. We're telling QIIME that we have six lanes of sequence data (specified as a comma-separated list of files passed as ``-i``), six lanes of barcode data (specified as a comma-separated list of files passed as ``-b``), and a metadata mapping file corresponding to each lane (specified as a comma-separated list of files passed as ``-m``). The metadata mapping file contains the sample-to-barcode mapping that we need for demultiplexing. \n",
      "\n",
      "**Important**: The order of files passed for ``-m``, ``-b``, and ``-i`` must be consistent, so if you pass the lane 1 sequence data first for ``-i``, you must pass the lane 1 barcode data first for ``-b``, and the lane 1 metadata mapping file first as ``-m``. The only other parameter here is the output directory, which we'll call ``slout``, for *split libraries output*."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!split_libraries_fastq.py \\\n",
      "-o moving_pictures_tutorial-1.8.0/illumina/slout/ \\\n",
      "-i moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_1_sequence.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_2_sequence.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_3_sequence.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_4_sequence.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_5_sequence.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_6_sequence.fastq -b moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_1_sequence_barcodes.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_2_sequence_barcodes.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_3_sequence_barcodes.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_4_sequence_barcodes.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_5_sequence_barcodes.fastq,moving_pictures_tutorial-1.8.0/illumina/raw/subsampled_s_6_sequence_barcodes.fastq -m moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l1.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l2.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l3.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l4.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l5.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l6.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We often want to see the results of running a command. Here we can do that by calling our output formatter again, this time passing the output directory from the previous step."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "FileLinks('moving_pictures_tutorial-1.8.0/illumina/slout/')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!count_seqs.py -i moving_pictures_tutorial-1.8.0/illumina/slout/seqs.fna"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "OTU picking: using an open-reference OTU picking protocol by searching reads against the Greengenes database."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now that we have demultiplexed sequences, we're ready to cluster these sequences into OTUs. There are three high-level ways to do this in QIIME. We can use a *de novo*, *closed-reference*, or *open-reference OTU picking*. Open-reference OTU picking is currently our preferred method. Discussion of these methods can be found in [QIIME's OTU picking document](http://www.qiime.org/tutorials/otu_picking.html). Additionally, we recently wrote a [paper](https://peerj.com/preprints/411/)  (in press) describing a variation of open-reference OTU picking that allows the protocol to scale to very large datasets (billions of sequences).\n",
      "\n",
      "Here we apply open-reference OTU picking, which can require about 10 minutes to run on this data set. \n",
      "\n",
      "Note that this command takes the ``seqs.fna`` file that was generated in the previous step, as well as the reference fasta file (``$reference_seqs`` here). We're also taking a shortcut here for the sake of reduced run time: we're using the *fast uclust* parameters. To allow this to run in a just a few of minutes, we're using parameters that are optimized for reduced runtime at the expense of accuracy. These correspond to ``uclust``'s default parameters. QIIME uses slightly more stringent parameter settings by default. These parameters are specified the the *parameters file* which is passed as ``-p``. You can find information on defining parameters files [here](http://www.qiime.org/documentation/file_formats.html#qiime-parameters)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!pick_open_reference_otus.py \\\n",
      "-o moving_pictures_tutorial-1.8.0/illumina/otus/ \\\n",
      "-i moving_pictures_tutorial-1.8.0/illumina/slout/seqs.fna \\\n",
      "-r $reference_seqs -p moving_pictures_tutorial-1.8.0/uc_fast_params.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If you wanted to use a different OTU picking strategy (closed reference or de-novo) you would have to specify a different command at this step. Below is an example for closed reference picking. We aren't going to run this command (we generally don't recommend this in practice because it can discard important novel sequences), but it can be useful for integrating the results of multiple studies in meta-analyses, so it is below for your reference.\n",
      "\n",
      "```bash\n",
      "pick_closed_reference_otus.py -o moving_pictures_tutorial-1.8.0/illumina/otus/ -i moving_pictures_tutorial-1.8.0/illumina/slout/seqs.fna -r $reference_seqs -t $reference_tax -p moving_pictures_tutorial-1.8.0/uc_fast_params.txt\n",
      "```\n",
      "\n",
      "**Important:** if you run closed reference picking, your reference tree for alpha and beta diversity analysis will be the Greengenes tree (since you are discarding any sequences which don't map to the Greengenes OTUs whose phylogeny has already been computed)."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The primary output from the picking command is the *OTU table*, a table that contains the number of times each operational taxonomic unit (OTU) is observed in each sample. QIIME uses the Genomics Standards Consortium *standard* Biological Observation Matrix (BIOM) format for representing these files. You can find additional information on the BIOM format [here](http://www.biom-format.org), and information on converting these files to tab-separated text that can be view in spreadsheet programs [here](http://biom-format.org/documentation/biom_conversion.html). The ``rep_set.tre`` file is also essential for downstream phylogenetic diversity calculations.\n",
      "\n",
      "To view the output of this command, call ``FileLinks`` on the output directory."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "FileLinks('moving_pictures_tutorial-1.8.0/illumina/otus/')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "To see some summary statistics of the OTU table we can run the following command.\n",
      "\n",
      "One key piece of information you need to pull from this output is the depth of sequencing that should be used in diversity analyses. Many of the analyses that follow require that there are an equal number of sequences in each sample, so you need to review the *Counts/sample detail* and decide what depth you'd like. Any samples that don't have at least that many sequences will not be included in the analyses, so this is always a trade-off between the number of sequences you can include in your analysis and the number of samples you have to throw away to control for sequencing depth. For some perspective on this, see [Kuczynski 2010](http://www.ncbi.nlm.nih.gov/pubmed/20441597)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!biom summarize-table \\\n",
      "-i moving_pictures_tutorial-1.8.0/illumina/otus/otu_table_mc2_w_tax_no_pynast_failures.biom \\\n",
      "-o moving_pictures_tutorial-1.8.0/illumina/otus/otu_table_mc2_w_tax_no_pynast_failures.biom.stats\n",
      "\n",
      "!cat moving_pictures_tutorial-1.8.0/illumina/otus/otu_table_mc2_w_tax_no_pynast_failures.biom.stats"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Create a single mapping file from the per-lane mapping files."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We started with six lanes of data but have now summarized these in a single OTU table. However, we still need to merge the per-lane mapping files into a single *combined* mapping file that represents all six lanes, and therefore all of our data. Note that we will have duplicated barcodes in our mapping file, but that's OK as we've already demultiplexed our reads. After demultiplexing and quality filtering, we don't use the barcodes again. We can merge the six mapping files as follows. From this point on, we'll work with ``combined_mapping_file.txt``."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!merge_mapping_files.py \\\n",
      "-o moving_pictures_tutorial-1.8.0/illumina/combined_mapping_file.txt \\\n",
      "-m moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l1.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l2.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l3.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l4.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l5.txt,moving_pictures_tutorial-1.8.0/illumina/raw/filtered_mapping_l6.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's review the resulting mapping file. To view a single file (rather than a directory) we use the ``FileLink`` function instead of the ``FileLinks`` function."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "FileLink('moving_pictures_tutorial-1.8.0/illumina/combined_mapping_file.txt')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Beta Diversity"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Using the OTU table that we previously generated and with our full metadata, we can now proceed with the workflow and take a look at this dataset's beta diversity. To do this we will be using the [UniFrac](http://aem.asm.org/content/71/12/8228.abstract) distance metric in its weighted and unweighted form.\n",
      "\n",
      "QIIME provides an interface to compute the beta diversity distance matrix is called `beta_diversity.py`. Make sure you are passing in a phylogenetic tree because we want to compute a phylogeney based distance."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!beta_diversity.py \\\n",
      "-i moving_pictures_tutorial-1.8.0/illumina/otus/otu_table_mc2_w_tax_no_pynast_failures.biom \\\n",
      "-t moving_pictures_tutorial-1.8.0/illumina/otus/rep_set.tre \\\n",
      "-o beta"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "However, distance matrices are not amenable to convenient visualization of this number of samples, so let's calculate the principal coordinates belonging to this dataset and make a PCoA plot (`pcoa` suffixed files are the principal coordinates files) using Emperor."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!principal_coordinates.py -i beta -o beta"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "FileLinks('beta/')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Unweighted UniFrac\n",
      "!make_emperor.py \\\n",
      "-i beta/pcoa_unweighted_unifrac_otu_table_mc2_w_tax_no_pynast_failures.txt \\\n",
      "-m moving_pictures_tutorial-1.8.0/illumina/combined_mapping_file.txt \\\n",
      "-o beta/unweighted_3d\n",
      "\n",
      "# Weighted UniFrac\n",
      "!make_emperor.py \\\n",
      "-i beta/pcoa_weighted_unifrac_otu_table_mc2_w_tax_no_pynast_failures.txt \\\n",
      "-m moving_pictures_tutorial-1.8.0/illumina/combined_mapping_file.txt \\\n",
      "-o beta/weighted_3d"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "FileLink('beta/unweighted_3d/index.html')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "FileLink('beta/weighted_3d/index.html')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Note** that Emperor will create two outputs `index.html` and a folder called `emperor_required_resources`. This file and directory depend on each other to be useful and/or informative. **If you ever plan on sharing an Emperor visualization with someone make sure you send both of these things.**"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Using the metadata\n",
      "\n",
      "Now let's insert some of the metadata into the plot. The *days_since_epoch* column in the mapping file contains only numeric values so we can easily sort the samples using that metadata column.\n",
      "We can take advantage of this information to draw a line between the samples of each body site for each of the two subjects that we are looking at in this dataset.\n",
      "Note that because we don't have any category in the mapping file that differentiates sample types within subjects, we will simply ask Emperor to concatenate the *subject* and the *SampleType* columns."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# connecting the samples by using lines\n",
      "!make_emperor.py -i beta/pcoa_unweighted_unifrac_otu_table_mc2_w_tax_no_pynast_failures.txt \\\n",
      "-m moving_pictures_tutorial-1.8.0/illumina/combined_mapping_file.txt \\\n",
      "-o beta/vectors_unweighted_3d \\\n",
      "--add_vectors \"subject&&SampleType,days_since_epoch\""
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "FileLink('beta/vectors_unweighted_3d/index.html')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# using an explicit axis\n",
      "!make_emperor.py -i beta/pcoa_unweighted_unifrac_otu_table_mc2_w_tax_no_pynast_failures.txt \\\n",
      "-m moving_pictures_tutorial-1.8.0/illumina/combined_mapping_file.txt \\\n",
      "-o beta/explicit_axis_unweighted_3d \\\n",
      "-a \"days_since_epoch\"\n",
      "\n",
      "FileLink('beta/explicit_axis_unweighted_3d/index.html')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Summarize Taxa"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A helpful way to visualize the membership of your communities is to produce taxa summary plots. QIIME easily allows you to look at the breakdown of different taxonomic levels (kingdom to species level resolution, although it should be noted that the accuracy of finer grained levels of resolution suffers with short reads) of each of your samples using the summarize_taxa_through_plots.py workflow script. In the cell below, we bring up the help information for the script by passing ```-h```"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!summarize_taxa_through_plots.py -h"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "After bringing up the help description, you'll see that summarize_taxa_through_plots.py has two required parameters: an input OTU table in biom format, and a specified output directory in which to write the summarized tables and taxonomy HTML plots. In the cell below, run summarize_taxa_through_plots.py by passing these two required options and using default arguments for all the other parameters.\n",
      "\n",
      "Remember that our OTU table is located here:\n",
      "\n",
      "`moving_pictures_tutorial-1.8.0/illumina/otus/otu_table_mc2_w_tax_no_pynast_failures.biom`"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!summarize_taxa_through_plots.py \\\n",
      "-i moving_pictures_tutorial-1.8.0/illumina/otus/otu_table_mc2_w_tax_no_pynast_failures.biom \\\n",
      "-o qiime_rocks"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!ls qiime_rocks"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now that you've executed the command successfully, use the ```FileLink``` command on your specified output directory, and click on the ```bar_charts.html``` file that is within the newly created `taxa_summary_plots` directory. Take a few minutes to look through the taxa summaries at different taxonomic levels.\n",
      "\n",
      "The `summarize_taxa_through_plots.py` script can also summarize the taxa by a specific category in your mapping file by passing in a ```-m mapping_file``` and a ```-c mapping_category```. So, look at the mapping file and find a category header (i.e., the headers in the first line of the mapping file) that you'd like to summarize by. Then write the full command below:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Once again, use FileLink to open the output directory, and open the taxonomy plots:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "FileLinks('qiime_rocks')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#### Creating and visualizing BiPlots in Emperor\n",
      "\n",
      "Emperor can create a biplot using a summarized OTU table itself and adding spheres where each of the spheres represents an OTU, they are located between the samples that are differentiated by each sphere and the size of the sphere represents the relative abundance of the summarized OTU."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!make_emperor.py -i beta/pcoa_unweighted_unifrac_otu_table_mc2_w_tax_no_pynast_failures.txt \\\n",
      "-m moving_pictures_tutorial-1.8.0/illumina/combined_mapping_file.txt \\\n",
      "-o beta/biplot_unweighted_3d \\\n",
      "-t qiime_rocks/otu_table_mc2_w_tax_no_pynast_failures_L3.txt\n",
      "FileLink('beta/biplot_unweighted_3d/index.html')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Supervised Learning\n",
      "Supervised learning is a branch of machine learning that seeks to find patterns that differentiate classes of samples or features in a dataset. The 'supervised' part means that the algorithm gets to learn from some 'training' data, i.e. we are supervising its learning about the world, and then letting it loose on 'test' data. There are also 'unsupervised' learning techniques. PCoA aka classical MDS from earlier in the tutorial is an example. An excellent introduction to the subject and its application to microbial ecology can be found [here](http://onlinelibrary.wiley.com/doi/10.1111/j.1574-6976.2010.00251.x/abstract). The QIIME website also provides a [tutorial](http://qiime.org/tutorials/running_supervised_learning.html) which explains some pieces of our implementation of Random Forests for supervised learning.\n",
      "\n",
      "Here we are going to run an example of supervised learning using the metadata field ``SampleType``. Try the following command:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!supervised_learning.py \\\n",
      "-i moving_pictures_tutorial-1.8.0/illumina/otus/otu_table_mc2_w_tax_no_pynast_failures.biom \\\n",
      "-m moving_pictures_tutorial-1.8.0/illumina/combined_mapping_file.txt \\\n",
      "-c SampleType \\\n",
      "-e cv5 \\\n",
      "-o sl_SampleType"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Once that has finished, lets take a look at the output."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "FileLinks('sl_SampleType/')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The most important information is contained in ``summary.txt``. Lets open that and discuss by clicking on ``summary.txt`` link."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now lets look at the confusion matrix. Click on the ``confusion_matrix.txt`` link. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Finally, we are going to look at the feature importance scores. Click on the ``feature_importance_scores.txt`` link."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Group differences/category significances\n",
      "\n",
      "A traditional strategy for finding differences between categorical groups of samples is to compare the abundances of data features ('OTUs') using a statistical test. Traditional tests you might be familiar with are ANOVA or the t-test. Here we will use ``group_significance.py``. The help documentation on this one is pretty extensive, so you should read it over before using. For this tutorial we are just developing familiarity with the commands so we will skip this now."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!group_significance.py \\\n",
      "-i moving_pictures_tutorial-1.8.0/illumina/otus/otu_table_mc2_w_tax_no_pynast_failures.biom \\\n",
      "-m moving_pictures_tutorial-1.8.0/illumina/combined_mapping_file.txt \\\n",
      "-c SampleType \\\n",
      "-o gs_SampleType.txt"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Lets look at the output. Click on the link created by the command below."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from IPython.core.display import HTML\n",
      "\n",
      "# read the contents of the output of otu_category_significance.py\n",
      "fd = open('gs_SampleType.txt', 'r')\n",
      "lines = fd.readlines()\n",
      "fd.close()\n",
      "\n",
      "# write an HTML formatted string with the contents of such table\n",
      "output_string = '<table border=\"1\">'\n",
      "for element in lines:\n",
      "    if element == lines[0]:\n",
      "        row = '<tr>%s</tr>' % ''.join(['<th>%s</th>' % t for t in element.split('\\t')])\n",
      "        output_string += row\n",
      "    else:\n",
      "        row = '<tr>%s</tr>' % ''.join(['<td nowrap>%s</td>' % t for t in element.split('\\t')])\n",
      "        output_string += row\n",
      "output_string += '</table>'\n",
      "\n",
      "# display it within the IPython notebook usin the HTML function\n",
      "HTML(output_string)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Next steps\n",
      "\n",
      "This tutorial illustrated some of the basic features of QIIME, but there are a lot more places you can go from here. If you're interested in seeing additional visualizations, you should check out the [QIIME overview tutorial](http://www.qiime.org/tutorials/tutorial.html). The [Procrustes analysis tutorial](http://www.qiime.org/tutorials/procrustes_analysis.html) illustrates a really cool analysis, allowing you to continue with the same data used here, comparing against the samples sequenced on 454 (rather than Illumina, as in this analysis). If you're interested in some possibilities for statistical analyses you can try the [distance matrix comparison](http://www.qiime.org/tutorials/distance_matrix_comparison.html) tutorials, both of which can be adapted to use data generated in this tutorial."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This tutorial was adapted from work by Greg Caporaso, Antonio Gonzalez, Will Van Treuren, Yoshiki Vazquez Baeza, Luke Ursell, and Adam Robbins Pianka"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Advanced notes\n",
      "\n",
      "These are some additional notes that we included for more experienced users. Also a good place to keep your own notes.\n",
      "\n",
      "All the IPython 'magic' functions which normally execute bash commands are available without prefixing with '!' (e.g. ls, cd, pwd, etc.). \n",
      "\n",
      "less, grep, vi, etc. can be called with !, but their performance is drastically reduced and they can freeze the notebook."
     ]
    }
   ],
   "metadata": {}
  }
 ]
 }