dwinter · June 14, 2016 16:35
diff --git a/Parse_codeml.ipynb b/Parse_codeml.ipynb
 {
 "metadata": {
  "name": "",
  "signature": "sha256:970b8f1630c9347c2ef1dda47d3877aa72f80361044da71ba165eabe6ad6d078"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from collections import namedtuple\n",
      "\n",
      "from Bio.Phylo.PAML import codeml\n",
      "from scipy.stats import chi2"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#Getting data from a codeml results file\n",
      "\n",
      "`codeml` produces.... very verbose results files. Most of the informaton in one of those files is not very helpful for any given analylsis so it's helpful to know how to get just the information you want. \n",
      "\n",
      "This directoty has two results files, creating by fitting a null model (in which all branches evolve with a single best-fit $\\omega$, and another model in which one branch is free to have its own rate (in this case that branch leads to us). "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "! ls"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "codeml_branch.txt  codeml_null.txt  Parse_codeml.ipynb\r\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can use [`Biopython`'s `codeml` module](http://biopython.org/wiki/PAML) to parse these results. `codeml.read` returns a dictionary\""
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "branch_result = codeml.read(\"codeml_branch.txt\")\n",
      "branch_result.keys()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 3,
       "text": [
        "['codon model', 'model', 'version', 'NSsites']"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Most of the values are simple descriptions of the model used to geenerate the results:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "[branch_result[k] for k in ['version', 'codon model', 'model']]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 4,
       "text": [
        "['4.8', 'F3x4', 'several dN/dS ratios for branches for branches, ']"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The good stuff is in `NSsites` , which contains another dictionary, with keys corresponding to the _number_ of the models described in the results file. In this case each file only contains results from one model, so there is a single result in this file, so just one key"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "branch_result['NSsites'].keys()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 5,
       "text": [
        "[0]"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The model itself is... another dictionary. Again most of the keys are mapped to simple and self-explanatory paramters. Probably the most important of these simple one are `lnL` -- the log likelihood of the model and `omega tree` which is the tree with $\\omega$ values added"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "branch_data = branch_result['NSsites'][0]\n",
      "branch_data[\"lnL\"], branch_data[\"omega tree\"]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 6,
       "text": [
        "(-2464.038185,\n",
        " '(pan_troglodytes #0.4236 , (homo_sapiens #0.9774 , (macaca_mulatta #0.4236 , pongo_abelii #0.4236 ) #0.4236 ) #0.4236 , gorilla_gorilla #0.4236 );')"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "It's probably possible to parse the omega result out from this tree. But the results are indluded in _another_ nested dictionary. The `omega` element here has the estimated $\\omega$ values for the background and then any foreground branches"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "branch_data[\"parameters\"][\"omega\"]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 7,
       "text": [
        "[0.42362, 0.97743]"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "(The other thing you might want to get as is the table data with one line per branch. This is _yet another dictionary_ in the results object!)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "branch_data = branch_result[\"NSsites\"][0]\n",
      "branch_data.keys()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 8,
       "text": [
        "['description',\n",
        " 'parameters',\n",
        " 'omega tree',\n",
        " 'tree',\n",
        " 'dN tree',\n",
        " 'lnL',\n",
        " 'tree length',\n",
        " 'dS tree']"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## A functon to extract the interesting bits\n",
      "\n",
      "OK, so now we know what the results dictionary has, what inforation can we gather for our two files. To compare the null and branch models we need a likelihood, so we need `lnL` from each file. We are also interested in the mean $\\omega$ across all taxa, and in particular in the foreground branch.\n",
      "\n",
      "(I'm also going to used a named tuple to create a new class to represnt the result -- we could just as easily used a `list`, but I like the way these self-document)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "CodemlSummary = namedtuple(\"CodemlSummary\", \n",
      "                           [\"lnL_null\", \"lnL_branch\", \"omega_null\", \"omega_background\", \"omega_foreground\"])\n",
      "\n",
      "def extract_codeml_results(null_file, branch_file):\n",
      "    \"\"\"Summarise the results of two runs of codeML on the same data\n",
      "    \n",
      "    Takes two files, one represnting a null model and another with a single \"foreground\" branch\n",
      "    (i.e. a branch with its own omega value) and extracts relvant information from each\n",
      "    \n",
      "    Returns a CodemlSummary object, with attributes\n",
      "       \"lnL_null\", \"lnL_branch\", \"omega_null\", \"omega_foreground\", \"omega_background\"\n",
      "    \"\"\"\n",
      "    null_result = codeml.read(null_file)[\"NSsites\"][0]\n",
      "    branch_result = codeml.read(branch_file)[\"NSsites\"][0]\n",
      "    result = CodemlSummary(null_result[\"lnL\"], \n",
      "                           branch_result[\"lnL\"], \n",
      "                           null_result[\"parameters\"][\"omega\"], \n",
      "                           *branch_result[\"parameters\"][\"omega\"])#unpack a list of two\n",
      "    return(result)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 13
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "res = extract_codeml_results(null_file = \"codeml_null.txt\", branch_file = \"codeml_branch.txt\")\n",
      "res"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 17,
       "text": [
        "CodemlSummary(lnL_null=-2464.534131, lnL_branch=-2464.038185, omega_null=0.45265, omega_background=0.42362, omega_foreground=0.97743)"
       ]
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Use the results \n",
      "Because a named tuple is just a tuple, we can convert it to a list if we just want to print all of its contents"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "list(res)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 15,
       "text": [
        "[-2464.534131, -2464.038185, 0.45265, 0.42362, 0.97743]"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "But we can also refer to elemnts by name, for instance we coul do the likelihood ratio test with one degree fo freedom"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "D = 2 * (res.lnL_branch - res.lnL_null)\n",
      "pval = 1.0 - chi2.cdf(D, 1)\n",
      "pval"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 16,
       "text": [
        "0.31928039243135342"
       ]
      }
     ],
     "prompt_number": 16
    }
   ],
   "metadata": {}
  }
 ]
 }
	{
	"metadata": {
	"name": "",
	"signature": "sha256:970b8f1630c9347c2ef1dda47d3877aa72f80361044da71ba165eabe6ad6d078"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"from collections import namedtuple\n",
	"\n",
	"from Bio.Phylo.PAML import codeml\n",
	"from scipy.stats import chi2"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 1
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#Getting data from a codeml results file\n",
	"\n",
	"`codeml` produces.... very verbose results files. Most of the informaton in one of those files is not very helpful for any given analylsis so it's helpful to know how to get just the information you want. \n",
	"\n",
	"This directoty has two results files, creating by fitting a null model (in which all branches evolve with a single best-fit $\\omega$, and another model in which one branch is free to have its own rate (in this case that branch leads to us). "
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"! ls"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"codeml_branch.txt codeml_null.txt Parse_codeml.ipynb\r\n"
	]
	}
	],
	"prompt_number": 2
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We can use [`Biopython`'s `codeml` module](http://biopython.org/wiki/PAML) to parse these results. `codeml.read` returns a dictionary\""
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"branch_result = codeml.read(\"codeml_branch.txt\")\n",
	"branch_result.keys()"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 3,
	"text": [
	"['codon model', 'model', 'version', 'NSsites']"
	]
	}
	],
	"prompt_number": 3
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Most of the values are simple descriptions of the model used to geenerate the results:"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"[branch_result[k] for k in ['version', 'codon model', 'model']]"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 4,
	"text": [
	"['4.8', 'F3x4', 'several dN/dS ratios for branches for branches, ']"
	]
	}
	],
	"prompt_number": 4
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The good stuff is in `NSsites` , which contains another dictionary, with keys corresponding to the _number_ of the models described in the results file. In this case each file only contains results from one model, so there is a single result in this file, so just one key"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"branch_result['NSsites'].keys()"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 5,
	"text": [
	"[0]"
	]
	}
	],
	"prompt_number": 5
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The model itself is... another dictionary. Again most of the keys are mapped to simple and self-explanatory paramters. Probably the most important of these simple one are `lnL` -- the log likelihood of the model and `omega tree` which is the tree with $\\omega$ values added"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"branch_data = branch_result['NSsites'][0]\n",
	"branch_data[\"lnL\"], branch_data[\"omega tree\"]"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 6,
	"text": [
	"(-2464.038185,\n",
	" '(pan_troglodytes #0.4236 , (homo_sapiens #0.9774 , (macaca_mulatta #0.4236 , pongo_abelii #0.4236 ) #0.4236 ) #0.4236 , gorilla_gorilla #0.4236 );')"
	]
	}
	],
	"prompt_number": 6
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"It's probably possible to parse the omega result out from this tree. But the results are indluded in _another_ nested dictionary. The `omega` element here has the estimated $\\omega$ values for the background and then any foreground branches"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"branch_data[\"parameters\"][\"omega\"]"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 7,
	"text": [
	"[0.42362, 0.97743]"
	]
	}
	],
	"prompt_number": 7
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"(The other thing you might want to get as is the table data with one line per branch. This is _yet another dictionary_ in the results object!)"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"branch_data = branch_result[\"NSsites\"][0]\n",
	"branch_data.keys()"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 8,
	"text": [
	"['description',\n",
	" 'parameters',\n",
	" 'omega tree',\n",
	" 'tree',\n",
	" 'dN tree',\n",
	" 'lnL',\n",
	" 'tree length',\n",
	" 'dS tree']"
	]
	}
	],
	"prompt_number": 8
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## A functon to extract the interesting bits\n",
	"\n",
	"OK, so now we know what the results dictionary has, what inforation can we gather for our two files. To compare the null and branch models we need a likelihood, so we need `lnL` from each file. We are also interested in the mean $\\omega$ across all taxa, and in particular in the foreground branch.\n",
	"\n",
	"(I'm also going to used a named tuple to create a new class to represnt the result -- we could just as easily used a `list`, but I like the way these self-document)"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"CodemlSummary = namedtuple(\"CodemlSummary\", \n",
	" [\"lnL_null\", \"lnL_branch\", \"omega_null\", \"omega_background\", \"omega_foreground\"])\n",
	"\n",
	"def extract_codeml_results(null_file, branch_file):\n",
	" \"\"\"Summarise the results of two runs of codeML on the same data\n",
	" \n",
	" Takes two files, one represnting a null model and another with a single \"foreground\" branch\n",
	" (i.e. a branch with its own omega value) and extracts relvant information from each\n",
	" \n",
	" Returns a CodemlSummary object, with attributes\n",
	" \"lnL_null\", \"lnL_branch\", \"omega_null\", \"omega_foreground\", \"omega_background\"\n",
	" \"\"\"\n",
	" null_result = codeml.read(null_file)[\"NSsites\"][0]\n",
	" branch_result = codeml.read(branch_file)[\"NSsites\"][0]\n",
	" result = CodemlSummary(null_result[\"lnL\"], \n",
	" branch_result[\"lnL\"], \n",
	" null_result[\"parameters\"][\"omega\"], \n",
	" *branch_result[\"parameters\"][\"omega\"])#unpack a list of two\n",
	" return(result)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 13
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"res = extract_codeml_results(null_file = \"codeml_null.txt\", branch_file = \"codeml_branch.txt\")\n",
	"res"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 17,
	"text": [
	"CodemlSummary(lnL_null=-2464.534131, lnL_branch=-2464.038185, omega_null=0.45265, omega_background=0.42362, omega_foreground=0.97743)"
	]
	}
	],
	"prompt_number": 17
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"##Use the results \n",
	"Because a named tuple is just a tuple, we can convert it to a list if we just want to print all of its contents"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"list(res)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 15,
	"text": [
	"[-2464.534131, -2464.038185, 0.45265, 0.42362, 0.97743]"
	]
	}
	],
	"prompt_number": 15
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"But we can also refer to elemnts by name, for instance we coul do the likelihood ratio test with one degree fo freedom"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"D = 2 * (res.lnL_branch - res.lnL_null)\n",
	"pval = 1.0 - chi2.cdf(D, 1)\n",
	"pval"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 16,
	"text": [
	"0.31928039243135342"
	]
	}
	],
	"prompt_number": 16
	}
	],
	"metadata": {}
	}
	]
	}