Skip to content

Instantly share code, notes, and snippets.

@jdfreder
Last active December 18, 2015 17:08
Show Gist options
  • Save jdfreder/5816002 to your computer and use it in GitHub Desktop.
Save jdfreder/5816002 to your computer and use it in GitHub Desktop.
My senior project paper.
{
"metadata": {
"name": "NbconvertRefactor"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"IPython is an interactive *Python* computing environment[1]. It provides an enhanced interactive Python shell. The IPython Notebook is a browser based interface distributed with IPython. It enables the creation of richly formatted notebooks that contain embedded IPython code.\n",
"\n",
"Notebooks are a collection of cells. Each cell is assigned a type, either by the user or notebook backend. The type of the cell determines how its contents will be handled. Code cells can be executed one at a time or in batch. If a code cell produces output, the notebook backend will automatically add output cell(s) upon execution of the code cell. Output cells are always inserted immediately after their parent code cell.\n",
"\n",
"Text and heading cell types are included. If additional formatting is needed, Markdown, LaTeX, and HTML can be used. When saved, the notebook is written as a *JSON* text file.[2] Cells containing binary data (i.e. output cells with figures) are *base-64* encoded[3]. The IPython API contains functions that allow one to read and write from the notebook file type.\n",
"\n",
"To export a notebook to something other than JSON two options exist. The first is to \u201cprint\u201d the notebook to a PDF using the web browser[4]. The second option is to use **nbconvert**. With nbconvert, notebooks can be exported to various formats including, but not limited to, LaTeX, reveal.js, RST, and HTML. This is important for users that want to be able to share their work outside of IPython. nbconvert can be customized by the user to export to formats that are not supported by default. This senior project is an addition of a Sphinx Latex output format and a **refactor** of the nbconvert source code. Where *\"**Refactoring** is a disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior[5].\"*"
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Motivation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nbconvert started out as experiment created by Fernando Perez[6]. Overtime contributions were made by various authors to enable the export of additional formats. Since nbconvert started as an experiment, it lacked a solid foundation. As the codebase continued to grow, it became apparent that a nbconvert needed a major re-engineering. As contributors pushed the existing nbconvert architecture to its limits, nbconvert\u2019s core classes needed to be extended. As the project became increasingly popular, the number of requests to *merge* changes into the *master branch* became unmanageable[7]. The IPython core development team agreed to implement *Jinja* as a template engine in attempt to mitigate the number of merge requests[8].\n",
"\n",
"The original nbconvert idea was to define a base exporter classes and then subclasses corresponding to each export format. The new template powered nbconvert was implemented using the existing nbconvert structure. The template engine was implemented as a subclass of the base exporter. The result was a confusing codebase that was both a mix of the original nbconvert and a template powered nbconvert. This senior project fixed this by separating the original nbconvert code from the templating nbconvert code. In addition, the new code template powered exporter was completely refactored.\n",
"\n",
"The LaTeX template included with nbconvert was capable of producing simple LaTeX documents. Many users were already interested in the sophisticated LaTeX output that Sphinx could produce. At the time, Sphinx documents could be produced by exporting notebooks to RST and importing that into Sphinx. This senior project added a Sphinx LaTeX template, which builds off of the existing Sphinx LaTeX output. The new Sphinx LaTeX template allows Sphinx LaTeX to be exported directly from nbconvert without exporting to RST first."
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Details"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Pre-refactor"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nbconvert's parent level directory structure prior to the refactor is seen below"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"ls ../old_nbconvert/ -1 | grep -v \"[.][a-z]\""
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"converters\n",
"css\n",
"js\n",
"profile\n",
"reveal\n",
"rst2ipynblib\n",
"templates\n",
"tests\n"
]
}
],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Where\n",
"\n",
"- **converters** exporters, exporter base, Jinja filters, and notebook transformers. The last two are specific to the new template based exporter.\n",
"- **css** static style sheet for HTML exporter and style sheet for reveal.js exporter\n",
"- **js** mathjax javascript\n",
"- **profile** configuration files for template engine based exporter. Each configuration file corresponds to an output template\n",
"- **reveal** empty\n",
"- **rst2ipynblib** empty\n",
"- **templates** Jinja templates for new template based exporter\n",
"- **tests** Nose tests for testing the orignal nbconvert exporters\n",
"\n",
"There were many unsorted files in the top level directory, some of which were depracated, as seen below"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"ls ../old_nbconvert/ -1 | grep \".py\""
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"custom_converter.py\n",
"dollarmath.py\n",
"__init__.py\n",
"nbconvert2.py\n",
"nbconvert.py\n",
"nbstripout.py\n",
"notebook_sphinxext.py\n",
"rst2ipynblib\n",
"rst2ipynb.py\n"
]
}
],
"prompt_number": 2
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Refactor"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nbconvert should serve two main functions:\n",
"\n",
"1. Commandline notebook conversion utility.\n",
"2. Rich API for exporting and importing notebooks.\n",
"\n",
"The old nbconvert code was moved into \"*/nbconvert1*\". The new nbconvert code was moved into \"*/nbconvert*\". The new folder structure is seen below."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"ls -1 | grep -v \"[.][a-z]\""
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"nbconvert\n",
"nbconvert1\n"
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The directory structure of the refractored nbconvert was designed to expose as much of the inner workings of the template exporter as possible (as seen below.)"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"ls ./nbconvert/ -1 | grep -v \"[.][a-z]\""
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"exporters\n",
"filters\n",
"templates\n",
"transformers\n",
"utils\n"
]
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Where\n",
"\n",
"- **exporters** base exporter class and light subclasses that define default options for each export format\n",
"- **filters** Jinja filters that are accessible within the templates\n",
"- **templates** templates for each export format\n",
"- **transformers** notebook transformers\n",
"- **utils** collection of utility functions\n",
"\n",
"The nbconvert export process is a multistep process:\n",
"\n",
"1. Load notebook file using IPython API\n",
"2. Preprocess the notebook using **Transformer(s)**. A Transformer is a class that acts on the notebook as a whole or on a cell-by-cell basis prior to export.\n",
"3. The **filters** are passed into the Jinja templating engine. A **Filter** (Jinja specific) is a function that takes one or more arguments and returns a string. The filters are passed into Jinja so they are accessible to the templates in the next conversion step.\n",
"4. Notebook is converted using Jinja\n",
"5. *(optional)* Conversion results and exported figures are written to the user's hard disk.\n",
"\n",
"The first implementation of the template based nbconvert used *profiles* to configure a single Jinja exporter class to export to different formats. This is not how the IPython config system was originally designed to be used. It was decided to replace the the *profile* design with lightweight subclasses.\n",
"\n",
"The success of the refactor cannot be measured; however, code metrics can give an idea of how much the code was changed. Code lines can be broken up into three categories, blank, comments, and SLOC (source lines of code). By counting the number of blank lines and comment lines we know the remaining lines are SLOC. $$ SLOC = Total - (blanks + comments) $$\n",
"\n",
"In the following python block the blank and comment lines of the pre-refactored nbconvert, refactored nbconvert, and *achived* nbconvert files are counted. The *archived* nbconvert files (*/nbconvert1/*) are the original nbconvert class/subclass design. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import os \n",
"\n",
"def directory_traverser(path, endswith=\".py\"):\n",
" for root, dirs, files in os.walk(path):\n",
" for name in files:\n",
" if name.endswith(endswith):\n",
" yield os.path.join(root, name)\n",
"\n",
"def python_file_metrics(filename):\n",
" file = open(filename, \"r\")\n",
" \n",
" total_count = 0\n",
" comment_count = 0\n",
" blank_count = 0\n",
" \n",
" multiline_string_delim = \"\\\"\" * 3\n",
" in_docstring = False\n",
" \n",
" for line in file:\n",
" total_count+=1\n",
" line = line.strip()\n",
" if in_docstring:\n",
" comment_count += 1\n",
" if multiline_string_delim in line:\n",
" in_docstring = False\n",
" else:\n",
" if line.startswith(multiline_string_delim) and not (len(line) > 3 and line[3] == \")\"):\n",
" comment_count += 1\n",
" \n",
" #only start doc-string block if terminator isn't in same line\n",
" if not multiline_string_delim in line[3:]:\n",
" in_docstring = True\n",
" \n",
" elif line.startswith(\"#\"):\n",
" comment_count += 1\n",
" elif line==\"\":\n",
" blank_count+=1\n",
" \n",
" file.close()\n",
" return (total_count, comment_count, blank_count)\n",
"\n",
"def python_project_metrics(filenames):\n",
" total_count = 0\n",
" comment_count = 0\n",
" blank_count = 0\n",
" \n",
" for filename in filenames:\n",
" (total,comments,blank) = python_file_metrics(filename)\n",
" total_count += total\n",
" comment_count += comments\n",
" blank_count += blank\n",
" \n",
" comments_and_blanks = blank_count + comment_count\n",
" print(\"\"\"\n",
" SLOC : {0}\n",
" Comments : {1}\n",
" Blanks : {2}\n",
" Total: {3}\n",
" \n",
" Ratio of Comments per SLOC: {4:0.3}\n",
" \"\"\".format(total_count - comments_and_blanks, comment_count, blank_count, total_count, float(comment_count) / float(total_count - comments_and_blanks)))\n",
"\n",
"print(\"1. Pre-refactored\")\n",
"python_project_metrics(directory_traverser(\"../old_nbconvert/\"))\n",
"print(\"\\n2. Refactored\")\n",
"python_project_metrics(directory_traverser(\"./nbconvert/\"))\n",
"print(\"\\n3. Archived\")\n",
"python_project_metrics(directory_traverser(\"./nbconvert1/\"))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1. Pre-refactored\n",
"\n",
" SLOC : 2741\n",
" Comments : 2787\n",
" Blanks : 1199\n",
" Total: 6727\n",
" \n",
" Ratio of Comments per SLOC: 1.02\n",
" \n",
"\n",
"2. Refactored\n",
"\n",
" SLOC : 896\n",
" Comments : 1304\n",
" Blanks : 496\n",
" Total: 2696\n",
" \n",
" Ratio of Comments per SLOC: 1.46\n",
" \n",
"\n",
"3. Archived\n",
"\n",
" SLOC : 2143\n",
" Comments : 2324\n",
" Blanks : 962\n",
" Total: 5429\n",
" \n",
" Ratio of Comments per SLOC: 1.08\n",
" \n"
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The code metric above does not recognize line continuations. The larger comment to code ratio for both the archived and refactored content means the pre-refactored template based exporter actually had a comment to code ratio less than 1.02. One an a half comments for every line of code may sound daunting, but a large portion of the comments are due to the IPython coding stardard doc string format."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Sphinx LaTeX"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The Sphinx LaTeX template was designed provide the best possible output with minimal configuration. Instead of hardcoding the Sphinx material into the template, the template references the Sphinx installation on the user's machine. Because of this, the Sphinx LaTeX template is only 356 SLOC[9]. If the user has Sphinx installed on his or her machine, a PDF can be created from the nbconvert output using PdfLatex.\n",
"\n",
"The Sphinx Latex template provides two output document styles\n",
"\n",
"1. **HowTo** For short documents\n",
"2. **Manual** For longer documents, notebook Heading 1s are treated as chapters. Each chapter starts on a new page.\n",
"\n",
"It also allows the user to choose how IPython code cells are rendered\n",
"\n",
"1. **Simple** A thin horizontal break is used to separate code from text. At the top left of the horizontal break, in small font, the input/output prompt is visible.\n",
"2. **Notebook** Code cells are rendered like they are in the notebook. The are rendered in light gray tables with slightly rounded corners."
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Conclusion"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In conclusion, the refactor improved the code's readability by increasing the comment to code ratio and separating the files that were no longer in use from the code base. The refactored nbconvert and the orignal pull request can be found on GitHub[10]. The Sphinx LaTeX template added the ability to export beautiful LaTeX documents directly from nbconvert. It is in the master nbconvert repository and can be used with no modification to nbconvert. The Sphinx LaTeX template was used to export this notebook (senior project paper.) The source code for the template can be viewed on GitHub[9]. This paper is also available on GitHub as a gist[11]."
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. What is Python? Executive Summary, http://www.python.org/doc/essays/blurb.html\n",
"2. Introducing JSON, http://www.json.org/\n",
"3. Base64 - Online sample of a base64 poperty, http://www.motobit.com/util/base64-decoder-encoder.asp\n",
"4. What tools are available to export an ipython notebook to a PDF file?, http://stackoverflow.com/questions/14132213/what-tools-are-available-to-export-an-ipython-notebook-to-a-pdf-file\n",
"5. Refactoring Home Page, http://www.refactoring.com/\n",
"6. Fernando Perez, IPython PI, https://github.com/fperez\n",
"7. Wikipedia, Revision control, http://en.wikipedia.org/wiki/Revision_control\n",
"8. Jinja homepage, http://jinja.pocoo.org/\n",
"9. sphinx template source with SLOC count, on GitHub, https://github.com/ipython/nbconvert/blob/master/nbconvert/templates/latex/sphinx_base.tplx\n",
"10. nbconvert refactor, original pull request on GitHub, https://github.com/ipython/nbconvert/pull/137\n",
"11. gist for this paper, on GitHub, https://gist.github.com/jdfreder/5816002. Can also be view in nbviewer at http://nbviewer.ipython.org/5816002"
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment