simonbyrne · April 16, 2015 09:45
diff --git a/jlug.ipynb b/jlug.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Inside Julia\n",
    "\n",
    "### Simon Byrne\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "# About Me\n",
    "\n",
    "* I'm a researcher in statistics at University College London.\n",
    "* Focus on a class of algorithms known as Markov chain Monte Carlo (MCMC).\n",
    " * Largely theoretical analysis of computational techniques.\n",
    " * but occasionally I need to write some code.\n",
    " * Sequential algorithms: can't be vectorised\n",
    "\n",
    "# Why Julia?\n",
    "\n",
    "* Quick to write: easy to try out ideas.\n",
    "* Reasonably performant, but representative of the speed of the algorithm.\n",
    "* Easy to understand: if I need to look at the code 6 months later.\n",
    "* Able to peek under the hood.\n",
    "\n",
    "# How I met Julia\n",
    "\n",
    "> Just to never write another .mex file, I'll very seriously consider Julia for new projects\n",
    "> \n",
    "> &ndash; <cite>Justin Domke  (September 2012)</cite>\n",
    "\n",
    "I had Matlab code full of:\n",
    "```\n",
    "R = chol(A) \n",
    "y = R \\ ( R' \\ x)\n",
    "```\n",
    "\n",
    "I really liked being able to write Julia code:\n",
    "```\n",
    "C = cholfact(A) \n",
    "y = C \\ x\n",
    "```\n",
    "(and it was faster).\n",
    "\n",
    "I filed my first pull request 3 weeks later."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# My Contributions\n",
    "\n",
    "* Base Julia: mostly numeric related\n",
    " * Numeric comparisons\n",
    " * Linear algebra\n",
    " * Various mathematical functions\n",
    " \n",
    "* Stats\n",
    " * StatsBase.jl: basic statistics functionality\n",
    " * Distributions.jl: probability distributions\n",
    " * KernelDensity.jl: kernel density estimators\n",
    "* Other\n",
    " * AppleAccelerate.jl: an interface to Apple's Accelerate library\n",
    " * InplaceOps.jl: avoids unnecessary allocation of arrays\n",
    " * DoubleDouble.jl: extended precision arithmetic without BigFloats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Numeric comparisons\n",
    "or, why does:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "true"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "0.1 > 1//10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.100000000000000005551115123125782702118158340454101562500000"
     ]
    }
   ],
   "source": [
    "@printf \"%.60f\" 0.1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.30000000000000004"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "0.1+0.2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Few other languages let you compare numbers of different types.\n",
    "\n",
    "The rough idea:\n",
    "* every non-NaN `Real` value corresponds to exactly 1 real number, or an infinity.\n",
    "* all non-NaN `Real` values can be compared correctly."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Float vs Integer comparisons"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "function <(x::Float64, y::Int64)\n",
    "    fy = Float64(y)\n",
    "    (x < fy) | ((x == fy) & (fy < Float64(typemax(Int64))) & (unsafe_trunc(Int64,fy) < y))\n",
    "end"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Why is this so complicated?\n",
    "\n",
    "Neither type is a subset of the other.\n",
    "\n",
    "* `Float64` can only represent all integers up to 2<sup>53</sup>.\n",
    "* Any integers above this may be rounded when converted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "9007199254740993"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "2^53+1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "9007199254740992"
     ]
    }
   ],
   "source": [
    "@printf \"%i\" Float64(2^53+1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So the rough idea is:\n",
    "1. Convert to `Float64`, and compare the values\n",
    "2. If they are not equal, then the comparison is correct.\n",
    "3. If they are equal, then `y` must be an integer, so can be compared as integers\n",
    "  * `unsafe_trunc` is fast: it doesn't throw exceptions (same as a C cast).\n",
    "4. Extra check is for integers that can't be round-tripped (ones that get rounded to 2<sup>63</sup>)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "9223372036854775807"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y = typemax(Int64) # 2^63-1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "9223372036854775808"
     ]
    }
   ],
   "source": [
    "fy = Float64(y)\n",
    "@printf \"%i\" fy # rounded to 2^64"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "ename": "LoadError",
     "evalue": "LoadError: InexactError()\nwhile loading In[32], in expression starting on line 1",
     "output_type": "error",
     "traceback": [
      "LoadError: InexactError()\nwhile loading In[32], in expression starting on line 1",
      "",
      " in call at base.jl:36"
     ]
    }
   ],
   "source": [
    "Int64(fy)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-9223372036854775808"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unsafe_trunc(Int64,fy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## So what happens to your functions?\n",
    "\n",
    "You can see the underlying Julia code using `@less`. But what happens after?\n",
    "\n",
    "### 0. Parsing\n",
    "* Syntax is parsed (`parse` function): you can also see what this looks like by using `quote`, and playing around with the `Expr` object.\n",
    "* Macros are expanded: try out `macroexpand` on a quoted macro.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       ":(1.0 < 2)"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "e = parse(\"1.0 < 2\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       ":comparison"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "e.head"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3-element Array{Any,1}:\n",
       " 1.0\n",
       "  :<\n",
       " 2  "
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "e.args"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "quote \n",
       "    #308#out = Base.Printf.STDOUT\n",
       "    #309###x#2686 = 0.1\n",
       "    local #314#neg, #313#pt, #312#len, #307#exp, #310#do_out, #311#args\n",
       "    if Base.Printf.isfinite(#309###x#2686)\n",
       "        (#310#do_out,#311#args) = Base.Printf.fix_dec(#308#out,#309###x#2686,\"\",0,60,'f')\n",
       "        if #310#do_out\n",
       "            (#312#len,#313#pt,#314#neg) = #311#args\n",
       "            #314#neg && Base.Printf.write(#308#out,'-')\n",
       "            Base.Printf.print_fixed(#308#out,60,#313#pt,#312#len)\n",
       "        end\n",
       "    else \n",
       "        Base.Printf.write(#308#out,begin  # printf.jl, line 143:\n",
       "                if Base.Printf.isnan(#309###x#2686)\n",
       "                    \"NaN\"\n",
       "                else \n",
       "                    if (#309###x#2686 Base.Printf.< 0)\n",
       "                        \"-Inf\"\n",
       "                    else \n",
       "                        \"Inf\"\n",
       "                    end\n",
       "                end\n",
       "            end)\n",
       "    end\n",
       "    Base.Printf.nothing\n",
       "end"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "macroexpand(:(@printf \"%.60f\" 0.1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. `@code_lowered` \n",
    "This is a proto-Julia syntax: a couple of things have been done:\n",
    "* Syntax desugaring, e.g. \n",
    "  * expand `a+=b` to `a=a+b`\n",
    "  * substitute `a'*b` with `Ac_mul_B(a,b)`\n",
    "* Control flow are replaced by simple \"goto\"-style statements.\n",
    "* Resolve scope of functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1-element Array{Any,1}:\n",
       " :($(Expr(:lambda, Any[:x,:y], Any[Any[:fy],Any[Any[:x,:Any,0],Any[:y,:Any,0],Any[:fy,:Any,18]],Any[],0], :(begin  # In[3], line 2:\n",
       "        fy = Float64(y)::ANY # line 3:\n",
       "        return (x < fy)::ANY | ((x == fy)::ANY & (fy < Float64(typemax(Int64)::ANY)::ANY)::ANY)::ANY & (unsafe_trunc(Int64,fy)::ANY < y)::ANY::ANY::ANY\n",
       "    end::ANY))))"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@code_lowered 1.0 < 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. `@code_typed`\n",
    "\n",
    "The Julia magic happens\n",
    "* type inference\n",
    "* method sorting\n",
    "* inlining, and other optimisations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1-element Array{Any,1}:\n",
       " :($(Expr(:lambda, Any[:x,:y], Any[Any[:fy],Any[Any[:x,Float64,0],Any[:y,Int64,0],Any[:fy,Float64,18]],Any[],Any[]], :(begin  # In[3], line 2:\n",
       "        fy = (top(box))(Float64,(top(sitofp))(Float64,y::Int64)) # line 3:\n",
       "        return (top(box))(Bool,(top(or_int))((top(lt_float))(x::Float64,fy::Float64)::Bool,(top(box))(Bool,(top(and_int))((top(box))(Bool,(top(and_int))((top(eq_float))(x::Float64,fy::Float64)::Bool,(top(lt_float))(fy::Float64,(top(box))(Float64,(top(sitofp))(Float64,9223372036854775807)))::Bool)),(top(slt_int))((top(box))(Int64,(top(fptosi))(Int64,fy::Float64)),y::Int64)::Bool))))\n",
       "    end::Bool))))"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@code_typed 1.0 < 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. `@code_llvm`\n",
    "\n",
    "`@code_llvm` gives the LLVM IR (intermediate representation). Think of it as a cross-platform, assembly-like language.\n",
    "\n",
    "The code is shown after various optimsation passes (constant propagation, dead code elimination, etc.).\n",
    "\n",
    "If you want to see what each pass does, you can start `julia-debug` with `JULIA_LLVM_ARGS = -print-after-all`. Beware, this will print a *LOT* of text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "define i1 @\"julia_<_43379\"(double, i64) {\n",
      "top:\n",
      "  %2 = sitofp i64 %1 to double, !dbg !3108\n",
      "  %3 = fcmp ogt double %2, %0, !dbg !3110\n",
      "  %4 = fcmp oeq double %2, %0, !dbg !3110\n",
      "  %5 = fcmp olt double %2, 0x43E0000000000000, !dbg !3110\n",
      "  %6 = and i1 %4, %5, !dbg !3110\n",
      "  %7 = fptosi double %2 to i64, !dbg !3110\n",
      "  %8 = icmp slt i64 %7, %1, !dbg !3110\n",
      "  %9 = and i1 %6, %8, !dbg !3110\n",
      "  %10 = or i1 %3, %9, !dbg !3110\n",
      "  ret i1 %10, !dbg !3110\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "@code_llvm 1.0 < 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. `@code_native`\n",
    "Further platform-dependent optimisations, generate the assembly instructions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\t.section\t__TEXT,__text,regular,pure_instructions\n",
      "Filename: In[3]\n",
      "Source line: 2\n",
      "\tpushq\t%rbp\n",
      "\tmovq\t%rsp, %rbp\n",
      "Source line: 2\n",
      "\tvcvtsi2sdq\t%rdi, %xmm0, %xmm1\n",
      "\tvucomisd\t%xmm0, %xmm1\n",
      "Source line: 3\n",
      "\tseta\t%cl\n",
      "\tsetnp\t%dl\n",
      "\tsete\t%al\n",
      "\tandb\t%dl, %al\n",
      "\tmovabsq\t$13336766176, %rdx      ## imm = 0x31AEEE6E0\n",
      "\tvmovsd\t(%rdx), %xmm0\n",
      "\tvucomisd\t%xmm1, %xmm0\n",
      "\tseta\t%dl\n",
      "\tandb\t%al, %dl\n",
      "\tvcvttsd2si\t%xmm1, %rax\n",
      "\tcmpq\t%rdi, %rax\n",
      "\tsetl\t%al\n",
      "\tandb\t%dl, %al\n",
      "\torb\t%cl, %al\n",
      "\tpopq\t%rbp\n",
      "\tret\n"
     ]
    }
   ],
   "source": [
    "@code_native 1.0 < 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Does this mean that comparing integers and floats is computationally expensive?\n",
    "\n",
    "Not always."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "isneg (generic function with 1 method)"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "isneg(x) = x < 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1-element Array{Any,1}:\n",
       " :($(Expr(:lambda, Any[:x], Any[Any[],Any[Any[:x,:Any,0]],Any[],0], :(begin  # In[42], line 1:\n",
       "        return x < 0::ANY\n",
       "    end::ANY))))"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@code_lowered isneg(1.0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "define i1 @julia_isneg_43442(float) {\n",
      "top:\n",
      "  %1 = fcmp olt float %0, 0.000000e+00, !dbg !3138\n",
      "  ret i1 %1, !dbg !3138\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "@code_llvm isneg(1f0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Float vs MathConst comparisons"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "π = 3.1415926535897..."
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MathConst{:π}"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "typeof(pi)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3.141592653589793238462643383279502884197169399375105820974944592307816406286198e+00 with 256 bits of precision"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "big(pi)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "true"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = Float64(pi)\n",
    "x < pi < nextfloat(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "<(x::MathConst, y::Float64) = Float64(x,RoundUp) <= y\n",
    "<(x::Float64, y::MathConst) = x <= Float64(y,RoundDown)\n",
    "\n",
    "# this is called by Float64(c,r)\n",
    "stagedfunction call{T<:Union(Float32,Float64),s}(t::Type{T},c::MathConst{s},r::RoundingMode)\n",
    "    f = T(big(c()),r())\n",
    "    :($f)\n",
    "end"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## So what is a `stagedfunction`?\n",
    "\n",
    "> `magicfunction`\n",
    ">\n",
    "> Defines a magical function that shares similarities with warp drives\n",
    "and matter replicators. Also, the arguments are the types, not values.\n",
    ">\n",
    "> No it is not a macro.\n",
    ">\n",
    "> Also, we don't use words like \"homoiconic\" in this household young man.\n",
    ">\n",
    "> &ndash; <cite>Iain Dunning</cite>\n",
    "\n",
    "A `stagedfunction` consists of 2 stages:\n",
    "* at compile time (i.e. when first called), it is evaluated with its arguments as types, and returning an expression.\n",
    "* this is expression becomes the function for those types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1-element Array{Any,1}:\n",
       " :($(Expr(:lambda, Any[:(t::Any::ANY),:(c::Any::ANY),:(r::Any::ANY)], Any[Any[],Any[Any[:t,Type{Float64},0],Any[:c,MathConst{:π},0],Any[:r,Base.Rounding.RoundingMode{:Down},0]],Any[],Any[]], :(begin \n",
       "        return 3.141592653589793\n",
       "    end::Float64))))"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@code_typed Float64(pi,RoundDown)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1-element Array{Any,1}:\n",
       " :($(Expr(:lambda, Any[:(t::Any::ANY),:(c::Any::ANY),:(r::Any::ANY)], Any[Any[],Any[Any[:t,Type{Float64},0],Any[:c,MathConst{:π},0],Any[:r,Base.Rounding.RoundingMode{:Up},0]],Any[],Any[]], :(begin \n",
       "        return 3.1415926535897936\n",
       "    end::Float64))))"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@code_typed Float64(pi,RoundUp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Particularly useful for:\n",
    "* When functions behave slighlty differently for different parametric types: can avoid lots of \n",
    "```\n",
    "for T in (:Type1,:Type2)\n",
    "    @eval foo(x::$T) = ...\n",
    "end\n",
    "```\n",
    "* Variable agurments, e.g. multidimensional indexing:\n",
    "```\n",
    "getindex(A::MyArray, inds...)\n",
    "```\n",
    "* Used in Keno's [Cxx.jl](https://github.com/Keno/Cxx.jl): allows calling C++ straight from Julia (no wrappers, no precompilation)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Special functions\n",
    "\n",
    "Do you know what your `sin` function does?\n",
    "\n",
    "* Recent libraries on modern CPUs (support SSE instructions, i.e. the last 5 years or so) don't use the x87 assembly routines (e.g. FSIN)\n",
    "\n",
    "In [openlibm](https://github.com/JuliaLang/openlibm) (the default math library used by Julia), there are two steps:\n",
    "\n",
    "1. Range reduction: find an integer $n$ and $y \\in [-\\frac{\\pi}{4},\\frac{\\pi}{4}]$ such that\n",
    " $$ x = n \\frac{\\pi}{2} + y $$\n",
    " * this is tricky (as $\\pi$ does not correspond to a floating point number):\n",
    "   - Cody&ndash;Waite for small values\n",
    "   - Payne&ndash;Hanek for large values\n",
    "   \n",
    " * Calculate $y$ with extra precision: typically stored as a pair of floating point numbers `y_hi + y_lo`\n",
    "   \n",
    "2. Based on $n$, compute the appropriately signed `sin`/`cos` of $y$.\n",
    " * a minimax polynomial approximation (similar, but distinct from, a Taylor's series truncation).\n",
    " * some other libraries incorporate a further range reduction (e.g. $[-\\pi/128,\\pi/128]$), combined with a table-lookup and a shorter polynomial."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However [a](http://stackoverflow.com/questions/14429902/abssinpi-for-mitigating-systematic-rounding-error) [LOT](http://stackoverflow.com/questions/2670999/how-can-i-work-around-the-fact-that-in-c-sinm-pi-is-not-0) [of](http://stackoverflow.com/questions/22439827/why-is-sinpi-4-different-from-cospi-4-in-python) [people](http://stackoverflow.com/questions/7561698/javascripts-geometrical-methods-zero-isnt-exactly-zero-is-it) [want](http://stackoverflow.com/questions/6566512/value-of-sine-180-is-coming-out-as-1-22465e-16) [to](http://stackoverflow.com/questions/27643267/storing-cos-and-sin-into-a-vector-and-getting-weird-values) [calculate](http://stackoverflow.com/questions/9780057/sin-function-not-giving-correct-answer)\n",
    "\n",
    "$$ \\sin \\frac{n \\pi}{N}$$\n",
    "\n",
    "for $n = 0,1,\\ldots,N$, e.g.\n",
    "* Simple discretisations of the circle,\n",
    "* Fast Fourier transforms (FFTs)\n",
    "* Bessel functions, complex numbers, etc.\n",
    "\n",
    "Although technically correct (according to the IEEE 754 floating point standard), it does seem silly to give answers like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.2246467991473532e-16"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sin(10*pi/10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.49999999999999994"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sin(pi/6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But we can do better&hellip;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sinpi(10/10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sinpi(1/6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This function is basically\n",
    "1. A range reduction modulo 1/2\n",
    " * *can* be done exactly using a standard `rem` operation (or even faster using some other tricks).\n",
    "2. Multiply by $\\pi$\n",
    " * Do this using \"double-double\" arithmetic to get `y_hi + y_lo`\n",
    "3. Call the \"kernel\" `sin`/`cos`\n",
    "\n",
    "\n",
    "* More accurate: to within 1 unit in the last place,\n",
    "* Is actually *faster* then `sin(pi*x)` (we avoid the expensive range-reduction),\n",
    "* Is implemented *all* in native Julia."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Base.Math.DoubleFloat64(0.5235987755982988,2.835264368075643e-17)"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a = Base.Math.mulpi_ext(1/6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Base.Math.sin_kernel(a)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Mathematical functions\n",
    "\n",
    "Slowly, a lot of the mathematical functions are being implented in Julia.\n",
    "\n",
    "* `reinterpret` makes it easy to implement \"bit-twiddling\" functions\n",
    " * e.g. `significand`, `exponent`, `frexp`.\n",
    " * much clearer than type-punning in C"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.2"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# find y = x * 2^k ∈ [1,2)\n",
    "reinterpret(Float64,\n",
    "    (reinterpret(UInt64, 0.3) & 0x800f_ffff_ffff_ffff)| 0x3ff0000000000000\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* `@label`/`@goto` makes it very easy to translate Fortran code\n",
    " * See Mike Nolta's [SpecialFunctions.jl](https://github.com/nolta/SpecialFunctions.jl): a translation of openspecfun (AMOS and Fadeeva)\n",
    "\n",
    "Will hopefully allow users to easily understand what their code is doing (and how to improve it)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lower-level functions\n",
    "\n",
    "`llvmcall` allows you to call LLVM IR directly. This allows you to directly call LLVM intrinsics without C code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "getmxcsr (generic function with 1 method)"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# accesses the SSE floating point control status register (mxcsr)\n",
    "\n",
    "function getmxcsr()\n",
    "    Base.llvmcall(\"\"\"\n",
    "        %ptr = alloca i32\n",
    "        call void @llvm.x86.sse.stmxcsr(i32 * %ptr)\n",
    "        %curval = load i32 * %ptr\n",
    "        ret i32 %curval\"\"\", UInt32, ())\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0x00001fa1"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "getmxcsr()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Flags:     100001\n",
      "Den = 0:   0\n",
      "Masks:     111111\n",
      "Rounding:  00\n",
      "Flush den: 0\n"
     ]
    }
   ],
   "source": [
    "function decmx(u::UInt32)\n",
    "    b = bits(u)\n",
    "    println(\"Flags:     \",b[27:32])\n",
    "    println(\"Den = 0:   \",b[26])\n",
    "    println(\"Masks:     \",b[20:25])\n",
    "    println(\"Rounding:  \",b[18:19])\n",
    "    println(\"Flush den: \",b[17])\n",
    "end\n",
    "decmx(getmxcsr())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Inf"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = 1.0 / 0.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Flags:     100101\n",
      "Den = 0:   0\n",
      "Masks:     111111\n",
      "Rounding:  00\n",
      "Flush den: 0\n"
     ]
    }
   ],
   "source": [
    "decmx(getmxcsr())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "function setmxcsr(u::UInt32)\n",
    "    Base.llvmcall(\"\"\"%ptr = alloca i32\n",
    "                  store i32 %0, i32 * %ptr\n",
    "                  call void @llvm.x86.sse.ldmxcsr(i32 * %ptr)\n",
    "                  ret void\"\"\", Void, (UInt32,),u)\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "FE_MASK_DIVBYZERO = 0b0_0_00_000100_0_000000\n",
    "u = getmxcsr()\n",
    "setmxcsr(u & ~FE_MASK_DIVBYZERO)\n",
    "decmx(getmxcsr())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "1.0/0.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "setmxcsr(u)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In future, I'm hoping to be able to use this to allow for a sane approach to floating point exceptions.\n",
    " * Ideally, integrate this into the future debugger."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Even lower\n",
    "\n",
    "We can also use this for inline assembly (requires LLVM 3.4+)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "asm_pi (generic function with 1 method)"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "function asm_pi()\n",
    "    Base.llvmcall(\"\"\"\n",
    "        %pi = call double asm \"fldpi\", \"={st},~{dirflag},~{fpsr},~{flags}\"() #0\n",
    "        ret double %pi\"\"\",\n",
    "        Float64, ())\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3.141592653589793"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "asm_pi()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "error: failed to compute relocation: X86_64_RELOC_UNSIGNED\n",
      "error: failed to compute relocation: X86_64_RELOC_UNSIGNED\n",
      "error: failed to compute relocation: X86_64_RELOC_UNSIGNED\n",
      "error: failed to compute relocation: X86_64_RELOC_UNSIGNED\n",
      "error: failed to compute relocation: X86_64_RELOC_UNSIGNED\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\t.section\t__TEXT,__text,regular,pure_instructions\n",
      "Filename: In[21]\n",
      "Source line: 0\n",
      "\tpushq\t%rbp\n",
      "\tmovq\t%rsp, %rbp\n",
      "Source line: 2\n",
      "\tfldpi\n",
      "\tfstpl\t-8(%rbp)\n",
      "\tvmovsd\t-8(%rbp), %xmm0\n",
      "\tpopq\t%rbp\n",
      "\tretq\n"
     ]
    }
   ],
   "source": [
    "@code_native asm_pi()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Other cool stuff in 0.4: meta tags\n",
    "\n",
    "These are \"extra\" instructions to control various optimisations.\n",
    "\n",
    "## `@inbounds`\n",
    "\n",
    "This disables boundary checking of array access and assignment (i.e. standard C behaviour)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "second_element (generic function with 1 method)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "second_element(x) = x[2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "define i64 @julia_second_element_43258(%jl_value_t*) {\n",
      "top:\n",
      "  %1 = getelementptr inbounds %jl_value_t* %0, i64 2, !dbg !2566\n",
      "  %2 = bitcast %jl_value_t* %1 to i64*, !dbg !2566\n",
      "  %3 = load i64* %2, align 8, !dbg !2566, !tbaa %jtbaa_arraylen\n",
      "  %4 = icmp ugt i64 %3, 1, !dbg !2566\n",
      "  br i1 %4, label %idxend, label %oob, !dbg !2566\n",
      "\n",
      "oob:                                              ; preds = %top\n",
      "  %5 = alloca i64, align 8, !dbg !2566\n",
      "  store i64 2, i64* %5, align 8, !dbg !2566\n",
      "  call void @jl_bounds_error_ints(%jl_value_t* %0, i64* %5, i64 1), !dbg !2566\n",
      "  unreachable, !dbg !2566\n",
      "\n",
      "idxend:                                           ; preds = %top\n",
      "  %6 = getelementptr inbounds %jl_value_t* %0, i64 1, !dbg !2566\n",
      "  %7 = bitcast %jl_value_t* %6 to i8**, !dbg !2566\n",
      "  %8 = load i8** %7, align 8, !dbg !2566, !tbaa %jtbaa_arrayptr\n",
      "  %9 = getelementptr i8* %8, i64 8, !dbg !2566\n",
      "  %10 = bitcast i8* %9 to i64*, !dbg !2566\n",
      "  %11 = load i64* %10, align 8, !dbg !2566, !tbaa %jtbaa_user\n",
      "  ret i64 %11, !dbg !2566\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "@code_llvm second_element([1,2,3])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "second_element (generic function with 1 method)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "second_element(x) = @inbounds x[2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "define %jl_value_t* @julia_second_element_43259(%jl_value_t*, %jl_value_t**, i32) {\n",
      "top:\n",
      "  ret %jl_value_t* inttoptr (i64 4371070976 to %jl_value_t*), !dbg !2570\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "@code_llvm second_element([1,2,3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "\n",
    "## `@inline`\n",
    "\n",
    "Function inlining replaces a function call with the actual code from the function. Julia will automatically inline certain functions based on certain rules. \n",
    "\n",
    "Some additional control is provided by:\n",
    "\n",
    "* `@inline` macro marks functions at declaration which should be inlined  (similar to C `inline` statement).\n",
    "\n",
    "* `@noinline` marks functions which should NOT be inlined.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `@fastmath`\n",
    "\n",
    "This enables certain floating point optimisations that aren't strictly valid according to the IEEE-754 floating point standard:\n",
    "* allows reordering of instructions\n",
    "* allows substitution of fused multiply-add (FMA) instructions\n",
    "* assumes all values are finite (no Inf or NaNs)\n",
    "* treats signed zeros are the the same.\n",
    "* allows reciprocal approximations to division. \n",
    "\n",
    "Similar to the C `-ffast-math` compiler flag, but applied only locally."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "define double @julia_third_43558(double) {\n",
      "top:\n",
      "  %1 = fdiv double %0, 3.000000e+00, !dbg !3506\n",
      "  ret double %1, !dbg !3506\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "third(x) = x/3\n",
    "@code_llvm third(1.0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "define double @julia_third_43559(double) {\n",
      "top:\n",
      "  %1 = fmul double %0, 0x3FD5555555555555, !dbg !3510\n",
      "  ret double %1, !dbg !3510\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "third(x) = @fastmath x/3\n",
    "@code_llvm third(1.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "In future could be used for faster, but less accurate, special functions (`sin`, `log`, etc.)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## `@simd`\n",
    "\n",
    "Allows for limited reordering of loops so as to be able to take advantage of SIMD (single instruction, multiple data) operations on modern processors. \n",
    "\n",
    "* Particularly useful for `Int32`/`Float32` operands, though there is typically still some speedup on `Float64` as well."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Conclusions\n",
    "\n",
    "* `@less` is your friend.\n",
    "* `@code_*` can tell you lots of useful stuff.\n",
    "* Meta tags are going to allow more power, and avoid the need for compiler flags"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 0.4.0-dev",
   "language": "julia",
   "name": "julia 0.4"
  },
  "language_info": {
   "name": "julia",
   "version": "0.4.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
 }