lgarrison · July 24, 2020 12:52
diff --git a/RR_autocorrelations.ipynb b/RR_autocorrelations.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The RR term in autocorrelations\n",
    "Author: Lehman Garrison (http://lgarrison.github.io)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$\\newcommand{\\RR}{\\mathrm{RR}}$\n",
    "$\\newcommand{\\DD}{\\mathrm{DD}}$\n",
    "When computing a two-point correlation function estimator like\n",
    "$\\xi(r) = \\frac{\\DD}{\\RR} - 1$,\n",
    "the $\\RR$ term can be computed analytically if the domain is a periodic box.\n",
    "In 2D, this is often done as\n",
    "$$\\begin{align}\n",
    "\\RR_i &= N A_i \\bar\\rho \\\\\n",
    "&= N A_i \\frac{N}{L^2}\n",
    "\\end{align}$$\n",
    "where $\\RR_i$ is the expected number of random-random pairs in bin $i$, $N$ is the total number of points, $A_i$ is the area (or volume if 3D) of bin $i$, $L$ is the box size, and $\\bar\\rho$ is the average density in the box.\n",
    "\n",
    "However, using $\\bar\\rho = \\frac{N}{L^2}$ is not quite right.  When sitting on a particle, only $N-1$ particles are available to be in a bin some distance away.  The other particle is the particle you're sitting on, which is always at distance $0$.  Thus, the correct expression is\n",
    "$$\\RR_i = N A_i \\frac{N-1}{L^2}.$$\n",
    "\n",
    "The following notebook empirically demonstrates that using $N-1$ is correct, and that using $N$ introduces bias of order $\\frac{1}{N}$ into the estimator.  This is a tiny correction for large $N$ problems, but important for small $N$.\n",
    "\n",
    "Cross-correlations of two different particle sets don't suffer from this problem; the particle you're sitting on is never part of the set of particles under consideration for pair-making.\n",
    "\n",
    "This $N-1$ correction is implemented in the [Corrfunc](https://github.com/manodeep/Corrfunc) code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Generates a set of N uniform random points in boxsize L\n",
    "# and returns binned pair counts\n",
    "# Pair counts are doubled, and self-pairs are counted\n",
    "def rand_DD(N, L, bins):\n",
    "    # Make a 2D random set of N particles in [0,L)\n",
    "    x = np.random.random((2,N))*L\n",
    "    \n",
    "    # Form the periodic distance array\n",
    "    dist = x[:,None] - x[...,None]\n",
    "    dist -= np.round(dist/L)*L\n",
    "    dist = (dist**2).sum(axis=0)**.5\n",
    "    \n",
    "    # Count pairs in 2 bins\n",
    "    DD = np.histogram(dist, bins=bins)[0]\n",
    "    \n",
    "    return DD"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DD/RR using density (N-1) = [ 1.00022627  0.99934544  0.99956887  1.00006867]\n",
      "DD/RR using density (N)   = [ 0.96817988  0.8994109   0.89961199  0.9000618 ]\n"
     ]
    }
   ],
   "source": [
    "# DD parameters\n",
    "N = 10\n",
    "L = 1.\n",
    "bins = np.linspace(0.0,0.49, 5)\n",
    "n_iter = 100000  # number of times to repeat the experiment\n",
    "seed = 42\n",
    "np.random.seed(seed)\n",
    "\n",
    "# First, calculate RR analytically\n",
    "dA = (bins[1:]**2 - bins[:-1]**2)*np.pi  # The area of each 2D annular bin\n",
    "\n",
    "RR = np.empty((2, len(dA)))\n",
    "RR[:] = N/L**2*dA\n",
    "\n",
    "# Calculate RR using N-1 and N\n",
    "RR[0] *= N - 1\n",
    "RR[1] *= N\n",
    "\n",
    "# If the first bin includes 0, the random term must include the self-pairs\n",
    "if bins[0] == 0.:\n",
    "    RR[:, 0] += N\n",
    "\n",
    "xi_mean = np.zeros((2, len(bins)-1))\n",
    "for i in xrange(n_iter):\n",
    "    DD = rand_DD(N, L, bins)\n",
    "    xi_mean += DD/RR\n",
    "    \n",
    "xi_mean /= n_iter\n",
    "print 'DD/RR using density (N-1) = {}'.format(xi_mean[0])\n",
    "print 'DD/RR using density (N)   = {}'.format(xi_mean[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we expected, using $N$ in the density leads to a $\\frac{1}{N} = 10\\%$ bias in $\\frac{\\DD}{\\RR}$, when averaging over many realizations of $\\DD$.\n",
    "\n",
    "The first bin has a smaller bias because we chose to include a bin that starts at $0$ separation, and thus includes self-pairs, which are indepdendent of the $N$ or $N-1$ density term."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [Root]",
   "language": "python",
   "name": "Python [Root]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# The RR term in autocorrelations\n",
	"Author: Lehman Garrison (http://lgarrison.github.io)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"$\\newcommand{\\RR}{\\mathrm{RR}}$\n",
	"$\\newcommand{\\DD}{\\mathrm{DD}}$\n",
	"When computing a two-point correlation function estimator like\n",
	"$\\xi(r) = \\frac{\\DD}{\\RR} - 1$,\n",
	"the $\\RR$ term can be computed analytically if the domain is a periodic box.\n",
	"In 2D, this is often done as\n",
	"$$\\begin{align}\n",
	"\\RR_i &= N A_i \\bar\\rho \\\\\n",
	"&= N A_i \\frac{N}{L^2}\n",
	"\\end{align}$$\n",
	"where $\\RR_i$ is the expected number of random-random pairs in bin $i$, $N$ is the total number of points, $A_i$ is the area (or volume if 3D) of bin $i$, $L$ is the box size, and $\\bar\\rho$ is the average density in the box.\n",
	"\n",
	"However, using $\\bar\\rho = \\frac{N}{L^2}$ is not quite right. When sitting on a particle, only $N-1$ particles are available to be in a bin some distance away. The other particle is the particle you're sitting on, which is always at distance $0$. Thus, the correct expression is\n",
	"$$\\RR_i = N A_i \\frac{N-1}{L^2}.$$\n",
	"\n",
	"The following notebook empirically demonstrates that using $N-1$ is correct, and that using $N$ introduces bias of order $\\frac{1}{N}$ into the estimator. This is a tiny correction for large $N$ problems, but important for small $N$.\n",
	"\n",
	"Cross-correlations of two different particle sets don't suffer from this problem; the particle you're sitting on is never part of the set of particles under consideration for pair-making.\n",
	"\n",
	"This $N-1$ correction is implemented in the [Corrfunc](https://github.com/manodeep/Corrfunc) code."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"import numpy as np"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# Generates a set of N uniform random points in boxsize L\n",
	"# and returns binned pair counts\n",
	"# Pair counts are doubled, and self-pairs are counted\n",
	"def rand_DD(N, L, bins):\n",
	" # Make a 2D random set of N particles in [0,L)\n",
	" x = np.random.random((2,N))*L\n",
	" \n",
	" # Form the periodic distance array\n",
	" dist = x[:,None] - x[...,None]\n",
	" dist -= np.round(dist/L)*L\n",
	" dist = (dist2).sum(axis=0).5\n",
	" \n",
	" # Count pairs in 2 bins\n",
	" DD = np.histogram(dist, bins=bins)[0]\n",
	" \n",
	" return DD"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"DD/RR using density (N-1) = [ 1.00022627 0.99934544 0.99956887 1.00006867]\n",
	"DD/RR using density (N) = [ 0.96817988 0.8994109 0.89961199 0.9000618 ]\n"
	]
	}
	],
	"source": [
	"# DD parameters\n",
	"N = 10\n",
	"L = 1.\n",
	"bins = np.linspace(0.0,0.49, 5)\n",
	"n_iter = 100000 # number of times to repeat the experiment\n",
	"seed = 42\n",
	"np.random.seed(seed)\n",
	"\n",
	"# First, calculate RR analytically\n",
	"dA = (bins[1:]2 - bins[:-1]2)*np.pi # The area of each 2D annular bin\n",
	"\n",
	"RR = np.empty((2, len(dA)))\n",
	"RR[:] = N/L*2dA\n",
	"\n",
	"# Calculate RR using N-1 and N\n",
	"RR[0] *= N - 1\n",
	"RR[1] *= N\n",
	"\n",
	"# If the first bin includes 0, the random term must include the self-pairs\n",
	"if bins[0] == 0.:\n",
	" RR[:, 0] += N\n",
	"\n",
	"xi_mean = np.zeros((2, len(bins)-1))\n",
	"for i in xrange(n_iter):\n",
	" DD = rand_DD(N, L, bins)\n",
	" xi_mean += DD/RR\n",
	" \n",
	"xi_mean /= n_iter\n",
	"print 'DD/RR using density (N-1) = {}'.format(xi_mean[0])\n",
	"print 'DD/RR using density (N) = {}'.format(xi_mean[1])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"As we expected, using $N$ in the density leads to a $\\frac{1}{N} = 10\\%$ bias in $\\frac{\\DD}{\\RR}$, when averaging over many realizations of $\\DD$.\n",
	"\n",
	"The first bin has a smaller bias because we chose to include a bin that starts at $0$ separation, and thus includes self-pairs, which are indepdendent of the $N$ or $N-1$ density term."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python [Root]",
	"language": "python",
	"name": "Python [Root]"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.12"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}