netj · July 19, 2016 23:49
diff --git a/tsj-tsv-jsonl comparison.ipynb b/tsj-tsv-jsonl comparison.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# A quick comparison of serialization formats\n",
    "\n",
    "* Tab-separated values or [PostgreSQL Text format](https://www.postgresql.org/docs/current/static/sql-copy.html#AEN74386)\n",
    "* Tab-separated JSONs format proposed by DeepDive\n",
    "* One JSON object per line format\n",
    "\n",
    "This was motivated by [DeepDive Pull Request #565](https://github.com/HazyResearch/deepdive/pull/565).\n",
    "\n",
    "You'll need `pv`, `timeout` in GNU coreutils, `curl`, and `wc` to reproduce these outputs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[4;33mWarning\u001b[0m: pv-1.6.0 already installed\r\n",
      "\u001b[4;33mWarning\u001b[0m: coreutils-8.25 already installed\r\n"
     ]
    }
   ],
   "source": [
    "!brew install pv coreutils"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Prep\n",
    "Let's grab a small dataset in each format.  DeepDive spouse example has a dump of its sentences table that contains 1000 sentences with their NLP markups at token level."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100   186  100   186    0     0    204      0 --:--:-- --:--:-- --:--:--   204\n",
      "100 5580k  100 5580k    0     0  1562k      0  0:00:03  0:00:03 --:--:-- 2607k\n",
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100   186  100   186    0     0    240      0 --:--:-- --:--:-- --:--:--   240\n",
      "100 5730k  100 5730k    0     0  1834k      0  0:00:03  0:00:03 --:--:-- 3807k\n",
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100 5809k  100 5809k    0     0  3754k      0  0:00:01  0:00:01 --:--:-- 3753k\n"
     ]
    }
   ],
   "source": [
    "!curl -fSL https://github.com/HazyResearch/deepdive/raw/tab-separated-jsons/examples/spouse/input/sentences-1000.tsv.bz2 | bunzip2 >s.tsv\n",
    "!curl -fSL https://github.com/HazyResearch/deepdive/raw/tab-separated-jsons/examples/spouse/input/sentences-1000.tsj.bz2 | bunzip2 >s.tsj\n",
    "!curl -fSL http://i.stanford.edu/~netj/data/sentences-1000.jsonl.bz2                                                     | bunzip2 >s.jsonl"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Getting equivalent JSON lines (jsonl) is a bit more tedious as DeepDive doesn't support it off-the-shelf.  The following script should work if you have a running PostgreSQL:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "! [[ -e s.jsonl ]] || exit  # XXX comment out this line to actually run the steps below\n",
    "\n",
    "export DEEPDIVE_DB_URL=postgresql:///tsj-tsv-jsonl\n",
    "deepdive db init\n",
    "deepdive create table sentences \\\n",
    "    doc_id:TEXT \\\n",
    "    sentence_index:INT \\\n",
    "    sentence_text:TEXT \\\n",
    "    tokens:TEXT[] \\\n",
    "    lemmas:TEXT[] \\\n",
    "    pos_tags:TEXT[] \\\n",
    "    ner_tags:TEXT[] \\\n",
    "    doc_offsets:INT[] \\\n",
    "    dep_types:TEXT[] \\\n",
    "    dep_tokens:INT[] \\\n",
    "    #\n",
    "deepdive load sentences s.tsv\n",
    "deepdive sql eval \"SELECT * FROM sentences\" format=json >s.jsonl"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   26975   836437 31206139 s.tsj\n",
      "   26975   836434 24822041 s.tsv\n",
      "   26975   593662 34524064 s.jsonl\n",
      "   80925  2266533 90552244 total\n"
     ]
    }
   ],
   "source": [
    "!wc s.{tsj,tsv,jsonl}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All ~27k rows and ~30MiB.  tsj is 1.257x the size of tsv and jsonl is 1.391x.\n",
    "\n",
    "Interestingly jsonl has the least number of word count, which suggests it uses almost no space, and it must be in the most compact form."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Throughput"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Shorthands\n",
    "First, let's define shorthands for repeating any of the 30MiB data 1000 times to use it as a source for such format and a dummy sink that just reads and throws away input for 10 seconds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing datasource\n"
     ]
    }
   ],
   "source": [
    "%%file datasource\n",
    "ext=$1\n",
    "echo >&2 \"# Using s.$ext\"\n",
    "exec timeout 10s cat $(for i in $(seq 1000); do echo s.$ext; done)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing datasink\n"
     ]
    }
   ],
   "source": [
    "%%file datasink\n",
    "exec timeout 10s cat >/dev/null"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "!chmod +x datasource datasink"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's make sure the data source/sink aren't bottlenecks themselves."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# Using s.tsj\n",
      "10.3GiB 0:00:10 [1.03GiB/s]\n",
      "cat: write error: Broken pipe\n",
      "# Using s.tsv\n",
      "10.3GiB 0:00:10 [1.03GiB/s]\n",
      "# Using s.jsonl\n",
      "10.4GiB 0:00:10 [1.04GiB/s]\n"
     ]
    }
   ],
   "source": [
    "!./datasource tsj   | pv --bytes --timer --average-rate | ./datasink\n",
    "!./datasource tsv   | pv --bytes --timer --average-rate | ./datasink\n",
    "!./datasource jsonl | pv --bytes --timer --average-rate | ./datasink"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Good. all show ~1GiB/s."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scan workload\n",
    "\n",
    "Let's then see how many lines are actually being processed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# Using s.tsj\n",
      "5.73M 0:00:10 [ 573k/s]\n",
      "# Using s.tsv\n",
      "7.11M 0:00:10 [ 710k/s]\n",
      "# Using s.jsonl\n",
      "5.13M 0:00:10 [ 512k/s]\n"
     ]
    }
   ],
   "source": [
    "!./datasource tsj   | pv --line-mode --bytes --timer --average-rate | ./datasink\n",
    "!./datasource tsv   | pv --line-mode --bytes --timer --average-rate | ./datasink\n",
    "!./datasource jsonl | pv --line-mode --bytes --timer --average-rate | ./datasink"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These goodput number reflects the size overhead.\n",
    "\n",
    "* tsv is ~1.2x faster than tsj\n",
    "* tsv is ~1.4x faster than jsonl.\n",
    "* tsj is ~1.1x faster than jsonl."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Projection workload\n",
    "\n",
    "Now, let's see what happens when we actually want to look at the data and take some fields out."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Spoiler: On average, jsonl has to pay the cost of becoming 20-25x slower to take a few columns out from the original data, whereas other two formats permit this without any parsing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Projecting just the second and eighth `sentence_index` and `doc_offsets` columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# Using s.tsj\n",
      "3.61M 0:00:10 [ 360k/s]\n",
      "# Using s.tsv\n",
      "4.54M 0:00:10 [ 453k/s]\n",
      "# Using s.jsonl\n",
      " 187k 0:00:10 [18.7k/s]\n"
     ]
    }
   ],
   "source": [
    "!./datasource tsj   | pv --line-mode --bytes --timer --average-rate | cut -f2,8                                                          2>/dev/null | ./datasink\n",
    "!./datasource tsv   | pv --line-mode --bytes --timer --average-rate | cut -f2,8                                                          2>/dev/null | ./datasink\n",
    "!./datasource jsonl | pv --line-mode --bytes --timer --average-rate | jq -c '{sentence_index:.sentence_index, doc_offsets:.doc_offsets}' 2>/dev/null | ./datasink"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Projecting just the first `doc_id` column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# Using s.tsj\n",
      "4.03M 0:00:10 [ 402k/s]\n",
      "# Using s.tsv\n",
      " 5.1M 0:00:10 [ 509k/s]\n",
      "# Using s.jsonl\n",
      " 214k 0:00:10 [21.4k/s]\n"
     ]
    }
   ],
   "source": [
    "!./datasource tsj   | pv --line-mode --bytes --timer --average-rate | cut -f1                  2>/dev/null | ./datasink\n",
    "!./datasource tsv   | pv --line-mode --bytes --timer --average-rate | cut -f1                  2>/dev/null | ./datasink\n",
    "!./datasource jsonl | pv --line-mode --bytes --timer --average-rate | jq -c '{doc_id:.doc_id}' 2>/dev/null | ./datasink"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Projecting just the `sentence_index` column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# Using s.tsj\n",
      "4.06M 0:00:10 [ 405k/s]\n",
      "# Using s.tsv\n",
      "5.12M 0:00:10 [ 511k/s]\n",
      "# Using s.jsonl\n",
      " 207k 0:00:10 [20.7k/s]\n"
     ]
    }
   ],
   "source": [
    "!./datasource tsj   | pv --line-mode --bytes --timer --average-rate | cut -f1                                  2>/dev/null | ./datasink\n",
    "!./datasource tsv   | pv --line-mode --bytes --timer --average-rate | cut -f1                                  2>/dev/null | ./datasink\n",
    "!./datasource jsonl | pv --line-mode --bytes --timer --average-rate | jq -c '{sentence_index:.sentence_index}' 2>/dev/null | ./datasink"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's observe how the length of field names used in jsonl affects the throughput.  You can see the throughput increases a few percent every time we shorten or remove the redundant information from every line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# Using s.jsonl\n",
      " 215k 0:00:10 [21.5k/s]\n",
      "# Using s.jsonl\n",
      " 222k 0:00:10 [22.2k/s]\n",
      "# Using s.jsonl\n",
      " 226k 0:00:10 [22.6k/s]\n",
      "# Using s.jsonl\n",
      " 224k 0:00:10 [22.4k/s]\n"
     ]
    }
   ],
   "source": [
    "!./datasource jsonl | pv --line-mode --bytes --timer --average-rate | jq -c '{sentence_index:.sentence_index}' 2>/dev/null | ./datasink\n",
    "!./datasource jsonl | pv --line-mode --bytes --timer --average-rate | jq -c '{x:.sentence_index}'              2>/dev/null | ./datasink\n",
    "!./datasource jsonl | pv --line-mode --bytes --timer --average-rate | jq -c '.sentence_index'                  2>/dev/null | ./datasink\n",
    "!./datasource jsonl | pv --line-mode --bytes --timer --average-rate | jq -c '[.sentence_index]'                2>/dev/null | ./datasink"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tentative Conclusion\n",
    "\n",
    "tsj certainly shares the JSON overhead jsonl has, and ~1.3x slower than tsv when the data holds a lot of structured values, e.g., text arrays.\n",
    "The workloads tested in this notebook is too simplistic, and we must see how the PG TSV parsing compares to a typical JSON parser to get a more realistic number.\n",
    "However, one thing very clear from these runs is that tsj is a superior serialization format for large datasets with fixed schema than jsonl.\n",
    "jsonl may still be very useful for storing less structured data for its schemalessness."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TODOs\n",
    "\n",
    "* Compare a workload involving array element access, which requires full PG TSV parsing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# A quick comparison of serialization formats\n",
	"\n",
	"* Tab-separated values or [PostgreSQL Text format](https://www.postgresql.org/docs/current/static/sql-copy.html#AEN74386)\n",
	"* Tab-separated JSONs format proposed by DeepDive\n",
	"* One JSON object per line format\n",
	"\n",
	"This was motivated by [DeepDive Pull Request #565](https://github.com/HazyResearch/deepdive/pull/565).\n",
	"\n",
	"You'll need `pv`, `timeout` in GNU coreutils, `curl`, and `wc` to reproduce these outputs."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"\u001b[4;33mWarning\u001b[0m: pv-1.6.0 already installed\r\n",
	"\u001b[4;33mWarning\u001b[0m: coreutils-8.25 already installed\r\n"
	]
	}
	],
	"source": [
	"!brew install pv coreutils"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Data Prep\n",
	"Let's grab a small dataset in each format. DeepDive spouse example has a dump of its sentences table that contains 1000 sentences with their NLP markups at token level."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"collapsed": false,
	"scrolled": true
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" % Total % Received % Xferd Average Speed Time Time Time Current\n",
	" Dload Upload Total Spent Left Speed\n",
	"100 186 100 186 0 0 204 0 --:--:-- --:--:-- --:--:-- 204\n",
	"100 5580k 100 5580k 0 0 1562k 0 0:00:03 0:00:03 --:--:-- 2607k\n",
	" % Total % Received % Xferd Average Speed Time Time Time Current\n",
	" Dload Upload Total Spent Left Speed\n",
	"100 186 100 186 0 0 240 0 --:--:-- --:--:-- --:--:-- 240\n",
	"100 5730k 100 5730k 0 0 1834k 0 0:00:03 0:00:03 --:--:-- 3807k\n",
	" % Total % Received % Xferd Average Speed Time Time Time Current\n",
	" Dload Upload Total Spent Left Speed\n",
	"100 5809k 100 5809k 0 0 3754k 0 0:00:01 0:00:01 --:--:-- 3753k\n"
	]
	}
	],
	"source": [
	"!curl -fSL https://github.com/HazyResearch/deepdive/raw/tab-separated-jsons/examples/spouse/input/sentences-1000.tsv.bz2 \| bunzip2 >s.tsv\n",
	"!curl -fSL https://github.com/HazyResearch/deepdive/raw/tab-separated-jsons/examples/spouse/input/sentences-1000.tsj.bz2 \| bunzip2 >s.tsj\n",
	"!curl -fSL http://i.stanford.edu/~netj/data/sentences-1000.jsonl.bz2 \| bunzip2 >s.jsonl"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Getting equivalent JSON lines (jsonl) is a bit more tedious as DeepDive doesn't support it off-the-shelf. The following script should work if you have a running PostgreSQL:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"%%bash\n",
	"\n",
	"! [[ -e s.jsonl ]] \|\| exit # XXX comment out this line to actually run the steps below\n",
	"\n",
	"export DEEPDIVE_DB_URL=postgresql:///tsj-tsv-jsonl\n",
	"deepdive db init\n",
	"deepdive create table sentences \\\n",
	" doc_id:TEXT \\\n",
	" sentence_index:INT \\\n",
	" sentence_text:TEXT \\\n",
	" tokens:TEXT[] \\\n",
	" lemmas:TEXT[] \\\n",
	" pos_tags:TEXT[] \\\n",
	" ner_tags:TEXT[] \\\n",
	" doc_offsets:INT[] \\\n",
	" dep_types:TEXT[] \\\n",
	" dep_tokens:INT[] \\\n",
	" #\n",
	"deepdive load sentences s.tsv\n",
	"deepdive sql eval \"SELECT * FROM sentences\" format=json >s.jsonl"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Size"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {
	"collapsed": false,
	"scrolled": true
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" 26975 836437 31206139 s.tsj\n",
	" 26975 836434 24822041 s.tsv\n",
	" 26975 593662 34524064 s.jsonl\n",
	" 80925 2266533 90552244 total\n"
	]
	}
	],
	"source": [
	"!wc s.{tsj,tsv,jsonl}"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"All ~27k rows and ~30MiB. tsj is 1.257x the size of tsv and jsonl is 1.391x.\n",
	"\n",
	"Interestingly jsonl has the least number of word count, which suggests it uses almost no space, and it must be in the most compact form."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Throughput"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Shorthands\n",
	"First, let's define shorthands for repeating any of the 30MiB data 1000 times to use it as a source for such format and a dummy sink that just reads and throws away input for 10 seconds."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Writing datasource\n"
	]
	}
	],
	"source": [
	"%%file datasource\n",
	"ext=$1\n",
	"echo >&2 \"# Using s.$ext\"\n",
	"exec timeout 10s cat $(for i in $(seq 1000); do echo s.$ext; done)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Writing datasink\n"
	]
	}
	],
	"source": [
	"%%file datasink\n",
	"exec timeout 10s cat >/dev/null"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"!chmod +x datasource datasink"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Next, let's make sure the data source/sink aren't bottlenecks themselves."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"# Using s.tsj\n",
	"10.3GiB 0:00:10 [1.03GiB/s]\n",
	"cat: write error: Broken pipe\n",
	"# Using s.tsv\n",
	"10.3GiB 0:00:10 [1.03GiB/s]\n",
	"# Using s.jsonl\n",
	"10.4GiB 0:00:10 [1.04GiB/s]\n"
	]
	}
	],
	"source": [
	"!./datasource tsj \| pv --bytes --timer --average-rate \| ./datasink\n",
	"!./datasource tsv \| pv --bytes --timer --average-rate \| ./datasink\n",
	"!./datasource jsonl \| pv --bytes --timer --average-rate \| ./datasink"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Good. all show ~1GiB/s."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Scan workload\n",
	"\n",
	"Let's then see how many lines are actually being processed."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"# Using s.tsj\n",
	"5.73M 0:00:10 [ 573k/s]\n",
	"# Using s.tsv\n",
	"7.11M 0:00:10 [ 710k/s]\n",
	"# Using s.jsonl\n",
	"5.13M 0:00:10 [ 512k/s]\n"
	]
	}
	],
	"source": [
	"!./datasource tsj \| pv --line-mode --bytes --timer --average-rate \| ./datasink\n",
	"!./datasource tsv \| pv --line-mode --bytes --timer --average-rate \| ./datasink\n",
	"!./datasource jsonl \| pv --line-mode --bytes --timer --average-rate \| ./datasink"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"These goodput number reflects the size overhead.\n",
	"\n",
	"* tsv is ~1.2x faster than tsj\n",
	"* tsv is ~1.4x faster than jsonl.\n",
	"* tsj is ~1.1x faster than jsonl."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Projection workload\n",
	"\n",
	"Now, let's see what happens when we actually want to look at the data and take some fields out."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Spoiler: On average, jsonl has to pay the cost of becoming 20-25x slower to take a few columns out from the original data, whereas other two formats permit this without any parsing."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Projecting just the second and eighth `sentence_index` and `doc_offsets` columns:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"# Using s.tsj\n",
	"3.61M 0:00:10 [ 360k/s]\n",
	"# Using s.tsv\n",
	"4.54M 0:00:10 [ 453k/s]\n",
	"# Using s.jsonl\n",
	" 187k 0:00:10 [18.7k/s]\n"
	]
	}
	],
	"source": [
	"!./datasource tsj \| pv --line-mode --bytes --timer --average-rate \| cut -f2,8 2>/dev/null \| ./datasink\n",
	"!./datasource tsv \| pv --line-mode --bytes --timer --average-rate \| cut -f2,8 2>/dev/null \| ./datasink\n",
	"!./datasource jsonl \| pv --line-mode --bytes --timer --average-rate \| jq -c '{sentence_index:.sentence_index, doc_offsets:.doc_offsets}' 2>/dev/null \| ./datasink"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Projecting just the first `doc_id` column:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"# Using s.tsj\n",
	"4.03M 0:00:10 [ 402k/s]\n",
	"# Using s.tsv\n",
	" 5.1M 0:00:10 [ 509k/s]\n",
	"# Using s.jsonl\n",
	" 214k 0:00:10 [21.4k/s]\n"
	]
	}
	],
	"source": [
	"!./datasource tsj \| pv --line-mode --bytes --timer --average-rate \| cut -f1 2>/dev/null \| ./datasink\n",
	"!./datasource tsv \| pv --line-mode --bytes --timer --average-rate \| cut -f1 2>/dev/null \| ./datasink\n",
	"!./datasource jsonl \| pv --line-mode --bytes --timer --average-rate \| jq -c '{doc_id:.doc_id}' 2>/dev/null \| ./datasink"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Projecting just the `sentence_index` column:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"# Using s.tsj\n",
	"4.06M 0:00:10 [ 405k/s]\n",
	"# Using s.tsv\n",
	"5.12M 0:00:10 [ 511k/s]\n",
	"# Using s.jsonl\n",
	" 207k 0:00:10 [20.7k/s]\n"
	]
	}
	],
	"source": [
	"!./datasource tsj \| pv --line-mode --bytes --timer --average-rate \| cut -f1 2>/dev/null \| ./datasink\n",
	"!./datasource tsv \| pv --line-mode --bytes --timer --average-rate \| cut -f1 2>/dev/null \| ./datasink\n",
	"!./datasource jsonl \| pv --line-mode --bytes --timer --average-rate \| jq -c '{sentence_index:.sentence_index}' 2>/dev/null \| ./datasink"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now, let's observe how the length of field names used in jsonl affects the throughput. You can see the throughput increases a few percent every time we shorten or remove the redundant information from every line."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"# Using s.jsonl\n",
	" 215k 0:00:10 [21.5k/s]\n",
	"# Using s.jsonl\n",
	" 222k 0:00:10 [22.2k/s]\n",
	"# Using s.jsonl\n",
	" 226k 0:00:10 [22.6k/s]\n",
	"# Using s.jsonl\n",
	" 224k 0:00:10 [22.4k/s]\n"
	]
	}
	],
	"source": [
	"!./datasource jsonl \| pv --line-mode --bytes --timer --average-rate \| jq -c '{sentence_index:.sentence_index}' 2>/dev/null \| ./datasink\n",
	"!./datasource jsonl \| pv --line-mode --bytes --timer --average-rate \| jq -c '{x:.sentence_index}' 2>/dev/null \| ./datasink\n",
	"!./datasource jsonl \| pv --line-mode --bytes --timer --average-rate \| jq -c '.sentence_index' 2>/dev/null \| ./datasink\n",
	"!./datasource jsonl \| pv --line-mode --bytes --timer --average-rate \| jq -c '[.sentence_index]' 2>/dev/null \| ./datasink"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Tentative Conclusion\n",
	"\n",
	"tsj certainly shares the JSON overhead jsonl has, and ~1.3x slower than tsv when the data holds a lot of structured values, e.g., text arrays.\n",
	"The workloads tested in this notebook is too simplistic, and we must see how the PG TSV parsing compares to a typical JSON parser to get a more realistic number.\n",
	"However, one thing very clear from these runs is that tsj is a superior serialization format for large datasets with fixed schema than jsonl.\n",
	"jsonl may still be very useful for storing less structured data for its schemalessness."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## TODOs\n",
	"\n",
	"* Compare a workload involving array element access, which requires full PG TSV parsing."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.10"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}