Created
July 19, 2016 23:49
-
-
Save netj/c5ef35e74488500da95701f90e471765 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# A quick comparison of serialization formats\n", | |
"\n", | |
"* Tab-separated values or [PostgreSQL Text format](https://www.postgresql.org/docs/current/static/sql-copy.html#AEN74386)\n", | |
"* Tab-separated JSONs format proposed by DeepDive\n", | |
"* One JSON object per line format\n", | |
"\n", | |
"This was motivated by [DeepDive Pull Request #565](https://github.com/HazyResearch/deepdive/pull/565).\n", | |
"\n", | |
"You'll need `pv`, `timeout` in GNU coreutils, `curl`, and `wc` to reproduce these outputs." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"\u001b[4;33mWarning\u001b[0m: pv-1.6.0 already installed\r\n", | |
"\u001b[4;33mWarning\u001b[0m: coreutils-8.25 already installed\r\n" | |
] | |
} | |
], | |
"source": [ | |
"!brew install pv coreutils" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Data Prep\n", | |
"Let's grab a small dataset in each format. DeepDive spouse example has a dump of its sentences table that contains 1000 sentences with their NLP markups at token level." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": false, | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" % Total % Received % Xferd Average Speed Time Time Time Current\n", | |
" Dload Upload Total Spent Left Speed\n", | |
"100 186 100 186 0 0 204 0 --:--:-- --:--:-- --:--:-- 204\n", | |
"100 5580k 100 5580k 0 0 1562k 0 0:00:03 0:00:03 --:--:-- 2607k\n", | |
" % Total % Received % Xferd Average Speed Time Time Time Current\n", | |
" Dload Upload Total Spent Left Speed\n", | |
"100 186 100 186 0 0 240 0 --:--:-- --:--:-- --:--:-- 240\n", | |
"100 5730k 100 5730k 0 0 1834k 0 0:00:03 0:00:03 --:--:-- 3807k\n", | |
" % Total % Received % Xferd Average Speed Time Time Time Current\n", | |
" Dload Upload Total Spent Left Speed\n", | |
"100 5809k 100 5809k 0 0 3754k 0 0:00:01 0:00:01 --:--:-- 3753k\n" | |
] | |
} | |
], | |
"source": [ | |
"!curl -fSL https://github.com/HazyResearch/deepdive/raw/tab-separated-jsons/examples/spouse/input/sentences-1000.tsv.bz2 | bunzip2 >s.tsv\n", | |
"!curl -fSL https://github.com/HazyResearch/deepdive/raw/tab-separated-jsons/examples/spouse/input/sentences-1000.tsj.bz2 | bunzip2 >s.tsj\n", | |
"!curl -fSL http://i.stanford.edu/~netj/data/sentences-1000.jsonl.bz2 | bunzip2 >s.jsonl" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Getting equivalent JSON lines (jsonl) is a bit more tedious as DeepDive doesn't support it off-the-shelf. The following script should work if you have a running PostgreSQL:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"%%bash\n", | |
"\n", | |
"! [[ -e s.jsonl ]] || exit # XXX comment out this line to actually run the steps below\n", | |
"\n", | |
"export DEEPDIVE_DB_URL=postgresql:///tsj-tsv-jsonl\n", | |
"deepdive db init\n", | |
"deepdive create table sentences \\\n", | |
" doc_id:TEXT \\\n", | |
" sentence_index:INT \\\n", | |
" sentence_text:TEXT \\\n", | |
" tokens:TEXT[] \\\n", | |
" lemmas:TEXT[] \\\n", | |
" pos_tags:TEXT[] \\\n", | |
" ner_tags:TEXT[] \\\n", | |
" doc_offsets:INT[] \\\n", | |
" dep_types:TEXT[] \\\n", | |
" dep_tokens:INT[] \\\n", | |
" #\n", | |
"deepdive load sentences s.tsv\n", | |
"deepdive sql eval \"SELECT * FROM sentences\" format=json >s.jsonl" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Size" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": false, | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" 26975 836437 31206139 s.tsj\n", | |
" 26975 836434 24822041 s.tsv\n", | |
" 26975 593662 34524064 s.jsonl\n", | |
" 80925 2266533 90552244 total\n" | |
] | |
} | |
], | |
"source": [ | |
"!wc s.{tsj,tsv,jsonl}" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"All ~27k rows and ~30MiB. tsj is 1.257x the size of tsv and jsonl is 1.391x.\n", | |
"\n", | |
"Interestingly jsonl has the least number of word count, which suggests it uses almost no space, and it must be in the most compact form." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Throughput" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Shorthands\n", | |
"First, let's define shorthands for repeating any of the 30MiB data 1000 times to use it as a source for such format and a dummy sink that just reads and throws away input for 10 seconds." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Writing datasource\n" | |
] | |
} | |
], | |
"source": [ | |
"%%file datasource\n", | |
"ext=$1\n", | |
"echo >&2 \"# Using s.$ext\"\n", | |
"exec timeout 10s cat $(for i in $(seq 1000); do echo s.$ext; done)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Writing datasink\n" | |
] | |
} | |
], | |
"source": [ | |
"%%file datasink\n", | |
"exec timeout 10s cat >/dev/null" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"!chmod +x datasource datasink" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Next, let's make sure the data source/sink aren't bottlenecks themselves." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"# Using s.tsj\n", | |
"10.3GiB 0:00:10 [1.03GiB/s]\n", | |
"cat: write error: Broken pipe\n", | |
"# Using s.tsv\n", | |
"10.3GiB 0:00:10 [1.03GiB/s]\n", | |
"# Using s.jsonl\n", | |
"10.4GiB 0:00:10 [1.04GiB/s]\n" | |
] | |
} | |
], | |
"source": [ | |
"!./datasource tsj | pv --bytes --timer --average-rate | ./datasink\n", | |
"!./datasource tsv | pv --bytes --timer --average-rate | ./datasink\n", | |
"!./datasource jsonl | pv --bytes --timer --average-rate | ./datasink" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Good. all show ~1GiB/s." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Scan workload\n", | |
"\n", | |
"Let's then see how many lines are actually being processed." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"# Using s.tsj\n", | |
"5.73M 0:00:10 [ 573k/s]\n", | |
"# Using s.tsv\n", | |
"7.11M 0:00:10 [ 710k/s]\n", | |
"# Using s.jsonl\n", | |
"5.13M 0:00:10 [ 512k/s]\n" | |
] | |
} | |
], | |
"source": [ | |
"!./datasource tsj | pv --line-mode --bytes --timer --average-rate | ./datasink\n", | |
"!./datasource tsv | pv --line-mode --bytes --timer --average-rate | ./datasink\n", | |
"!./datasource jsonl | pv --line-mode --bytes --timer --average-rate | ./datasink" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"These goodput number reflects the size overhead.\n", | |
"\n", | |
"* tsv is ~1.2x faster than tsj\n", | |
"* tsv is ~1.4x faster than jsonl.\n", | |
"* tsj is ~1.1x faster than jsonl." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Projection workload\n", | |
"\n", | |
"Now, let's see what happens when we actually want to look at the data and take some fields out." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Spoiler: On average, jsonl has to pay the cost of becoming 20-25x slower to take a few columns out from the original data, whereas other two formats permit this without any parsing." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Projecting just the second and eighth `sentence_index` and `doc_offsets` columns:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"# Using s.tsj\n", | |
"3.61M 0:00:10 [ 360k/s]\n", | |
"# Using s.tsv\n", | |
"4.54M 0:00:10 [ 453k/s]\n", | |
"# Using s.jsonl\n", | |
" 187k 0:00:10 [18.7k/s]\n" | |
] | |
} | |
], | |
"source": [ | |
"!./datasource tsj | pv --line-mode --bytes --timer --average-rate | cut -f2,8 2>/dev/null | ./datasink\n", | |
"!./datasource tsv | pv --line-mode --bytes --timer --average-rate | cut -f2,8 2>/dev/null | ./datasink\n", | |
"!./datasource jsonl | pv --line-mode --bytes --timer --average-rate | jq -c '{sentence_index:.sentence_index, doc_offsets:.doc_offsets}' 2>/dev/null | ./datasink" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Projecting just the first `doc_id` column:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"# Using s.tsj\n", | |
"4.03M 0:00:10 [ 402k/s]\n", | |
"# Using s.tsv\n", | |
" 5.1M 0:00:10 [ 509k/s]\n", | |
"# Using s.jsonl\n", | |
" 214k 0:00:10 [21.4k/s]\n" | |
] | |
} | |
], | |
"source": [ | |
"!./datasource tsj | pv --line-mode --bytes --timer --average-rate | cut -f1 2>/dev/null | ./datasink\n", | |
"!./datasource tsv | pv --line-mode --bytes --timer --average-rate | cut -f1 2>/dev/null | ./datasink\n", | |
"!./datasource jsonl | pv --line-mode --bytes --timer --average-rate | jq -c '{doc_id:.doc_id}' 2>/dev/null | ./datasink" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Projecting just the `sentence_index` column:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"# Using s.tsj\n", | |
"4.06M 0:00:10 [ 405k/s]\n", | |
"# Using s.tsv\n", | |
"5.12M 0:00:10 [ 511k/s]\n", | |
"# Using s.jsonl\n", | |
" 207k 0:00:10 [20.7k/s]\n" | |
] | |
} | |
], | |
"source": [ | |
"!./datasource tsj | pv --line-mode --bytes --timer --average-rate | cut -f1 2>/dev/null | ./datasink\n", | |
"!./datasource tsv | pv --line-mode --bytes --timer --average-rate | cut -f1 2>/dev/null | ./datasink\n", | |
"!./datasource jsonl | pv --line-mode --bytes --timer --average-rate | jq -c '{sentence_index:.sentence_index}' 2>/dev/null | ./datasink" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now, let's observe how the length of field names used in jsonl affects the throughput. You can see the throughput increases a few percent every time we shorten or remove the redundant information from every line." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"# Using s.jsonl\n", | |
" 215k 0:00:10 [21.5k/s]\n", | |
"# Using s.jsonl\n", | |
" 222k 0:00:10 [22.2k/s]\n", | |
"# Using s.jsonl\n", | |
" 226k 0:00:10 [22.6k/s]\n", | |
"# Using s.jsonl\n", | |
" 224k 0:00:10 [22.4k/s]\n" | |
] | |
} | |
], | |
"source": [ | |
"!./datasource jsonl | pv --line-mode --bytes --timer --average-rate | jq -c '{sentence_index:.sentence_index}' 2>/dev/null | ./datasink\n", | |
"!./datasource jsonl | pv --line-mode --bytes --timer --average-rate | jq -c '{x:.sentence_index}' 2>/dev/null | ./datasink\n", | |
"!./datasource jsonl | pv --line-mode --bytes --timer --average-rate | jq -c '.sentence_index' 2>/dev/null | ./datasink\n", | |
"!./datasource jsonl | pv --line-mode --bytes --timer --average-rate | jq -c '[.sentence_index]' 2>/dev/null | ./datasink" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Tentative Conclusion\n", | |
"\n", | |
"tsj certainly shares the JSON overhead jsonl has, and ~1.3x slower than tsv when the data holds a lot of structured values, e.g., text arrays.\n", | |
"The workloads tested in this notebook is too simplistic, and we must see how the PG TSV parsing compares to a typical JSON parser to get a more realistic number.\n", | |
"However, one thing very clear from these runs is that tsj is a superior serialization format for large datasets with fixed schema than jsonl.\n", | |
"jsonl may still be very useful for storing less structured data for its schemalessness." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## TODOs\n", | |
"\n", | |
"* Compare a workload involving array element access, which requires full PG TSV parsing." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 2", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.10" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment