Skip to content

Instantly share code, notes, and snippets.

@mrocklin
Last active October 22, 2023 13:57
Show Gist options
  • Select an option

  • Save mrocklin/d31eba27b01a203894566980baee81af to your computer and use it in GitHub Desktop.

Select an option

Save mrocklin/d31eba27b01a203894566980baee81af to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "f93cd8a9-4fb9-45a4-9096-455b1120c049",
"metadata": {},
"outputs": [],
"source": [
"import pyarrow as pa\n",
"import pyarrow.parquet as pq\n",
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "27d172d4-0aeb-456b-bd01-4b21960721a6",
"metadata": {},
"source": [
"### Create Dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fb0e955e-baf8-4274-8298-3872ba33b2e0",
"metadata": {},
"outputs": [],
"source": [
"x = np.random.randint(0, 100000, size=(1000000, 100))\n",
"df = pd.DataFrame(x)\n",
"t = pa.Table.from_pandas(df)"
]
},
{
"cell_type": "markdown",
"id": "1f636cf9-3604-4b01-bba2-96a4aa5e4cb8",
"metadata": {},
"source": [
"### Write to parquet locally"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1dac9b6f-6318-4aa8-afd4-ad37d8ad4ac6",
"metadata": {},
"outputs": [],
"source": [
"pq.write_table(t, \"foo.parquet\")"
]
},
{
"cell_type": "markdown",
"id": "08e03332-f664-49aa-abea-31f1ae528eff",
"metadata": {},
"source": [
"### Time Disk Speeds"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "442c9c4b-6dfe-4abb-be78-001b225f4a7a",
"metadata": {},
"outputs": [],
"source": [
"import time\n",
"\n",
"start = time.time()\n",
"with open(\"foo.parquet\", mode=\"rb\") as f:\n",
" bytes = f.read()\n",
" nbytes = len(bytes)\n",
" \n",
"stop = time.time()\n",
"\n",
"print(\"Disk Bandwidth:\", int(nbytes / (stop - start) / 2**20), \"MiB/s\")"
]
},
{
"cell_type": "markdown",
"id": "d11c4c98-b970-4aa9-9b23-937c0c39fb09",
"metadata": {},
"source": [
"### Time Arrow Parquet Speeds"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0e974144-7119-4a99-b02d-113b3750407a",
"metadata": {},
"outputs": [],
"source": [
"start = time.time()\n",
"_ = pq.read_table(\"foo.parquet\")\n",
"stop = time.time()\n",
"\n",
"print(\"PyArrow Read Bandwidth:\", int(nbytes / (stop - start) / 2**20), \"MiB/s\")"
]
},
{
"cell_type": "markdown",
"id": "843fa006-f0da-4d8f-a492-d8607ad19173",
"metadata": {},
"source": [
"### Pure-in-memory reading"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2d6af90f-953b-4842-b87f-ca7b36804675",
"metadata": {},
"outputs": [],
"source": [
"import io\n",
"\n",
"start = time.time()\n",
"pq.read_table(io.BytesIO(bytes))\n",
"stop = time.time()\n",
"\n",
"print(\"PyArrow In-Memory Bandwidth:\", int(nbytes / (stop - start) / 2**20), \"MiB/s\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:test-env]",
"language": "python",
"name": "conda-env-test-env-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment