Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save nickdelgrosso/73c95d037856f1b2972723d1cc97fdff to your computer and use it in GitHub Desktop.
Save nickdelgrosso/73c95d037856f1b2972723d1cc97fdff to your computer and use it in GitHub Desktop.
Demo for a few ways to Save DataFrames to HDF5
{
"cells": [
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import h5py"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>time</th>\n",
" <th>channel</th>\n",
" <th>temp</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Chan1</td>\n",
" <td>18.2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>Chan1</td>\n",
" <td>19.1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Chan1</td>\n",
" <td>19.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>Chan2</td>\n",
" <td>14.3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" time channel temp\n",
"0 1 Chan1 18.2\n",
"1 2 Chan1 19.1\n",
"2 3 Chan1 19.4\n",
"3 1 Chan2 14.3"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame({'time': [1, 2, 3, 1], 'channel': [\"Chan1\", \"Chan1\", \"Chan1\", \"Chan2\"], \"temp\": [18.2, 19.1, 19.4, 14.3]})\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Writing with Pandas/PyTables\n",
"\n",
"https://myhdf5.hdfgroup.org/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### `pd.to_hdf()`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.to_hdf(\"data/pd_to_hdf.h5\", key=\"/temp\", mode=\"w\") # Make a new file in PyTables format, with the table in the \"/temp\" group\n",
"df.to_hdf(\"data/pd_to_hdf.h5\", key=\"/temp2\", mode=\"a\", format='fixed') # Append to the file in \"Fixed\", putting a table in th \"/temp2\" group"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reading it back in in Pandas:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>time</th>\n",
" <th>channel</th>\n",
" <th>temp</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Chan1</td>\n",
" <td>18.2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>Chan1</td>\n",
" <td>19.1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Chan1</td>\n",
" <td>19.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>Chan2</td>\n",
" <td>14.3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" time channel temp\n",
"0 1 Chan1 18.2\n",
"1 2 Chan1 19.1\n",
"2 3 Chan1 19.4\n",
"3 1 Chan2 14.3"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_hdf(\"data/pd_to_hdf.h5\", key=\"/temp\")"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>time</th>\n",
" <th>channel</th>\n",
" <th>temp</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Chan1</td>\n",
" <td>18.2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>Chan1</td>\n",
" <td>19.1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Chan1</td>\n",
" <td>19.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>Chan2</td>\n",
" <td>14.3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" time channel temp\n",
"0 1 Chan1 18.2\n",
"1 2 Chan1 19.1\n",
"2 3 Chan1 19.4\n",
"3 1 Chan2 14.3"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_hdf(\"data/pd_to_hdf.h5\", key=\"/temp2\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas also has an `HDFStore` class, which can give you a *little* more control. Here's the docs: https://pandas.pydata.org/docs/reference/io.html#hdfstore-pytables-hdf5"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Writing with Numpy & H5Py\n",
"\n",
"This is what I'd recommend as a general way to approach storing tables. The advantage is that it stores the data in a very simple way (a 2D array, all in one dataet), which keeps the data very easy to use no matter what tool you are using to analyze it (as opposed to the PyTables format, which can get quite complex). \n",
"\n",
"The secret is to convert the dataframe to a NumPy records array first, before saving (shown below)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Helpful tip 1: `dt.to_records()`"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"rec.array([(0, 1, 'Chan1', 18.2), (1, 2, 'Chan1', 19.1),\n",
" (2, 3, 'Chan1', 19.4), (3, 1, 'Chan2', 14.3)],\n",
" dtype=[('index', '<i8'), ('time', '<i8'), ('channel', 'O'), ('temp', '<f8')])"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.to_records()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Helpful tip 2: because HDF5 doesn't natively support unicode, one simple approach is to convert strings to `bytes`"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>time</th>\n",
" <th>channel</th>\n",
" <th>temp</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>b'Chan1'</td>\n",
" <td>18.2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>b'Chan1'</td>\n",
" <td>19.1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>b'Chan1'</td>\n",
" <td>19.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>b'Chan2'</td>\n",
" <td>14.3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" time channel temp\n",
"0 1 b'Chan1' 18.2\n",
"1 2 b'Chan1' 19.1\n",
"2 3 b'Chan1' 19.4\n",
"3 1 b'Chan2' 14.3"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.astype({'channel': bytes})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All of it together:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"with h5py.File(\"data/np_to_records.h5\", 'w') as f:\n",
" f['/temp'] = df.astype({'channel': bytes}).to_records(index=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>time</th>\n",
" <th>channel</th>\n",
" <th>temp</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Chan1</td>\n",
" <td>18.2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>Chan1</td>\n",
" <td>19.1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Chan1</td>\n",
" <td>19.4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>Chan2</td>\n",
" <td>14.3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" time channel temp\n",
"0 1 Chan1 18.2\n",
"1 2 Chan1 19.1\n",
"2 3 Chan1 19.4\n",
"3 1 Chan2 14.3"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_hdf(\"data/np_to_records.h5\", \"/temp\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "default",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment