Last active
April 20, 2022 02:09
-
-
Save jpivarski/a5813cc51e94236c03a71bf38c13c185 to your computer and use it in GitHub Desktop.
Argo-Awkward Array demo
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"id": "b7d1be8e-7d8a-4b9e-a4cb-37c25a9cf03e", | |
"metadata": {}, | |
"source": [ | |
"Get Awkward Array from\n", | |
"\n", | |
"```bash\n", | |
"pip install 'awkward>=1.9.0rc2'\n", | |
"```\n", | |
"\n", | |
"We'll be using version 2.0, the development version, which is a submodule within 1.9.0rc2." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"id": "2fe4578e-ee28-414c-8ba7-0bccff23fa5e", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"import awkward._v2 as ak" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "d0c27481-c95f-4e20-9b90-d579b829e9d8", | |
"metadata": {}, | |
"source": [ | |
"This file has all of the Argo data from 1997 through 2021 inclusive.\n", | |
"\n", | |
" * It is 7.0 GB.\n", | |
" * The equivalent NetCDF is 135 GB.\n", | |
" * Uncompressed, the expert-level data is 118 GB, and the subset that is standard-level data is 42 GB.\n", | |
"\n", | |
"We'll be opening just the first row group (10000 sets of levels) to explore the data." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"id": "447a4186-e863-47bc-8ff7-e6d611dc9b46", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"expert_fields = ak.from_parquet(\"s3://pivarski-princeton/argo-floats-expert.parquet\", row_groups=[0])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "a5f05065-faf0-43c1-9845-f40002f6a53a", | |
"metadata": {}, | |
"source": [ | |
"The data consist of nested record structures and variable-length lists.\n", | |
"\n", | |
"`show` prints as much as will fit into 20 lines and 80 characters." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"id": "65d4fb8f-dfa7-4dca-a7c7-b216df82fc73", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[{latitude: -0.126, longitude: -11.9, time: 1997-07-28T20:26:20.000000, ...},\n", | |
" {latitude: 0.267, longitude: -16, time: 1997-07-29T20:03:00.000000, ...},\n", | |
" {latitude: 0.236, longitude: -19.7, time: 1997-07-30T14:45:11.000000, ...},\n", | |
" {latitude: -0.402, longitude: -27, time: 1997-08-01T07:59:00.000000, ...},\n", | |
" {latitude: 0.429, longitude: -34.2, time: 1997-08-02T15:53:47.000000, ...},\n", | |
" {latitude: 0.497, longitude: -37.1, time: 1997-08-03T17:23:13.000000, ...},\n", | |
" {latitude: -0.511, longitude: -36.7, time: 1997-08-03T03:01:01.000000, ...},\n", | |
" {latitude: 0.4, longitude: -40.2, time: 1997-08-04T06:10:56.000000, ...},\n", | |
" {latitude: 0.072, longitude: -17.7, time: 1997-08-09T19:21:12.000000, ...},\n", | |
" {latitude: -0.035, longitude: -13.8, time: 1997-08-09T01:52:41.000000, ...},\n", | |
" ...,\n", | |
" {latitude: 4.72, longitude: -41.3, time: 2002-07-06T21:56:43.000000, ...},\n", | |
" {latitude: 43.5, longitude: -45.8, time: 2002-07-06T21:45:24.000000, ...},\n", | |
" {latitude: 6.33, longitude: -27.4, time: 2002-07-06T20:25:41.000000, ...},\n", | |
" {latitude: 2.68, longitude: -34.2, time: 2002-07-06T18:04:48.000000, ...},\n", | |
" {latitude: 26.6, longitude: -68.8, time: 2002-07-06T15:53:00.000000, ...},\n", | |
" {latitude: 3.81, longitude: -31.2, time: 2002-07-06T15:05:40.000000, ...},\n", | |
" {latitude: 35, longitude: -32, time: 2002-07-06T14:09:28.000000, ...},\n", | |
" {latitude: 24.1, longitude: -49.7, time: 2002-07-06T09:57:21.000000, ...},\n", | |
" {latitude: 32.1, longitude: -19.6, time: 2002-07-06T08:14:23.000000, ...}]\n" | |
] | |
} | |
], | |
"source": [ | |
"expert_fields.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "4278ae80-9b69-484e-9ebf-34e5393f19f5", | |
"metadata": {}, | |
"source": [ | |
"The data's `type` also has a `show`, which reveals the structure without any values.\n", | |
"\n", | |
"The syntax is [datashape](https://datashape.readthedocs.io/en/latest/); the `10000 *` means an array of fixed length and the `var *` means lists of variable lengths." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"id": "391ff890-1530-4da9-93c7-03d339ea53b2", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"10000 * {\n", | |
" latitude: float64,\n", | |
" longitude: float64,\n", | |
" time: datetime64[us],\n", | |
" levels: var * {\n", | |
" pres: float32,\n", | |
" pres_adjusted: float32,\n", | |
" pres_adjusted_error: float32,\n", | |
" pres_adjusted_qc: string,\n", | |
" pres_qc: string,\n", | |
" psal: float32,\n", | |
" psal_adjusted: float32,\n", | |
" psal_adjusted_error: float32,\n", | |
" psal_adjusted_qc: string,\n", | |
" psal_qc: string,\n", | |
" temp: float32,\n", | |
" temp_adjusted: float32,\n", | |
" temp_adjusted_error: float32,\n", | |
" temp_adjusted_qc: string,\n", | |
" temp_qc: string\n", | |
" },\n", | |
" config_mission_number: int32,\n", | |
" cycle_number: int32,\n", | |
" data_centre: string,\n", | |
" data_mode: string,\n", | |
" data_state_indicator: string,\n", | |
" dc_reference: string,\n", | |
" direction: string,\n", | |
" firmware_version: string,\n", | |
" float_serial_no: string,\n", | |
" pi_name: string,\n", | |
" platform_number: int32,\n", | |
" platform_type: string,\n", | |
" positioning_system: string,\n", | |
" position_qc: string,\n", | |
" profile_pres_qc: string,\n", | |
" profile_psal_qc: string,\n", | |
" profile_temp_qc: string,\n", | |
" project_name: string,\n", | |
" time_location: datetime64[us],\n", | |
" time_qc: string,\n", | |
" vertical_sampling_scheme: string,\n", | |
" wmo_inst_type: int16\n", | |
"}\n" | |
] | |
} | |
], | |
"source": [ | |
"expert_fields.type.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "2b694b9e-092e-498b-b9ab-10e6cdd16997", | |
"metadata": {}, | |
"source": [ | |
"Find a few that are small enough to print out." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"id": "a1722a9c-a58a-45b3-b584-7010884a8422", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(array([573, 632]),)" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"np.nonzero(ak.num(expert_fields.levels) == 2)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"id": "78714285-558c-48b0-9ade-fb545175cd12", | |
"metadata": { | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"{'latitude': 43.134,\n", | |
" 'longitude': -32.227,\n", | |
" 'time': datetime.datetime(1998, 8, 9, 2, 26, 20),\n", | |
" 'levels': [{'pres': 1343.199951171875,\n", | |
" 'pres_adjusted': 1343.199951171875,\n", | |
" 'pres_adjusted_error': 2.4000000953674316,\n", | |
" 'pres_adjusted_qc': '1',\n", | |
" 'pres_qc': '1',\n", | |
" 'psal': 34.94169998168945,\n", | |
" 'psal_adjusted': nan,\n", | |
" 'psal_adjusted_error': nan,\n", | |
" 'psal_adjusted_qc': '4',\n", | |
" 'psal_qc': '4',\n", | |
" 'temp': 4.125,\n", | |
" 'temp_adjusted': nan,\n", | |
" 'temp_adjusted_error': nan,\n", | |
" 'temp_adjusted_qc': '4',\n", | |
" 'temp_qc': '4'},\n", | |
" {'pres': 1358.0999755859375,\n", | |
" 'pres_adjusted': 1358.0999755859375,\n", | |
" 'pres_adjusted_error': 2.4000000953674316,\n", | |
" 'pres_adjusted_qc': '1',\n", | |
" 'pres_qc': '1',\n", | |
" 'psal': 35.01839828491211,\n", | |
" 'psal_adjusted': nan,\n", | |
" 'psal_adjusted_error': nan,\n", | |
" 'psal_adjusted_qc': '4',\n", | |
" 'psal_qc': '4',\n", | |
" 'temp': 3.7119998931884766,\n", | |
" 'temp_adjusted': nan,\n", | |
" 'temp_adjusted_error': nan,\n", | |
" 'temp_adjusted_qc': '4',\n", | |
" 'temp_qc': '4'}],\n", | |
" 'config_mission_number': 1,\n", | |
" 'cycle_number': 6,\n", | |
" 'data_centre': 'IF',\n", | |
" 'data_mode': 'D',\n", | |
" 'data_state_indicator': '2C ',\n", | |
" 'dc_reference': 'fl0173.006',\n", | |
" 'direction': 'A',\n", | |
" 'firmware_version': 'n/a',\n", | |
" 'float_serial_no': '144',\n", | |
" 'pi_name': 'Klaus-Peter KOLTERMANN',\n", | |
" 'platform_number': 69018,\n", | |
" 'platform_type': 'APEX',\n", | |
" 'positioning_system': 'S ',\n", | |
" 'position_qc': '1',\n", | |
" 'profile_pres_qc': 'A',\n", | |
" 'profile_psal_qc': 'F',\n", | |
" 'profile_temp_qc': 'F',\n", | |
" 'project_name': 'Euro-Argo',\n", | |
" 'time_location': datetime.datetime(1998, 8, 9, 5, 5, 59),\n", | |
" 'time_qc': '1',\n", | |
" 'vertical_sampling_scheme': 'Primary sampling: discrete []',\n", | |
" 'wmo_inst_type': 846}" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"expert_fields[573].tolist()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "aa7eda88-f82c-4de5-bf92-5abf1092f144", | |
"metadata": {}, | |
"source": [ | |
"Pull out a few fields and look at them individually." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"id": "438bed35-fb99-45a7-b700-4d9aa8e54c4b", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(<Array [-11.9, -16, -19.7, -27, ..., -32, -49.7, -19.6] type='10000 * float64'>,\n", | |
" <Array [-0.126, 0.267, 0.236, ..., 35, 24.1, 32.1] type='10000 * float64'>,\n", | |
" <Array [1997-07-28T20:26:20.000000, ...] type='10000 * datetime64[us]'>)" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"expert_fields.longitude, expert_fields[\"latitude\"], expert_fields.time" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "f65d61da-c758-4224-a0cf-aed3ffbe78e0", | |
"metadata": {}, | |
"source": [ | |
"Look at just the temperature and its quality control.\n", | |
"\n", | |
" * Numbers, slice objects, arrays, etc. slice rows.\n", | |
" * Strings slice columns, even nested columns." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"id": "57b5e018-4c18-4b7e-8f4f-72fa22e50d0f", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[[{temp: 21.8, temp_qc: '1'}, {...}, ..., {...}, {temp: 4.45, temp_qc: '1'}],\n", | |
" [{temp: 22.2, temp_qc: '1'}, {...}, ..., {...}, {temp: 4.45, temp_qc: '1'}],\n", | |
" [{temp: 22.8, temp_qc: '1'}, {...}, ..., {...}, {temp: 5.17, temp_qc: '1'}],\n", | |
" [{temp: 25.1, temp_qc: '1'}, {...}, ..., {...}, {temp: 4.66, temp_qc: '1'}],\n", | |
" [{temp: 26.1, temp_qc: '1'}, {...}, ..., {...}, {temp: 4.54, temp_qc: '1'}],\n", | |
" [{temp: 26.8, temp_qc: '4'}, {temp: ..., ...}, ..., {temp: nan, temp_qc: ' '}],\n", | |
" [{temp: 26.5, temp_qc: '1'}, {...}, ..., {...}, {temp: 4.52, temp_qc: '1'}],\n", | |
" [{temp: 26.4, temp_qc: '1'}, {...}, ..., {...}, {temp: 4.65, temp_qc: '1'}],\n", | |
" [{temp: 23.3, temp_qc: '1'}, {...}, ..., {...}, {temp: 4.47, temp_qc: '1'}],\n", | |
" [{temp: 22.6, temp_qc: '1'}, {temp: ..., ...}, ..., {temp: nan, temp_qc: ' '}],\n", | |
" ...,\n", | |
" [{temp: 28.3, temp_qc: '1'}, {temp: ..., ...}, ..., {temp: nan, temp_qc: ' '}],\n", | |
" [{temp: 15, temp_qc: '1'}, {temp: 15, ...}, ..., {temp: nan, temp_qc: ' '}],\n", | |
" [{temp: 28, temp_qc: '1'}, {temp: 28, ...}, ..., {temp: nan, temp_qc: ' '}],\n", | |
" [{temp: 27.6, temp_qc: '1'}, {temp: ..., ...}, ..., {temp: nan, temp_qc: ' '}],\n", | |
" [{temp: 28.1, temp_qc: '2'}, {temp: 28, ...}, ..., {temp: 5.5, temp_qc: '2'}],\n", | |
" [{temp: 27.8, temp_qc: '1'}, {temp: ..., ...}, ..., {temp: nan, temp_qc: ' '}],\n", | |
" [{temp: 22.3, temp_qc: '1'}, {temp: ..., ...}, ..., {temp: nan, temp_qc: ' '}],\n", | |
" [{temp: 26.4, temp_qc: '1'}, {temp: ..., ...}, ..., {temp: nan, temp_qc: ' '}],\n", | |
" [{temp: 21.3, temp_qc: '1'}, {temp: ..., ...}, ..., {temp: nan, temp_qc: ' '}]]\n" | |
] | |
} | |
], | |
"source": [ | |
"expert_fields[\"levels\", [\"temp\", \"temp_qc\"]].show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "eec50460-e734-4f4a-9fba-ec26f0ad5f74", | |
"metadata": {}, | |
"source": [ | |
"How many values are in each level?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"id": "50ea0ec1-a6ed-41df-91d7-1aac8673c8c5", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<Array [102, 112, 109, 105, 107, ..., 369, 369, 369, 369] type='10000 * int64'>" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"ak.num(expert_fields.levels)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "7c517391-b4e7-4ac2-996b-76752656d8b2", | |
"metadata": {}, | |
"source": [ | |
"Which pressures are above 40?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"id": "e5413900-b355-4fbf-935b-89c1b5a777b2", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<Array [[False, False, False, ..., True, True], ...] type='10000 * var * bool'>" | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"expert_fields.levels.pres > 40" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "b551da2c-2421-4ec3-96ee-30dfd1929170", | |
"metadata": {}, | |
"source": [ | |
"Which sets of levels have at least one over 40?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"id": "df3eb8ca-b699-40f1-942c-5805a3b07dbb", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<Array [True, True, True, True, ..., True, True, True] type='10000 * bool'>" | |
] | |
}, | |
"execution_count": 11, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"ak.any(expert_fields.levels.pres > 40, axis=-1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "4ab95885-a3c1-4847-968f-b44d5655e7f4", | |
"metadata": {}, | |
"source": [ | |
"Which salinities pass quality control?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"id": "13813f22-d7ff-483d-b7a3-f4ed6ddcd4b4", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<Array [[False, False, ..., False, False], ...] type='10000 * var * bool'>" | |
] | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"expert_fields.levels.psal_qc == \"1\"" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "f3b29af7-5a85-4dcc-ae8d-af99243bd452", | |
"metadata": {}, | |
"source": [ | |
"For which sets of levels do all salinities pass quality control?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"id": "d9c5e4b9-7c51-4b59-8d98-d943126b7d6e", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<Array [True, True, True, True, ..., False, False, False] type='10000 * bool'>" | |
] | |
}, | |
"execution_count": 13, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"ak.all(expert_fields.levels.pres_qc == \"1\", axis=-1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "74872c1e-2fa8-496f-bbc8-35aee6ad029f", | |
"metadata": {}, | |
"source": [ | |
"What's the mean pressure in each set?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"id": "0f7c839f-00b8-45da-8428-de3658c1e575", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<Array [492, 499, 501, 510, ..., nan, nan, nan, nan] type='10000 * ?float64'>" | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"ak.mean(expert_fields.levels.pres, axis=-1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "a09f24ff-21e2-464b-b2a4-b9202ec157c3", | |
"metadata": {}, | |
"source": [ | |
"Disregarding not-a-number values?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"id": "370ed2e9-6f7d-41f4-9030-71a682ab5c21", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<Array [492, 499, 501, 510, ..., 347, 315, 348, 572] type='10000 * ?float64'>" | |
] | |
}, | |
"execution_count": 15, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"ak.nanmean(expert_fields.levels.pres, axis=-1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "55e81472-f2ed-4fe5-9817-93061f0b206e", | |
"metadata": {}, | |
"source": [ | |
"All of the above should be reminiscent of exploring data with NumPy, except that the data have more complex structures." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "54b473e1-0d2c-4b6a-87db-f340c1fb0e13", | |
"metadata": {}, | |
"source": [ | |
"Just as Numba can iterate over NumPy arrays, it can iterate over Awkward Arrays.\n", | |
"\n", | |
"The following calculates means, disregarding not-a-number, just like `nanmean`." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"id": "87b854d6-2666-4aa6-8b72-0047f8c9d0a7", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import numba as nb" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"id": "6d262bc0-ea6d-4895-890c-b5a59e253ffe", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([491.65784372, 499.49642835, 501.15871513, ..., 314.63414634,\n", | |
" 347.6744186 , 571.81818182])" | |
] | |
}, | |
"execution_count": 17, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"@nb.njit\n", | |
"def walk_over_array(array):\n", | |
" out = np.empty(len(array))\n", | |
" for i, levels in enumerate(array.levels):\n", | |
" numer = 0.0\n", | |
" denom = 0.0\n", | |
" for pres in levels.pres:\n", | |
" if not np.isnan(pres):\n", | |
" numer += pres\n", | |
" denom += 1.0\n", | |
"\n", | |
" if denom != 0.0:\n", | |
" out[i] = numer / denom\n", | |
" else:\n", | |
" out[i] = np.nan\n", | |
"\n", | |
" return out\n", | |
"\n", | |
"walk_over_array(expert_fields)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "3e63a98d-a74e-4c3d-93c9-03b2d4c92ff7", | |
"metadata": {}, | |
"source": [ | |
"Now to use this feature to make a complex cut: suppose we want only data that are a certain distance from shore?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"id": "de5592bc-8584-454b-acd1-e9497f954305", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import matplotlib.image" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"id": "1a674121-edb7-49c7-ae2c-804a994cad91", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"land = matplotlib.image.imread(\"land.png\")[:, :, 0]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"id": "4482fd99-f065-4427-a268-8defee205272", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"@nb.njit\n", | |
"def delta_longitude_latitude(at_latitude):\n", | |
" a, e2 = 6378.137, 0.00669437999014 # WGS84\n", | |
" radians = np.deg2rad(at_latitude)\n", | |
" partial = (1.0 - e2*np.sin(radians)**2)\n", | |
" delta_longitude = 180.0 * partial**0.5 / (np.pi*a*np.cos(radians))\n", | |
" delta_latitude = 180.0 * partial**1.5 / (np.pi*a*(1.0 - e2))\n", | |
" return delta_longitude, delta_latitude" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"id": "ae99b53e-aa9d-4018-b2c2-540c57a46803", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"@nb.njit\n", | |
"def fraction_in_land(longitude, latitude, ball_kilometers):\n", | |
" delta_longitude, delta_latitude = delta_longitude_latitude(latitude)\n", | |
" inv_longitude, inv_latitude = 1.0 / delta_longitude, 1.0 / delta_latitude\n", | |
"\n", | |
" pixels_per_degree = 1.0\n", | |
" degrees_per_pixel = 1.0 / pixels_per_degree\n", | |
" min_horizontal = max(0, int(np.floor(pixels_per_degree * (180 + longitude - ball_kilometers * delta_longitude))))\n", | |
" max_horizontal = min(3600, int(np.ceil(pixels_per_degree * (180 + longitude + ball_kilometers * delta_longitude))))\n", | |
" min_vertical = max(1, int(np.floor(pixels_per_degree * (90 + latitude - ball_kilometers * delta_latitude))))\n", | |
" max_vertical = min(1801, int(np.floor(pixels_per_degree * (90 + latitude + ball_kilometers * delta_latitude))))\n", | |
"\n", | |
" num_land = 0.0\n", | |
" num_pixels = 0.0\n", | |
" for horizontal in range(min_horizontal, max_horizontal + 1):\n", | |
" for vertical in range(min_vertical, max_vertical + 1):\n", | |
" kilometers_east = inv_longitude * (degrees_per_pixel * (horizontal + 0.5) - 180.0 - longitude)\n", | |
" kilometers_north = inv_latitude * (degrees_per_pixel * (vertical - 0.5) - 90.0 - latitude)\n", | |
" if np.sqrt(kilometers_east**2 + kilometers_north**2) < ball_kilometers:\n", | |
" num_land += land[-vertical, horizontal]\n", | |
" num_pixels += 1.0\n", | |
"\n", | |
" if num_pixels == 0.0:\n", | |
" return 0.0\n", | |
" else:\n", | |
" return num_land / num_pixels" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "46e99555-cd68-42d5-a057-3566c02fdd4d", | |
"metadata": {}, | |
"source": [ | |
"Make it a NumPy ufunc.\n", | |
"\n", | |
"Awkward Arrays can be used with `@nb.jit` functions and any ufunc, including new ones made by Numba." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"id": "8c0139d3-0077-4e51-a4d0-677283022ee3", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"@nb.vectorize([nb.float64(nb.float64, nb.float64, nb.float64)])\n", | |
"def fraction_in_land_ufunc(longitude, latitude, ball_kilometers):\n", | |
" return fraction_in_land(longitude, latitude, ball_kilometers)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "1a1d6369-d45d-4d48-b892-f1e28dc6a455", | |
"metadata": {}, | |
"source": [ | |
"Now get the full dataset, but only a few columns." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"id": "6b62abff-64f8-44af-8458-a7f26fd97e1a", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"all_coordinates = ak.from_parquet(\n", | |
" \"s3://pivarski-princeton/argo-floats-expert.parquet\",\n", | |
" columns=[\"longitude\", \"latitude\", \"time\"],\n", | |
")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"id": "509b78f6-7143-4309-974a-85a1be8f7bd0", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(10000, 2534567)" | |
] | |
}, | |
"execution_count": 24, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"len(expert_fields), len(all_coordinates)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "f1015ef4-05cd-43a2-9da0-869f12ff48ee", | |
"metadata": {}, | |
"source": [ | |
"Create a selection as an array of booleans and apply it as a slice (like NumPy)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"id": "9bbfe47d-f81e-4042-9849-9822a41f05d4", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"selection = (\n", | |
" (fraction_in_land_ufunc(all_coordinates.longitude, all_coordinates.latitude, 1100.0) > 0.1) &\n", | |
" (fraction_in_land_ufunc(all_coordinates.longitude, all_coordinates.latitude, 990.0) < 0.1)\n", | |
")\n", | |
"\n", | |
"selected = all_coordinates[selection]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 26, | |
"id": "1dc70471-cba8-4de9-99e8-77fd4b9482a7", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.02544339920783313" | |
] | |
}, | |
"execution_count": 26, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"len(selected) / len(all_coordinates)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "989de32d-4f2a-40f8-888a-797d54bbb230", | |
"metadata": {}, | |
"source": [ | |
"Only measurements taken close to 1000 km from shore are included in the 2.5% that have been selected." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"id": "3a479380-b61e-451e-8c3c-ddc99ddc2fa0", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<matplotlib.collections.PathCollection at 0x7f51c3479a60>" | |
] | |
}, | |
"execution_count": 27, | |
"metadata": {}, | |
"output_type": "execute_result" | |
}, | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 864x576 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"import matplotlib.pyplot as plt\n", | |
"\n", | |
"fig, ax = plt.subplots(figsize=(12, 8))\n", | |
"\n", | |
"ax.imshow(land, extent=(-180, 180, -90, 90), cmap=\"gray\")\n", | |
"ax.scatter(selected.longitude, selected.latitude, s=1)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "cf3eb975-8ece-4487-b7a9-05ca03caf9b5", | |
"metadata": {}, | |
"source": [ | |
"Eventually, data analysts will want Pandas DataFrames and Xarrays.\n", | |
"\n", | |
"Awkward Arrays could be used just in the selection process, then project the data onto standard rectilinear formats." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"id": "f3a7c948-f953-4bf3-9d58-b997eeacbf28", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def make_pandas(row_groups, selection_function):\n", | |
" awkward_array = ak.from_parquet(\n", | |
" \"/home/jpivarski/storage/data/argo-floats/argo-floats-expert.parquet\", # use my local file instead of S3\n", | |
" row_groups=row_groups,\n", | |
" columns=[\n", | |
" \"latitude\",\n", | |
" \"longitude\",\n", | |
" \"time\",\n", | |
" \"levels.pres\",\n", | |
" \"levels.pres_qc\",\n", | |
" \"levels.psal\",\n", | |
" \"levels.psal_qc\",\n", | |
" \"levels.temp\",\n", | |
" \"levels.temp_qc\",\n", | |
" \"config_mission_number\",\n", | |
" \"cycle_number\",\n", | |
" \"data_mode\",\n", | |
" \"direction\",\n", | |
" \"platform_number\",\n", | |
" \"position_qc\",\n", | |
" \"time_qc\",\n", | |
" ],\n", | |
" )\n", | |
" # Treat any missing lists as empty lists.\n", | |
" awkward_array[\"levels\"] = ak.fill_none(awkward_array[\"levels\"], [], axis=1)\n", | |
" \n", | |
" # Apply the selection function.\n", | |
" selected = awkward_array[selection_function(awkward_array)]\n", | |
" \n", | |
" # Turn the quantities in levels into non-nested columns.\n", | |
" for name in selected[\"levels\"].fields:\n", | |
" selected[name] = selected[\"levels\", name]\n", | |
" \n", | |
" # Remove the original, nested version.\n", | |
" del selected[\"levels\"]\n", | |
" \n", | |
" # Now this projects into a Pandas DataFrame.\n", | |
" return ak.to_pandas(selected)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"id": "b6236300-d36e-4502-977a-e5ac2bd34766", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def selection_function(array):\n", | |
" return (\n", | |
" (fraction_in_land_ufunc(array.longitude, array.latitude, 1100.0) > 0.1) &\n", | |
" (fraction_in_land_ufunc(array.longitude, array.latitude, 990.0) < 0.1)\n", | |
" )" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "c0d208d8-e12c-4390-825e-9c977a9ac415", | |
"metadata": {}, | |
"source": [ | |
"Here's the first three row groups." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"id": "6189829c-8347-4633-aa59-51bedae586f8", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th>latitude</th>\n", | |
" <th>longitude</th>\n", | |
" <th>time</th>\n", | |
" <th>config_mission_number</th>\n", | |
" <th>cycle_number</th>\n", | |
" <th>data_mode</th>\n", | |
" <th>direction</th>\n", | |
" <th>platform_number</th>\n", | |
" <th>position_qc</th>\n", | |
" <th>time_qc</th>\n", | |
" <th>pres</th>\n", | |
" <th>pres_qc</th>\n", | |
" <th>psal</th>\n", | |
" <th>psal_qc</th>\n", | |
" <th>temp</th>\n", | |
" <th>temp_qc</th>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>entry</th>\n", | |
" <th>subentry</th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" <th></th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th rowspan=\"5\" valign=\"top\">0</th>\n", | |
" <th>0</th>\n", | |
" <td>-0.126</td>\n", | |
" <td>-11.863</td>\n", | |
" <td>1997-07-28 20:26:20</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>R</td>\n", | |
" <td>A</td>\n", | |
" <td>13858</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>15.500000</td>\n", | |
" <td>1</td>\n", | |
" <td>NaN</td>\n", | |
" <td></td>\n", | |
" <td>21.804001</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>-0.126</td>\n", | |
" <td>-11.863</td>\n", | |
" <td>1997-07-28 20:26:20</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>R</td>\n", | |
" <td>A</td>\n", | |
" <td>13858</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>21.100000</td>\n", | |
" <td>1</td>\n", | |
" <td>NaN</td>\n", | |
" <td></td>\n", | |
" <td>21.788000</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>-0.126</td>\n", | |
" <td>-11.863</td>\n", | |
" <td>1997-07-28 20:26:20</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>R</td>\n", | |
" <td>A</td>\n", | |
" <td>13858</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>26.600000</td>\n", | |
" <td>1</td>\n", | |
" <td>NaN</td>\n", | |
" <td></td>\n", | |
" <td>21.521000</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>-0.126</td>\n", | |
" <td>-11.863</td>\n", | |
" <td>1997-07-28 20:26:20</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>R</td>\n", | |
" <td>A</td>\n", | |
" <td>13858</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>32.200001</td>\n", | |
" <td>1</td>\n", | |
" <td>NaN</td>\n", | |
" <td></td>\n", | |
" <td>20.671000</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>-0.126</td>\n", | |
" <td>-11.863</td>\n", | |
" <td>1997-07-28 20:26:20</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>R</td>\n", | |
" <td>A</td>\n", | |
" <td>13858</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>37.799999</td>\n", | |
" <td>1</td>\n", | |
" <td>NaN</td>\n", | |
" <td></td>\n", | |
" <td>19.691000</td>\n", | |
" <td>1</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>...</th>\n", | |
" <th>...</th>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" <td>...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th rowspan=\"5\" valign=\"top\">1162</th>\n", | |
" <th>369</th>\n", | |
" <td>7.386</td>\n", | |
" <td>-45.771</td>\n", | |
" <td>2004-06-25 01:12:30</td>\n", | |
" <td>1</td>\n", | |
" <td>151</td>\n", | |
" <td>R</td>\n", | |
" <td>A</td>\n", | |
" <td>39007</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>NaN</td>\n", | |
" <td></td>\n", | |
" <td>NaN</td>\n", | |
" <td></td>\n", | |
" <td>NaN</td>\n", | |
" <td></td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>370</th>\n", | |
" <td>7.386</td>\n", | |
" <td>-45.771</td>\n", | |
" <td>2004-06-25 01:12:30</td>\n", | |
" <td>1</td>\n", | |
" <td>151</td>\n", | |
" <td>R</td>\n", | |
" <td>A</td>\n", | |
" <td>39007</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>NaN</td>\n", | |
" <td></td>\n", | |
" <td>NaN</td>\n", | |
" <td></td>\n", | |
" <td>NaN</td>\n", | |
" <td></td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>371</th>\n", | |
" <td>7.386</td>\n", | |
" <td>-45.771</td>\n", | |
" <td>2004-06-25 01:12:30</td>\n", | |
" <td>1</td>\n", | |
" <td>151</td>\n", | |
" <td>R</td>\n", | |
" <td>A</td>\n", | |
" <td>39007</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>NaN</td>\n", | |
" <td></td>\n", | |
" <td>NaN</td>\n", | |
" <td></td>\n", | |
" <td>NaN</td>\n", | |
" <td></td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>372</th>\n", | |
" <td>7.386</td>\n", | |
" <td>-45.771</td>\n", | |
" <td>2004-06-25 01:12:30</td>\n", | |
" <td>1</td>\n", | |
" <td>151</td>\n", | |
" <td>R</td>\n", | |
" <td>A</td>\n", | |
" <td>39007</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>NaN</td>\n", | |
" <td></td>\n", | |
" <td>NaN</td>\n", | |
" <td></td>\n", | |
" <td>NaN</td>\n", | |
" <td></td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>373</th>\n", | |
" <td>7.386</td>\n", | |
" <td>-45.771</td>\n", | |
" <td>2004-06-25 01:12:30</td>\n", | |
" <td>1</td>\n", | |
" <td>151</td>\n", | |
" <td>R</td>\n", | |
" <td>A</td>\n", | |
" <td>39007</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>NaN</td>\n", | |
" <td></td>\n", | |
" <td>NaN</td>\n", | |
" <td></td>\n", | |
" <td>NaN</td>\n", | |
" <td></td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"<p>316608 rows × 16 columns</p>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" latitude longitude time \\\n", | |
"entry subentry \n", | |
"0 0 -0.126 -11.863 1997-07-28 20:26:20 \n", | |
" 1 -0.126 -11.863 1997-07-28 20:26:20 \n", | |
" 2 -0.126 -11.863 1997-07-28 20:26:20 \n", | |
" 3 -0.126 -11.863 1997-07-28 20:26:20 \n", | |
" 4 -0.126 -11.863 1997-07-28 20:26:20 \n", | |
"... ... ... ... \n", | |
"1162 369 7.386 -45.771 2004-06-25 01:12:30 \n", | |
" 370 7.386 -45.771 2004-06-25 01:12:30 \n", | |
" 371 7.386 -45.771 2004-06-25 01:12:30 \n", | |
" 372 7.386 -45.771 2004-06-25 01:12:30 \n", | |
" 373 7.386 -45.771 2004-06-25 01:12:30 \n", | |
"\n", | |
" config_mission_number cycle_number data_mode direction \\\n", | |
"entry subentry \n", | |
"0 0 1 1 R A \n", | |
" 1 1 1 R A \n", | |
" 2 1 1 R A \n", | |
" 3 1 1 R A \n", | |
" 4 1 1 R A \n", | |
"... ... ... ... ... \n", | |
"1162 369 1 151 R A \n", | |
" 370 1 151 R A \n", | |
" 371 1 151 R A \n", | |
" 372 1 151 R A \n", | |
" 373 1 151 R A \n", | |
"\n", | |
" platform_number position_qc time_qc pres pres_qc psal \\\n", | |
"entry subentry \n", | |
"0 0 13858 1 1 15.500000 1 NaN \n", | |
" 1 13858 1 1 21.100000 1 NaN \n", | |
" 2 13858 1 1 26.600000 1 NaN \n", | |
" 3 13858 1 1 32.200001 1 NaN \n", | |
" 4 13858 1 1 37.799999 1 NaN \n", | |
"... ... ... ... ... ... ... \n", | |
"1162 369 39007 1 1 NaN NaN \n", | |
" 370 39007 1 1 NaN NaN \n", | |
" 371 39007 1 1 NaN NaN \n", | |
" 372 39007 1 1 NaN NaN \n", | |
" 373 39007 1 1 NaN NaN \n", | |
"\n", | |
" psal_qc temp temp_qc \n", | |
"entry subentry \n", | |
"0 0 21.804001 1 \n", | |
" 1 21.788000 1 \n", | |
" 2 21.521000 1 \n", | |
" 3 20.671000 1 \n", | |
" 4 19.691000 1 \n", | |
"... ... ... ... \n", | |
"1162 369 NaN \n", | |
" 370 NaN \n", | |
" 371 NaN \n", | |
" 372 NaN \n", | |
" 373 NaN \n", | |
"\n", | |
"[316608 rows x 16 columns]" | |
] | |
}, | |
"execution_count": 30, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"make_pandas([0, 1, 2], selection_function)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 31, | |
"id": "56afb02a-cb6d-42be-b6e3-22f8a31d7e04", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"254" | |
] | |
}, | |
"execution_count": 31, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"num_row_groups = ak.metadata_from_parquet(\n", | |
" \"s3://pivarski-princeton/argo-floats-expert.parquet\"\n", | |
").metadata.num_row_groups\n", | |
"num_row_groups" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "18a47c45-fdad-413a-ac56-d7d1781c0bc6", | |
"metadata": {}, | |
"source": [ | |
"There are 254 row groups in this dataset. How long does it take to turn them all into Xarray?\n", | |
"\n", | |
"It depends heavily on network. Downloading the 7.0 GB file and having tens of GB of RAM can make it pretty quick.\n", | |
"\n", | |
"Dask can make it take advantage of multiple threads or computers in a cluster, and [that's under development](https://github.com/ContinuumIO/dask-awkward/)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 32, | |
"id": "5fa921a1-8dfd-4bda-ab0c-82d1e9ccf9df", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"/home/jpivarski/mambaforge/lib/python3.9/site-packages/awkward/_v2/_connect/numpy.py:193: RuntimeWarning: invalid value encountered in fraction_in_land_ufunc\n", | |
" result = getattr(ufunc, method)(*args, **kwargs)\n", | |
"/home/jpivarski/mambaforge/lib/python3.9/site-packages/awkward/_v2/_connect/numpy.py:193: RuntimeWarning: invalid value encountered in fraction_in_land_ufunc\n", | |
" result = getattr(ufunc, method)(*args, **kwargs)\n", | |
"/home/jpivarski/mambaforge/lib/python3.9/site-packages/awkward/_v2/_connect/numpy.py:193: RuntimeWarning: invalid value encountered in fraction_in_land_ufunc\n", | |
" result = getattr(ufunc, method)(*args, **kwargs)\n", | |
"/home/jpivarski/mambaforge/lib/python3.9/site-packages/awkward/_v2/_connect/numpy.py:193: RuntimeWarning: invalid value encountered in fraction_in_land_ufunc\n", | |
" result = getattr(ufunc, method)(*args, **kwargs)\n", | |
"/home/jpivarski/mambaforge/lib/python3.9/site-packages/awkward/_v2/_connect/numpy.py:193: RuntimeWarning: invalid value encountered in fraction_in_land_ufunc\n", | |
" result = getattr(ufunc, method)(*args, **kwargs)\n", | |
"/home/jpivarski/mambaforge/lib/python3.9/site-packages/awkward/_v2/_connect/numpy.py:193: RuntimeWarning: invalid value encountered in fraction_in_land_ufunc\n", | |
" result = getattr(ufunc, method)(*args, **kwargs)\n", | |
"/home/jpivarski/mambaforge/lib/python3.9/site-packages/awkward/_v2/_connect/numpy.py:193: RuntimeWarning: invalid value encountered in fraction_in_land_ufunc\n", | |
" result = getattr(ufunc, method)(*args, **kwargs)\n", | |
"/home/jpivarski/mambaforge/lib/python3.9/site-packages/awkward/_v2/_connect/numpy.py:193: RuntimeWarning: invalid value encountered in fraction_in_land_ufunc\n", | |
" result = getattr(ufunc, method)(*args, **kwargs)\n", | |
"/home/jpivarski/mambaforge/lib/python3.9/site-packages/awkward/_v2/_connect/numpy.py:193: RuntimeWarning: invalid value encountered in fraction_in_land_ufunc\n", | |
" result = getattr(ufunc, method)(*args, **kwargs)\n", | |
"/home/jpivarski/mambaforge/lib/python3.9/site-packages/awkward/_v2/_connect/numpy.py:193: RuntimeWarning: invalid value encountered in fraction_in_land_ufunc\n", | |
" result = getattr(ufunc, method)(*args, **kwargs)\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"6min 57s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n" | |
] | |
} | |
], | |
"source": [ | |
"%%timeit -r 1 -n 1\n", | |
"\n", | |
"for start in range(0, num_row_groups, 5):\n", | |
" make_pandas(range(start, min(start + 5, num_row_groups)), selection_function).reset_index().to_xarray()" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.9.12" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment