jorisvandenbossche · June 5, 2020 19:52
diff --git a/tokenize-issue.ipynb b/tokenize-issue.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dask-GeoPandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import geopandas\n",
    "\n",
    "import dask.dataframe as dd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a series of points, and convert into a dask series:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "N = 100_000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "points = geopandas.GeoSeries(geopandas.points_from_xy(np.random.randn(N),np.random.randn(N)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dask Series Structure:\n",
       "npartitions=16\n",
       "0        geometry\n",
       "6250          ...\n",
       "           ...   \n",
       "93750         ...\n",
       "99999         ...\n",
       "dtype: geometry\n",
       "Dask Name: from_pandas, 16 tasks"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dd.from_pandas(points, npartitions=16)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**RESTART KERNEL**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import geopandas\n",
    "\n",
    "import dask.dataframe as dd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "N = 100_000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "points = geopandas.GeoSeries(geopandas.points_from_xy(np.random.randn(N),np.random.randn(N)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now first do some other computation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = geopandas.read_file(geopandas.datasets.get_path(\"naturalearth_lowres\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "ddf = dd.from_pandas(df, npartitions=4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "normalize EA of len 0\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Africa                     51\n",
       "Asia                       47\n",
       "Europe                     39\n",
       "North America              18\n",
       "South America              13\n",
       "Oceania                     7\n",
       "Seven seas (open ocean)     1\n",
       "Antarctica                  1\n",
       "Name: continent, dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.continent.value_counts().compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And then convert the Series of points to a dask series:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "normalize EA of len 100000\n",
      "normalize EA of len 100000\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Dask Series Structure:\n",
       "npartitions=16\n",
       "0        geometry\n",
       "6250          ...\n",
       "           ...   \n",
       "93750         ...\n",
       "99999         ...\n",
       "dtype: geometry\n",
       "Dask Name: from_pandas, 16 tasks"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dd.from_pandas(points, npartitions=16)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I patched `normalize_extension_array` to do the print statement as above, and it indicates it is tokenizing the full series of points, which is expensive. But the strange things is that this did not happen if this was the first thing done with dask (see above)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python (geo-dev)",
   "language": "python",
   "name": "geo-dev"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Dask-GeoPandas"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"import geopandas\n",
	"\n",
	"import dask.dataframe as dd"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Create a series of points, and convert into a dask series:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"N = 100_000"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"points = geopandas.GeoSeries(geopandas.points_from_xy(np.random.randn(N),np.random.randn(N)))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"Dask Series Structure:\n",
	"npartitions=16\n",
	"0 geometry\n",
	"6250 ...\n",
	" ... \n",
	"93750 ...\n",
	"99999 ...\n",
	"dtype: geometry\n",
	"Dask Name: from_pandas, 16 tasks"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"dd.from_pandas(points, npartitions=16)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"RESTART KERNEL"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"import geopandas\n",
	"\n",
	"import dask.dataframe as dd"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"N = 100_000"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"points = geopandas.GeoSeries(geopandas.points_from_xy(np.random.randn(N),np.random.randn(N)))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now first do some other computation:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [],
	"source": [
	"df = geopandas.read_file(geopandas.datasets.get_path(\"naturalearth_lowres\"))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [],
	"source": [
	"ddf = dd.from_pandas(df, npartitions=4)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"normalize EA of len 0\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"Africa 51\n",
	"Asia 47\n",
	"Europe 39\n",
	"North America 18\n",
	"South America 13\n",
	"Oceania 7\n",
	"Seven seas (open ocean) 1\n",
	"Antarctica 1\n",
	"Name: continent, dtype: int64"
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"ddf.continent.value_counts().compute()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"And then convert the Series of points to a dask series:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"normalize EA of len 100000\n",
	"normalize EA of len 100000\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"Dask Series Structure:\n",
	"npartitions=16\n",
	"0 geometry\n",
	"6250 ...\n",
	" ... \n",
	"93750 ...\n",
	"99999 ...\n",
	"dtype: geometry\n",
	"Dask Name: from_pandas, 16 tasks"
	]
	},
	"execution_count": 7,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"dd.from_pandas(points, npartitions=16)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"I patched `normalize_extension_array` to do the print statement as above, and it indicates it is tokenizing the full series of points, which is expensive. But the strange things is that this did not happen if this was the first thing done with dask (see above)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python (geo-dev)",
	"language": "python",
	"name": "geo-dev"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.8.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}
No results found