Skip to content

Instantly share code, notes, and snippets.

@martindurant
Created May 13, 2020 18:33
Show Gist options
  • Save martindurant/6a95e20bce620471462062083d1ac1ff to your computer and use it in GitHub Desktop.
Save martindurant/6a95e20bce620471462062083d1ac1ff to your computer and use it in GitHub Desktop.
fail cases
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"sys.path.append(\"/Users/mdurant/code/awkward-1.0/\")\n",
"import awkward1 as ak"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import random\n",
"import json\n",
"\n",
"def make_chunk(n, fn='out.json'):\n",
" \n",
" with open(fn, 'w') as f:\n",
" data = []\n",
" for _ in range(n):\n",
" row = {\"data\": {'f': random.random(), \n",
" 'i': random.randrange(0, 10), \n",
" 'str': random.choice(['fred', 'wilma', 'barney', 'betty']),\n",
" 'li': \n",
" [random.choice(list('abcdef')) for _ in range(random.randrange(1, 10))],\n",
" 'map': {random.choice(list('abcdef')): random.choice(['fred', 'wilma', 'barney', 'betty'])\n",
" for _ in range(random.randrange(1, 10))},\n",
" }}\n",
" # row = [random.choice(list('abcdef')) for _ in range(random.randrange(1, 10))]\n",
" data.append(row)\n",
" json.dump(data, f)\n",
" \n",
" "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"make_chunk(100)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<Array [{data: {f: 0.45, i: 6, ... a: None}}}] type='100 * {\"data\": {\"f\": float6...'>"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"arr = ak.fromjson('out.json')\n",
"arr"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"import numba\n",
"\n",
"@numba.njit\n",
"def len_lists(inarr):\n",
" count = 0\n",
" for d in inarr:\n",
" val = d['data']['map']['a']\n",
" if val:\n",
" count += 1\n",
" return count"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"ename": "TypingError",
"evalue": "Failed in nopython mode pipeline (step: nopython mode backend)\n\u001b[1m\u001b[1mInvalid use of Function(<class 'bool'>) with argument(s) of type(s): (awkward1.ArrayView(awkward1.NumpyArrayType(array(uint8, 1d, C), none, {\"__array__\": \"char\"}), None, ()))\n * parameterized\n\u001b[1mIn definition 0:\u001b[0m\n\u001b[1m All templates rejected with literals.\u001b[0m\n\u001b[1mIn definition 1:\u001b[0m\n\u001b[1m All templates rejected without literals.\u001b[0m\n\u001b[1mIn definition 2:\u001b[0m\n\u001b[1m TypingError: \u001b[1mInvalid use of Function(<built-in function truth>) with argument(s) of type(s): (awkward1.ArrayView(awkward1.NumpyArrayType(array(uint8, 1d, C), none, {\"__array__\": \"char\"}), None, ()))\n * parameterized\n\u001b[1mIn definition 0:\u001b[0m\n\u001b[1m All templates rejected with literals.\u001b[0m\n\u001b[1mIn definition 1:\u001b[0m\n\u001b[1m All templates rejected without literals.\u001b[0m\n\u001b[1mIn definition 2:\u001b[0m\n\u001b[1m All templates rejected with literals.\u001b[0m\n\u001b[1mIn definition 3:\u001b[0m\n\u001b[1m All templates rejected without literals.\u001b[0m\n\u001b[1mIn definition 4:\u001b[0m\n\u001b[1m UnboundLocalError: local variable 'left' referenced before assignment\u001b[0m\n raised from /Users/mdurant/code/awkward-1.0/awkward1/_connect/_numba/arrayview.py:493\n\u001b[1mIn definition 5:\u001b[0m\n\u001b[1m UnboundLocalError: local variable 'left' referenced before assignment\u001b[0m\n raised from /Users/mdurant/code/awkward-1.0/awkward1/_connect/_numba/arrayview.py:493\n\u001b[1mIn definition 6:\u001b[0m\n\u001b[1m All templates rejected with literals.\u001b[0m\n\u001b[1mIn definition 7:\u001b[0m\n\u001b[1m All templates rejected without literals.\u001b[0m\n\u001b[1mThis error is usually caused by passing an argument of a type that is unsupported by the named function.\u001b[0m\u001b[0m\u001b[0m\n raised from /Users/mdurant/conda/envs/py37/lib/python3.7/site-packages/numba/types/functions.py:79\n\u001b[1mIn definition 3:\u001b[0m\n\u001b[1m TypingError: \u001b[1mInvalid use of Function(<built-in function truth>) with argument(s) of type(s): (awkward1.ArrayView(awkward1.NumpyArrayType(array(uint8, 1d, C), none, {\"__array__\": \"char\"}), None, ()))\n * parameterized\n\u001b[1mIn definition 0:\u001b[0m\n\u001b[1m All templates rejected with literals.\u001b[0m\n\u001b[1mIn definition 1:\u001b[0m\n\u001b[1m All templates rejected without literals.\u001b[0m\n\u001b[1mIn definition 2:\u001b[0m\n\u001b[1m All templates rejected with literals.\u001b[0m\n\u001b[1mIn definition 3:\u001b[0m\n\u001b[1m All templates rejected without literals.\u001b[0m\n\u001b[1mIn definition 4:\u001b[0m\n\u001b[1m UnboundLocalError: local variable 'left' referenced before assignment\u001b[0m\n raised from /Users/mdurant/code/awkward-1.0/awkward1/_connect/_numba/arrayview.py:493\n\u001b[1mIn definition 5:\u001b[0m\n\u001b[1m UnboundLocalError: local variable 'left' referenced before assignment\u001b[0m\n raised from /Users/mdurant/code/awkward-1.0/awkward1/_connect/_numba/arrayview.py:493\n\u001b[1mIn definition 6:\u001b[0m\n\u001b[1m All templates rejected with literals.\u001b[0m\n\u001b[1mIn definition 7:\u001b[0m\n\u001b[1m All templates rejected without literals.\u001b[0m\n\u001b[1mThis error is usually caused by passing an argument of a type that is unsupported by the named function.\u001b[0m\u001b[0m\u001b[0m\n raised from /Users/mdurant/conda/envs/py37/lib/python3.7/site-packages/numba/types/functions.py:79\n\u001b[1mThis error is usually caused by passing an argument of a type that is unsupported by the named function.\u001b[0m\u001b[0m\n\u001b[0m\u001b[1m[1] During: lowering \"branch val, 36, 49\" at <ipython-input-11-fbcfedc8e316> (8)\u001b[0m",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mTypingError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-12-2c038d8aae72>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mlen_lists\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m~/conda/envs/py37/lib/python3.7/site-packages/numba/dispatcher.py\u001b[0m in \u001b[0;36m_compile_for_args\u001b[0;34m(self, *args, **kws)\u001b[0m\n\u001b[1;32m 399\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpatch_message\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmsg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 400\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 401\u001b[0;31m \u001b[0merror_rewrite\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'typing'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 402\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mUnsupportedError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 403\u001b[0m \u001b[0;31m# Something unsupported is present in the user code, add help info\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/conda/envs/py37/lib/python3.7/site-packages/numba/dispatcher.py\u001b[0m in \u001b[0;36merror_rewrite\u001b[0;34m(e, issue_type)\u001b[0m\n\u001b[1;32m 342\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 343\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 344\u001b[0;31m \u001b[0mreraise\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 345\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 346\u001b[0m \u001b[0margtypes\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/conda/envs/py37/lib/python3.7/site-packages/numba/six.py\u001b[0m in \u001b[0;36mreraise\u001b[0;34m(tp, value, tb)\u001b[0m\n\u001b[1;32m 666\u001b[0m \u001b[0mvalue\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtp\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 667\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__traceback__\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mtb\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 668\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwith_traceback\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 669\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 670\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mTypingError\u001b[0m: Failed in nopython mode pipeline (step: nopython mode backend)\n\u001b[1m\u001b[1mInvalid use of Function(<class 'bool'>) with argument(s) of type(s): (awkward1.ArrayView(awkward1.NumpyArrayType(array(uint8, 1d, C), none, {\"__array__\": \"char\"}), None, ()))\n * parameterized\n\u001b[1mIn definition 0:\u001b[0m\n\u001b[1m All templates rejected with literals.\u001b[0m\n\u001b[1mIn definition 1:\u001b[0m\n\u001b[1m All templates rejected without literals.\u001b[0m\n\u001b[1mIn definition 2:\u001b[0m\n\u001b[1m TypingError: \u001b[1mInvalid use of Function(<built-in function truth>) with argument(s) of type(s): (awkward1.ArrayView(awkward1.NumpyArrayType(array(uint8, 1d, C), none, {\"__array__\": \"char\"}), None, ()))\n * parameterized\n\u001b[1mIn definition 0:\u001b[0m\n\u001b[1m All templates rejected with literals.\u001b[0m\n\u001b[1mIn definition 1:\u001b[0m\n\u001b[1m All templates rejected without literals.\u001b[0m\n\u001b[1mIn definition 2:\u001b[0m\n\u001b[1m All templates rejected with literals.\u001b[0m\n\u001b[1mIn definition 3:\u001b[0m\n\u001b[1m All templates rejected without literals.\u001b[0m\n\u001b[1mIn definition 4:\u001b[0m\n\u001b[1m UnboundLocalError: local variable 'left' referenced before assignment\u001b[0m\n raised from /Users/mdurant/code/awkward-1.0/awkward1/_connect/_numba/arrayview.py:493\n\u001b[1mIn definition 5:\u001b[0m\n\u001b[1m UnboundLocalError: local variable 'left' referenced before assignment\u001b[0m\n raised from /Users/mdurant/code/awkward-1.0/awkward1/_connect/_numba/arrayview.py:493\n\u001b[1mIn definition 6:\u001b[0m\n\u001b[1m All templates rejected with literals.\u001b[0m\n\u001b[1mIn definition 7:\u001b[0m\n\u001b[1m All templates rejected without literals.\u001b[0m\n\u001b[1mThis error is usually caused by passing an argument of a type that is unsupported by the named function.\u001b[0m\u001b[0m\u001b[0m\n raised from /Users/mdurant/conda/envs/py37/lib/python3.7/site-packages/numba/types/functions.py:79\n\u001b[1mIn definition 3:\u001b[0m\n\u001b[1m TypingError: \u001b[1mInvalid use of Function(<built-in function truth>) with argument(s) of type(s): (awkward1.ArrayView(awkward1.NumpyArrayType(array(uint8, 1d, C), none, {\"__array__\": \"char\"}), None, ()))\n * parameterized\n\u001b[1mIn definition 0:\u001b[0m\n\u001b[1m All templates rejected with literals.\u001b[0m\n\u001b[1mIn definition 1:\u001b[0m\n\u001b[1m All templates rejected without literals.\u001b[0m\n\u001b[1mIn definition 2:\u001b[0m\n\u001b[1m All templates rejected with literals.\u001b[0m\n\u001b[1mIn definition 3:\u001b[0m\n\u001b[1m All templates rejected without literals.\u001b[0m\n\u001b[1mIn definition 4:\u001b[0m\n\u001b[1m UnboundLocalError: local variable 'left' referenced before assignment\u001b[0m\n raised from /Users/mdurant/code/awkward-1.0/awkward1/_connect/_numba/arrayview.py:493\n\u001b[1mIn definition 5:\u001b[0m\n\u001b[1m UnboundLocalError: local variable 'left' referenced before assignment\u001b[0m\n raised from /Users/mdurant/code/awkward-1.0/awkward1/_connect/_numba/arrayview.py:493\n\u001b[1mIn definition 6:\u001b[0m\n\u001b[1m All templates rejected with literals.\u001b[0m\n\u001b[1mIn definition 7:\u001b[0m\n\u001b[1m All templates rejected without literals.\u001b[0m\n\u001b[1mThis error is usually caused by passing an argument of a type that is unsupported by the named function.\u001b[0m\u001b[0m\u001b[0m\n raised from /Users/mdurant/conda/envs/py37/lib/python3.7/site-packages/numba/types/functions.py:79\n\u001b[1mThis error is usually caused by passing an argument of a type that is unsupported by the named function.\u001b[0m\u001b[0m\n\u001b[0m\u001b[1m[1] During: lowering \"branch val, 36, 49\" at <ipython-input-11-fbcfedc8e316> (8)\u001b[0m"
]
}
],
"source": [
"len_lists(arr)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
import awkward1 as ak
import pandas as pd
import numpy as np
data = [[1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]]*100000
array = ak.Array(data)
s = pd.Series(array)
%timeit s.map(len)
11.4 s ± 138 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit list(map(len, data))
9.15 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
s = pd.Series(data)
%timeit s.map(len)
63.9 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.diff(array.layout.offsets)
386 µs ± 36.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
@jpivarski
Copy link

For the Numba error (perhaps these should have been different threads), the operator that hasn't been overloaded is <built-in function truth>; this line:

if val:

I agree that makes sense; it's an oversight. When these are arrays, no truth value is defined (same reason as NumPy), but it looks like val ought to be a string in that case (['fred', 'wilma', 'barney', 'betty']).

Another thing, though: is that array a union-type? (Print ak.type(arr).) Unions are the one type I couldn't find a way to implement in Numba, unless we want to restrict unions to only allow operations that would be legal for all possible variants of the union. If your setup code constructs a record with a lot of optional fields, then it's fine.

@martindurant
Copy link
Author

What about this is a "fail?"

That you had to call a specialist ak function to do it, rather than be able to access such functionality through standard python overloading. Obviously the data is there and already in the right array format to be able to do this quickly. You should bet on people, when accessing variable-length arrays as a pandas series, will certainly want to map len over it.

@martindurant
Copy link
Author

perhaps these should have been different threads

Agreed, probably don't want to interleave

@jpivarski
Copy link

What about this is a "fail?"

That you had to call a specialist ak function to do it, rather than be able to access such functionality through standard python overloading. Obviously the data is there and already in the right array format to be able to do this quickly. You should bet on people, when accessing variable-length arrays as a pandas series, will certainly want to map len over it.

Is it possible to override map? If so, we could check for cases in which the particular function len is mapped over an ak.Array and switch that for ak.num, but that's a very particular special case. The same argument could be leveled against NumPy, which has specialty functions like the np.diff that you used above. (Along the same lines, we'd want to identify list comprehensions or loops in the mapped function to know when to run ak.combinations or ak.cartesian.)

ak.num is a NumPy-like function with an axis parameter; the general-purpose way to get lengths at any level of depth. It's similar to NumPy having functions like np.sum for performing summations at any depth, except that NumPy doesn't need/never invented a "num" due to the fact that all NumPy arrays are rectangular (such a thing would just pick out a shape element and broadcast it).

In general, Pandas users' first inclination would probably be to run arbitrary Python functions against the rows of a DataFrame. That would never be the efficient way to do it, whether backed by NumPy or Awkward, but without Awkward, there wasn't an efficient way to do it. (Even Numba would have to unbox a lot of Python, which would be its stumbling block for that case.) Pandas users have already been convinced to use array-at-a-time functions when possible; the idea is that this extends what array-at-a-time functions are possible by adding the missing cases for non-rectilinear data.

Maybe the right thing to do would be for ak.num to recognize when a Pandas object is passed as an argument, unwrap the DataFrame/Series to get at the ak.Array, apply the array-at-a-time function, then wrap the result back up as the Pandas type? That would be like the pattern established by NumPy's NEP 13 and 18, but for the array-at-a-time functions that extend beyond rectilinear arrays.

@martindurant
Copy link
Author

Is it possible to override map?

I daresay that the extension type (i.e., the definition of this sort of Series) can.

The same argument could be leveled against NumPy, which has specialty functions like the np.diff

Certainly! Unfortunately, history is against you, because these have already been folded into pandas, so that df.diff does exactly what you would expect if you had learned it from numpy.

In general, Pandas users' first inclination would probably be to run arbitrary Python functions against the rows of a DataFrame.

I don't think so, they usually would run very few methods, like map. It is generally known that map and aggregate (and their groupy variants) are expected to run vectorised code, but apply and direct iteration are likely to be slow.

Maybe the right thing to do would be for ak.num to recognize when a Pandas object is passed as an argument

I really feel this is the wrong way around. The you have to recognise a dask-dataframe-containing-some-awkwards. If, instead, you can run the pandas methods (like map on a jitted function) on an array and dispatch at the extension implementation, then you get the dask/distributed hooks for free.

@martindurant
Copy link
Author

Here is where map is dispatched to extension types: https://github.com/pandas-dev/pandas/blob/master/pandas/core/base.py#L1139

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment