Numpy is great, in fact it’s one of the things that pulls people to Python. But can it be better?
Common Lisp is great, in fact it’s one of the things that pulls people to Common Lisp. But can it be better? Indeed Python can’t be better than Common Lisp without it becoming another Lisp. The closest we have is Julia. And while it gets some things right, Julia lacks certain features that limit the goodness of a numerical computing library.
All combined, below I will highlight some of the features that I wish a numerical computing library or ecosystem had. I also want to request the readers for their own inputs about how things can be even better. The goal amidst this is solely to keep things numpy-like. I do not intend to - nor have the background to - make a DSL like April or Petalisp.
While I take some interest in performance and numerical computing, I have minimal professional experience about this. So this is intended more as a discussion with an invitation for inputs, rather than an announcement of “this is the best way to do things”.
| Feature | Numpy | Julia | numcl | magicl | dense-numericals | |-------------------------------------------+-------+-------+-------+--------+------------------| | Abstract Arrays | --- | +++ | --- | +++ | +++ | | Frontend-backend separation | --- | + | --- | --- | + | | Multidimensional strides/offsets | +++ | +++ | --- | --- | +++ | | Efficient for small arrays | --- | +++ | - | + | +++ | | OUT parameter | + | + | + | + | + | | Dynamic Scoped Parameters | --- | --- | + | + | +++ | | Restarts | --- | --- | - | - | + | | Good / Easily Configurable Array Printing | + | + | - | + | +++ | | (Optional) Broadcasting | + | +++ | + | - | +++ | | Multithreading | - | + | - | - | +++ | | AOT, JIT-JAOT, Linter, Debugger | - | - | + | - | + | --- (Nearly) Impossible to implement - Either not implemented or not a big part of the library/ecosystem + Implemented but could be better +++ Implemented fairly well
- Abstract Arrays as separate from Dense and Sparse Arrays
- Separation between backend and frontend
- Multidimensional offsets and strides
- Small inlinable static-dispatchable functions for small simple arrays
- Provision of an OUT parameter to avoid allocation if the user wants so
- Using dynamic scoping (global variables done right) where relevant
- Using restarts where relevant
- Easily configurable array printing with good defaults
- Optional Broadcasting
- Optional Multithreading
- AOT/Linter vs JIT/JAOT, compiler notes, debugging
Credits for this idea go to Julia, which has an AbstractArray type, from which its stdlib defines Array
and SparseArray
. Thus the proposal is this that the library/ecosystem should separate out the default concrete dense array type from a more abstract array type. This way, a generic interface defined for the abstract array type can be used for both the dense arrays and other additional array types like sparse arrays.
What could such a generic interface be like? Off the top of my mind, such a interface would provide functions for
- reference and set-reference (basically, indexing)
- total-size
- dimensions
- rank
- element-type
- storage: a method to get the underlying object that actually stores the data, for example this could be a 1d array acting as a storage for a multi-dimensional array
Perhaps, also an iterator interface that I’m yet to think about.
Even for a single Dense Array type, one can have arrays that are simply implemented over non-static storage vectors, or one could have arrays implemented over static storage to provide for certain guarantees or optimizations while operating over them. Besides them, one could have arrays that have their storage vectors in shared memory, or they could have their storage vectors in GPU, or perhaps even distributed across multiple devices. There goes imagination and use cases. And yet, we would like to provide users with as common an interface as possible to operate over them.
I’m aware of this thanks to numpy. It is easy to see how to interpret a one-dimensional storage vector as a multidimensional array. For instance, a one dimensional vector
#(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23)
can be interpreted as the following 2x3x4 multidimensional array
#3A(((0 1 2 3) (4 5 6 7) (8 9 10 11))
((12 13 14 15) (16 17 18 19) (20 21 22 23)))
In the simplest case, each axis length acts as a guide indicating how much one step along the particular axis constitutes along the one dimensional storage vector. In this example, 1 step along the 0th dimension is 3x4 steps along the one dimensional storage vector; 1 step along the 1st dimension is 4 steps along the one dimensional storage vector; while 1 step along the 2nd dimension is 1 step along the one dimensional storage vector.
Numpy arrays take this two steps further and involve the incorporation of strides and offsets for each of the axis. Strides indicate the number of steps to move along the one dimensional vector to move one step along the particular axis, and these can be different from the values obtained by the axis lengths. Offsets indicate the number of steps to move to even start the first element or sub-array along that axis.
Above, one would have (12 4 1)
and (0 0 0)
as the strides and offsets for that particular interpretation of the one dimensional vector. However, with strides (12 4 2)
, one can have the following interpretation of the same one dimensional vector
#3A(((0 2) (4 6) (8 10)) ((12 14) (16 18) (20 22)))
Similarly, one could vary the offsets to (0 0 1)
with strides being the same (12 4 2)
to obtain yet another interpretation for that same one dimensional vector
#3A(((1 3) (5 7) (9 11)) ((13 15) (17 19) (21 23)))
For all these different interpretations, we are still using the same one dimensional vector, and this is the point. By incorporating multidimensional strides and offsets, one can obtain different “views” for the same one dimensional vector, without any additional copying. That this helps avoid a separate copy of the array is the biggest advantage of having multidimensional strides and offsets.
If all one needs to operate on are medium and large arrays (usually in excess of 1000 elements), one is good to go with dynamic dispatch. However, when arrays become smaller, the cost of even a simple function call let alone dynamic dispatch becomes significant. Not only we would like to avoid dynamic dispatch, but we would like to avoid function calls themselves.
However, while inlining in an attempt to avoid function calls, one would also like to avoid code bloat. Operating on arrays with multidimensional strides and offsets involves a fairly complicated mess of code. But in the simplest case of “simple arrays” aka arrays with 0 offsets and “usual” strides, so that the array is actually representable by a continuous block of memory, one can avoid this complicated mess.
Thus, we have the following cases of decreasing complexity:
- Multidimensional strides and offsets with broadcasting: Here, inlining won’t even be of much help since the code involved in broadcasting would pretty much be the bottleneck.
- No strides and offsets, with broadcasting: Here again, the mere presence of broadcasting takes away much of the benefits of not having to account for strides and offsets and treating the arrays as one dimensional vectors.
- No strides and offsets, and no broadcasting: It is in this case that the benefits of inlining along with static dispatch can become apparant. However, one needs to keep the code for this case as simple and small as possible. Larger code can overturn the benefits of inlining.
Avoiding unnecessary memory allocation is a simple way to boost performance. A user may want to reuse already-allocated memory and avoid new allocations as a way to increment performance. However, this is only possible if the library/ecosystem makes provision for it. Many but not all numpy functions make provision for this through the use of out
parameter.
Default dynamic scoping is evil. But default static scoping along with optional dynamic scoping is global variables done right. Some use cases relevant to numerical computing libraries / ecosystems involve the use of dynamic scoping to indicate
- array-element-type
- whether to broadcast
This helps the user avoid specifying the “type” and “broadcast” (and more?) parameters to every function. A downside is indeed that this is useful only when dynamic dispatch is employed. Static dispatch needs the knowledge of these parameters at compile time itself.
Imagine running a computation for a day - or even a week - and then suddenly, one gets an error that there is a dimension mismatch, or perhaps some other error. Common Lisp with its restart system provides a way to recover from such errors. However, it is only as useful as the programmers using it make it to be.
While it is never a good idea to run the main computation without doing a pilot run once, restarts can be put to use to avoid even those accidents in which the bug wasn’t discovered until the main run. However, care needs to be taken to ensure that the use of restarts does not affect performance.
What are some of the parameters involved in array printing?
- number of levels to print
- number of elements/sub-arrays to print at each level
- how exactly to print an element
What information is important while working with arrays?
- element-type
- dimensions
- whether the array is a contiguous array aka a simple array with 0 offsets and default strides, or if it is a view into another array
This information should be available at a quick glance without requiring the user to dig around or inspect the given object. What else?
Array broadcasting allows binary or n-anary operations to be carried over arrays with different shapes, or over arrays and scalars. The dimensions are extended “suitably”. How exactly the dimensions are extended can depend on one ecosystem to another. The case of numpy is fairly well-documented.
Thus, for instance, an array can always be operated with a scalar. A (100 3)
shaped array can be operated with another array with shape (3)
or also with (1 3)
or (100 1)
, or a number of other compatible shapes. The arrays do not need to be of the same shape to be operated over. Coupled with multidimensional strides and offsets - either implicit or explicit - this allows operations to be carried over without copying over the entire smaller array.
However, at least as a beginner, this can lead to bugs. For instance, one may wish to add a 100-length vector to a 100-by-100 matrix, but one wants to add the 100-length vector to every column, rather to every row, with rows being elements along the 1st axis, and columns being elements along the 0th axis. A correct shape for broadcasting would then be (100 1)
. However, as per the numpy broadcasting rules, the (100)
shaped vector would be broadcast as if it were (1 100)
, resulting in addition to every row rather than every column.
;; Adding the following two arrays with broadcasting
#2a((1 2 3) (4 5 6) (7 8 9))
#(-1 0 1)
;; Intended result
#2a((0 1 2) (4 5 6) (8 9 10))
;; Obtained result
#2a((0 2 4) (3 5 7) (6 8 10))
It seems useful and perhaps not so non-trivial to be able to turn off broadcasting.
Like inlining, multithreading too is largely useful for medium-large arrays, and less useful and in fact performance-wise harmful for smaller arrays. Even for larger arrays, the benefits of multithreading depend on whether or not the memory or caches are shared amongst multiple cores or whether they are separate. And not all functions benefit from multithreading, while trigonometric operations or exponentiation may benefit from it, simple addition or subtraction might not, since for larger arrays the cost of memory access itself exceeds the computation cost of addition or subtraction.
Thus, multithreading needs to be configurable in two ways:
- the array size beyond which multithreading is employed
- the functions that might not benefit from multithreading at all
If you are developing, in the initial moments, you want to focus on “unused variables”, “variables used before definition”, “functions called with incorrect arguments”, “undefined functions”, and may be even dry running your numerical code… and not the performance! All of this entails fast compilation, rather than compilation for fast runs. It is only once your code is syntactically correct, do you want to focus on the performance aspects; it is only here that JIT/JAOT becomes important. A number of these earlier issues can be discovered either through AOT, or through a fast and good aka non-buggy linter.
And even if JIT/JAOT is great, until we get an almighty compiler, we still want the user to understand what is it that is causing performance issues, we want helpful compiler notes. SBCL clearly wins, hands down. But even code that is built over SBCL almost-always ignores the provision of compiler notes, leaving the user puzzled about what are the performance issues.