Skip to content

Instantly share code, notes, and snippets.

@JeffBezanson
Last active July 15, 2022 19:11
Show Gist options
  • Save JeffBezanson/24b9e2820262cdeb74f96b81534a4d1f to your computer and use it in GitHub Desktop.
Save JeffBezanson/24b9e2820262cdeb74f96b81534a4d1f to your computer and use it in GitHub Desktop.

An array is a structure mapping indices to data values, where the indices have a particular structure.

An array can be thought of as a pair of an index set and a data set.

Two arrays are equal if they map the same indices to the same values.

The data

An array is iterable, and when iterated generates exactly the values of the data set. Therefore arrays support "lifting", i.e. given any value x, you can construct an array that generates exactly x, once. Put differently, arrays support map. This is a crucial characteristic, perhaps the primary feature that makes arrays so powerful and intuitive. If you understand "x", you can immediately understand "array of x".

The indices

The index set I of an array A is also iterable, and generates values i_n such that A[i_n] gives the data values of A in the same order as iterating A itself. The generated index objects must be unique.

I has an associated total order over its elements, order(I). Given two elements of I, i_a and i_b, order(I)(i_a,i_b) is true iff I generates i_a before i_b.

I is constructed from N "dimensions", and each index i generated by I has N components. Loosely speaking, a "dimension" is a vector-like object describing the possible values of a component of an index. (Given the syntax A[i_1,...,i_n], we imagine that the is are immediately combined into a single index object in a manner dictated by A.)

I contains a mapping from dimension labels to dimensions. To the extent the syntax A[i_1,...,i_n] is supported by A, 1:n are accepted as dimension labels, though other labels for the dimensions might also exist.

In summary, an index set consists of:

  • an integer N
  • a mapping from N labels to N dimension objects
  • a way of generating index objects from the dimension objects, such that each index has N components, and each component is described by its corresponding dimension object in some way.
  • a total order on the index objects

Example for Core.Array:

  • integer N: type parameter
  • labels are integers 1:N, each dimension is a OneTo(m)
  • index set is the cartesian product of the dimensions
  • order is colexicographic order on tuples

The index set object is implicit in Array and not a separate object. That's of course fine as long as the index set can be obtained on demand.

Example for NDSparse:

  • integer N: length(a.index.columns), or ndims(a)
  • labels are keys(a.index.columns)
  • index set is the ordered zip of the dimensions
  • order is lexicographic order on tuples

It is common to talk about the storage order of an array, but this is a bit confusing. In this model, the data is just there, and the ordering is controlled by the index object.

It is recommended that every aspect of an array except the contents of the data and dimensions be represented in the type system:

  • the types of the dimension objects (implying N)
  • the labels (keys) for the dimension objects
  • the index order
  • the type of the data representation

Example of what an array type and constructor might look like with all of these features exposed:

Array{T,N} <: AbstractArray{ColumnMajorProduct{NTuple{N,OneTo{Int}}}, Buffer{T}}

A = array(ColumnMajorProduct(x=-1:.1:1, y=-1:.1:1), repeated(0))

A isa AbstractArray{ColumnMajorProduct{NamedTuple{(:x,:y),NTuple{2,FloatRange{Float64}}}}, Repeated{Int}}

In theory, there could be just a single concrete array type parameterized this way, with methods currently specialized for Array instead specialized for e.g. UrArray{ColumnMajorProduct{NTuple{N,OneTo{Int}}}}.

Implications

indices currently returns a tuple of dimensions, and the fact that the actual index space is a product of these is implicit. I think indices should return an index iterator that exposes (1) the same information as indices currently does, but also (2) the order. So it would return e.g. ColumnMajor{Tuple{OneTo{Int},OneTo{Int}}}.

The current behavior of indices could be called something like dimensions, idea being you just get the container of dimensions. For plain arrays you'd get a tuple, and for arrays with named dimensions you'd get a NamedTuple or dictionary.

We should have a function order(::(array or index set)) that returns the total order predicate for indices. We could define e.g. ColumnMajor:

abstract Order <: Function
immutable ColumnMajorOrder <: Order end
(::ColumnMajorOrder)(x::Tuple, y::Tuple) = isless(reverse(x), reverse(y))

Currently this is handled by having CartesianIndex implement isless. This is valid; an index set can have its own element type that implements isless in the right way, and then order(I) can just return isless. However it seems worthwhile to be able to reuse index types, e.g. Tuples, for multiple kinds of arrays and let them instead specify a different order predicate.

We should perhaps try to use tuples instead of CartesianIndex and see if the order(I) interface works well enough.

This model also implies that dimension names or labels are not intrinsic to dimensions themselves, but are keys used to look them up. This is because the keys used to look up dimensions must be unique, and uniqueness can only be enforced by the container.

A SparseMatrixCSC has two important index sets: one for the matrix interface it implements, and one for just the nonzeros. For example you could efficiently copy just the nonzeros from a sparse array to another array using dest[nonzero_indices(sp)] = nonzeros(sp). nonzero_indices(sp) would return an object with similar characteristics to indices(::NDSparse) (with a different implementation of course).

API summary

  • indices(A) - get index set
  • dimensions(A | indices(A)) - dimension container
  • keys(dimensions(A)) - dimension labels
  • order(indices(A)) - index order predicate
  • ndims(A) = length(dimensions(A))
  • ArrayType(IndexSet, Data) - constructor
  • storage(A) - get the data part of A, which implements some TBD storage API that might be slightly different from the array API. This function can return A itself if A implements that API and there is nothing else sensible to return.

Differences from Dict

  • Lifting semantics: an array wraps data values unrelated to its indices. A dict is a set of key=>value pairs, not just values.
  • Dict keys only need to support equality, not a total order. While ordered dicts exist, the ordering part is clearly optional and not central to what it means to be a dict.
  • A dict can be constructed from a sequence of key=>value pairs. An array can't really do this, since you need the structure around the indices and not just the indices themselves.
@timholy
Copy link

timholy commented Aug 13, 2016

Really fantastic. A few comments:

  • Your suggestion that indices return information about storage order is really interesting. I suspect that's a huge win. Note we currently have storage order encoded in PermutedDimsArray, and that's necessary, but not sufficient for what you're describing.
  • I like the comparison making use of the order type very much. As you point out, containers with arbitrary storage order make hash of any isless on indices unless you know the context.
  • I've always thought that dimensions is a little confusing: we speak about "the dimensions of the room" as well as "physics in 4 dimensions." If we adopt dimensions, we might want to couple it with a terminology change throughout the stdlib, mean(A, dims) -> mean(A, axes) or something. In other words, use dimensions the way you're proposing but rename other meanings. (The alternative is to use shape or domain for what you're calling dimensions.)

See also http://julialang.org/blog/2016/03/arrays-iteration, specifically the section titled "Formalizing AbstractArray," for related thoughts. I think the only thing of consequence that's not already here is the notion of "index synonyms," which I think could be important for certain algorithms. The simplest example of such synonyms today are linear indexing and cartesian indexing. A different kind of example is https://github.com/timholy/ArrayIteration.jl/blob/40817afaa9fb534fe3d10b0bafc42d3b801884d0/src/sparse.jl#L12-L16, which is a column index (aka, like an Int) which can be used for densely visiting each entry in a column of a SparseMatrixCSC but in a way that provides more efficient access (you hoist out the search). There are times when performance might suggest one type of iterator but generality another. For example, copying elements of a RowMajorArray to a ColumnMajorArray, you have to decide whether linear indexing always means "column major" or whether it means "memory order."

@JeffBezanson
Copy link
Author

JeffBezanson commented Aug 13, 2016

Thanks for the comments.

I agree dimensions is an overly-overloaded word. Existing uses like mean(A, dims) seem to correspond to what I call "dimension labels" above. I'd be fine with continuing to use "dimension" this way, and use "axis" for what I was calling "dimension object". You could say e.g. "the axis for dimension 1 is -10:10", which reads pretty well to me. So we'd have axes(A) instead of dimensions(A). Perhaps dimensions(A) would then give the dimension labels, keys(axes(A)).

I think index synonyms are fine here; you can just return an alternate index set, but of course only one can be canonical (for example for doing joins between wildly different kinds of arrays). One thing I like about my model here is that it makes it clear that for two equal arrays you'll get the same set of index, value pairs, but you don't know what order they'll be iterated in (without consulting further information).

@ViralBShah
Copy link

So, a dense array and a sparse array can be equal by this definition. Presumably other information would then be necessary to distinguish.

@timholy
Copy link

timholy commented Aug 15, 2016

I officially sign off on this entire plan. If forced to make a choice, I have a slight preference for axes over dimensions.

Would be good to get @mbauman's insights, too.

@timholy
Copy link

timholy commented Aug 15, 2016

It is worth noting that the plan is somewhat incomplete, as you note esp. with regards to storage. And I do think we are likely to want some generic way of saying "I want to iterate densely" (the default) vs "I want to iterate over just the stored values."

@ihnorton
Copy link

Might want to email matt directly, iirc gist pings don't do anything.

@StefanKarpinski
Copy link

I need to read this over several more times and contemplate it, but 👍. @mbauman should probably read this if he hasn't.

@jiahao
Copy link

jiahao commented Aug 23, 2016

@jiahao
Copy link

jiahao commented Aug 23, 2016

The notion of "dimensions" sounds like that of a dope vector

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment