Skip to content

Instantly share code, notes, and snippets.

@rohithreddykota
Created March 29, 2025 12:09
Show Gist options
  • Save rohithreddykota/337c205e72cea2c22726a110a69ce6b8 to your computer and use it in GitHub Desktop.
Save rohithreddykota/337c205e72cea2c22726a110a69ce6b8 to your computer and use it in GitHub Desktop.

Apache Arrow Physical Memory Layout (Rust Implementation Focus)

Apache Arrow defines a standardized columnar in-memory format to enable high-performance analytics across languages. It emphasizes contiguous, aligned memory and minimal metadata, allowing zero-copy sharing and efficient vectorized processing (Arrow Columnar Format — Apache Arrow v19.0.1). This answer breaks down Arrow’s physical memory layout fields (validity bitmap, offsets, data buffers, type IDs, etc.), and explains how these map to the Rust implementation (using the arrow and arrow-array crates). We’ll cover buffer structures (Buffer 0/1/2), validity bitmaps, alignment/padding rules, nested types (List, Struct, Union), offset buffers for variable-length types, dictionary encoding, and run-end encoding. Finally, we highlight how Arrow’s Rust crates represent these concepts with structs, enums, and safe memory management. Diagrams and code examples are included for clarity.

Overview of Arrow’s Memory Layout Fields and Buffers

Each Arrow Array (column) is stored as a set of one or more memory buffers plus some metadata (length, null count, etc.) (Internal structure of Arrow objects • Arrow R Package) (Introduction — Apache Arrow v19.0.1). The exact buffers present depend on the array’s data type, but they generally fall into these categories:

  • Validity Bitmap (Null Bitmap)Buffer 0: a bitmask indicating which entries are not null (valid).
  • Data BufferBuffer 1 (and possibly Buffer 2): contiguous memory holding the actual values (or parts of values). Fixed-size types have one data buffer; variable-size types use an offsets buffer and a data buffer.
  • Offset BufferBuffer 1 for strings/lists: an integer array delineating start/end positions of each variable-length value in the data buffer.
  • Type ID BufferBuffer 0 for Union types: an 8-bit integer array indicating which child type each element of a Union array has. (Dense unions add a second buffer for offsets into child arrays.)

Most arrays have at most 3 predefined buffers (numbered 0, 1, 2). More complex layouts (unions, run-end encoding) use additional child arrays rather than extra buffers (Arrow Columnar Format — Apache Arrow v19.0.1) (Arrow Columnar Format — Apache Arrow v19.0.1). The table below summarizes the buffer usage for each layout type:

(image) Summary of Arrow’s physical layouts and their buffers (Arrow Columnar Format — Apache Arrow v19.0.1) (Arrow Columnar Format — Apache Arrow v19.0.1).
Each row shows which buffers (Bitmap, Offsets, Data, etc.) a given array type uses, and whether it has child arrays.

As shown, Buffer 0 is typically the validity bitmap for all types except union (which has no parent validity) (Arrow Columnar Format — Apache Arrow v19.0.1) (Arrow Columnar Format — Apache Arrow v19.0.1). Buffer 1 and 2 hold data or offsets depending on the type: e.g. a primitive array uses Buffer 1 for data values, whereas a string (Binary) array uses Buffer 1 for offsets and Buffer 2 for the actual byte data (Arrow Columnar Format — Apache Arrow v19.0.1) (Arrow Columnar Format — Apache Arrow v19.0.1). Nested types (List, Struct, etc.) have one or more child arrays rather than a flat data buffer, so their buffers are just the bitmap (and possibly offsets) and the actual element values live in the child array(s) (Arrow Columnar Format — Apache Arrow v19.0.1) (Arrow Columnar Format — Apache Arrow v19.0.1). We will now examine each kind of buffer/field in detail.

Validity Bitmaps (Null Bitmaps)

Almost every Arrow array can have missing values, represented by a validity bitmap (also called a null bitmap or bitmap buffer). This is a dedicated Buffer (usually Buffer 0) of bits, with 1 bit per array element indicating if the value is present (1) or null (0) (Introduction — Apache Arrow v19.0.1). For example, an Int32 array with values [1, null, 2, 4, 8] has a bitmap 10111 in binary (reading right-to-left) to mark which slots are valid (Internal structure of Arrow objects • Arrow R Package) (Internal structure of Arrow objects • Arrow R Package). These bits are packed densely (8 elements per byte). If the number of elements is not a multiple of 8, the last byte is padded with zeros for the unused bits (Internal structure of Arrow objects • Arrow R Package). The bitmap buffer’s length is rounded up to a multiple of 8 bits (1 byte) and further padded to a multiple of 8 or 64 bytes for alignment (discussed later) (Internal structure of Arrow objects • Arrow R Package).

Notably, Arrow uses little-endian bit numbering: within a byte, the least significant bit corresponds to the first element (index 0) (Introduction — Apache Arrow v19.0.1). This means that if we write the bits in human-readable order, we list them from least significant to most (effectively reversing the bit order you might normally write) (Internal structure of Arrow objects • Arrow R Package). Arrow’s libraries take care of this detail when reading/writing bits. The key point is that the bitmap provides O(1) null checks and is memory-efficient (only 1 bit per value).

If an array has no nulls, Arrow can omit the bitmap buffer entirely to save space (Introduction — Apache Arrow v19.0.1). In that case, the array’s “null count” metadata is 0 and consumers assume all values are valid. But whenever any nulls are present, the bitmap buffer must exist and have at least ceil(length/8) bytes (Arrow Columnar Format — Apache Arrow v19.0.1). (Union arrays are the exception: they do not use a parent bitmap – nulls in unions are indicated by nulls in the child arrays, as we’ll see later (Arrow Columnar Format — Apache Arrow v19.0.1).)

Example: In our [1, null, 2, 4, 8] Int32 array, the validity bitmap for 5 elements is 10111 (in abstract bit form). In memory, this becomes one byte 00011101 (padded with three 0 bits and written in least-significant-bit-first order) (Internal structure of Arrow objects • Arrow R Package) (Internal structure of Arrow objects • Arrow R Package). Below is a diagram of this array’s layout in memory, showing the 64-byte aligned bitmap and data buffers:

(Internal structure of Arrow objects • Arrow R Package) Diagram of a primitive Int32 array with a validity bitmap and data buffer (Internal structure of Arrow objects • Arrow R Package) (Internal structure of Arrow objects • Arrow R Package).
Gray = metadata (length and null count), blue = buffers (dotted box shows contents). Here 1, NA, 2, 4, 8 are stored in the data buffer (each 4 bytes), and the bitmap 00011101 (binary) marks NA as null. Unused bytes are “unspecified” (padding) (Internal structure of Arrow objects • Arrow R Package).

Under the hood, Arrow’s Rust implementation represents the validity bitmap using a NullBuffer type (or sometimes a BooleanBuffer for bitmaps). This NullBuffer essentially wraps a Buffer<u8> plus an offset and length in bits (BooleanArray in arrow::array - Rust) (BooleanArray in arrow::array - Rust). If an Arrow ArrayData in Rust has an Option<NullBuffer> set to None, it means there are no nulls (no bitmap allocated) (ArrayData in arrow_data - Rust). The Arrow spec allows the memory for null entries in the data buffer to be left undefined (garbage) since the bitmap is the source of truth on nullness (In-Memory Analytics with Apache Arrow | Packt G.R. Jenkin & Associates) (Internal structure of Arrow objects • Arrow R Package). Many implementations (including Rust) choose to zero-initialize null data for predictability, but it’s not required by the format (In-Memory Analytics with Apache Arrow | Packt G.R. Jenkin & Associates).

Fixed-Size Primitive Arrays (Data Buffers)

Primitive arrays are those with a fixed number of bytes per element – e.g. integers, floats, fixed-size decimals, and even booleans (booleans use 1 bit but are treated as a bit-packed fixed layout) (Introduction — Apache Arrow v19.0.1) (Introduction — Apache Arrow v19.0.1). In Arrow’s layout, a primitive array has two buffers: Buffer 0: validity bitmap (unless no nulls) and Buffer 1: data buffer (Arrow Columnar Format — Apache Arrow v19.0.1) (Arrow Columnar Format — Apache Arrow v19.0.1). The data buffer is a contiguous memory region that stores each value in sequence, using the native endianness (Arrow is typically little-endian) (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). For example, an Int32Array of length N has a data buffer of length N × 4 bytes (plus padding to 8/64 bytes). The i-th element’s bits are at offset i * 4 in this buffer, allowing constant-time random access.

For all primitive types except Boolean, each value occupies a full byte-width slot. Boolean arrays are handled specially: Arrow packs boolean values into bits (still using a “primitive” layout) (Introduction — Apache Arrow v19.0.1). A BooleanArray therefore has Buffer 1 as a values bitmap buffer (with 1 bit per boolean element, similar to a validity bitmap) instead of a byte-per-value buffer. It may also have a validity bitmap (Buffer 0) if it contains nulls (Introduction — Apache Arrow v19.0.1). In effect, booleans use two bitmaps – one for values and one for validity – when nulls are present. The Rust BooleanArray type reflects this: it stores a BooleanBuffer for the values and an optional NullBuffer for nulls (BooleanArray in arrow::array - Rust) (BooleanArray in arrow::array - Rust). This design maximizes space efficiency for booleans, at the cost of minor bit-level operations to read/write them. All other fixed-width types (int8…int64, float32/64, etc.) use whole bytes for values.

The memory alignment of primitive values is important. Arrow recommends 8 or 64-byte alignment for buffers (Arrow Columnar Format — Apache Arrow v19.0.1) (Arrow Columnar Format — Apache Arrow v19.0.1) so that CPU can perform aligned loads/stores and SIMD operations. In practice, Arrow’s IPC format requires each buffer to start at a multiple of 8 bytes (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316), and many implementations allocate memory at 64-byte boundaries for performance (Arrow Columnar Format — Apache Arrow v19.0.1). This means the data buffer for a primitive array might have unused padding bytes at the end to round its size up to a multiple of 8 or 64. In our Int32 example, 5 values (20 bytes) are padded to 64 bytes total (Internal structure of Arrow objects • Arrow R Package) (Internal structure of Arrow objects • Arrow R Package). Alignment ensures that even the last value (byte 16-19) can be loaded efficiently, and that arrays can be placed in shared memory without worrying about misalignment in a consuming process (Arrow Columnar Format — Apache Arrow v19.0.1).

Variable-Size Binary Types (Offsets and Data Buffers)

For variable-length data like strings or binary blobs, Arrow uses a two-buffer layout: an offset buffer and a data buffer, plus a bitmap for nulls. This applies to types Binary, Utf8 (UTF-8 strings), and similar (as well as List which is analogous for nested sequences). The offset buffer (Buffer 1) is an array of integers (32-bit for Utf8/Binary, 64-bit for LargeUtf8/LargeBinary) that delimits the start and end of each element’s data within the data buffer (Arrow Columnar Format — Apache Arrow v19.0.1) (In-Memory Analytics with Apache Arrow | Packt G.R. Jenkin & Associates). It has length array_length + 1, with the convention that offsets[i] is the starting byte offset of element i in the data buffer, and offsets[array_length] is the total length of the data. Each element’s value bytes thus occupy offsets[i+1] - offsets[i] bytes in the data buffer.

The data buffer (Buffer 2) is a contiguous blob of all the element values concatenated back-to-back (In-Memory Analytics with Apache Arrow | Packt G.R. Jenkin & Associates) (In-Memory Analytics with Apache Arrow | Packt G.R. Jenkin & Associates). There are no delimiters in this buffer; it’s just raw bytes. To get element i, one reads from data[ offsets[i] ... offsets[i+1] ). Because the offset buffer provides constant-time random access to any element’s boundaries, this layout avoids storing pointers for each string and keeps all string bytes adjacent in memory (In-Memory Analytics with Apache Arrow | Packt G.R. Jenkin & Associates) (In-Memory Analytics with Apache Arrow | Packt G.R. Jenkin & Associates). This improves cache locality and SIMD scan potential when processing many strings sequentially, compared to an array of pointers to separate string buffers (common in conventional in-memory representations) (In-Memory Analytics with Apache Arrow | Packt G.R. Jenkin & Associates) (In-Memory Analytics with Apache Arrow | Packt G.R. Jenkin & Associates).

Example: Consider an Arrow StringArray containing ["hello", "amazing", "and", "cruel", "world"]. The data buffer would contain the UTF-8 bytes of "helloamazingandcruelworld" (all strings concatenated) (Internal structure of Arrow objects • Arrow R Package) (Internal structure of Arrow objects • Arrow R Package). The offset buffer (length 6 for 5 strings) might be [0, 5, 12, 15, 20, 25] (Internal structure of Arrow objects • Arrow R Package). This means: string0 spans bytes [0:5) ("hello"), string1 is [5:12) ("amazing"), string2 [12:15) ("and"), string3 [15:20) ("cruel"), string4 [20:25) ("world"). The offset values are cumulative end positions. If an element is null, Arrow still sets offsets as if it were an empty string (or empty list), and relies on the validity bitmap to mark it null.

Since each offset is just a number, the offset buffer itself is fixed-width (4 or 8 bytes per entry). It is also padded to 8 or 64 bytes alignment. The data buffer is padded to ensure alignment if it’s referenced directly. For instance, after the last string’s bytes at position 24, there may be one padding byte to reach a multiple of 8. The validity bitmap works as usual to indicate any null strings. If an entire string array has no nulls, Buffer 0 is omitted and only offsets+data buffers are present (Arrow Columnar Format — Apache Arrow v19.0.1) (Arrow Columnar Format — Apache Arrow v19.0.1).

(Internal structure of Arrow objects • Arrow R Package) Diagram of a string array with offsets and data buffers (Internal structure of Arrow objects • Arrow R Package) (Internal structure of Arrow objects • Arrow R Package).
Here, all strings are non-null (so validity: null indicates no bitmap). The offset buffer holds [0,5,12,15,20,25] and the data buffer holds the concatenated text "helloamazingandcruelworld". Each string is a slice into the data buffer based on offsets.

In Rust’s Arrow implementation, variable-length arrays are represented by types like StringArray or BinaryArray. Internally, a StringArray contains an OffsetBuffer<i32> and a values Buffer<u8>, plus an optional NullBuffer (arrow_array - Rust) (arrow_array - Rust). The OffsetBuffer<i32> is a thin wrapper around a Buffer<i32> that provides methods to get offsets; similarly the values buffer is a Buffer<u8> (often just called Buffer). The Arrow Rust API ensures that the offsets and data buffers are consistent (e.g., it will validate that offsets[length] == data_length and that offsets are non-decreasing) (ArrayData in arrow_data - Rust) (ArrayData in arrow_data - Rust). When constructing a StringArray in Rust, you typically provide a vector of offsets and a single contiguous byte buffer for all strings, as shown below:

use arrow_array::{Int32Array, StringArray, OffsetBuffer};

# // Construct an Int32 array from a Vec (no copy, since Arrow takes ownership)
let int_array = Int32Array::new(vec![1, 2, 3].into(), None);
assert_eq!(int_array.values(), &[1, 2, 3]);  // values() gives a slice of the data buffer
assert_eq!(int_array.null_count(), 0);

# // Construct a String array from raw parts: offsets and values buffer
let offsets = OffsetBuffer::new(vec![0, 5, 10].into());  // offsets for two strings: [0,5,10]
let data = b"helloworld".to_vec().into();                // values buffer with bytes "helloworld"
let str_array = StringArray::new(offsets, data, None);
assert_eq!(str_array.value(0), "hello");  // string at index 0
assert_eq!(str_array.value(1), "world");  // string at index 1

Code: Building Arrow arrays from raw buffers without copying (arrow_array - Rust) (arrow_array - Rust). The Int32Array uses a single data buffer, while the StringArray uses an OffsetBuffer and values buffer. Both have None for the validity bitmap since there are no nulls.

Arrow chooses this offsets-and-data design to enable zero-copy substring access and efficient slicing. For example, if you take a subarray of a StringArray, it can simply slice the offset buffer and reuse the same data buffer (since the string bytes for that range are contiguous) (Internal structure of Arrow objects • Arrow R Package) (Internal structure of Arrow objects • Arrow R Package). This avoids duplicating string data. It’s one of the performance trade-offs: Arrow does more work up front (constructing offsets) and uses more memory for offsets, in order to gain O(1) element access and contiguity of data. The result is that algorithms can often scan the data buffer sequentially (maximizing cache hits) and compute on whole chunks of bytes. This layout is also friendly to SIMD: e.g., scanning for a byte pattern in all strings can be done with vectorized memory operations on the unified data buffer.

Lists and Nested Types (List, Fixed-Size List, Struct)

List arrays in Arrow allow nesting one array inside another (analogous to an array of vectors). There are two main forms: variable-size lists (each list element can have a different length) and fixed-size lists (each has the same length). Both are considered nested types because they have a child array representing the elements.

A variable-size List<T> is physically similar to a Binary/String: it has a validity bitmap (Buffer 0), an offset buffer (Buffer 1) of length length+1 indicating element boundaries in the child data, and a child array of type T which holds all the list elements concatenated (Arrow Columnar Format — Apache Arrow v19.0.1) (Arrow Columnar Format — Apache Arrow v19.0.1). For example, a List<Int32> with values like [ [1,2], null, [3] ] would have offsets [0, 2, 2, 3] (pointing into a child Int32 array of values [1,2,3]). Offset 0 to 2 corresponds to the first list’s two elements, 2 to 2 corresponds to the second list (a null, so it has zero elements – the offsets don’t increase), and 2 to 3 corresponds to the third list’s one element. The child Int32 array in this case has length 3 and its own validity bitmap (which in this example would mark the second position as null if representing the null list’s “elements”, but since a null list is just an empty sequence, the child array typically doesn’t get an entry at all for the null list). The Arrow spec dictates that the offsets buffer for lists works just like for strings, and the child array stores all the non-null elements in order (Arrow Columnar Format — Apache Arrow v19.0.1) (Arrow Columnar Format — Apache Arrow v19.0.1). If a list is null, the offsets for that slot will be equal (indicating zero length) and the actual values are irrelevant.

A fixed-size list has no offsets buffer – because each sub-list is a constant length (a parameter of the type). For example, a FixedSizeList<3, Int32> means each list contains exactly 3 Int32s. In memory, a FixedSizeList just has a validity bitmap (Buffer 0) and the child array of type Int32 containing all the values for all lists (Arrow Columnar Format — Apache Arrow v19.0.1). The length of the child array is parent_length * 3 in this case, and you can compute element positions by simple arithmetic (index * 3, etc.), so no explicit offsets are needed. If a fixed-size list slot is null, typically the child values for that slot are still allocated but considered “don’t care” (often left as zero) (In-Memory Analytics with Apache Arrow | Packt G.R. Jenkin & Associates).

Struct arrays represent a column with multiple named sub-columns (fields) – analogous to a table row type. A Struct in Arrow is stored as parallel child arrays for each field, plus a single shared validity bitmap (Buffer 0) for the struct as a whole (Arrow Columnar Format — Apache Arrow v19.0.1). All child field arrays have the same length as the struct. If the struct has a null in some position, none of the field values at that index should be considered valid (even though the child arrays might still have a value in memory for that index) (Arrow Columnar Format — Apache Arrow v19.0.1). Each field’s array can also have its own null bitmap for its values, but those nulls are independent of the top-level struct’s nullness. In other words, a struct value is only non-null if all its fields have a value (or at least the struct says it’s valid); however, a struct can be valid while some field has a null – meaning “the struct is present, but that particular field is null” (Arrow Columnar Format — Apache Arrow v19.0.1). To avoid confusion: Arrow treats struct’s own nullness separately – the top-level bitmap says whether the entire struct is null or not, and each field’s bitmap says if that field is null within a non-null struct. (If a struct is null, typically its field values at that index can be left undefined, similar to how a null list or null primitive can have garbage in the data buffer.)

In Rust, a StructArray is basically a container of Arc<dyn Array> for each field plus an optional NullBuffer for the struct’s validity. The Arrow Rust API ensures that all child arrays have the same length and manages the struct’s null count. You can retrieve each child by field name or index. Internally, StructArray doesn’t use any extra data buffers beyond the child arrays; Buffer 0 is the struct’s null bitmap if any (Arrow Columnar Format — Apache Arrow v19.0.1).

Union Types (Type IDs and Offsets for Mixed Types)

Arrow’s Union type allows elements of multiple different types to coexist in one array. Each slot can be a value of one of several child types. Arrow defines two physical layouts for unions: dense union and sparse union (Arrow Columnar Format — Apache Arrow v19.0.1) (Arrow Columnar Format — Apache Arrow v19.0.1). Both use a type IDs buffer (and no validity bitmap at the parent level) but differ in how they handle offsets and child storage.

A key point: Unions do not have a validity bitmap of their own (Arrow Columnar Format — Apache Arrow v19.0.1). If an element is null, it must be represented by the child array’s value being null (for dense union, the child’s validity at the offset is 0; for sparse, the child at that index is null). In the earlier dense union example, the union had a “null” at index1 – this was encoded by giving it a type ID (0 = Float) and offset (1) that points to a position in the Float child where the value is null (Arrow Columnar Format — Apache Arrow v19.0.1). Thus union nulls piggyback on child nulls. The union’s null_count can be derived by summing nulls across children for dense (with care) or by looking at all children at a given index for sparse (only one child is active at a time).

In Rust, union arrays are less commonly used and may not have full stable API support at the moment, but conceptually one would have a UnionArray with fields like type_ids: Buffer<i8>, an optional offsets: Buffer<i32> (for dense), and a Vec<ArrayRef> for children. The DataType::Union in Rust’s Arrow schema will carry metadata about whether it’s dense or sparse and the mapping of type IDs to child fields. The arrow crate ensures that if you construct a union array, the buffers and child arrays meet the spec (e.g., sparse children all same length as parent; dense offsets within child bounds, etc.). Accessing a value would involve checking the type_id at index and downcasting to the appropriate child array type.

Dictionary-Encoded Arrays

Dictionary encoding is a compression technique Arrow supports at the array level. A dictionary-encoded array replaces actual values with integer codes (indices) pointing into a separate dictionary array of unique values (Arrow Columnar Format — Apache Arrow v19.0.1) (In-Memory Analytics with Apache Arrow | Packt G.R. Jenkin & Associates). For example, an array ["foo","bar","foo","bar",null,"baz"] could be dictionary-encoded by extracting the unique values ["foo","bar","baz"] into a dictionary, and representing the original array as indices [0,1,0,1,null,2] referencing that dictionary (In-Memory Analytics with Apache Arrow | Packt G.R. Jenkin & Associates). This can save memory and speed up comparisons if there are many repeats.

Physically, a dictionary-encoded array looks like a primitive integer array (of type Int8, Int16, Int32, etc., as needed to index the dictionary) plus an associated dictionary mapping. In the Arrow layout tables, it is shown as having Buffer 0: validity bitmap and Buffer 1: data (indices) (Arrow Columnar Format — Apache Arrow v19.0.1). There is no Buffer 2 because the actual values live in the separate dictionary array. The dictionary itself can be any Arrow array (even nested) and is typically stored elsewhere (in IPC, dictionaries are transmitted in special DictionaryBatch messages). In-memory, an ArrayData for a dictionary array in Rust will have a data_type = Dictionary(index_type, value_type) and usually holds the index buffer and a reference to the dictionary values. In Rust’s high-level API, DictionaryArray<K> is a struct that contains: a PrimitiveArray<K> for keys (the indices) and an Arc<dyn Array> for the dictionary values, plus an is_ordered flag (and its own DataType) (DictionaryArray in arrow::array - Rust) (DictionaryArray in arrow::array - Rust). For example, DictionaryArray<Int32> might contain an Int32Array of codes and a StringArray of values.

When you access a value from a dictionary array (e.g., dict_array.value(i)), the library will take the code at keys[i] and look up dictionary_array[key]. This indirection is usually handled transparently. The dictionary’s nulls are independent of the indices’ nulls. The Arrow spec allows the dictionary itself to contain nulls or duplicates, though typically dictionaries are unique values (In-Memory Analytics with Apache Arrow | Packt G.R. Jenkin & Associates) (In-Memory Analytics with Apache Arrow | Packt G.R. Jenkin & Associates). The null count of a dictionary array is determined solely by the indices’ validity bitmap (a null index means the whole value is null) and ignores whether the dictionary value at that index could be null (In-Memory Analytics with Apache Arrow | Packt G.R. Jenkin & Associates) (In-Memory Analytics with Apache Arrow | Packt G.R. Jenkin & Associates). In other words, if indices[i] is not null but points to a dictionary entry that is null, that scenario is logically a non-null value which is null semantically – Arrow treats that as the value being null (so it’s a bit unusual but allowed). Usually, one avoids nulls inside dictionaries for simplicity.

From a memory layout perspective, dictionary encoding’s benefit is that the index buffer is much smaller than storing full values, especially when values are long strings or highly repeated. Arrow’s format allows any array to be dictionary-encoded (it’s a logical wrapper around the physical index buffer). The dictionary values are stored as a separate Arrow array so they can themselves leverage Arrow’s efficient format (contiguous, possibly further compressed, etc.). Many Arrow algorithms can operate on the indices without expanding them (for instance, sorting a dictionary array can sort the indices then permute the dictionary if needed).

In Rust, creating a dictionary array might involve first building the dictionary value array and an index array, then constructing DictionaryArray<K> with them. The API provides safe constructors that check that index values are within bounds of the dictionary length. You can also start with a StringArray and call .unify_dictionary or similar to factor it into a dictionary representation. Once created, the dictionary array behaves like any other array (with its own DataType indicating dictionary), and you can extract the .keys() (indices) or .values() (the dictionary array) if needed (DictionaryArray in arrow::array - Rust) (DictionaryArray in arrow::array - Rust).

Run-End Encoded Arrays (Run-Length Encoding Variation)

Run-End Encoding (REE) is a columnar run-length encoding introduced in Arrow format v1.3 (Arrow Columnar Format — Apache Arrow v19.0.1). It is useful when an array has consecutive runs of the same value. Instead of storing each value for each index, REE stores runs as (value, run_end_index) pairs. Arrow’s layout for a run-end encoded array uses two child arrays: one for run end indices and one for values (Arrow Columnar Format — Apache Arrow v19.0.1) (Arrow Columnar Format — Apache Arrow v19.0.1). There are no buffers in the parent (no validity bitmap or data buffer directly in the parent array) (Arrow Columnar Format — Apache Arrow v19.0.1). The parent’s length is conceptually the last run end, and null count is always 0 at the parent (since nulls can be represented as a value in the runs if needed) (Arrow Columnar Format — Apache Arrow v19.0.1).

The first child, conventionally named "run_ends", is an Int16/Int32/Int64 array indicating the index (1-based) where each run ends (Arrow Columnar Format — Apache Arrow v19.0.1). The second child, "values", is an array of the same length (number of runs) holding the value for each run. For example, consider an array of length 7: [1.0, 1.0, 1.0, 1.0, null, null, 2.0]. In REE format, it could be represented as run_ends = [4, 6, 7] and values = [1.0, null, 2.0] (Arrow Columnar Format — Apache Arrow v19.0.1) (Arrow Columnar Format — Apache Arrow v19.0.1). This means: from index 0 to 3 the value is 1.0 (run ends at 4, i.e., index 3 inclusive), from index 4 to 5 the value is null (run ends at 6, meaning index 5 inclusive), and from index 6 to 6 the value is 2.0 (run ends at 7, index 6 inclusive). Each run’s length can be derived by subtracting the previous run_end (or 0 at start) from the current run_end. To find the value at a specific index, you do a binary search in run_ends to find which run covers that index (Arrow Columnar Format — Apache Arrow v19.0.1).

Arrow’s REE design decision was to use child arrays instead of buffers for run ends and values, even though runends are fixed-width integers, in order to keep the parent array’s length separate from the physical length of runs (Arrow Columnar Format — Apache Arrow v19.0.1). The parent array length is the _logical length (e.g., 7 in the example), whereas the run_ends and values children have length equal to the number of runs (3 in the example). This separation is clearer by having them as full child arrays with their own metadata, rather than, say, putting run_ends in a buffer of the parent (which would make the parent have a buffer length not directly tied to its logical length, which Arrow’s metadata model avoids) (Arrow Columnar Format — Apache Arrow v19.0.1).

In Rust, run-end encoded arrays aren’t yet first-class types at the time of writing, but one can imagine a RunEndEncodedArray<K, ValueType> struct that contains two fields: run_ends: PrimitiveArray<K> and values: ArrayRef (of the value type). The DataType for such an array would indicate run-end encoding with a run-end index type K. The Arrow spec says run_ends cannot have nulls (since a run must end at a valid index) and must be strictly increasing positive integers (Arrow Columnar Format — Apache Arrow v19.0.1). The values array can be any type (including nulls to signify runs of null). Many operations on REE arrays would first need to decode them to a regular array or handle them specially (e.g., scanning or filtering could potentially be done on the runs representation directly). Because this is a newer addition, library support is evolving.

Alignment and Padding Rules in Arrow

Arrow’s buffers are all aligned to 8 bytes, and in practice often to 64 bytes, to satisfy hardware alignment and performance guidelines (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316) (Arrow Columnar Format — Apache Arrow v19.0.1). Moreover, each buffer’s size is padded up to a multiple of 8 bytes (or 64) (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316) (Arrow Columnar Format — Apache Arrow v19.0.1). This means if you have, say, 10 bytes of actual data, the buffer might be allocated as 16 bytes (next multiple of 8). Padding bytes are unused and can be arbitrary values (Arrow Columnar Format — Apache Arrow v19.0.1). The rationale for 64-byte alignment is to match typical cache line sizes and allow efficient SIMD loads (AVX2 is 32 bytes, AVX-512 is 64 bytes wide) (Arrow Columnar Format — Apache Arrow v19.0.1) (In-Memory Analytics with Apache Arrow | Packt G.R. Jenkin & Associates). Aligning to 64 can also reduce false sharing or partial cache line reads on some architectures.

In the Arrow C++ implementation, memory is allocated with 64-byte alignment by default. The Rust implementation similarly aligns allocations. In fact, the arrow-buffer crate defines an alignment constant: on 64-bit architectures it uses 128-byte alignment (1<<7) as a further optimization (alignment.rs - source) (alignment.rs - source). (On 32-bit it uses 64 bytes.) The code comment references Intel’s recommendations for aligning on 128 bytes for certain streaming optimizations (alignment.rs - source) (alignment.rs - source). This is an implementation choice in Arrow Rust to push alignment beyond the minimum – it doesn’t change the format (which only mandates 8-byte multiples), but it may squeeze a bit more performance on modern CPUs. Regardless, when Arrow data is sent via IPC or shared memory, the buffers will be padded to 8 or 64 bytes so that any language reading it can assume that alignment (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316).

The padding to a multiple of 8 bytes ensures that if you append buffers end-to-end (for example, in an IPC message body), each buffer starts at an 8-byte boundary (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). Arrow’s IPC format uses a 8-byte alignment for the message body and even adds padding bytes after the Flatbuffer metadata to achieve that (Arrow Columnar Format — Apache Arrow v19.0.1). This way, when the message is loaded, the pointers to buffers (often computed via offsets) will naturally be 8-byte aligned. Arrow’s spec explicitly states that when reconstructing arrays from metadata, you can use pointer arithmetic with the known offsets, and the data will be properly aligned for direct use (Arrow Columnar Format — Apache Arrow v19.0.1) (Arrow Columnar Format — Apache Arrow v19.0.1).

In Rust, the Buffer struct abstracts a contiguous memory region and ensures alignment by using Arrow’s custom allocator or by realigning if necessary. For example, if you create a Buffer from a Vec<u8> that the Rust global allocator didn’t align to 8, Arrow could reallocate it. The arrow_buffer::Allocation trait and related functions handle this under the hood. The function Buffer::from_vec in Rust will allocate using Arrow’s memory pool (which is aligned) when possible to avoid copies (Buffer in arrow::buffer - Rust) (Buffer in arrow::buffer - Rust). When slicing a buffer (zero-copy), Arrow keeps track of an internal offset but preserves the original aligned pointer for safety (Buffer in arrow::buffer - Rust) (Buffer in arrow::buffer - Rust). The Buffer also carries a capacity which is usually a multiple of 64, and a length which may be less; the extra bytes are padding that may remain unused but ensure any subsequent extension stays aligned (Buffer in arrow::buffer - Rust) (Buffer in arrow::buffer - Rust).

To summarize, Arrow’s alignment and padding rules are all about ensuring proper hardware alignment and zero-copy interoperability. They don’t affect the logical content of arrays, only how the buffers are allocated and laid out in memory. Rust’s Arrow implementation adheres to these rules by using aligned allocators and by providing safe APIs that enforce buffer size and alignment invariants (e.g., ArrayData::validate will check that buffer lengths are multiples of the data type’s alignment requirements, etc.). Alignment is one of the key reasons Arrow can be zero-copy: an Arrow array from C++ can be handed to Rust or Python without copying, because the pointers and data align to what each expects on their architecture.

Arrow Rust Implementation: Buffers, ArrayData, and Memory Safety

The Apache Arrow Rust crates (arrow, arrow-array, arrow-buffer, etc.) provide a memory-safe interface to Arrow’s memory layout. Rust’s strict ownership model and type system help catch errors, but Arrow’s design requires some unsafe underpinnings because we are dealing with raw memory buffers. The developers carefully encapsulate unsafe code inside safe abstractions to maintain soundness (arrow - Rust).

Key building blocks in Rust Arrow:

  • Buffer: The arrow_buffer::Buffer struct represents a contiguous memory region. It internally holds an Arc<Allocation> (reference-counted allocation) and pointers/offsets into that allocation (Buffer in arrow::buffer - Rust) (Buffer in arrow::buffer - Rust). When you clone or slice a Buffer, you get a new Buffer struct pointing to the same underlying Arc memory (with possibly different start or length), so the memory is shared. The Allocation knows how to free the memory when the last reference is dropped. This ensures that multiple arrays can reference the same data without double-free or leaks. For example, slicing an array (like array.slice(1, 2)) will create new ArrayData that reuses the parent’s buffers with an adjusted offset, and the underlying Buffer’s Arc count is incremented (arrow_array - Rust) (arrow_array - Rust). The memory won’t be freed until all slices are dropped. This is how Arrow achieves zero-copy slicing in Rust safely. Creating a Buffer from scratch can be done via safe methods like Buffer::from_slice_ref or Buffer::from(vec) which allocate and copy if needed, or through an unsafe from_custom_allocation if you have an already allocated, aligned pointer (Buffer in arrow::buffer - Rust) (Buffer in arrow::buffer - Rust).

  • ArrayData: This struct (in the arrow-data crate) is a low-level description of an Arrow array’s memory. It contains the DataType (logical type), length, offset (in elements) into the buffers, a vector of Buffer objects for the value buffers, an optional NullBuffer for the validity bitmap, and a vector of child ArrayData for nested types (ArrayData in arrow_data - Rust) (ArrayData in arrow_data - Rust). ArrayData is analogous to the C++ ArrayData or the IPC metadata – it’s basically the pieces needed to reconstruct an array. Higher-level array types in arrow-array (like Int32Array, ListArray, etc.) are thin wrappers around ArrayData that provide type-specific methods (and ensure the DataType matches). For example, Int32Array internally holds an ArrayData and just casts Buffer[0] to NullBuffer and Buffer[1] to &[i32] for convenience. ListArray holds an ArrayData whose child_data[0] is the child’s ArrayData, and provides methods to get offsets, etc. The ArrayData struct provides methods to validate that buffers lengths make sense (for example, that the length of the buffers matches the array length times the type width, etc.) (ArrayData in arrow_data - Rust) (ArrayData in arrow_data - Rust). This validation helps catch misconstructed arrays early (important in Rust since misuse of buffers could lead to memory unsafety).

  • Array trait and concrete arrays: The arrow-array crate defines an Array trait that all array types implement, plus concrete structs for each type (PrimitiveArray, BooleanArray, ListArray, StructArray, DictionaryArray, etc.) (arrow_array - Rust) (arrow_array - Rust). These structs provide an idiomatic API (for example, PrimitiveArray<T>::values() returns a slice [T] of the values (arrow_array - Rust) (arrow_array - Rust), BooleanArray::values() returns a BooleanBuffer of the bits, StringArray::value(i) returns a &str, etc.). They also uphold invariants: for instance, StringArray always has exactly 3 buffers in its ArrayData (bitmap, offsets, values) (arrow_array - Rust). The Rust type system encodes some of these relationships; e.g., PrimitiveArray<T> is generic over ArrowNativeType T and knows that Buffer[1] is of type T. Each array struct usually contains either its ArrayData or the specific buffers directly as fields. For example, BooleanArray has fields values: BooleanBuffer and nulls: Option<NullBuffer> for the validity (BooleanArray in arrow::array - Rust); DictionaryArray<K> has keys: PrimitiveArray<K> and values: Arc<dyn Array> for the dictionary values (DictionaryArray in arrow::array - Rust) (DictionaryArray in arrow::array - Rust). These structs are cheap to clone (they use Arc internally) and are send + sync (because the Buffer’s memory is ref-counted and thread-safe).

  • Memory safety and ownership: By using Arc for buffers and careful slice tracking, Arrow’s Rust implementation ensures that even if you have multiple arrays pointing into the same memory, it won’t be freed prematurely. For example, if you take a slice of an array, both the original and the slice share the buffer. They both increment the Arc count, so when one is dropped, the other still holds the buffer alive (arrow_array - Rust) (arrow_array - Rust). This design avoids copying data. It is vital, though, that the buffers are treated as immutable – and indeed, Arrow arrays are immutable by design (no method to mutate an array’s data in place) (In-Memory Analytics with Apache Arrow | Packt G.R. Jenkin & Associates) (Internal structure of Arrow objects • Arrow R Package). This immutability is what allows sharing: you don’t have to worry about someone modifying the buffer while another is reading it, and it aligns with Rust’s general borrow rules. If you need to “modify” an Arrow array, you typically create a new one (or use a mutable builder to build one). The MutableBuffer and builder APIs exist for constructing arrays efficiently, but once an Array is finished, it’s not meant to change.

  • Use of unsafe: The Arrow crate uses unsafe internally in a controlled way, for example when constructing an &[T] from a raw pointer in a Buffer, or when implementing bit operations. These are wrapped in safe interfaces. The documentation states that it endeavors to be sound, meaning no undefined behavior should be possible through safe usage (arrow - Rust). If you, as a user, stick to the provided APIs (like using builders or Array::from methods), you won’t have to directly handle raw pointers. However, if you use low-level ArrayData::new_unchecked or Buffer::from_unallocated functions, you must ensure to follow the invariants (for instance, you must not create an ArrayData with a length that doesn’t match the buffer’s length, etc.). The library provides ArrayData::try_new and PrimitiveArray::try_new which perform validation for you (arrow_array - Rust).

  • Rust and C interoperability: Arrow’s memory format is the same across languages. The Arrow Rust crate provides a C Data Interface integration (through arrow::ffi module) to allow zero-copy interchange with C/C++ or Python (via PyArrow). This works by exchanging pointers to buffers and schema metadata. Because Rust ensures the buffers are aligned and padded as per Arrow spec, it can share them with C++ or Python without copying. One just has to increment the reference counts appropriately (in Rust this is done by turning them into a C data structure with an appropriate release callback that decrements the Arc; PyArrow does similar). This is beyond the scope of this question, but it’s worth noting that all the design choices – bitmaps, contiguous buffers, alignment – come together to make such interop possible. In Rust, enabling the ipc or ffi feature of the arrow crate gives you the ability to import/export Arrow arrays from byte streams or foreign memory.

  • Dictionary and Run-end in Rust: In Rust’s current implementation (Arrow 13 and up), DictionaryArray is fully supported as described. Run-end encoded arrays are not yet mainstream in the Rust API at this moment (one might have to manually handle them if encountered), but as the spec evolves, support will likely be added, following the child-array model. Arrow’s design allows extension types as well, which Arrow2 (an independent Rust implementation) uses heavily. But focusing on the official arrow-rs: it models the layout concepts directly. For example, to inspect an array’s memory, you can call array.data() to get its ArrayData and then examine array.data().buffers() or array.data().child_data(). This low-level view will show exactly Buffer0 = validity, Buffer1 = offsets, etc., consistent with the spec.

Design Trade-offs: Zero-Copy Access and SIMD Readiness

To conclude, it’s helpful to reiterate why Arrow’s layout is the way it is:

  • Zero-Copy and Interprocess Sharing: By using a flat memory layout (just buffers of primitives) with no internal pointers, Arrow arrays can be shared between processes or libraries without serialization. A memory block can be memory-mapped or sent over shared memory, and the receiver can reconstruct the same array just by pointer offsets (no pointer fix-up needed) (Arrow Columnar Format — Apache Arrow v19.0.1) (arrow - Rust). This is in contrast to, say, a Python list of lists, which would be full of pointers that only make sense in one process’s address space. Arrow’s format is self-contained in terms of offsets within the buffers.

  • Columnar & SIMD-friendly: Storing data for one column in contiguous memory means when you iterate over that column, you get excellent cache locality (Introduction — Apache Arrow v19.0.1) (Introduction — Apache Arrow v19.0.1). Modern CPUs can prefetch sequential memory accesses efficiently. Moreover, Arrow’s contiguous buffers enable the use of SIMD instructions – e.g., you can load 8 doubles at once into an XMM register if they start at a 64-byte boundary, which Arrow’s alignment helps guarantee (Arrow Columnar Format — Apache Arrow v19.0.1). Many of Arrow’s compute kernels (in C++ and some in Rust) utilize this to process data in batches of 8 or 16 values at a time. Even the validity bitmap can be used with bit-parallel operations (e.g., bit masks for filtering, bitwise AND to combine two boolean masks quickly, etc.).

  • Efficient Null Handling: The validity bitmap allows skipping over nulls quickly. A sequence of operations can often operate on whole 64-bit chunks of the bitmap (using popcount to count non-nulls, etc.). Also, when integrating with CPUs that have specialized instructions (like AVX512’s mask registers), having a compact null representation is useful. In Rust, iterating through an array with an iterator will check the bitmap bit for each value to decide if it’s None or Some(value). This is fairly efficient, but for large operations, often a vectorized approach is used instead. Importantly, the bitmap overhead is low (only 1 bit per element) compared to storing a full boolean per entry or using option types that might double memory.

  • Cross-language consistency: The Arrow format is language-agnostic. Rust, C++, Python (via pyarrow), Java, Go, etc., all follow the same spec. Thus, an Arrow file or stream produced in one can be consumed by another. Rust’s arrow crate ensures that if you create an Arrow RecordBatch (a collection of arrays) and write it using the arrow-ipc writer, it will produce a stream identical to what C++ or Python would produce. This is crucial for interoperability in data systems. It also means the concepts described (buffers, offsets, type IDs) are not unique to Rust – they are core to Arrow.

  • Memory usage vs. CPU trade-off: Arrow sometimes uses more memory than an in-memory object representation might (for instance, the offsets buffer adds overhead, and padding wastes some bytes). However, this is a deliberate trade for predictable memory access patterns and the ability to avoid copies. The dictionary encoding helps mitigate memory usage for repeated values, and run-end encoding helps with long runs of identical values. Those are optional – you choose them when beneficial. In Rust, you can choose to compress an array by dictionary-encoding it, or if you know it’s sparse in another way, you might choose a different encoding. But no matter what, once it’s in Arrow format, processing it will leverage the same columnar advantages.

In summary, Arrow’s physical layout – validity bitmap, contiguous value buffers, offset buffers for variable-length data, separate child arrays for nested types, with alignment and padding – is optimized for fast analytical reads and zero-copy sharing. The Rust implementation brings these benefits into a safe, idiomatic Rust environment, allowing you to build systems (like DataFusion or Polars) that operate on Arrow arrays efficiently. By understanding the low-level layout, a Rust developer can also interface with Arrow at the FFI level or optimize kernels (using the raw buffers for SIMD), all while relying on the core crates to maintain safety and correctness in memory management.

Sources:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment