Skip to content

Instantly share code, notes, and snippets.

@rohithreddykota
Created March 29, 2025 12:10
Show Gist options
  • Save rohithreddykota/a9fcb1a477764205adc4446dec967804 to your computer and use it in GitHub Desktop.
Save rohithreddykota/a9fcb1a477764205adc4446dec967804 to your computer and use it in GitHub Desktop.

Apache Arrow Physical Data Types and Layouts (v17.0.0)

Apache Arrow defines a standardized in-memory columnar format composed of primitive and nested data types. Each Arrow array is backed by one or more contiguous buffers (blocks of memory) and optional metadata such as length and null count (Internal structure of Arrow objects • Arrow R Package) (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). This report provides a deep dive into all Arrow physical layouts in version 17.0.0, covering their memory structure, Rust implementation, code examples, and performance trade-offs. We will explore primitives, variable-length types, list types, struct, union, dictionary encoding, run-end encoding, and the null type in detail. Along the way, we highlight buffer alignment, padding for SIMD, and best practices for startups concerned with memory and serialization overhead.

Overview of Memory Layouts and Buffers

Arrow arrays store values in one or more buffers, typically with the following roles (Internal structure of Arrow objects • Arrow R Package) (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316):

  • Validity bitmap (Null bitmap) – a bitmask indicating non-null (1) vs null (0) for each element. This buffer is optional if there are no nulls.
  • Offset buffer – for variable-length types (like strings or lists), an integer buffer that marks the start of each element’s data in a contiguous values buffer.
  • Data buffer – the actual values (for primitives, this is fixed-size elements; for variable-length types, a concatenated blob of bytes for all elements).
  • Type buffer – for union types, an 8-bit buffer indicating the data type id of each element.
  • Run ends buffer – for run-end encoded arrays, an integer buffer indicating where consecutive runs of identical values end.

All Arrow buffers are 64-bit aligned and padded to multiples of 8 bytes (and commonly 64 bytes for cache-line alignment) (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316) (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). This alignment ensures efficient vectorized access on modern CPUs. Arrow uses little-endian byte order for multi-byte values (Internal structure of Arrow objects • Arrow R Package) (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). For performance, it is recommended (though not required) that buffers be 64-byte aligned and padded (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316) to fit cache lines, minimizing cross-cache-line reads.

(image) Quick summary of physical layouts and buffers for each Arrow type (buffer indices correspond to validity bitmap = 0, offset or type buffers = 1, etc.)

As shown in the summary above, each data type has a predictable physical layout in memory. We will now examine each type in detail, explaining its buffers, memory organization (with diagrams), Rust API, and use cases.

Primitive Types (Integers, Floats, Booleans)

Physical Layout: Primitive arrays consist of a validity bitmap (optional) and a value buffer of fixed-width elements (Internal structure of Arrow objects • Arrow R Package) (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). Each element occupies a fixed number of bytes (e.g. 1 byte for Int8, 4 bytes for Float32, 8 bytes for Int64, etc.). If an array has nulls, the validity bitmap buffer stores 1 bit per element (bit=1 for non-null, 0 for null) (Internal structure of Arrow objects • Arrow R Package). The value buffer contains the values laid out consecutively. For example, an Int32 array of 5 elements [1, null, 2, 4, 8] would have: a length of 5, null count of 1, a 5-bit validity bitmap 10111 (padded to one byte as 00011101 in memory) and a 5×4-byte data buffer (Internal structure of Arrow objects • Arrow R Package) (Internal structure of Arrow objects • Arrow R Package). The data buffer stores 1, (unspecified), 2, 4, 8 in 4-byte slots, where "unspecified" means undefined memory for the null slot (Internal structure of Arrow objects • Arrow R Package). Both buffers are padded up to 64 bytes for alignment (Internal structure of Arrow objects • Arrow R Package). Notably, if null count = 0 (no nulls), the validity buffer can be omitted entirely to save space (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). In that case, Arrow arrays assume all values are valid by default.

Booleans are a special case of primitives: they are stored as bit-packed values in the data buffer (1 bit per boolean) rather than 1 byte each (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). For example, a boolean array [true, false, true, true] would have a data buffer with bits 1011 (in little-endian bit order within bytes). The validity bitmap (if any) is separate from the boolean data bits. Bit-packing booleans improves memory density at the cost of needing bitwise operations to access values.

Memory Diagram: For a primitive int32 array with nulls, the memory layout looks like:

Rust Implementation: In Arrow’s Rust crates, primitive arrays are represented by concrete types like Int32Array (an alias for PrimitiveArray<Int32Type>) (Int32Array in arrow::array - Rust). Each PrimitiveArray stores an optional NullBuffer for validity and a contiguous Buffer for values. You can construct primitive arrays in high-level ways. For example:

use arrow_array::Int32Array;
let arr = Int32Array::from(vec![Some(1), None, Some(2)]);

This creates an Int32Array from a Rust Vec<Option<i32>>, automatically building the null bitmap and value buffer (Int32Array in arrow::array - Rust). Similarly, Int32Array::from(vec![1,2,3]) will create a non-nullable array (Int32Array in arrow::array - Rust) (no validity buffer since all values are valid). Under the hood, these constructors use Arrow’s buffer builder APIs. For explicit control, you can use ArrayData::builder with DataType::Int32, then .add_buffer(values_buffer) and .null_bit_buffer(bitmap_buffer) to assemble an array. For instance:

use arrow_buffer::Buffer;
use arrow_array::{ArrayData, Int32Array};
let values: [i32; 3] = [10, 20, 30];
let bitmap: [u8; 1] = [0b00000111]; // all 3 values non-null (bits 111)
let data = ArrayData::builder(DataType::Int32)
    .len(3)
    .null_count(0)
    .add_buffer(Buffer::from_slice_ref(&values))        // value buffer
    .null_bit_buffer(Some(Buffer::from_slice_ref(&bitmap))) // validity buffer
    .build().unwrap();
let primitive_array = Int32Array::from(data);

Here we manually construct the buffers and wrap them in an ArrayData. Usually, using higher-level APIs like Int32Array::from or PrimitiveBuilder is more convenient, but ArrayData provides flexibility to build any array if needed (how to create a polars-arrow Array from raw values (&[u8]) - Stack Overflow). Arrow’s Buffer type ensures memory is properly aligned and padded. When no nulls are present, you can omit the null bitmap (or call NullBuffer::new_valid(len) to explicitly create an all-valid buffer) (NullBuffer in arrow::buffer - Rust). If all values are null, Arrow uses a NullArray optimization (discussed later in the Null type section).

Performance: Primitive arrays offer O(1) random access to any element’s value by direct indexing into the value buffer (and a bit check in the bitmap for validity). The fixed-width layout is very friendly to SIMD instructions and CPU caching – values are tightly packed and aligned, allowing vectorized operations across the buffer (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). For instance, computing a function on all values (ignoring nulls) can be done by loading 128-bit or 256-bit chunks of the data buffer at once. The validity bitmap can also be processed 8 or 64 bits at a time to quickly skip nulls (Internal structure of Arrow objects • Arrow R Package). Boolean data benefits from bit-packing to reduce memory footprint, but each access requires bit masking. In general, primitive arrays are the most efficient in Arrow for computation. The overhead to note is padding: small arrays still allocate buffers rounded up to 64 bytes, which is negligible for large arrays but can proportionally impact very small arrays (a trade-off for alignment). For best performance, prefer primitive types for numeric data, and use booleans for binary flags (the slight overhead of bit unpacking is often offset by the memory savings and cache benefits of fitting 8 booleans per byte).

Variable-Length Binary Types (Binary and Utf8)

Physical Layout: Variable-length binary (byte sequences) and UTF-8 string types in Arrow use three buffers: a validity bitmap (optional), an offsets buffer, and a data buffer (Internal structure of Arrow objects • Arrow R Package). Arrow’s Binary (Utf8) type uses 32-bit offsets (supporting up to $2^{31}-1$ bytes of data) (DataType in arrow::datatypes - Rust), whereas LargeBinary (LargeUtf8) uses 64-bit offsets for extremely large data (up to $2^{63}-1$ bytes) (DataType in arrow::datatypes - Rust) (DataType in arrow::datatypes - Rust). The offsets buffer has length = number of elements + 1, with the first entry 0 and the last entry = total bytes in the data buffer (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316) (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). Each offsets entry indicates the starting byte index of the corresponding element in the data buffer. The length of element i can be computed as offsets[i+1] - offsets[i] (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). The data buffer is a contiguous blob of all the binary values concatenated together (Internal structure of Arrow objects • Arrow R Package), typically with no delimiter (the offsets provide the segmentation).

For example, consider a Utf8 (string) array of length 5: ["hello", "amazing", "and", "cruel", "world"]. The offsets buffer might contain [0, 5, 12, 15, 20, 25] (in bytes) – meaning: element0 starts at 0, element1 at 5, element2 at 12, etc., and the last value ends at byte 25. The data buffer would store the actual UTF-8 bytes for "helloamazingandcruelworld" concatenated (Internal structure of Arrow objects • Arrow R Package). If any element is null, the validity bitmap bit is 0, and typically the offsets for that null element repeat (offset[i] == offset[i+1]), indicating an empty span in the data buffer (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316) (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). The Arrow specification dictates that even null elements must have a valid offsets entry (usually the same as the next offset) to maintain consistency (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316) (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). As with all buffers, the offsets and data buffers are 8-byte aligned; the data buffer is often a multiple of 8 bytes padded with unused bytes at the end for alignment (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316).

Memory Diagram: For the above string array:

  • Buffer 0: Validity bitmap for 5 elements (1 byte, if any nulls).
  • Buffer 1: Offsets (int32) of length 6. In memory (little-endian 4 bytes each) it would look like: [0, 5, 12, 15, 20, 25] representing starting indices.
  • Buffer 2: Data buffer of 25 bytes, containing the UTF-8 bytes of all strings back-to-back: "helloamazingandcruelworld". If padded to 32 bytes (next multiple of 8), the last 7 bytes are unused padding.

If element 1 ("amazing") were null instead, the bitmap bit1 = 0 and offsets would be [0, 5, 5, 15, 20, 25] – notice offset[1] and offset[2] both 5, indicating that element1 occupies zero bytes (null) (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316).

Rust Implementation: Arrow’s Rust API provides BinaryArray/StringArray (for Utf8) and their “large” 64-bit variants. You can use builders like StringBuilder or simply collect from a vector of Option<String>. For example:

use arrow_array::StringArray;
let arr = StringArray::from(vec![
    Some("hello"),
    None,
    Some("world")
]);

This creates a StringArray with values "hello", null, "world". Internally, this sets the offsets buffer to [0, 5, 5, 10] and the data buffer to "helloworld" (how to create a polars-arrow Array from raw values (&[u8]) - Stack Overflow) (how to create a polars-arrow Array from raw values (&[u8]) - Stack Overflow). The second element’s offsets (5 to 5) denote an empty string for the null. If you need to build arrays manually, you can use ArrayData::builder(DataType::Utf8) and supply two buffers: first the offsets (as a Buffer of i32 or i64), then the data buffer of u8 (how to create a polars-arrow Array from raw values (&[u8]) - Stack Overflow). The example below demonstrates manual construction equivalent to the above:

use arrow_buffer::Buffer;
use arrow_array::{ArrayData, StringArray};
let offsets: [i32; 4] = [0, 5, 5, 10];
let data: &[u8] = b"helloworld";
let bitmap: [u8; 2] = [0b00000101, 0]; // bitmap bits: 1 (hello), 0 (null), 1 (world)
let array_data = ArrayData::builder(DataType::Utf8)
    .len(3)
    .null_count(1)
    .add_buffer(Buffer::from_slice_ref(&offsets))
    .add_buffer(Buffer::from_slice_ref(data))
    .null_bit_buffer(Some(Buffer::from_slice_ref(&bitmap)))
    .build().unwrap();
let string_array = StringArray::from(array_data);

Here we explicitly provide the offsets and data. Arrow will validate that offsets.len() == array.length + 1 and that the last offset equals data length (how to create a polars-arrow Array from raw values (&[u8]) - Stack Overflow). For convenience, one can also use LargeStringArray for 64-bit offsets, or BinaryArray for arbitrary bytes (the StringArray variant validates UTF-8 encoding of the data).

Performance: Accessing an element in a Binary/String array requires two memory lookups: one into the offsets buffer (to get start and end positions) and then into the data buffer to retrieve the bytes. This indirection means a random access is O(1) to get the slice boundaries, but the actual data size varies. When iterating sequentially, the data buffer is iterated linearly which is cache-friendly, but the offsets buffer access can incur some overhead. Overall, string data is less SIMD-friendly than fixed-width data, but Arrow’s format still provides good locality by storing all characters contiguously (Introduction — Apache Arrow v19.0.1) (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). The use of 32-bit offsets keeps the index small (saves memory vs 64-bit if data < 2GB). There is a trade-off: if you frequently slice or filter string arrays, updating the offset buffer is cheap (just adjust pointers), but if you frequently random-seek, you’ll be jumping between the offset and data buffers.

For startups dealing with text data, Arrow’s string layout is efficient for encoding and transferring large amounts of text (especially when converting to Parquet or IPC, as it can zero-copy slice the buffers). If many strings repeat, consider dictionary encoding them (see Dictionary section) to compress memory. In general, use the normal Binary/Utf8 layout when you want maximum data compactness and sequential read speed. The contiguous data buffer also makes it fast to serialize or hash all values at once.

BinaryView and StringView (Variable-length View Types)

Physical Layout: BinaryView and Utf8View are alternate layouts for variable-length data introduced to improve certain operations (Introduction — Apache Arrow v19.0.1) (Introduction — Apache Arrow v19.0.1). Logically, they represent the same type of data as Binary/Utf8, but the memory layout differs. Instead of a single offsets buffer and contiguous data buffer, the “view” layout uses a fixed-width view buffer where each element occupies a constant number of bytes (e.g., 16 bytes) containing a header. This header typically stores the length of the binary/string and either inline data or a reference (offset) to data stored elsewhere (DataType in arrow::datatypes - Rust) (DataType in arrow::datatypes - Rust). In Arrow’s design, if the value is small enough, it may be stored entirely in the view buffer (inline); if it’s larger, the view buffer stores (a) a prefix of the data (first 4 bytes, for example) and (b) an index pointing to an external data buffer where the full value resides (Introduction — Apache Arrow v19.0.1). Because each slot has a fixed footprint, there is no separate offsets buffer; instead, there may be one or multiple values buffers that hold the out-of-line data for large values.

Concretely, each BinaryView/Utf8View element could be a struct like { length: u32, data[?] } where the data field in the view is either the entire byte sequence (if length <= N bytes), or a prefix plus a pointer/offset to another buffer. The Arrow spec notes that this allows out-of-order writing of elements and can store values non-contiguously (Introduction — Apache Arrow v19.0.1) (Introduction — Apache Arrow v19.0.1). It essentially trades some extra space per element for potentially faster element-wise access and comparison. For instance, storing the first 4 bytes in the view provides a quick way to compare strings by prefix (which is useful because many string inequalities are decided in the first few characters) (Introduction — Apache Arrow v19.0.1).

Memory Diagram: Imagine a StringView array with two values: ["cat", "hippopotamus"]. Suppose the view buffer allocates 16 bytes per slot. For "cat" (length 3), the view might store {len:3, data:"cat", pad,...} entirely in the view buffer (with trailing bytes unused or zeroed). For "hippopotamus" (length 12), the view buffer could store {len:12, data:"hipp", ref: (offset 0 in data_buffer)} – i.e., the first 4 letters "hipp" and perhaps an index pointing into a data buffer. The data buffer (separate) would then contain the full "hippopotamus" string bytes. The validity bitmap works as usual (1 bit per entry). There is no offsets buffer; instead each slot in the fixed-size view buffer self-describes where its data is.

This layout means the view buffer is larger (e.g., 16 bytes * number of elements), and possibly multiple data buffers if values are written out-of-order. By not forcing all bytes into one big buffer, you can append large strings without reallocating a huge contiguous buffer or you can gather subsets of strings easily by copying references instead of substrings (Introduction — Apache Arrow v19.0.1).

Rust Implementation: In Arrow Rust, DataType::BinaryView and DataType::Utf8View are defined (DataType in arrow::datatypes - Rust) (DataType in arrow::datatypes - Rust), but as of 17.0.0 they are not fully stabilized for use. The documentation flags them as “NOT YET FULLY SUPPORTED” and that using them may panic (DataType in arrow::datatypes - Rust). This means there isn’t a public BinaryViewArray with high-level methods at this time (the feature was introduced but marked experimental). However, the intent is that they would be represented by an ArrayData with one fixed-size buffer (the view buffer) and one or more data buffers for out-of-line storage. The DataTypeLayout in the Rust crate indicates that for view types, the number of buffers is variadic (not strictly 2) (DataTypeLayout - arrow::array). In practice, constructing a view array would likely involve using an ArrayData builder with DataType::BinaryView and supplying a view buffer (which could be built by some builder that packs length and prefix) and a data buffer of concatenated overflow values. Since direct support is limited, there is no dedicated builder in arrow-rs v17 for BinaryView; one would have to carefully construct the buffers manually following the layout rules.

Performance: The BinaryView layout is designed for locality of reference and efficient element-wise operations. Since each element’s metadata and possibly data prefix are stored contiguously in a fixed-size slot, random access to the value is faster – you don’t need to index into a separate offsets buffer and then jump to data; you can often read a small string directly from the view buffer. Comparisons can often short-circuit: the stored prefix (e.g., first 4 bytes) allows quick inequality checks (Introduction — Apache Arrow v19.0.1). Small values (up to a certain number of bytes) incur no extra pointer chase because they’re inline.

However, this comes at a cost: the overall memory usage can be higher. For example, a single-character string takes 16 bytes in the view buffer, whereas in the normal Utf8 layout it would take 4 bytes for offset + 1 byte in data buffer (plus a few padding bytes). Large strings also still require storing their content in a secondary buffer, plus the prefix in the view, and each element carries the overhead of a length field. So for large values, you might be storing a few duplicate bytes (prefix in view plus full value in data). This is a trade-off: BinaryView favors compute speed and selective access over total memory compactness (DataType in arrow::datatypes - Rust). If your workload involves lots of random lookups, comparisons, or constructing strings out-of-order (e.g., building a table by gathering values from many sources), the view layout can be advantageous. It avoids needing to copy all bytes into one contiguous buffer on write and can improve CPU cache usage for certain operations (since the fixed-size header can be vectorized or at least predictably accessed).

For a startup use-case, a BinaryView might be considered when you have many short strings and performance-critical comparisons (like keys in a join or dictionary) or if you need to append strings without contiguous reallocation. In most other cases, the standard Binary/Utf8 layout is preferred due to its lower memory overhead and maturity. As of Arrow 17, given the experimental status of view types in Rust, one might stick to regular Binary and rely on dictionary encoding or other methods for performance until BinaryView is fully supported.

List and FixedSizeList

Physical Layout (List): The List type in Arrow represents sequences (arrays) of a sub-type. It is conceptually similar to an array of arrays (e.g., List<Int32> could be thought of as Vec<Option<Vec<i32>>>). The physical layout of a List is analogous to a string: it uses a validity bitmap for the top-level list entries, an offsets buffer (32-bit for List, 64-bit for LargeList), and a child data buffer which is actually another Arrow array for all the list values concatenated (Introduction — Apache Arrow v19.0.1) (GenericListArray in arrow::array::array - Rust). Each offsets entry indicates the starting index in the child array for the corresponding list element, and the length of the list element is the difference to the next offset (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). The child array’s length equals the sum of lengths of all individual lists.

For example, consider a List<Int32> array with 4 elements: [[A, B, C], [], NULL, [D]] (here we use letters as placeholder int values for simplicity). The offsets buffer would have length 5 (one more than the number of lists). If we label the underlying values array as all the ints in order, it might look like:

The validity bitmap for this list array has a bit vector like 1011 (where the third element is null, bit=0) (GenericListArray in arrow::array::array - Rust). The child values array in this case is an Int32Array of length 4 storing [A, B, C, D] along with its own validity bitmap (in our example all those ints might be non-null except maybe if some inner values are null). The list’s null count is independent of the child’s null count – a null list means the entire sublist is null, regardless of child data.

FixedSizeList: A FixedSizeList is a specialized list where each list has the same length k, which is known at schema time. This removes the need for an offsets buffer – given the index of a list element, you can compute its start position by index * k (and the length is always k) (Introduction — Apache Arrow v19.0.1). The physical layout is just a validity bitmap and the child values array (the child array length = number of lists * k). No offsets buffer is stored because it’s redundant; however, Arrow still represents the FixedSizeList type with a parameter list_size. If a particular list slot is null, by convention its child values in that range are considered null/ignored (similar to variable lists).

Memory Diagram: For the variable List example above:

For a FixedSizeList of size k (say FixedSizeList<Int32>(k=3) with 2 elements), memory would be:

  • Buffer 0: Validity for 2 lists.
  • No offsets buffer.
  • Child array: an Int32 array of length 2*3 = 6 values. The first 3 belong to list0, next 3 belong to list1. If list1 is null, its 3 values may be left unspecified or default, but typically one still allocates space for them.

Rust Implementation: Arrow Rust provides ListArray and FixedSizeListArray (in general, GenericListArray<OffsetType> covers both List and LargeList) (GenericListArray in arrow::array::array - Rust) (DataType in arrow::datatypes - Rust). You can create a ListArray using ListBuilder: for example, ListBuilder<Int32Builder> will let you append values for each sublist and end each list as needed. Another approach is to construct via ArrayData. A ListArray’s ArrayData expects two buffers: the validity and the offsets buffer (just like a string) (GenericListArray in arrow::array::array - Rust), plus it holds a pointer to the child array’s ArrayData. In Rust, GenericListArray has fields for value_offsets: OffsetBuffer and an Arc<dyn Array> for the child values (GenericListArray in arrow::array::array - Rust). After constructing, you can obtain the child data via .values() and the offsets via .value_offsets().

For example, using high-level APIs:

use arrow_array::{ListBuilder, Int32Builder};
let mut int_builder = Int32Builder::new();
let mut list_builder = ListBuilder::new(int_builder);

// Build first list [1,2,3]
list_builder.values().append_value(1);
list_builder.values().append_value(2);
list_builder.values().append_value(3);
list_builder.append(true);  // finish first list (non-null)

// Build second list []
list_builder.append(true);  // an empty list, no values appended in between

// Build third list null
list_builder.append(false); // append a null list (no values)

// Build fourth list [4]
list_builder.values().append_value(4);
list_builder.append(true);

let list_array = list_builder.finish();

This would produce a ListArray with the structure we described ([[1,2,3], [], null, [4]]). If you inspect list_array.value_offsets() it would give [0,3,3,3,4], and list_array.value_length(0) == 3, etc. The ListBuilder takes care of managing the offsets buffer and child builder internally.

For FixedSizeList, Arrow Rust has FixedSizeListBuilder or you can construct by providing a child array and length. The builder ensures each added sublist has the correct fixed length. For instance, for FixedSizeList<UInt8>[3] (triplets of bytes), you would push values in groups of 3 and call append. Under the hood, a FixedSizeListArray stores a NullBuffer for validity and the child Array. The offset for element i is simply i * 3 (not stored, computed).

Performance: Lists carry a similar performance profile to strings: there is an indirection through the offsets buffer to get to the data. In a List, however, the payload is another Arrow array, which could itself be nested or primitive. Accessing an element of a list (i.e., a sublist) involves reading two offsets to know the range, and then potentially processing that many elements in the child array. If you iterate through all elements of all lists, you effectively traverse the offsets buffer and then the entire child array.

The advantage of Arrow’s list layout is that the child data is stored contiguously across all lists, which is cache-efficient when processing the whole array. For example, filtering all integers from all lists can be done by scanning the child Int32Array in one pass. On the other hand, to retrieve a specific list by index, you do a quick O(1) lookup in offsets, but then you may have to copy a segment of the child array or handle a slice.

List vs FixedSizeList: If each list has a known constant length, FixedSizeList avoids the memory and time overhead of the offsets buffer (Introduction — Apache Arrow v19.0.1). Indexing into a fixed-size list element is trivial arithmetic. This can improve performance and reduce memory usage, especially for large numbers of very small lists. For example, a list of 3 floats (a 3D point) can be stored as a FixedSizeList of length 3, which is essentially the same as storing a struct with 3 floats but labeled as a list for schema flexibility. The downside is inflexibility: if any list can have a different length, you must use the generic List. Also, some algorithms might be simpler with explicit offsets.

List vs LargeList: Use LargeList (64-bit offsets) only if you truly expect more than $2^{31}-1$ elements in the child data. Most use cases (even millions of elements) are fine with the standard 32-bit List, which saves half the offset buffer size and is plenty for up to ~2 billion elements in total.

List vs ListView: Arrow also defines a ListView layout (experimental) which adds a size buffer alongside offsets (Introduction — Apache Arrow v19.0.1). In a ListView, offsets need not be monotonically increasing by fixed differences; each element’s length is explicitly stored in a separate buffer. This allows out-of-order offsets and could let the child data buffer be laid out differently (even out-of-order). The trade-off is an extra buffer (the sizes). The benefit is that splitting or slicing lists might be easier (since you can shuffle offsets without recomputing lengths) and certain vectorized operations (like direct gather) can be done by using the sizes as a quick index. In Rust Arrow, DataType::ListView exists and expects an offsets and a sizes buffer (DataType in arrow::datatypes - Rust), but like BinaryView it is not fully supported. Typically, you won’t use ListView unless you have a very specific use-case (like zero-copy filtering where you want to mark some lists as empty by setting size=0 rather than moving offsets).

Best Practices: Use List for variable-length sequences and FixedSizeList when all lists are uniform length. The contiguous storage of child data means that operations on all sub-elements (like aggregations) are efficient. However, if you need to frequently access or modify individual sublists, be mindful of the cost to slice out ranges. For mostly read-heavy pipelines (e.g., analytics), Arrow lists are extremely efficient. If your data model has deeply nested lists, Arrow can handle it (child of a list can itself be a list or struct), but deeply nested data may be less cache-friendly.

From a startup perspective, Arrow’s list layout is helpful for JSON-like data or variable-length records, enabling efficient analytics on them (especially when converting to Parquet, which also supports lists). Just remember that the total memory is the sum of the child array plus the offsets. If memory becomes an issue, consider compressing large list elements via run-end encoding or dictionary if applicable (though for list of numbers, dictionary usually doesn’t apply).

Struct (Composite Types)

Physical Layout: A Struct in Arrow is a nested type that groups multiple fields (columns) together, analogous to a struct or record with named fields. For example, a struct type might be Struct<{x: Int32, y: Int32}> representing points with two coordinates. The physical layout of a struct array is quite straightforward: it consists of an optional validity bitmap for the struct as a whole, and no value buffer of its own (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). Instead, it has child arrays, one for each field in the struct (Introduction — Apache Arrow v19.0.1). All child arrays are of the same length as the struct. If the struct array has nulls, there is a validity bit for each struct slot indicating if the entire struct (all fields) is null at that index (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). If a struct value is null, typically all child values at that index can be ignored (they may be set to null or zero, but are not considered meaningful) (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). If a struct value is non-null, each child’s value at that index is considered part of the struct.

In summary: Struct = container of parallel child arrays + optional null bitmap. There is no offsets buffer because a struct does not vary in length per element – it always has one value (a tuple of fields) per index.

Memory Diagram: Consider Struct<{a: Int32, b: Utf8}> with 3 elements:

Index:  0         1         2
Struct: {a:1,b:"x"}   null      {a:3,b:"yz"}

Memory:

  • Buffer 0: Validity bitmap for struct: e.g. bits 101 (element1 is null) (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316).
  • Child 0 (for field a): an Int32Array of length 3 storing [1, ?, 3] (where ? is undefined or could be any value because the struct at index1 is null – Arrow doesn’t require child[1] to be null, but many implementations set it null for clarity) (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). It will have its own validity bitmap if a itself has nulls independent of the struct’s null (if a field is required non-null, it might not have a bitmap).
  • Child 1 (for field b): a StringArray of length 3 for ["x", ?, "yz"]. It has its own offsets and data buffers (for "x", "", "yz" perhaps) and validity. If the struct is null at index1, b’s value at 1 might be set to null as well for consistency.

It’s important: the presence of a struct-level null does not automatically propagate to child bitmaps – instead, Arrow expects the consumer to consult the parent (struct) bitmap first. “The parent null bitmap is authoritative” (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). In other words, even if a child field has a non-null value at a position, the struct can still be null, and the child’s value should be ignored. This avoids duplicate null indicators and allows children to have nulls independently of parent nulls too (though that situation is unusual – typically a null struct means you won’t interpret the children’s values).

Rust Implementation: In arrow-rs, a struct array is represented by StructArray. It contains an optional NullBuffer for the struct’s validity and a Vec<ArrayRef> for child arrays. The StructArray can be constructed by providing the child arrays and (optionally) a validity mask. For example:

use arrow_array::{Int32Array, StringArray, StructArray, Array};
use arrow_array::builder::BooleanBufferBuilder; // for custom bitmap

let field_a = Int32Array::from(vec![Some(1), Some(2), Some(3)]);
let field_b = StringArray::from(vec![Some("foo"), Some("bar"), Some("baz")]);
// Suppose we want the second struct to be null:
let mut bitmap_builder = BooleanBufferBuilder::new(3);
bitmap_builder.append(true);   // index 0 not null
bitmap_builder.append(false);  // index 1 is null
bitmap_builder.append(true);   // index 2 not null
let struct_validity = bitmap_builder.finish();

let struct_array = StructArray::try_new(
    DataType::Struct(vec![
        Field::new("a", DataType::Int32, true),
        Field::new("b", DataType::Utf8, true),
    ]),
    vec![field_a.into(), field_b.into()],
    Some(struct_validity.into())   // provide the null bitmap
).unwrap();

Here we manually created a validity bitmap where element1 is null. The StructArray::try_new takes a DataType (with child field metadata), the child arrays (as ArrayRef), and an Option<Buffer> for nulls. After construction, struct_array.is_null(1) would be true, and struct_array.column(0) gives the Int32Array for field a. Notice we didn’t alter the child arrays’ values at index1 – field_a has 2 at index1 and field_b has "bar". But because the struct’s bitmap says null, those would be ignored in semantics. In practice, one might set those to None when building to avoid confusion.

Each child array in a struct can itself have its own nulls. Arrow doesn’t forbid non-null struct containing null field values. For example, you could have a struct {a, b} where struct isn’t null but field a is null at a position. This would mean the struct is present, but one of its fields is null. That is allowed: you'd mark bit in struct validity as 1 (non-null struct) and bit in child a's validity as 0 (null field value). This fine-grained nullability can represent cases like SQL nullable columns in a struct.

Performance: Struct arrays are essentially a batch of parallel arrays. Accessing a field of a struct is very efficient: it’s just accessing the corresponding child array. There is no additional offset or indirection; the struct doesn’t own a data buffer beyond child pointers. Checking if a struct element is null requires consulting one bitmap (the struct’s). If struct-level nulls are rare, you mostly pay the cost of per-field null checks.

One thing to note is memory alignment: since each child is a separate array, each child’s buffers are individually padded to 64 bytes. There isn’t a single combined memory region for a struct row; rather, field data are stored in separate buffers. But since operations on one field typically focus only on that field’s array, this separation doesn’t harm performance. In fact, it can help cache locality when you only need one field (you don’t have to skip over other fields’ data). This is a key benefit of Arrow’s design: even though data may be logically nested, physically each field is a contiguous column.

For analytics, struct arrays are useful for grouping related columns (for example, a struct could represent a composite key with multiple parts). The cost to iterate a struct’s values is essentially the cost to iterate all child arrays in parallel. If you need to reconstruct the full struct object for each row, you would check the struct bitmap and then gather each child’s value – this is a bit more overhead than a single array, but not by much.

Best Practices: Use struct when you want to keep multiple columns together logically. It's also how Arrow represents a Map type (which is a list of struct key/value pairs under the hood). Because a struct does not require an offset or assembly, it’s very cheap to add or remove fields from a struct – they are just additional child arrays. For serialization (IPC or Parquet), struct types are handled by writing each child column, so it aligns well with columnar storage.

For startup scenarios, struct can model JSON objects or nested schemas without losing columnar properties. For instance, a JSON with an object can be mapped to a struct with fields for each key. This allows efficient filtering on those fields. Just remember to handle the null semantics correctly: if an entire struct is null, none of its fields should be considered valid even if present. When converting to formats like Parquet, struct fields become separate Parquet columns grouped under a schema group.

Union (Dense and Sparse)

Physical Layout: A Union type in Arrow is a collection where each element can be one of several types (a bit like a variant or sum type). Arrow defines two flavors: Dense Union and Sparse Union (Introduction — Apache Arrow v19.0.1) (Introduction — Apache Arrow v19.0.1). In both, the union array has multiple child arrays (one for each possible type variant) and a buffer of 8-bit type IDs indicating which child type each element belongs to (Introduction — Apache Arrow v19.0.1). The key difference lies in how values are aligned between these children:

  • Dense Union: In a Dense Union, the children arrays each contain only the values for the union elements of that type. The union array has:

    • Types buffer: an 8-bit value for each element indicating the child type index (Introduction — Apache Arrow v19.0.1).
    • Offsets buffer: a 32-bit integer for each element giving the index within the corresponding child array where that element’s value is found (Introduction — Apache Arrow v19.0.1).
    • Child arrays: one array per type, each of variable length (could be shorter than the union length). For example, if you have a union of {Int32, Float64} types and most elements are Int32, the Int32 child array might be length N (close to union length) while the Float64 child is shorter.

    Every union index i is resolved by looking at type_id = types[i], then taking offset = offsets[i], and then looking up child_array[type_id][offset] as the value. So the offsets buffer "densely" maps union slots to positions in each child.

  • Sparse Union: In a Sparse Union, the children arrays are all length equal to the union array length (Introduction — Apache Arrow v19.0.1). Each child array has placeholders (often null or dummy values) in positions where the union’s actual type was different. In this layout:

    • Types buffer: same as dense – type ID per element (Introduction — Apache Arrow v19.0.1).
    • No Offsets buffer: it’s omitted (Introduction — Apache Arrow v19.0.1) because the position in the child equals the position in the union (i.e., index i in union corresponds to index i in the chosen child array).
    • Child arrays: each length = union length. Only one child array has a "real" value at index i (the one whose type matches types[i]); all other child arrays should have a null or default at that index.

    Example: If element 0 is an Int32 and element 1 is a Float64 in a sparse union, child0 (Int32 array) will have a valid value at index0 and maybe null at index1, whereas child1 (Float64 array) will have null at index0 and a valid value at index1, etc.

Unions do not use a validity bitmap of their own (Introduction — Apache Arrow v19.0.1). If an element is null, it must be represented as a null value in the appropriate child. In dense union, that means the child’s own bitmap has a 0 at that offset; in sparse, similarly the child has null at that position. The spec explicitly notes that union nullness is determined by the children (Introduction — Apache Arrow v19.0.1).

Memory Diagram: To illustrate, consider a Dense Union of types {Int32 (type_id=0), Utf8 (type_id=1)} with values: [42, "hello", "world", 99]. Suppose the sequence of types is Int32, Utf8, Utf8, Int32. Then:

  • Types buffer (Buffer 0): bytes [0, 1, 1, 0] indicating type of each slot (Introduction — Apache Arrow v19.0.1).
  • Offsets buffer (Buffer 1): if child0 (Int32) has its elements stored in order and child1 (Utf8) similarly, we map union index to child index:
    • Union[0] is type0 Int32 -> maybe offset 0.
    • Union[1] is type1 Utf8 -> offset 0.
    • Union[2] is type1 Utf8 -> offset 1.
    • Union[3] is type0 Int32 -> offset 1. So Offsets buffer would be [0, 0, 1, 1] as 32-bit ints.
  • Child0 (Int32 array): contains the actual Int32 values in the order they appear – [42, 99].
  • Child1 (Utf8 array): contains ["hello", "world"] with its own offsets and data buffers.

Thus, union index 2 (type1, offset1) means look at child1 index1 -> "world".

For a Sparse Union with the same logical content:

  • Types buffer: [0, 1, 1, 0] (same).
  • No offsets buffer (since union length = 4, children length = 4).
  • Child0 Int32 array length 4: values might be [42, null, null, 99] (with a validity bitmap marking index1 and 2 as null).
  • Child1 Utf8 array length 4: values ["", "hello", "world", ""] perhaps, with index0 and 3 null (or empty) and index1,2 valid.

The sparse layout uses more space (two child entries per union element, in worst case one real and one null for each element), but avoids the offsets indirection.

(image) Illustration of Dense vs Sparse Union layout. Above: Dense Union with type ids buffer and offsets mapping to positions in each child array (each child array length equals the count of union elements of that type). Below: Sparse Union with no offsets – child arrays are full length, aligning with union indices (unused slots marked as null).

Caption: Dense union stores an offset for each value to find it in the respective child array (here, TypeIDs 0 and 1 with Offsets 0,1,2...), whereas Sparse union does not use offsets, and each child array (Field-0 and Field-1) has the same length as the union with nulls filling non-used positions (Introduction — Apache Arrow v19.0.1) (Introduction — Apache Arrow v19.0.1).

Rust Implementation: The Rust UnionArray in arrow-rs can represent both dense and sparse unions. It holds an ArrayData with DataType Union(fields, mode) where mode is dense or sparse and an optional offsets buffer. In arrow-rs 17, constructing a UnionArray is a bit lower-level (there isn’t a convenient builder for union types as of that version). One can use ArrayData::try_new to create it. For example:

use arrow_array::{ArrayData, Int32Array, StringArray, UnionArray};
use arrow_buffer::Buffer;
use arrow_schema::{DataType, Field, UnionMode};

// Prepare child arrays
let ints = Int32Array::from(vec![42, 99]);
let strings = StringArray::from(vec!["hello", "world"]);
// Type IDs buffer
let type_ids = Buffer::from_slice_ref(&[0_i8, 1, 1, 0]);
// Offsets buffer for dense union
let offsets = Buffer::from_slice_ref(&[0_i32, 0, 1, 1]);

let union_type = DataType::Union(vec![
    Field::new("ints", DataType::Int32, true),
    Field::new("strings", DataType::Utf8, true),
], /* type_ids */ None, UnionMode::Dense);

let union_data = ArrayData::try_new(
    union_type,
    4, // length
    None, // null count (union uses child nulls)
    0, // offset into this array data
    vec![Buffer::from(type_ids), Buffer::from(offsets)], // buffers: types and offsets
    vec![ints.to_data(), strings.to_data()] // child data
).unwrap();
let union_array = UnionArray::from(union_data);

In this snippet, we create a dense union array. The type_ids vector [0,1,1,0] and offsets [0,0,1,1] correspond to the example above. We pass child ArrayData for each variant. The DataType for Union in Rust expects a list of Field for each child (with an optional explicit mapping from type_id to child index; here we relied on the order so type_ids 0->ints, 1->strings by position). We specify UnionMode::Dense. For a sparse union, we would choose UnionMode::Sparse and omit the offsets buffer (passing only the type_ids buffer, and child arrays of full length). Indeed, UnionArray::from will panic if offsets are present for a sparse union or missing for a dense union, so they must match (UnionArray in arrow::array - Rust - Docs.rs).

Performance:

  • Dense Union: Access requires looking up two buffers (type and offset) and then indexing into one of the child arrays. This is an O(1) operation, but the constant factors are a bit higher than a straightforward array access. If you iterate through a dense union array, you might incur branch mispredictions because each element could lead down a different code path depending on its type. However, the data for each type is densely packed in its own child buffer, which is good for operations that can be done per type in bulk.
  • Sparse Union: Access is slightly simpler (just type buffer then index child at same index), but the memory footprint is larger, as each child is allocated to full length. This can be wasteful if one type dominates. On iteration, you still have a branch on type id, but children can be accessed directly by index. Sparse union may be faster for random element access due to skipping the offset lookup, but slower overall if memory bandwidth is an issue (since you're dragging around potentially a lot of nulls in memory).

Dense vs Sparse Trade-offs: Dense unions use less memory when the distribution of types is skewed (e.g., if only a small fraction of elements are of type1, the type1 child array will be small). They do, however, incur the cost of the offsets buffer and a double indirection. Sparse unions are conceptually simpler and might be quicker to construct (no offsets to compute; you just align all child arrays by index, padding with nulls) (Introduction — Apache Arrow v19.0.1). But they allocate space for every element in every child array, which can be huge if you have many types but each element uses only one. In sparse union, the child arrays’ null count could be high (lots of unused slots).

For operations that process the entire array, one common strategy is to separate the union into batches per type. For example, you might filter or transform the Int32 values in one go and the Utf8 values in one go. Dense union makes that easy: the child arrays already isolate those values. Sparse union would require scanning through each child with its validity to pick out actual entries (unless you already know which indices belong to which type from the type buffer). In other words, dense union packs same-type values together physically, whereas sparse union interleaves them padded with nulls.

When to use Union: Unions are less common in typical analytics pipelines, but they appear when dealing with JSON, mixed-type columns, or cases like storing different subtypes in one column. For instance, if an array can hold either an integer or a string per slot, a union is appropriate. If one cares about memory, dense union is usually preferred (since Arrow IPC messages also expect dense unions typically). Sparse unions might be used in streaming scenarios to avoid computing offsets at the cost of bandwidth.

From a startup perspective, if you need to handle dynamic or polymorphic data (like a JSON with values that could be int or string), Arrow’s union can represent it without serializing to a generic type like string. But this complexity should be weighed: operations on union types require handling each possible type. If possible, converting data to a common type (or using Option types in struct) might be simpler. However, when preserving type information is necessary, Arrow Unions are the way to go. They allow you to store different types in one column while still keeping zero-copy compatibility with Arrow-based processing (e.g., no need to convert everything to string or struct).

Dictionary Encoded Arrays

Physical Layout: Dictionary encoding in Arrow allows one to represent repeated values more efficiently by splitting the data into a dictionary of unique values and an array of indices that reference this dictionary (Introduction — Apache Arrow v19.0.1). This is analogous to a lookup table or categorical encoding. A Dictionary Array has:

  • An indices buffer (value buffer of an integer type, often Int8, Int16, or Int32 depending on dictionary size) of the same length as the array. Each index is an integer that points to a value in the dictionary.
  • A dictionary values array which holds each unique value exactly once. This dictionary is not considered a child array in the same way (in IPC format it’s part of the schema), but in memory one can think of it as a separate Arrow array or vector of values.
  • A validity bitmap (optional) for the indices (if the array itself has nulls distinct from dictionary values). Often, if nulls appear, Arrow can use a special index (like -1 or a reserved value) to indicate null, but in practice the simplest representation is to mark nulls via the bitmap.

The Arrow spec and implementations treat the dictionary as metadata: multiple arrays can reference the same dictionary by ID. But logically, you can imagine each dictionary-encoded array carries its dictionary. The dictionary values array can be of any type (primitive, binary, etc.), but it cannot itself have nulls in the current Arrow spec (nulls in the data are indicated by the indices or bitmap, not by having a null entry in the dictionary).

Memory Diagram: Suppose we have a DictionaryArray<Int8, Utf8> (8-bit indices, string values) representing ["red", "blue", "red", "green", "blue"]. The unique values are ["blue", "green", "red"] for example. We can assign dictionary indices: "blue"->0, "green"->1, "red"->2. The dictionary array would store:

  • Buffer 0: Validity bitmap for 5 entries (if no nulls here, we might omit this).
  • Buffer 1: Indices buffer (Int8 values): e.g. [2, 0, 2, 1, 0].
  • Dictionary (not a buffer of the array but an associated array): an Utf8Array of length 3 with values ["blue","green","red"].

In memory, the indices buffer is just like a primitive array of type Int8 (with values 2,0,2,1,0). The dictionary values are somewhere else in memory (possibly a separate Arrow Array object). The array length is 5, null count maybe 0 here. If one of the entries were null (say second element is null instead of "blue"), the indices buffer might contain something (some implementations use -1 or 0 for null index) but the official Arrow approach is to use the validity bitmap: mark index1 as 0 (null) in the bitmap, and the actual index value at position 1 is undefined or irrelevant. The dictionary remains the same.

Arrow’s metadata would record the index type (Int8) and the dictionary value type (Utf8), linking them.

Rust Implementation: In Rust, dictionary arrays are represented by DictionaryArray<K> where K is the index type (must implement ArrowDictionaryKeyType, typically a signed integer type). For example, DictionaryArray<Int32Type> with values of type StringArray. You can construct dictionary arrays in a high-level way using an iterator of values (which will behind the scenes build a dictionary of unique values and an index array), or by providing the indices and separately a dictionary array.

A convenient method is DictionaryArray::from_iter which takes an iterator of Rust values (or Options) and returns a dictionary-encoded Arrow array automatically (TypedDictionaryArray in arrow::array - Rust). Example:

use arrow_array::{DictionaryArray, Int8Type};
let data = vec!["red", "blue", "red", "green", "blue"];
let dict_array: DictionaryArray<Int8Type> = data.into_iter().collect();

This will produce a DictionaryArray with Int8 keys and the dictionary values inferred from the data (likely in order of first occurrence: ["red","blue","green"] or similar). The indices might be [0,1,0,2,1] if "red" got index0, "blue" index1, "green" index2. The exact ordering of dictionary entries can vary (commonly it's order of appearance or sorted if specified).

Internally, DictionaryArray stores an ArrayData for the indices and an Arc<dyn Array> for the dictionary values. You can obtain the dictionary values by dict_array.values() which gives an ArrayRef (e.g., can downcast to StringArray) (TypedDictionaryArray in arrow::array - Rust). The indices can be fetched with dict_array.keys() which returns a PrimitiveArray<K> of the indices (TypedDictionaryArray in arrow::array - Rust).

You can also manually construct it:

use arrow_array::{Int8Array, StringArray, DictionaryArray, types::Int8Type};
let indices = Int8Array::from(vec![2, 0, 2, 1, 0]);
let dict = StringArray::from(vec!["blue", "green", "red"]);
let dict_array = DictionaryArray::<Int8Type>::try_new(&indices, Arc::new(dict)).unwrap();

This creates the same example as above. We use try_new to ensure the index values are within bounds of the dictionary and that the index type matches the provided generic (Int8Type here). The dictionary array length is the length of the indices (5). The null count would come from the indices’ null bitmap if any, or you could supply a separate bitmap.

Performance: Dictionary encoding shines when there are many repeated values. The memory footprint of the indices can be much smaller than storing full values, especially for strings or large primitives. In the above example, instead of storing 5 strings totaling perhaps ~? bytes, we store 5 single-byte indices plus 3 unique strings. If you have highly repetitive data (e.g., categorical data with few categories), this can drastically reduce memory and improve cache usage (Introduction — Apache Arrow v19.0.1).

Accessing an element’s value involves an extra step: first fetch the index from the indices buffer (fast, as it’s just an integer in a buffer), then use that to index into the dictionary values array. This is an O(1) operation, but effectively two array lookups instead of one. However, if you are scanning through the data performing operations that don’t require materializing the actual value (for instance, grouping by category, counting frequencies), you can often just work with the indices which is very fast (operating on small integers). The dictionary values might only need to be accessed at the end or when presenting results.

One must remember that a dictionary array’s logical nulls can be represented either by a null index or a special index. Arrow’s current approach is to use the validity bitmap for nulls (the indices buffer and dictionary contain no representation for "null" as a value) (NullBuffer in arrow::buffer - Rust) (NullBuffer in arrow::buffer - Rust). Therefore, processing nulls in dictionary arrays is the same as any array: check the bitmap. The dictionary values array itself typically has no nulls (ensuring each index corresponds to a valid value).

When to use: Dictionary encoding is ideal for low-cardinality columns – i.e., columns where the number of distinct values is much smaller than the number of total values. Examples: country codes, status strings ("OPEN", "CLOSED", ...), categorical features in ML, etc. It's also commonly used in Arrow IPC and Parquet under the hood to compress data. In a startup context, using dictionary encoding can significantly reduce memory pressure for such categorical data. The Arrow Rust implementation makes it easy to convert a StringArray to a DictionaryArray if needed (there’s a StringArray::dictionary_encode() utility).

One trade-off is that if the dictionary itself is large (say tens of thousands of unique values), random access may not be as cache-friendly since the dictionary values are separate. But often the dictionary is small enough to fit in cache, and reuses of the same value mean better locality. Also, dictionary arrays add complexity if you need to mutate the data (as you’d have to update both dictionary and indices). Arrow arrays are immutable though, so that is usually handled by rebuilding the array.

Finally, dictionary encoding is great for serialization: Arrow IPC can send the dictionary once and then send batches of indices, avoiding sending repetitive data multiple times. Parquet similarly will dictionary-encode data chunks to save file size. So, if your pipeline is Arrow -> Parquet or Arrow -> Arrow Flight, using dictionary arrays can yield the same benefits when writing out.

Run-End Encoded Arrays (Run-Length Encoding)

Physical Layout: Run-End encoding is a compression technique for data with consecutive repeated values (runs). Arrow’s Run-End Encoded (REE) array is a nested type that consists of two child arrays (Introduction — Apache Arrow v19.0.1):

  • Run ends array: a monotonic integer array marking the end index of each run (in the logical expanded data).
  • Values array: an array of the values for each run.

Unlike dictionary encoding which compresses by unique values globally, run-end encoding compresses by consecutive repetition. For example, a logical array [A, A, B, B, B, C] has runs (A length2, B length3, C length1). The run-end encoded form would have run_ends = [2, 5, 6] (assuming 0-indexing, run 1 ends at index2 exclusive, run 2 ends at index5, run 3 ends at index6 which is the array length) and values = [A, B, C] (RunArray in arrow_array::array - Rust) (RunArray in arrow_array::array - Rust). The length of the REE array (the parent) is defined as the last run end (in this example, 6). Each position in 0..5 logically maps to one of the runs.

Important: Arrow’s run-end encoded array has no buffers of its own (no validity or data buffer at the parent level) (Introduction — Apache Arrow v19.0.1). Nulls in the data are represented as runs of a null value in the values array. So if the data has nulls, the values array’s type will be the same as the logical type (say Int32) but with null values for runs that are null. Essentially, the values array carries any nulls as well; the REE parent doesn’t use a separate validity bitmap (Introduction — Apache Arrow v19.0.1).

The Run ends array can be Int16, Int32, or Int64 depending on the maximum length (Arrow chooses the smallest integer type that can represent the length). It has the same number of elements as the values array (each run has a corresponding end index) (DataType in arrow - Docs.rs).

Memory Diagram: Using the above [A, A, B, B, B, C] example:

  • Run ends (child0): [2, 5, 6] – indicating run1 ends at index 2, run2 at 5, run3 at 6 (the length). Typically stored as Int32 here (assuming length is small).
  • Values (child1): [A, B, C] – each corresponding to runs in order. If B had been null for that entire run in the original, then the value array entry for B would be null (with its own validity bitmap marking it).

If we index into logical position 4 (0-based), we determine which run covers that index by finding the first run_end >= 5 (since index4 < 5, we are in run2 which ends at 5). That is value array index 1 (B). Index 0 or 1 would find run_end 2 (covering indices 0 and 1) -> value index 0 (A). Index 5 finds run_end 6 -> value index 2 (C). So random access involves a binary search or linear scan in the run_ends array to locate the run containing the index (RunArray in arrow_array::array - Rust) (since run_ends is sorted, binary search can be used).

Rust Implementation: In arrow-rs, run-end encoded arrays are represented by the generic struct RunArray<R> where R is the RunEndIndexType (e.g., Int16Type, Int32Type) (RunArray in arrow_array::array - Rust). The RunArray struct has run_ends: RunEndBuffer<R> and values: ArrayRef as fields (RunArray in arrow_array::array - Rust). The DataType for such an array is DataType::RunEndEncoded(run_end_type, Box<Field>) where the field describes the value type of the runs.

Creating a RunArray can be done via RunArray::try_new(run_ends, values) provided the lengths match (run_ends and values must have equal length) and the last run_end equals the logical length. Arrow does not yet provide a builder that takes the expanded array and compresses it (you have to produce the run_ends and values yourself or via some utility).

Example:

use arrow_array::{RunArray, Int32Array};
use arrow_array::types::Int32Type;

// Logical array: [7,7,7,  0,0,  null,null,null,null]
// Runs: 7 x3, 0 x2, null x4 (length 9)
let run_ends = Int32Array::from(vec![3, 5, 9]);
let values = Int32Array::from(vec![Some(7), Some(0), None]);
let run_array = RunArray::<Int32Type>::try_new(&run_ends, &values).unwrap();
assert_eq!(run_array.logical_len(), 9);

Here we manually encoded an array of 9 Int32s with three runs (three 7’s, two 0’s, four nulls). The run_array behaves logically like an Int32 array of length 9 with those values. We could verify run_array.value(0) == Some(7), etc. Under the hood, run_array.run_ends() returns the RunEndBuffer (which is basically an Int32Array of [3,5,9]) and run_array.values() returns the values array [7,0,null].

Because run-end encoding uses the values array to carry nulls, run_array.null_count() will count how many logical nulls by looking at the values array entries (and multiplying by their run lengths). In our example, the values array has one null (at position 2), but that null corresponds to 4 logical nulls (from index5 to 8). The RunArray likely computes logical null count accordingly (arrow-rs provides logical_nulls() to get a Bitmap of logical length) (NullArray in arrow::array::array - Rust) (NullArray in arrow::array::array - Rust).

Performance: Run-end encoding is very efficient in memory for data with long runs. In the extreme case of an array of constant value, it compresses N elements into 1 value + 1 run end. It’s also efficient for alternating patterns if there are still long stretches.

However, random access is slower: to get element at position i, one must find which run covers that index, which is typically a binary search in the run_ends array of length R (number of runs). That’s O(log R). If R is much smaller than N (as it is in well-compressed data), this is still fast. But in the worst case (no compression, each run length 1), R ~ N and you’d essentially binary search ~N which is O(N) for random access, making it worse than just scanning an uncompressed array. That worst case is rare in practice because if every value differs, you probably wouldn’t use run-length encoding at all.

Iterating through a run-end encoded array sequentially can be done by iterating runs: you get a value and a run length and produce that value repeatedly. This can be very fast if the consumer can handle runs at a time (e.g., if summing, you add value * run_length for runs of numeric data, etc.). If a consumer needs each element individually, then there’s overhead in expanding it.

One typical use of RLE (run-length encoding) is to compress null patterns. Arrow could compress a long sequence of nulls effectively. Indeed, NullArray of length N could be represented as run_ends [N] and values [null] (one run). But currently NullArray is separate; RLE is more useful for cases like sensor readings that stay constant for a while or large sorted groupings.

Dictionary vs Run-End: Both are compression techniques, but for different patterns:

  • Dictionary is best for high global repetition (even if not consecutive). It allows random access with one indirection and works even if identical values are scattered.
  • Run-end is best for long consecutive repeats. It compresses sequences but if the same value appears in separate places non-consecutively, run-end will list separate runs, whereas dictionary would capture it once in dictionary but still repeat the index for each occurrence.

They can even be combined conceptually, but Arrow doesn’t directly support a dictionary of runs or vice versa in a single array (though you could have a RunEndEncoded of dictionary indices or something if needed).

From a startup viewpoint, if you have time-series or sorted data with repeated spans (like many identical values in a row, including long stretches of null), Run-End encoding can save memory big time. But it might not be worth using unless those patterns are significant, given the added complexity of handling an extra array for run ends and binary searching. Many Arrow compute kernels may not yet fully support run-end encoded arrays seamlessly, so one might need to convert to flat arrays for certain operations. As Arrow evolves, run-end encoding support will improve for operations like scanning or filtering (which can operate on runs).

The Null Type and Null Arrays

Null Type: Arrow’s Null type is a degenerate type that has no physical storage for values – it represents a column of all nulls (NullArray in arrow::array::array - Rust). A Null array of length N logically has N null values. According to the Arrow spec, a Null array may omit the validity bitmap entirely because it’s redundant – we know all values are null, so a consumer can treat any index as null by definition. In practice, implementations often store just a length and a null count, with no buffers.

In Arrow Rust, a NullArray is implemented with just a length field (NullArray in arrow::array::array - Rust). The NullArray::new(N) constructor creates an array of the given length. If you check null_array.null_count(), interestingly, arrow-rs returns 0 for physical null count (since it doesn’t allocate a bitmap) but logical_null_count() returns N (NullArray in arrow::array::array - Rust) (NullArray in arrow::array::array - Rust). This is an implementation detail: they treat it as a special case where the absence of a bitmap means “all null”. When exported or used in computation, it is understood that every index is null.

Buffers: A Null type array technically doesn’t have any data or validity buffers. Some languages may allocate a validity bitmap of all 0 bits for convenience, but Arrow allows it to be omitted to save memory (NullBuffer in arrow::buffer - Rust). For example, in C++ Arrow, MakeArrayOfNull will allocate no buffers and just set null_count = length.

Behavior in Nested Structures: When a Null array is a child of a struct or list, its effect is that wherever that value is used, it contributes nulls. If you had a List of length m with some lists lengths, the offsets would be there but the child “values” array would be Null type (which conceptually means every element in every sublist is null). But more commonly, Null appears as a standalone column type.

When reading or writing a Null type field in Arrow IPC or Parquet, it’s represented by just the field metadata and a length & null count. For example, Parquet has a concept of all-null column chunk which is handled similarly.

Use cases: Null type can be handy to represent columns that are missing or to allocate a placeholder column. It uses minimal memory. For instance, if you want to add a column to a schema that currently has no values, you can create a NullArray of the appropriate length.

Null Buffers in general: Aside from the Null type, we should clarify the rules around null buffers (validity bitmaps) for any array:

  • If null_count == 0 (no nulls), the validity buffer can be None (absent). Consumers should treat the array as all valid (NullBuffer in arrow::buffer - Rust). This saves memory and overhead.
  • If null_count == length (all nulls), typically a validity buffer is still provided (all 0 bits) except in the case of a NullType array. But it's not strictly required if the consumer knows from context (for normal arrays, null_count == length doesn’t automatically mean we can drop the bitmap, because we wouldn’t know a priori it’s all null unless we check each value or the metadata).
  • For nested types, nulls in parent vs child:
    • If a parent (like struct or list) is null at index i, the values in children at index i should be considered null too (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). Arrow relies on the parent’s null bitmap to override children. So children might even have a "non-null" value at that position, but it’s ignored. The recommended practice is often to also set the child’s value to null for clarity, but not required by the spec (the parent null is enough).
    • If a parent is non-null but a child value is null, that means the struct/list is considered non-null but one field inside is null. This is allowed: for a struct, that means the struct exists but has a null in one field. For a list, a null in the child array wouldn’t usually happen because the child array is a flat sequence of values of the same type (if an element of a list is null, typically that element wouldn’t appear as a value at all – the entire sublist might be null or empty, but individual inner values being null is just the child's own nulls, representing e.g. a list of ints that can contain null ints).
    • When serializing nested structures, Arrow will propagate nulls accordingly. E.g., if a list contains null values in its child, that’s fine; if a list element is null (the sublist itself is null), the offsets for that element usually repeat (offset[i] == offset[i+1]) and no actual values are stored for it (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316) (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316).

Rust specifics: In arrow-rs, the NullArray type’s data() will have no buffers. If you convert a NullArray to ArrayData, you’ll see buffers.len() == 0 and null_bitmap possibly None. The length is stored and null_count equals 0 (physical) but logical_null_count = length via an override. They provide a NullBuffer structure to represent validity bitmaps in general. If you had, say, an Int32 array with no nulls, arrow-rs might simply set the nulls: Option<NullBuffer> to None. If you had some nulls, NullBuffer holds the actual bitmap Buffer. Arrow-rs’s NullBuffer::new_valid(len) can create an optimized representation of all-valid without allocating a full bitmap (NullBuffer in arrow::buffer - Rust).

Practical considerations: The Null type is mostly used as a placeholder. In pandas or R, this might correspond to a column of NA of a certain length without any actual data. It’s also used in some file schemas to represent columns that were dropped or not present. Because it carries no data, operations on a Null array are trivial (e.g., sum is null, any lookup is null). Some compute kernels bypass processing Null arrays aside from outputting a Null result of appropriate length.

From a memory standpoint, a Null array is extremely cheap (just an integer for length, essentially). So if you have columns that are completely missing, using Null type is better than using, say, a Binary of empty or something, which would allocate buffers.

Interoperability: Arrow’s C data interface marks Null type with a specific ArrowType and no buffers. If you send a Null array to another system, they will create their own representation (like PyArrow makes an NullArray of that length). Conversion to JSON typically would output nulls for each element.

Summary of Null Handling:

  • Validity buffers can be omitted for all-valid. This improves startup performance (less memory allocation) and Arrow handles this seamlessly (Rust uses null_count: 0 and a None buffer to indicate all valid).
  • All-null arrays use either Null type or a normal array with all bits 0 in validity. Arrow’s spec explicitly allows dropping the buffer if all values are valid, but not if all are null (in that case you still need some indicator of length and nulls; Null type solves that by type).
  • Nested nulls: If an outer structure is null, inner data is not accessed. So, one should always check the outermost validity before drilling down. This cascading null behavior means you can have an inner child array with fewer nulls than the outer parent effectively has (because outer null covers some).

For a startup developer, understanding Arrow’s null handling is important when mapping to other systems. For example, converting Arrow to JSON, you need to output actual null values for any position where either the value is null or the parent struct is null. Arrow conveniently centralizes null representation in the validity bitmaps, which is memory-efficient (1 bit per null). This contrasts with having a special marker in the data buffer which would cost more space or complexity.

In summary, Arrow’s null rules are:

Performance and Best Practices Summary

Having detailed each layout, we can summarize some trade-offs and usage tips:

  • Primitive vs Binary: Use primitive types for fixed-size data (numbers, timestamps). They yield the best performance for computation (vectorizable, aligned) (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). Use binary/string for variable data; be mindful that string processing is inherently slower due to memory accesses, but Arrow’s contiguous storage and offset scheme is optimized for throughput on large data scans (Internal structure of Arrow objects • Arrow R Package). If your strings are short and heavily used in random access or comparisons, consider Utf8View to speed up those operations at the cost of extra memory overhead (Introduction — Apache Arrow v19.0.1).

  • List vs FixedSizeList vs ListView: If each list has the same length, always prefer FixedSizeList – it saves one buffer and simplifies indexing (Introduction — Apache Arrow v19.0.1). Use List for genuinely variable-length sequences; it’s memory-efficient and integrates with Parquet/IPC smoothly. ListView is an advanced feature for specialized scenarios (like zero-copy filtering), and not fully supported as of Arrow 17 (DataType in arrow::datatypes - Rust), so unless you have a clear performance need and the support in your processing pipeline, you might avoid it for now. From a performance angle, ListView could help avoid rewriting offset buffers on slicing, but typical operations on lists (like flattening or aggregating) are well-served by the standard layout.

  • Binary vs BinaryView: Similarly, use Binary/Utf8 for most cases – they keep memory footprint lower by not allocating a fixed slot per value (DataType in arrow::datatypes - Rust). If profiling shows a lot of time spent on string comparisons or random lookups, and memory isn’t a bottleneck, then BinaryView/Utf8View might be worth exploring. They improve locality (each element’s metadata is in one place) and can accelerate prefix comparisons (Introduction — Apache Arrow v19.0.1). But remember the view types in Arrow 17 are experimental in Rust, so the safer route is to use dictionary encoding for strings if repeating, or just accept the offset indirection. Locality vs Indirection is the key trade-off: BinaryView provides locality at the element level (at cost of a pointer to actual data), whereas Binary provides locality at the column level (all data contiguous, but needs an offset for each element).

  • Dense vs Sparse Union: Dense Union is usually preferable because it doesn’t allocate full-length arrays for each type, which is important if only a fraction of elements are of a certain type (Introduction — Apache Arrow v19.0.1) (Introduction — Apache Arrow v19.0.1). The overhead is 4 bytes per element for the offset buffer, which is negligible if your types are larger or if memory saving from skipping nulls is significant. Sparse Union might be useful if you have a relatively balanced usage of types and you want simpler logic (no offsets). It can also be slightly faster to access because you index directly. But memory usage can blow up if one type is rare (you’d still allocate space for it at every index). In practice, sparse unions are seldom used unless memory is abundant and you prioritize absolute access speed or simplicity. Additionally, many Arrow consumers assume dense union (since it was the first implemented), so dense union ensures better interoperability.

  • Dictionary Encoding: This is a powerful tool for reducing memory. Use it for columns with repetition (categorical data) – e.g., converting a StringArray to a DictionaryArray can drastically cut memory if there are only, say, 100 unique strings out of millions of rows (Introduction — Apache Arrow v19.0.1). Dictionary encoding not only saves memory but can speed up comparisons and grouping (comparing small integers is faster than comparing long strings). It’s also startup-friendly because it reduces memory pressure and IPC payload sizes – sending indices + dictionary once is cheaper than sending full strings for each record. The overhead is the indirection and the need to maintain the dictionary. In Arrow, dictionary arrays can be unified (e.g., if two arrays have separate dictionaries, you may need to unify them to compare globally). But within a single array, that’s not an issue. Use the smallest index type that fits the dictionary (Arrow will often default to Int32, but if you know the dictionary is <256, Int8 can be used to save space). Arrow’s Rust automatically picks the index type based on the type parameter you choose (Int8Type, etc.).

  • Run-End Encoding: Use run-end encoding if your data has significant runs of identical values, especially if those runs are long. A classic example is time-series data where a reading stays constant for hours, or a sorted column where identical values are grouped. It shines for compressing long null runs too – e.g., a column that is mostly null except a few values. The memory savings can be enormous. However, run-end encoded arrays are relatively new and not all algorithms might handle them natively. Often you might compress for storage or transmission, then expand for processing. If your use-case is scanning or summarizing data (rather than random access), you can operate on runs effectively (like “if value is X for 1000 runs, count++ for 1000 in one go”). Keep in mind random access and element-wise operations will need to search the run, adding overhead (RunArray in arrow_array::array - Rust). Also, run-length encoding only helps if there is locality of repetition – if identical values are scattered, dictionary might be better; if identical values come in large contiguous chunks, run-length is superior.

  • Nulls and Padding: Arrow’s design ensures that the presence of nulls does not significantly degrade performance of non-null values. The validity bitmap is separate and bit-packed, so processing non-null values can often ignore the bitmap until needed, and it uses very little memory (1 bit per element) (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). Best practice is to avoid materializing nulls as special marker values in the data buffer (like sentinel values); always rely on the bitmap. This keeps the data buffer clean for SIMD. Also, if you have a choice, try to mark columns that have no nulls as non-nullable in the schema so that some pipelines might skip null handling altogether.

On alignment, Arrow already 64-byte aligns buffers, so you typically don’t need to worry about misalignment. One thing to avoid is creating sub-slices of arrays at arbitrary offsets – Arrow’s Array.slice(offset) will adjust internal pointers and also keep track of offset for bitmaps to remain aligned to original allocations. It still ensures each buffer’s starting address remains aligned (since they were from a larger block aligned). As a consumer, just trust Arrow’s alignment and pad rules – e.g., when constructing ArrayData, ensure you follow the padding (the builder in Rust often does it for you). Unaligned data can cause exceptions in some vectorized operations, so it’s an important aspect of Arrow.

Serialization considerations: Arrow is designed for zero-copy or low-copy serialization. If you need to send data over the network frequently, using dictionary or run-length encoding within Arrow arrays can reduce the data size to send. Arrow Flight (gRPC for Arrow) or Parquet both benefit from these encodings. JSON, however, doesn’t know about these – if you convert Arrow to JSON, you’ll be expanding dictionary indices back to full values and writing nulls explicitly. That can be slow for large data. If part of your pipeline involves JSON (for example, web responses or logs), be mindful that an Arrow optimization like dictionary will need to be undone for JSON output. Parquet, on the other hand, will often preserve dictionary encoding (Parquet has its own dictionary and RLE encoding schemes that Arrow can utilize during writes). So a best practice: use Arrow/Parquet for internal data interchange for speed, and only convert to JSON at the final point if needed for human or external system consumption – and do so by streaming if possible to handle the conversion cost.

In summary, Apache Arrow’s physical types offer a toolbox of memory layouts. By choosing the right layout for the right data shape (flat vs nested, repeated vs unique, fixed vs variable), you can achieve an optimal balance of speed and memory usage. Arrow’s Rust implementation provides builders and types to construct these layouts safely, and understanding the under-the-hood representation helps in writing efficient data processing pipelines. Whether you are a startup optimizing an analytics engine or building a data service, leveraging Arrow’s columnar format will allow you to scale with large data volumes while keeping performance high (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316) (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316). With proper use of alignment, buffering, and encoding techniques like dictionary and run-length encoding, you can minimize memory overhead and maximize throughput, all while maintaining interoperability through Arrow’s standardized format.

References:

  1. Apache Arrow Format and Specification – Memory Layout and Buffers (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316) (Physical memory layout — Apache Arrow v0.12.1.dev425+g828b4377f.d20190316)
  2. Arrow R Package Developer Documentation – Internal Data Object Layout (Internal structure of Arrow objects • Arrow R Package) (Internal structure of Arrow objects • Arrow R Package)
  3. Apache Arrow Rust Documentation – DataType definitions (DataType in arrow::datatypes - Rust) (DataType in arrow::datatypes - Rust)
  4. Apache Arrow Rust Documentation – ArrayData, Buffer, NullBuffer usage (how to create a polars-arrow Array from raw values (&[u8]) - Stack Overflow) (NullBuffer in arrow::buffer - Rust)
  5. Apache Arrow Format Documentation – Layout for variable-length, list, struct, union, etc. (Introduction — Apache Arrow v19.0.1) (Introduction — Apache Arrow v19.0.1)
  6. Apache Arrow Rust Examples – Constructing arrays and using builders (Int32Array in arrow::array - Rust) (TypedDictionaryArray in arrow::array - Rust)
  7. Apache Arrow Format Documentation – Dictionary and Run-End Encoded layout (Introduction — Apache Arrow v19.0.1) (Introduction — Apache Arrow v19.0.1)
  8. Apache Arrow Rust Internals – RunArray and UnionArray implementation (RunArray in arrow_array::array - Rust) (Introduction — Apache Arrow v19.0.1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment