This is a set of examples demonstrating whether various Python and R dataframe libraries and OLAP query engines preserve (or do not preserve) the original order of the records in the data.
The examples all use this dataset describing the 28 times when a person walked on the moon:
year | mission | name | minutes |
---|---|---|---|
1969 | Apollo 11 | Neil Armstrong | 151 |
1969 | Apollo 11 | Buzz Aldrin | 151 |
1969 | Apollo 12 | Pete Conrad | 236 |
1969 | Apollo 12 | Alan Bean | 236 |
1969 | Apollo 12 | Pete Conrad | 229 |
1969 | Apollo 12 | Alan Bean | 229 |
1971 | Apollo 14 | Alan Shepard | 287 |
1971 | Apollo 14 | Edgar Mitchell | 287 |
1971 | Apollo 14 | Alan Shepard | 274 |
1971 | Apollo 14 | Edgar Mitchell | 274 |
1971 | Apollo 15 | David Scott | 392 |
1971 | Apollo 15 | James Irwin | 392 |
1971 | Apollo 15 | David Scott | 432 |
1971 | Apollo 15 | James Irwin | 432 |
1971 | Apollo 15 | David Scott | 289 |
1971 | Apollo 15 | James Irwin | 289 |
1972 | Apollo 16 | John Young | 431 |
1972 | Apollo 16 | Charles Duke | 431 |
1972 | Apollo 16 | John Young | 443 |
1972 | Apollo 16 | Charles Duke | 443 |
1972 | Apollo 16 | John Young | 340 |
1972 | Apollo 16 | Charles Duke | 340 |
1972 | Apollo 17 | Gene Cernan | 431 |
1972 | Apollo 17 | Harrison Schmitt | 431 |
1972 | Apollo 17 | Gene Cernan | 456 |
1972 | Apollo 17 | Harrison Schmitt | 456 |
1972 | Apollo 17 | Gene Cernan | 435 |
1972 | Apollo 17 | Harrison Schmitt | 435 |
The rows of this data are implicitly ordered first by the Apollo mission (in chronological order from top to bottom), second by the order of moonwalks during the mission (earliest at the top), and third by the order in which the astronauts egressed from the lunar module during the first or only moonwalk of the mission (first at the top). See the included file moonwalks.csv
which contains this data.
import pandas as pd
moonwalks = pd.read_csv("moonwalks.csv")
moonwalks[["mission"]].drop_duplicates()
mission
0 Apollo 11
2 Apollo 12
6 Apollo 14
10 Apollo 15
16 Apollo 16
22 Apollo 17
pandas preserves the original row order.
import polars as pl
moonwalks = pl.read_csv("moonwalks.csv")
moonwalks.select("mission").unique(maintain_order=True)
shape: (6, 1)
┌───────────┐
│ mission │
│ --- │
│ str │
╞═══════════╡
│ Apollo 11 │
│ Apollo 12 │
│ Apollo 14 │
│ Apollo 15 │
│ Apollo 16 │
│ Apollo 17 │
└───────────┘
import polars as pl
moonwalks = pl.read_csv("moonwalks.csv")
moonwalks.select("mission").unique(maintain_order=False)
shape: (6, 1)
┌───────────┐
│ mission │
│ --- │
│ str │
╞═══════════╡
│ Apollo 17 │
│ Apollo 14 │
│ Apollo 12 │
│ Apollo 16 │
│ Apollo 11 │
│ Apollo 15 │
└───────────┘
Polars preserves the original row order only when maintain_order
is set to True
.
import dask.dataframe as dd
moonwalks = dd.read_csv("moonwalks.csv")
moonwalks[["mission"]].drop_duplicates().compute()
mission
0 Apollo 11
2 Apollo 12
6 Apollo 14
10 Apollo 15
16 Apollo 16
22 Apollo 17
When the input index is monotonically increasing (as in this example), Dask DataFrame preserves the original row order.
import modin.pandas as pd
moonwalks = pd.read_csv("moonwalks.csv")
moonwalks[["mission"]].drop_duplicates()
mission
0 Apollo 11
2 Apollo 12
6 Apollo 14
10 Apollo 15
16 Apollo 16
22 Apollo 17
Modin preserves the original row order.
library(dplyr)
library(readr)
moonwalks <- read_csv("moonwalks.csv")
moonwalks |> distinct(mission)
# A tibble: 6 × 1
mission
<chr>
1 Apollo 11
2 Apollo 12
3 Apollo 14
4 Apollo 15
5 Apollo 16
6 Apollo 17
dplyr preserves the original row order.
library(data.table)
moonwalks <- fread("moonwalks.csv")
unique(moonwalks[, .(mission)])
mission
<char>
1: Apollo 11
2: Apollo 12
3: Apollo 14
4: Apollo 15
5: Apollo 16
6: Apollo 17
data.table preserves the original row order.
SET preserve_insertion_order = true;
SELECT DISTINCT(mission) FROM "moonwalks.csv";
┌───────────┐
│ mission │
│ varchar │
├───────────┤
│ Apollo 16 │
│ Apollo 15 │
│ Apollo 17 │
│ Apollo 11 │
│ Apollo 12 │
│ Apollo 14 │
└───────────┘
SET preserve_insertion_order = false;
SELECT DISTINCT(mission) FROM "moonwalks.csv";
┌───────────┐
│ mission │
│ varchar │
├───────────┤
│ Apollo 16 │
│ Apollo 11 │
│ Apollo 12 │
│ Apollo 15 │
│ Apollo 17 │
│ Apollo 14 │
└───────────┘
DuckDB does not preserve the original row order, regardless of the value of preserve_insertion_order
.
CREATE EXTERNAL TABLE moonwalks
STORED AS CSV
LOCATION "moonwalks.csv"
OPTIONS ("has_header" "true");
SELECT DISTINCT(mission) FROM moonwalks;
+-----------+
| mission |
+-----------+
| Apollo 14 |
| Apollo 11 |
| Apollo 12 |
| Apollo 16 |
| Apollo 15 |
| Apollo 17 |
+-----------+
DataFusion does not preserve the original row order.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
moonwalks = spark.read.csv("moonwalks.csv", header=True, inferSchema=True)
moonwalks.select("mission").distinct().show()
+---------+
| mission|
+---------+
|Apollo 11|
|Apollo 16|
|Apollo 12|
|Apollo 14|
|Apollo 15|
|Apollo 17|
+---------+
Spark does not preserve the original row order.