Skip to content

Instantly share code, notes, and snippets.

@mchav
Created March 27, 2026 01:10
Show Gist options
  • Select an option

  • Save mchav/aa4a4510ca06367de7a76fdb0a907c70 to your computer and use it in GitHub Desktop.

Select an option

Save mchav/aa4a4510ca06367de7a76fdb0a907c70 to your computer and use it in GitHub Desktop.

Dealing with nulls as the schema evolves

Often times your data is split into a training set and test set. Columns with missing values in the training set might not have missing values in the test set (and vice versa). So, if you write a function to clean your training data it might crash on your test data. This is a problem. The software is naive to the fact that missingness is ubiquitous and should be dealt with gracefully. There are a number of ways to solve this problem.

You can see the code smell in the Synthesis.hs example. We create an uber dataframe with train and test just to approximate the final schema, then split them up again.

Read everything as Maybe a

When users read CSVs (or any file format for that matter) we can default to assuming there is missingness everywhere. The user can then just work with Maybe a as is (never unwrapping concrete types) or always deal with missingness before.

a) Leaving everything as Maybe a

-- Schema: x :: Expr (Maybe Double)
--         y :: Expr (Maybe Double)
df <- D.readCsv D.csvOptions {safeRead = True} "./data/test.csv"

df' = D.derive "area" (F.col @(Maybe Double) "x" .* F.col @(Maybe Double) "y")

D.writeCsv "./data/test_aug.csv" df'

### b) Cleaning up `Maybe a`
```haskell
-- Schema: x :: Expr (Maybe Double)
--         y :: Expr (Maybe Double)

clean :: TypedDataFrame '[Column "x" (Maybe Double), Column "y" (Maybe Double)] -> TypedDataFrame '[Column "x" Double, Column "y" Double]
clean = undefined -- stub implementation

raw <- D.readCsv D.csvOptions {safe = True} "./data/test.csv"

let tdf = either error id (DT.freezeWithError @'[Column "x" (Maybe Double), Column "y" (Maybe Double)] raw)
    tdf' = DT.derive "area" (DT.col @"x" * D.col @"y") tdf

Use the cast operator

We have a really unprincipled cast operator. We could ask the user to always use that first so they have full control over types (at the expense of some initial runt ime failures)

raw <- D.readCsv "./data/test.csv"

-- If you can/want to impute
df = D.derive "area" (F.castWithDefault @Double 0 "x" * F.castWithDefault @Double 0 "y")

-- If you cannnot/don't want to impute
df = D.derive "area" (F.cast @(Maybe Double) "x" .* F.cast @(Maybe Double) "y")

Now you can be sure that your types are what you say/think they are and if there is any weirdness it'll be an early run time error.

Cast would be an extremely unpopular default to point users towards but in some cases it's the best solution.

Looser inference rules

My least favourite option. This means return types are all fake and we're effectively lying to the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment