Dealing with nulls as the schema evolves

Often times your data is split into a training set and test set. Columns with missing values in the training set might not have missing values in the test set (and vice versa). So, if you write a function to clean your training data it might crash on your test data. This is a problem. The software is naive to the fact that missingness is ubiquitous and should be dealt with gracefully. There are a number of ways to solve this problem.

You can see the code smell in the Synthesis.hs example. We create an uber dataframe with train and test just to approximate the final schema, then split them up again.

Read everything as `Maybe a`

When users read CSVs (or any file format for that matter) we can default to assuming there is missingness everywhere. The user can then just work with Maybe a as is (never unwrapping concrete types) or always deal with missingness before.

a) Leaving everything as `Maybe a`

-- Schema: x :: Expr (Maybe Double)
--         y :: Expr (Maybe Double)
df <- D.readCsv D.csvOptions {safeRead = True} "./data/test.csv"

df' = D.derive "area" (F.col @(Maybe Double) "x" .* F.col @(Maybe Double) "y")

D.writeCsv "./data/test_aug.csv" df'

### b) Cleaning up `Maybe a`
```haskell
-- Schema: x :: Expr (Maybe Double)
--         y :: Expr (Maybe Double)

clean :: TypedDataFrame '[Column "x" (Maybe Double), Column "y" (Maybe Double)] -> TypedDataFrame '[Column "x" Double, Column "y" Double]
clean = undefined -- stub implementation

raw <- D.readCsv D.csvOptions {safe = True} "./data/test.csv"

let tdf = either error id (DT.freezeWithError @'[Column "x" (Maybe Double), Column "y" (Maybe Double)] raw)
    tdf' = DT.derive "area" (DT.col @"x" * D.col @"y") tdf

Use the cast operator

We have a really unprincipled cast operator. We could ask the user to always use that first so they have full control over types (at the expense of some initial runt ime failures)

raw <- D.readCsv "./data/test.csv"

-- If you can/want to impute
df = D.derive "area" (F.castWithDefault @Double 0 "x" * F.castWithDefault @Double 0 "y")

-- If you cannnot/don't want to impute
df = D.derive "area" (F.cast @(Maybe Double) "x" .* F.cast @(Maybe Double) "y")

Now you can be sure that your types are what you say/think they are and if there is any weirdness it'll be an early run time error.

Cast would be an extremely unpopular default to point users towards but in some cases it's the best solution.

Looser inference rules

My least favourite option. This means return types are all fake and we're effectively lying to the user.

mchav/Nulls.md

Select an option

No results found

Select an option

No results found

Dealing with nulls as the schema evolves

Read everything as `Maybe a`

a) Leaving everything as `Maybe a`

Use the cast operator

Looser inference rules

mchav/Nulls.md

Dealing with nulls as the schema evolves

Read everything as Maybe a

a) Leaving everything as Maybe a

Use the cast operator

Looser inference rules

Read everything as `Maybe a`

a) Leaving everything as `Maybe a`