Often times your data is split into a training set and test set. Columns with missing values in the training set might not have missing values in the test set (and vice versa). So, if you write a function to clean your training data it might crash on your test data. This is a problem. The software is naive to the fact that missingness is ubiquitous and should be dealt with gracefully. There are a number of ways to solve this problem.
You can see the code smell in the Synthesis.hs example. We create an uber dataframe with train and test just to approximate the final schema, then split them up again.
When users read CSVs (or any file format for that matter) we can default to assuming there is missingness everywhere. The user can then just work with Maybe a as is (never unwrapping concrete types) or always deal with missingness before.
-- Schema: x :: Expr (Maybe Double)
-- y :: Expr (Maybe Double)
df <- D.readCsv D.csvOptions {safeRead = True} "./data/test.csv"
df' = D.derive "area" (F.col @(Maybe Double) "x" .* F.col @(Maybe Double) "y")
D.writeCsv "./data/test_aug.csv" df'
### b) Cleaning up `Maybe a`
```haskell
-- Schema: x :: Expr (Maybe Double)
-- y :: Expr (Maybe Double)
clean :: TypedDataFrame '[Column "x" (Maybe Double), Column "y" (Maybe Double)] -> TypedDataFrame '[Column "x" Double, Column "y" Double]
clean = undefined -- stub implementation
raw <- D.readCsv D.csvOptions {safe = True} "./data/test.csv"
let tdf = either error id (DT.freezeWithError @'[Column "x" (Maybe Double), Column "y" (Maybe Double)] raw)
tdf' = DT.derive "area" (DT.col @"x" * D.col @"y") tdfWe have a really unprincipled cast operator. We could ask the user to always use that first so they have full control over types (at the expense of some initial runt ime failures)
raw <- D.readCsv "./data/test.csv"
-- If you can/want to impute
df = D.derive "area" (F.castWithDefault @Double 0 "x" * F.castWithDefault @Double 0 "y")
-- If you cannnot/don't want to impute
df = D.derive "area" (F.cast @(Maybe Double) "x" .* F.cast @(Maybe Double) "y")Now you can be sure that your types are what you say/think they are and if there is any weirdness it'll be an early run time error.
Cast would be an extremely unpopular default to point users towards but in some cases it's the best solution.
My least favourite option. This means return types are all fake and we're effectively lying to the user.