Created
April 11, 2017 16:50
-
-
Save tonyfischetti/30868b54538559fde93b032f4f6a5d27 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<small>*Version 2.0 of my data set validation package `assertr` hit CRAN just this weekend. It has some pretty great improvements over version 1. For those new to the package, what follows is a short and new introduction. For those who are already using `assertr`, the text below will point out the improvements.*</small> | |
I can (and have) go on and on about the treachery of messy/bad datasets. Though its substantially less exciting than… pretty much everything else, I believe (proportional to the heartache and stress it causes) we don’t spend enough time talking about it or building solutions around it. No matter how new and fancy your ML algorithm is, it’s success is predicated upon a properly sanitized dataset. If you are using bad data, your approach will fail—either flagrantly (best case), or unnoticeably (considerably more probable and considerably more pernicious). | |
`assertr` is a R package to help you identify common dataset errors. More specifically, it helps you easily spell out your assumptions about how the data should look and alert you of any deviation from those assumptions. | |
I’ll return to this point later in the post when we have more background, but I want to be up front about the goals of the package; `assertr` is not (and can never be) a “one-stop shop” for all of your data validation needs. The specific kind of checks individuals or teams have to perform any particular dataset are often far too idiosyncratic to ever be exhaustively addressed by a single package (although, the `assertive` meta-package may come very close!) But all of these checks will reuse motifs and follow the same patterns. So, instead, I’m trying to sell `assertr` as a way of thinking about dataset validations—a set of common dataset validation *actions*. If we think of these actions as *verbs*, you could say that `assertr` attempts to impose a grammar of error checking for datasets. | |
In my experience, the overwhelming majority of data validation tasks fall into only five different patterns: | |
* *For every element in a column*, you want to make sure it fits certain criteria. Examples of this strain of error checking would be to make sure every element is a valid credit card number, or fits a certain regex pattern, or represents a date between two other dates. `assertr` calls this verb `assert`. | |
* *For every element in a column*, you want to make sure certain criteria are met **but the criteria can only be decided only *after* looking at the entire column as a whole**. For example, testing whether each element is within *n* standard deviations of the mean of the elements requires computation on the elements prior to inform the criteria to check for. `assertr` calls this verb `insist`. | |
* *For every row of a dataset*, you want to make sure certain assumptions hold. Examples include ensuring that no row has more than *n* number of missing values, or that a group of columns are jointly unique and never duplicated. `assertr` calls this verb `assert_rows`. | |
* *For every row of a dataset*, you want to make sure certain assumptions hold **but the criteria can only be decided only *after* looking at the entire column as a whole**. This closely mirrors the distinction between `assert` and `insist`, but for entire rows (not individual elements). An example of using this would be checking to make sure that the [Mahalanobis distance](https://en.wikipedia.org/wiki/Mahalanobis_distance) between each row and all other rows are within *n* number of standard deviations of the mean distance. `assertr` calls this verb `insist_rows`. | |
* You want to check some property of the dataset as a whole object. Examples include making sure the dataset has more than *n* columns, making sure the dataset has some specified column names, etc… `assertr` calls this last verb `verify`. | |
Some of this might sound a little complicated, but I promise this is a worthwhile way to look at dataset validation. Now we can begin with an example of what can be achieved with these verbs. The following example is borrowed from the package vignette and README… | |
Pretend that, before finding the average miles per gallon for each number of engine cylinders in the `mtcars` dataset, we wanted to confirm the following dataset assumptions… | |
<small> | |
* that it has the columns `mpg`, `vs`, and `am` | |
* that the dataset contains more than 10 observations | |
* that the column for 'miles per gallon' (mpg) is a positive number | |
* that the column for ‘miles per gallon’ (mpg) does not contain a datum that is outside 4 standard deviations from its mean | |
* that the `am` and `vs` columns (automatic/manual and v/straight engine, respectively) contain 0s and 1s only | |
* each row contains at most 2 NAs | |
* each row is unique jointly between the `mpg`, `am`, and `wt` columns | |
* each row's mahalanobis distance is within 10 median absolute deviations of all the distances (for outlier detection) | |
</small> | |
```{r results='hide', warning=FALSE, comment=FALSE, warning=FALSE, message=FALSE} | |
library(assertr) | |
library(magrittr) | |
library(dplyr) | |
mtcars %>% | |
verify(has_all_names("mpg", "vs", "am", "wt")) %>% | |
verify(nrow(.) > 10) %>% | |
verify(mpg > 0) %>% | |
insist(within_n_sds(4), mpg) %>% | |
assert(in_set(0,1), am, vs) %>% | |
assert_rows(num_row_NAs, within_bounds(0,2), everything()) %>% | |
assert_rows(col_concat, is_uniq, mpg, am, wt) %>% | |
insist_rows(maha_dist, within_n_mads(10), everything()) %>% | |
group_by(cyl) %>% | |
summarise(avg.mpg=mean(mpg)) | |
``` | |
Before `assertr` version 2, the pipeline would immediately terminate at the first failure. Sometimes this is a good thing. However, sometimes we’d like to run a dataset through our entire suite of checks and record all failures. The latest version includes the `chain_start` and `chain_end` functions; all assumptions within a chain (below a call to `chain_start` and above `chain_end`) will run from beginning to end and accumulate errors along the way. At the end of the chain, a specific action can be taken but the default is to halt execution and display a comprehensive report of what failed including line numbers and the offending datum, where applicable. | |
Another major improvement since the last version of `assertr` of CRAN is that `assertr` errors are now S3 classes (instead of dumb strings). Additionally, the behavior of each assertion statement on success (no error) and failure can now be flexibly customized. For example, you can now tell `assertr` to just return TRUE and FALSE instead of returning the data passed in or halting execution, respectively. Alternatively, you can instruct `assertr` to just give a warning instead of throwing a fatal error. For more information on this, see `help("success_and_error_functions")` | |
### Beyond these examples | |
Since the package was initially published on CRAN (almost exactly two years ago) many people have asked me how they should go about using `assertr` to test a particular assumption (and I’m very happy to help if you have one of your own, dear reader!) In every single one of these cases, I’ve been able to express it as an incantation using one of these 5 verbs. It also underscored, to me, that creating specialized functions for every need is a pipe dream. There is, however, two good pieces of news. | |
The first is that there is another package, [assertive](https://CRAN.R-project.org/package=assertive) that greatly enhances the `assertr` experience. The predicates (functions that start with “is_”) from this (meta)package can be used in `assertr` pipelines just as easily as `assertr`’s own predicates. And `assertive` has an enormous amount of them! Some specialized and exciting examples include `is_hex_color`, `is_ip_address`, and `is_isbn_code`! | |
The second is if `assertive` doesn’t have what you’re looking for, with just a little bit of studying the `assertr` grammar, you can whip up your own predicates with relative ease. Using some these basic constructs and a little effort, I’m confident that the grammar is expressive enough to completely adapt to your needs. | |
*If this package interests you, I urge you to read the most recent package vignette [here](https://cran.r-project.org/package=assertr/vignettes/assertr.html). If you're a `assertr` old-timer, I point you to [this NEWS file](https://cran.r-project.org/package=assertr/NEWS) that list the changes from the previous version.* |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment