Last active
May 31, 2016 13:00
-
-
Save kisom/3863f17636d99b4f8401 to your computer and use it in GitHub Desktop.
"Why Purity Matters" blog post.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Purity is a useful construct that forces programmers to consider that the | |
environment that they are operating in is unclean, and this introduces | |
barriers to formally defining the behaviour of the system comprising | |
the program and its environment. | |
This file is a [literate Haskell | |
file](https://gist.github.com/kisom/3863f17636d99b4f8401) run through | |
[pandoc](http://johnmacfarlane.net/pandoc/) to produce a Markdown | |
post. There might be a few glitches as I'm still developing a workflow | |
in this style. | |
> import System.IO | |
> import System.IO.Error | |
I've spent the majority of my career thus far as an embedded Linux | |
engineer writing primarily C targeting our boxes. In this post, this | |
type of environment is what I have in mind for the target system, but it | |
really applies to most Linux systems as well. I can't speak to the others. | |
Consider the following program fragment in C; this is a common pattern | |
I've encountered working as a systems engineer. | |
```c | |
typedef struct { | |
uint8_t *data; | |
size_t length; | |
} FileData; | |
FileData * | |
read_file(const char *path) | |
{ | |
struct stat sb; | |
FileData *fdata = NULL; | |
FILE *file = NULL; | |
if (stat(path, &sb) == -1) { | |
return NULL; | |
} | |
if (NULL == (fdata = calloc(1, sizeof(FileData)))) { | |
return NULL; | |
} | |
fdata->length = (size_t)sb.st_size; | |
fdata->data = calloc(fdata->length, 1); | |
if (NULL == fdata->data) { | |
free(fdata); | |
return NULL; | |
} | |
file = fopen(path, "r"); | |
if (NULL == file) { | |
free(fdata->data); | |
free(fdata); | |
return NULL; | |
} | |
if (fdata->length != fread(fdata->data, 1, fdata->length, file)) { | |
free(fdata->data); | |
free(fdata); | |
return NULL; | |
} | |
fclose(file); | |
return fdata; | |
} | |
``` | |
In what ways can `read_file` fail? | |
#### The obvious | |
The most obvious way this fails are | |
0. The file doesn't exist (which is picked up in the `stat(2)` call). | |
0. The program doesn't have permissions to read the file (picked up in | |
the `stat(2)` or `fopen(3)` call). | |
0. The program cannot allocate memory, either kernel memory for the | |
call to `stat(2)` or user memory for the calls to `calloc(3)`. | |
0. The program cannot read the entire file into memory. | |
I've noticed that my first tendency is to thing of this file as | |
running on a "snapshot" of the system: that is, the state of the | |
system remains consistent throughout this function. Functions are | |
fast, right? And this shouldn't take so long to run that the world can | |
change, right? | |
It turns out the answer is more subtle than this. | |
#### Abandon every hope | |
In order to understand the complexities of what is actually going on, | |
let's consider how the previous four failure modes occur. We also need | |
to understand that the scheduler can sleep this process at any time and | |
give another process control. | |
##### ENOENT | |
In the case of `ENOENT`, it turns out that this can occur both places | |
the file is accessed. That is, if the process yields control between the | |
`stat(2)` and `fopen(3)`, the file may not exist any more. This will | |
result in the same behaviour as the case of a permissions failure. | |
This might occur, for example, during a log rotation: | |
* Process A runs a check through the system logs and determines that | |
"server.log" needs to be rotated. | |
* The scheduler puts A to sleep and wakes up process B. | |
* Process B enters `read_file` for "server.log". | |
* Process B calls `stat(2)` inside `read_file` and determines that | |
"server.log" is `L` bytes. | |
* The scheduler puts B to sleep and wakes up A. | |
* Process A renames "server.log" to "server.log.1". | |
* The scheduler puts A to sleep and wakes up process B. | |
* Process B allocates `L` bytes and attempts to read "server.log". | |
* "server.log" no longer exists, and `read_file` fails. The allocated | |
memory is returned back to the system. | |
This can put some churn on the memory allocator, which might lead to | |
performance problems. | |
The parent directories can also be removed or renamed, as well. | |
##### EACCES | |
A file's (or its parent's) permissions can change during the course of | |
its lifetime; while it's been rarer, in my experience, it might make | |
for an interesting debugging session. | |
##### ENOMEM | |
Linux malloc [can never fail](http://scvalex.net/posts/6/), except | |
when it can. Usually, it won't be the malloc itself that fails, but | |
the effects of the memory pressure will be felt elsewhere in the system | |
causing turbulence like heavy paging or OOM kills. Memory pressure can | |
also affect scheduling. | |
##### EOF/FERROR | |
The most common cas where the entire file can't be read into memory is | |
if it has been truncated. In the example for `ENOENT`, imagine that | |
process A manages to create a new "server.log" before process B resumes | |
execution. In this case, it expects to read `L` bytes, but "server.log" | |
is now `L'` bytes. The `fread(3)` doesn't know to expect a smaller file. | |
#### Other failure modes | |
There are other ways the disk can fail: hardware failures or filesystem | |
corruption, for example. If the filesystem is a network filesystem, | |
all the failure modes of a network enter the mix now as well. Some of | |
these calls will fail if the path name is too long or the program expects | |
32-bit file offsets on a 64-bit system (i.e. -D_FILE_OFFSET_BITS=32). | |
#### Purity | |
Purity are those functions that do not rely on the outside world for | |
their answer; they do not rely on side effects or some state. In Haskell, | |
pure functions are the default, and impure functions (such as those that | |
access the disk) must be handled distinctly. The main function of every | |
program is wrapped in an IO pipeline; pure functions can split off from | |
this and operate on data, but they must always return to the IO | |
pipeline. This means that interactions with the outside world are always | |
marked as impure, and require special handling. There are ways to | |
circumvent this, but they require explicitly doing so and are frowned | |
upon. Furthermore, adding type annotations to mark where it's appropriate | |
to handle impure interactions and explicitly marking the pure code paths | |
allows one to arguably better reason about the behaviour of their code. | |
#### Haskell example | |
The following code sample actually uses two layers of monads to mark | |
the code paths. | |
The function operates on a data structure similar to the `FileData` | |
structure in the C fragment above. | |
> data FileData = FileData String Integer | |
The data structure will be showable in the REPL, but I'd rather not see | |
all the file's contents when I see the file. | |
> instance Show FileData where | |
> show (FileData _ l) = "file of " ++ (show l) ++ " bytes" | |
Here is the Haskell `read_file`: it takes a file path and returns | |
an `IO (Either IOErrorType FileData)`. What is an `IO (IOErrorType | |
FileData)`? The `IO` part marks the output as being part of the `IO` | |
monad; it is in a pipeline of impure code that interacts with the outside | |
world. Any function that operates on the result of this function must | |
be prepared to handle such code. The `Either IOErrorType FileData` monad | |
inside the `IO` pipeline means that the result of this function is a | |
value that might be either an `IOErrorType` or `FileData`. Functions | |
that handle the contents of the `IO` pipeline should be prepared to | |
handle both of these types of values as well as actual data. | |
> read_file :: FilePath -> IO (Either IOErrorType FileData) | |
> read_file path = do | |
> catchIOError (hf path) exHandler | |
> where hf p = do | |
> handle <- openFile p ReadMode | |
> fileSize <- hFileSize handle | |
> hClose handle | |
> handle <- openFile p ReadMode | |
> fileData <- hGetContents handle | |
> hClose handle | |
> let fdata = Right $ FileData fileData fileSize | |
> return fdata | |
> exHandler e = return $ Left (ioeGetErrorType e) | |
Unlike the C version, the error information is returned immediately with | |
the code instead of going through extracting `errno` after receiving | |
a failure (which is idiomatic in C). | |
Coming from this embedded C background, I'm coming to like this | |
explictness about the world my programs operate in. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment