kisom · May 31, 2016 13:00
diff --git a/Purity.lhs b/Purity.lhs
 Purity is a useful construct that forces programmers to consider that the
 environment that they are operating in is unclean, and this introduces
 barriers to formally defining the behaviour of the system comprising
 the program and its environment.

 This file is a [literate Haskell
 file](https://gist.github.com/kisom/3863f17636d99b4f8401) run through
 [pandoc](http://johnmacfarlane.net/pandoc/) to produce a Markdown
 post. There might be a few glitches as I'm still developing a workflow
 in this style.

 > import System.IO
 > import System.IO.Error

 I've spent the majority of my career thus far as an embedded Linux
 engineer writing primarily C targeting our boxes. In this post, this
 type of environment is what I have in mind for the target system, but it
 really applies to most Linux systems as well. I can't speak to the others.

 Consider the following program fragment in C; this is a common pattern
 I've encountered working as a systems engineer.

 ```c
 typedef struct {
 	uint8_t *data;
 	size_t   length;
 } FileData;

 FileData *
 read_file(const char *path)
 {
 	struct stat	 sb;
 	FileData	*fdata = NULL;
 	FILE        *file = NULL;

 	if (stat(path, &sb) == -1) {
 		return NULL;
 	}

 	if (NULL == (fdata = calloc(1, sizeof(FileData)))) {
 		return NULL;
 	}

 	fdata->length = (size_t)sb.st_size;
 	fdata->data = calloc(fdata->length, 1);
 	if (NULL == fdata->data) {
 		free(fdata);
 		return NULL;
 	}

 	file = fopen(path, "r");
 	if (NULL == file) {
 		free(fdata->data);
 		free(fdata);
 		return NULL;
 	}

 	if (fdata->length != fread(fdata->data, 1, fdata->length, file)) {
 		free(fdata->data);
 		free(fdata);
 		return NULL;    
 	}

 	fclose(file);
 	return fdata;
 }
 ```

 In what ways can `read_file` fail?

 #### The obvious

 The most obvious way this fails are

 0. The file doesn't exist (which is picked up in the `stat(2)` call).
 0. The program doesn't have permissions to read the file (picked up in
   the `stat(2)` or `fopen(3)` call).
 0. The program cannot allocate memory, either kernel memory for the
   call to `stat(2)` or user memory for the calls to `calloc(3)`.
 0. The program cannot read the entire file into memory.

 I've noticed that my first tendency is to thing of this file as
 running on a "snapshot" of the system: that is, the state of the
 system remains consistent throughout this function. Functions are
 fast, right? And this shouldn't take so long to run that the world can
 change, right?

 It turns out the answer is more subtle than this.

 #### Abandon every hope

 In order to understand the complexities of what is actually going on,
 let's consider how the previous four failure modes occur. We also need
 to understand that the scheduler can sleep this process at any time and
 give another process control.

 ##### ENOENT

 In the case of `ENOENT`, it turns out that this can occur both places
 the file is accessed. That is, if the process yields control between the
 `stat(2)` and `fopen(3)`, the file may not exist any more. This will
 result in the same behaviour as the case of a permissions failure.

 This might occur, for example, during a log rotation:

 * Process A runs a check through the system logs and determines that
  "server.log" needs to be rotated.
 * The scheduler puts A to sleep and wakes up process B.
 * Process B enters `read_file` for "server.log".
 * Process B calls `stat(2)` inside `read_file` and determines that
  "server.log" is `L` bytes.
 * The scheduler puts B to sleep and wakes up A.
 * Process A renames "server.log" to "server.log.1".
 * The scheduler puts A to sleep and wakes up process B.
 * Process B allocates `L` bytes and attempts to read "server.log".
 * "server.log" no longer exists, and `read_file` fails. The allocated
  memory is returned back to the system.

 This can put some churn on the memory allocator, which might lead to
 performance problems.

 The parent directories can also be removed or renamed, as well.

 ##### EACCES

 A file's (or its parent's) permissions can change during the course of
 its lifetime; while it's been rarer, in my experience, it might make
 for an interesting debugging session.

 ##### ENOMEM

 Linux malloc [can never fail](http://scvalex.net/posts/6/), except
 when it can. Usually, it won't be the malloc itself that fails, but
 the effects of the memory pressure will be felt elsewhere in the system
 causing turbulence like heavy paging or OOM kills. Memory pressure can
 also affect scheduling.

 ##### EOF/FERROR

 The most common cas where the entire file can't be read into memory is
 if it has been truncated. In the example for `ENOENT`, imagine that
 process A manages to create a new "server.log" before process B resumes
 execution. In this case, it expects to read `L` bytes, but "server.log"
 is now `L'` bytes. The `fread(3)` doesn't know to expect a smaller file.

 #### Other failure modes

 There are other ways the disk can fail: hardware failures or filesystem
 corruption, for example. If the filesystem is a network filesystem,
 all the failure modes of a network enter the mix now as well. Some of
 these calls will fail if the path name is too long or the program expects
 32-bit file offsets on a 64-bit system (i.e. -D_FILE_OFFSET_BITS=32).

 #### Purity

 Purity are those functions that do not rely on the outside world for
 their answer; they do not rely on side effects or some state. In Haskell,
 pure functions are the default, and impure functions (such as those that
 access the disk) must be handled distinctly. The main function of every
 program is wrapped in an IO pipeline; pure functions can split off from
 this and operate on data, but they must always return to the IO
 pipeline. This means that interactions with the outside world are always
 marked as impure, and require special handling. There are ways to
 circumvent this, but they require explicitly doing so and are frowned
 upon. Furthermore, adding type annotations to mark where it's appropriate
 to handle impure interactions and explicitly marking the pure code paths
 allows one to arguably better reason about the behaviour of their code.

 #### Haskell example

 The following code sample actually uses two layers of monads to mark
 the code paths. 

 The function operates on a data structure similar to the `FileData`
 structure in the C fragment above.

 > data FileData = FileData String Integer

 The data structure will be showable in the REPL, but I'd rather not see
 all the file's contents when I see the file.

 > instance Show FileData where
 >     show (FileData _ l) = "file of " ++ (show l) ++ " bytes"

 Here is the Haskell `read_file`: it takes a file path and returns
 an `IO (Either IOErrorType FileData)`. What is an `IO (IOErrorType
 FileData)`? The `IO` part marks the output as being part of the `IO`
 monad; it is in a pipeline of impure code that interacts with the outside
 world. Any function that operates on the result of this function must
 be prepared to handle such code. The `Either IOErrorType FileData` monad
 inside the `IO` pipeline means that the result of this function is a
 value that might be either an `IOErrorType` or `FileData`. Functions
 that handle the contents of the `IO` pipeline should be prepared to
 handle both of these types of values as well as actual data.

 > read_file :: FilePath -> IO (Either IOErrorType FileData)
 > read_file path = do 
 >     catchIOError (hf path) exHandler
 >     where hf p = do
 >               handle   <- openFile p ReadMode
 >               fileSize <- hFileSize handle
 >               hClose handle
 >               handle   <- openFile p ReadMode
 >               fileData <- hGetContents handle
 >               hClose handle
 >               let fdata =  Right $ FileData fileData fileSize
 >               return fdata
 >           exHandler e = return $ Left (ioeGetErrorType e)

 Unlike the C version, the error information is returned immediately with
 the code instead of going through extracting `errno` after receiving
 a failure (which is idiomatic in C).

 Coming from this embedded C background, I'm coming to like this
 explictness about the world my programs operate in.
	Purity is a useful construct that forces programmers to consider that the
	environment that they are operating in is unclean, and this introduces
	barriers to formally defining the behaviour of the system comprising
	the program and its environment.

	This file is a [literate Haskell
	file](https://gist.github.com/kisom/3863f17636d99b4f8401) run through
	[pandoc](http://johnmacfarlane.net/pandoc/) to produce a Markdown
	post. There might be a few glitches as I'm still developing a workflow
	in this style.

	> import System.IO
	> import System.IO.Error

	I've spent the majority of my career thus far as an embedded Linux
	engineer writing primarily C targeting our boxes. In this post, this
	type of environment is what I have in mind for the target system, but it
	really applies to most Linux systems as well. I can't speak to the others.

	Consider the following program fragment in C; this is a common pattern
	I've encountered working as a systems engineer.

	```c
	typedef struct {
	uint8_t *data;
	size_t length;
	} FileData;

	FileData *
	read_file(const char *path)
	{
	struct stat sb;
	FileData *fdata = NULL;
	FILE *file = NULL;

	if (stat(path, &sb) == -1) {
	return NULL;
	}

	if (NULL == (fdata = calloc(1, sizeof(FileData)))) {
	return NULL;
	}

	fdata->length = (size_t)sb.st_size;
	fdata->data = calloc(fdata->length, 1);
	if (NULL == fdata->data) {
	free(fdata);
	return NULL;
	}

	file = fopen(path, "r");
	if (NULL == file) {
	free(fdata->data);
	free(fdata);
	return NULL;
	}

	if (fdata->length != fread(fdata->data, 1, fdata->length, file)) {
	free(fdata->data);
	free(fdata);
	return NULL;
	}

	fclose(file);
	return fdata;
	}
	```

	In what ways can `read_file` fail?

	#### The obvious

	The most obvious way this fails are

	0. The file doesn't exist (which is picked up in the `stat(2)` call).
	0. The program doesn't have permissions to read the file (picked up in
	the `stat(2)` or `fopen(3)` call).
	0. The program cannot allocate memory, either kernel memory for the
	call to `stat(2)` or user memory for the calls to `calloc(3)`.
	0. The program cannot read the entire file into memory.

	I've noticed that my first tendency is to thing of this file as
	running on a "snapshot" of the system: that is, the state of the
	system remains consistent throughout this function. Functions are
	fast, right? And this shouldn't take so long to run that the world can
	change, right?

	It turns out the answer is more subtle than this.

	#### Abandon every hope

	In order to understand the complexities of what is actually going on,
	let's consider how the previous four failure modes occur. We also need
	to understand that the scheduler can sleep this process at any time and
	give another process control.

	##### ENOENT

	In the case of `ENOENT`, it turns out that this can occur both places
	the file is accessed. That is, if the process yields control between the
	`stat(2)` and `fopen(3)`, the file may not exist any more. This will
	result in the same behaviour as the case of a permissions failure.

	This might occur, for example, during a log rotation:

	* Process A runs a check through the system logs and determines that
	"server.log" needs to be rotated.
	* The scheduler puts A to sleep and wakes up process B.
	* Process B enters `read_file` for "server.log".
	* Process B calls `stat(2)` inside `read_file` and determines that
	"server.log" is `L` bytes.
	* The scheduler puts B to sleep and wakes up A.
	* Process A renames "server.log" to "server.log.1".
	* The scheduler puts A to sleep and wakes up process B.
	* Process B allocates `L` bytes and attempts to read "server.log".
	* "server.log" no longer exists, and `read_file` fails. The allocated
	memory is returned back to the system.

	This can put some churn on the memory allocator, which might lead to
	performance problems.

	The parent directories can also be removed or renamed, as well.

	##### EACCES

	A file's (or its parent's) permissions can change during the course of
	its lifetime; while it's been rarer, in my experience, it might make
	for an interesting debugging session.

	##### ENOMEM

	Linux malloc [can never fail](http://scvalex.net/posts/6/), except
	when it can. Usually, it won't be the malloc itself that fails, but
	the effects of the memory pressure will be felt elsewhere in the system
	causing turbulence like heavy paging or OOM kills. Memory pressure can
	also affect scheduling.

	##### EOF/FERROR

	The most common cas where the entire file can't be read into memory is
	if it has been truncated. In the example for `ENOENT`, imagine that
	process A manages to create a new "server.log" before process B resumes
	execution. In this case, it expects to read `L` bytes, but "server.log"
	is now `L'` bytes. The `fread(3)` doesn't know to expect a smaller file.

	#### Other failure modes

	There are other ways the disk can fail: hardware failures or filesystem
	corruption, for example. If the filesystem is a network filesystem,
	all the failure modes of a network enter the mix now as well. Some of
	these calls will fail if the path name is too long or the program expects
	32-bit file offsets on a 64-bit system (i.e. -D_FILE_OFFSET_BITS=32).

	#### Purity

	Purity are those functions that do not rely on the outside world for
	their answer; they do not rely on side effects or some state. In Haskell,
	pure functions are the default, and impure functions (such as those that
	access the disk) must be handled distinctly. The main function of every
	program is wrapped in an IO pipeline; pure functions can split off from
	this and operate on data, but they must always return to the IO
	pipeline. This means that interactions with the outside world are always
	marked as impure, and require special handling. There are ways to
	circumvent this, but they require explicitly doing so and are frowned
	upon. Furthermore, adding type annotations to mark where it's appropriate
	to handle impure interactions and explicitly marking the pure code paths
	allows one to arguably better reason about the behaviour of their code.

	#### Haskell example

	The following code sample actually uses two layers of monads to mark
	the code paths.

	The function operates on a data structure similar to the `FileData`
	structure in the C fragment above.

	> data FileData = FileData String Integer

	The data structure will be showable in the REPL, but I'd rather not see
	all the file's contents when I see the file.

	> instance Show FileData where
	> show (FileData _ l) = "file of " ++ (show l) ++ " bytes"

	Here is the Haskell `read_file`: it takes a file path and returns
	an `IO (Either IOErrorType FileData)`. What is an `IO (IOErrorType
	FileData)`? The `IO` part marks the output as being part of the `IO`
	monad; it is in a pipeline of impure code that interacts with the outside
	world. Any function that operates on the result of this function must
	be prepared to handle such code. The `Either IOErrorType FileData` monad
	inside the `IO` pipeline means that the result of this function is a
	value that might be either an `IOErrorType` or `FileData`. Functions
	that handle the contents of the `IO` pipeline should be prepared to
	handle both of these types of values as well as actual data.

	> read_file :: FilePath -> IO (Either IOErrorType FileData)
	> read_file path = do
	> catchIOError (hf path) exHandler
	> where hf p = do
	> handle <- openFile p ReadMode
	> fileSize <- hFileSize handle
	> hClose handle
	> handle <- openFile p ReadMode
	> fileData <- hGetContents handle
	> hClose handle
	> let fdata = Right $ FileData fileData fileSize
	> return fdata
	> exHandler e = return $ Left (ioeGetErrorType e)

	Unlike the C version, the error information is returned immediately with
	the code instead of going through extracting `errno` after receiving
	a failure (which is idiomatic in C).

	Coming from this embedded C background, I'm coming to like this
	explictness about the world my programs operate in.