duglin · October 7, 2015 14:29
diff --git a/Errors Explained b/Errors Explained
 tl;dr:
 - Each unique error message that flows on the wire should use the error processing package.
 - This means each unique error message will have a unique ID associated with it.
 - Errors generated by libraries, non-Docker-specific code or that are not seen on the wire
  can/should continue to use the standard go-lang error mechanisms.
  Meaning: strings, types, etc... e.g. https://golang.org/src/os/file.go - line 84
 - Each error, defined using the error processing package, will be self-describing. Which includes:
    - unique ID
    - error text
    - descriptive text
    - HTTP status code (optional if default is ok)


 Issues I'm trying to address:
 - trying to define patterns for handling error messages - for both internal and external.
 - strcmp to determine which error we have is bad - for example:
  https://github.com/docker/docker/blob/master/api/server/httputils/httputils.go#L147
 - strsub to determine which error we have is bad - for example:
  https://github.com/docker/docker/pull/14012/files#diff-07773483549b2d3c003c53614e1e2ab2R45
 - no programmatic (consistent) way (either in docker or by user) to determine what error we have.
 - mechanism by which we determine the type of HTTP status code for errors is brittle due to above.
 - we have no mechanism to differentiate between daemon issues/errors and user errors. By this I mean,
  when we see ERRORs in the deamon logs they're not really daemon errors, most are due to bad user
  input/configuration which can be logged but should be logged differently than daemon errors. We
  should only show ERRORs in the daemon log for things that an admin (or Docker dev) needs to
  investigate.
 - use patterns in the code base that work for auto-generation of the documentation for our errors
  along with the APIs.

 Guiding thoughts:
 - this is mainly an internal change so the UX should not change as a result of simply using a
  new error processing package.
 - if the UX does change it needs to be because its better for the end-user not for some
  internal processing reasons.
 - I'm trying to make life as easy a possible on the end-user, the API user, the Docker dev and
  Docker doc dev. If so-called "go best practices" conflict with those goals, then UX wins
  because our choice of programming language should not impact the UX (for the most part).

 Parts of solution:
 - associate a unique ID per unique error message/condition
 - associate decriptive text for each error condition - for doc generation
 - remove strcmp/substr of error messages text - which is brittle
  - https://github.com/docker/docker/blob/master/api/server/httputils/httputils.go#L138
  - eventually on cli too:
    https://github.com/docker/docker/pull/14012/files#diff-07773483549b2d3c003c53614e1e2ab2R48
 - be able to associate an HTTP status code on a per error message/condition basis for error
  messages returned over HTTP

 My Design Thoughts (insight into my thought process - scary!):
 The current engine code, not in all cases but most cases, does something minimal like:
    func doFoo() error {
      ...
      return fmt.Errorf( "Can't do XXX because YYY" )
    }
 and this `error` is bubbled up to the http layer.  I think this model is actually very nice for
 a couple of reasons:
 1 - the amount of code the docker dev needs to write is minimal
 2 - the dev, for the most part, only need to be concerned with the logic of determining if there's
    an error. Once they've done that they just return the error and everything else above
    them in the stack acts appropriately based on how they want to process the thrown error.
 3 - it uses standard Go error.Error() logic to get the error text. This means that almost all of
    the code in the engine doesn't need to do anything special with these errors - and normal
    tooling (e.g. logrus) just works. The only exception to this is that the http layer will need
    to have some smarts to serialize the errors onto the wire but ideally that should be written
    in a generic way and not require specialized code each time a new error is created. I think
    asking the Docker dev to touch the http layer (or any other layer) each time a new error is
    created is: 1) asking too much, 2) asking them to forget to modify the other layers and things
    will get missed and go wrong. They should have a model as close to: 1) detect the error,
    2) return the error. Nothing else.

 By default we return HTTP 500 in error cases. We, however, have special logic to return something
 else in cetain cases, for example:
 - https://github.com/docker/docker/blob/master/api/server/httputils/httputils.go#L138

 Doing strcmp is very brittle as the error text may change over time. A better solution is to check
 for some kind of error ID. This means that each error condition should be uniquely identified
 (have its own ID).

 In order to associate a unique ID with each error message/condition doing `fmt.Errorf` isn't
 sufficient, we need to define each error along with some ID - e.g.:
    struct {
      string ID
      string errorMessage
    }

 This means that instead of:
    return fmt.Errorf( "Can't do XXX because YYY" )
 the docker dev would write something like:
    return CANT_XXX_YYY
 where CANT_XXX_YYY is defined using the above struct.  While this is a little more work than the
 simple fmt.Errorf(), its not that burdensome given the issues we're trying to address. Plus,
 its basically the same thing as having well defined error types.

 While this change so far seems trivial, its then worth explaining why there are other bits of
 data needed in the error's struct:
 - Description: Often the error text isn't really descriptive enough to help newbies. Experts will
  can get by with our terse error messages, but often (and for docs) we need more. I think its 
  useful to include some text that can go into more detail about the error message and error
  condition under which this error is generated. Keeping it with the error definition will help
  ensure that its always in-sync and will help avoid the case where we document an error that
  has been removed from the code.
  
 - HTTP Status Code: Having the status code in the same struct as the error itself has some
  benefits. It allows the error to be self-describing. This means that as devs add new errors
  they don't need to "remember" to modify the HTTP layer with additional logic to ensure the
  error returns the correct status code. 
  
  Additionally, it keeps a clean separation between the HTTP layer and the engine. Some have
  suggested that its better to keep HTTP logic out of the engine, and in general I agree.
  However, there needs to be a balance. If we let each error be self-describing then we don't
  need to make code modifications in two locations for each error - we just need to do it at
  the location where the error happens. Also, it means that the HTTP layer can be reused more
  easily. There's work underway to make the HTTP layer reusable between the engine and swarm.
  Unless we want the HTTP layer to be forced to know about all possible engine and swarm errors
  we need to have a way for the HTTP layer to return the appropriate status code w/o error
  specific logic. The best way to do this is to have the error itself contain that knowledge. 
  
  Nor do I think it would be right, or even possible, to define all possible error conditions
  as part of the interface between the HTTP layer and the implementation of the APIs. The set
  of errors returned by swarm will probably very different (or possibly a super-set) of what
  the engine would return. Each needs the freedom to define their own set of errors based on
  their (and their end-user's) needs w/o having to change a common component each time.

 So, that then leaves us with something like:
    struct {
      string ID
      string errorMessage
      string description
      int    statusCode
    }

 Bigger than just a fmt.Errorf() but statusCode could be optional, defaulting to whatever we want,
 and description is just good hygene - like adding comments to the code. Something we that our devs
 should do anyway - and I'm sure our doc folks would appreciate.

 Other Topics:
 =============
 - Some have suggested that we should let each engine package define their own errors and then the
  caller should wrapper, or map, those into another error. The problem with this is that you then
  lose the ability to uniquely identify each error condition. If more than one error is mapped or
  wrapped into a more general error then there's no way to know which exact error condition we 
  have w/o the use of substr and/or strcmp - which means we're back to where we started. 
  For example, see:
  https://github.com/docker/docker/pull/14012/files#diff-07773483549b2d3c003c53614e1e2ab2R45
  where they first use substr to strip off the "wrapper" part and then are forced to use
  strcmp on the rest to know which exact error they had.

  Aside from the need to use strcmp/substr, this wrappering adds little, to no, value.
  Take that case, again, as an example. Adding "Error response from daemon:" to the
  error message doesn't really help anyone at all. Its the stuff that comes next that's
  going to help the most as its the most specific.  So, if we want to allow for programmatic
  discovery of the error conditions then we need to let the unique error ID for each condition
  bubble all the up to the code that needs it (either docker itself, or the CLI/API user).

 - No wrapper at all? Not true. I do think we need to wrap errors from libraries that we have
  no control over - or that we expect to be used by other projects (like the ones in our
  'pkg' dir, or that a truly generic like perhaps in our 'utils' dir). In those cases since
  the error we get back will not be Docker specific it
  makes sense to wrapper those errors with Docker specific text and context. For example,
  while removing an image if the OS can't find a file and we get back os.ErrNotExist from
  a golang library call, then we should wrap that with a graph/Docker specific error
  so that the error message is more meaningful than just "file does not exist".

 - Why do we need to programmatically determine the error? Some cases:
  - Programmatic error recovery. Its not clear how many cases fall into this category
    but that's not for us to decide. We should allow smart devs to try if they want.
    But there are some cases that I know do exist. For example, when you try to delete
    a container but its still running we'll return:
         Conflict, You cannot remove a running container. Stop the container before
         attempting removal or use -f
    This is fully automatable by some CLI UI then prompt the user if they really want
    to delete it and then it can try again with the -f flag.
  - We've seen cases (mentioned in this page) where the docker code itself needs to
    take some action based on the type of error. I think:
    https://github.com/docker/docker/pull/14012/files#diff-07773483549b2d3c003c53614e1e2ab2R45
    is a good example. 

 - Why are we putting all errors into /errors instaed of into each engine package?
  This can easily be changed and we can certainly define each error within the package
  that generates it.  However, there were two reasons for why they were originally put into
  one dir (but each package having its own file):
  1 - it makes it easier for to import errors.
      If we spread them around the engine then not only will CLIs that want access to
      our error definitions need to have lots of imports, but they'll have to change them
      each time we move around our packaging. I don't see why we should force them to change
      their code just because we do an internal restructuring.
  2 - it makes life easier for the docker doc devs. If all of the descriptive text for the
      errors are in one spot then they can more easily review and manage them. Like the previous
      bullet, if things are spread out across the entire code base then while we might be able
      to write tooling to help, I think it becomes less of a burden if they're all in one spot.
  But this is a minor point and we can move the errors to any place in the tree that people want.

 - Ealier I mentioned that the daemon's log shows too many user-generated errors. Once we have
  all of this in place we should be able make sure that all errors that are due to bad user
  input returns 4xx status codes and are not logged in the daemon's logs - or at least aren't
  shown as "ERROR". Then we can reserve 5xx status codes for cases where the daemon got into
  a bad state and the admin (or a docker dev) really should investigate. These 5xx cases should
  be logged as "ERROR".

 - Eventually I do think we should return the error ID back to the CLI. Ideally, all error
  messages from the daemon should be in some json format, that's a major API change. In the
  short term we can simply encode the error ID in an HTTP header - that should be backwards
  compatible but still make the data available for those that want to leverage it.

 - The Distrubtion errocode package wasn't designed for all of this.
  True and originally I had an engine specific error code package but in the interest in trying
  to reuse code we tried to merge them. If the conceptual differences between what the registry
  wants to do and what the engine needs are too different then let's not try to reuse the code.
  But, in all honesty, there isn't much of a difference and I think the biggest issue is more 
  of a matter of perception than anything else - and accepting the usecases listed in this doc.

 - There have been some disagreement over modifying the existing errcode package to better suit
  the engine's needs. I'm listing the proposed changes here with some explainations for why I
  think they are needed:
  1 - substitution args need to be kept separate from the message.
      This will allow for code to easily extract those values w/o trying to parse the error
      message itself - which is risky and error prone. If the variables are simply kept in a
      slice then they can be easily retrieved via APIs.  However, this is an advance API and
      the err.Error() methods should always just return the error message post substitution.
      
  2 - err.Error() needs to always return the same thing. We have cases now where based on the 
      type of error you have err.Error() will return very differnt things. Sometimes it'll
      return the text of error, sometime it'll return just the ID of the error and other times
      it returns the ID + the text.  As mentioned above, err.Error() should behave as it does
      today for all errors so that we can avoid doing hacks like:
        https://github.com/docker/docker/blob/master/daemon/start.go#L48
      where we have to call GetErrorMessage() - it should just be: err.Error() in all cases.
      Some users of the error package may want different output from Error() which is
      why I think the return value of Error() should be configurable based on what the
      code using the package needs (not on a per error basis).
  
  - Some people have expressed concern about the number of errors we're defining. Well, yes
    there are a lot but that's because we have a lot of errors in the code today. If we assume
    no UX change then yes we'll have one error object for each unique error text we have today.
    I don't see the issue with that. If people think that's too many error objects then that
    implies there are too many error messages and they should issue PRs to reduce the number
    of errors generated, but that's totally independent of this exercise. All this is doing is
    formalizing the errors we're generating, not changing the number of them.
	tl;dr:
	- Each unique error message that flows on the wire should use the error processing package.
	- This means each unique error message will have a unique ID associated with it.
	- Errors generated by libraries, non-Docker-specific code or that are not seen on the wire
	can/should continue to use the standard go-lang error mechanisms.
	Meaning: strings, types, etc... e.g. https://golang.org/src/os/file.go - line 84
	- Each error, defined using the error processing package, will be self-describing. Which includes:
	- unique ID
	- error text
	- descriptive text
	- HTTP status code (optional if default is ok)


	Issues I'm trying to address:
	- trying to define patterns for handling error messages - for both internal and external.
	- strcmp to determine which error we have is bad - for example:
	https://github.com/docker/docker/blob/master/api/server/httputils/httputils.go#L147
	- strsub to determine which error we have is bad - for example:
	https://github.com/docker/docker/pull/14012/files#diff-07773483549b2d3c003c53614e1e2ab2R45
	- no programmatic (consistent) way (either in docker or by user) to determine what error we have.
	- mechanism by which we determine the type of HTTP status code for errors is brittle due to above.
	- we have no mechanism to differentiate between daemon issues/errors and user errors. By this I mean,
	when we see ERRORs in the deamon logs they're not really daemon errors, most are due to bad user
	input/configuration which can be logged but should be logged differently than daemon errors. We
	should only show ERRORs in the daemon log for things that an admin (or Docker dev) needs to
	investigate.
	- use patterns in the code base that work for auto-generation of the documentation for our errors
	along with the APIs.

	Guiding thoughts:
	- this is mainly an internal change so the UX should not change as a result of simply using a
	new error processing package.
	- if the UX does change it needs to be because its better for the end-user not for some
	internal processing reasons.
	- I'm trying to make life as easy a possible on the end-user, the API user, the Docker dev and
	Docker doc dev. If so-called "go best practices" conflict with those goals, then UX wins
	because our choice of programming language should not impact the UX (for the most part).

	Parts of solution:
	- associate a unique ID per unique error message/condition
	- associate decriptive text for each error condition - for doc generation
	- remove strcmp/substr of error messages text - which is brittle
	- https://github.com/docker/docker/blob/master/api/server/httputils/httputils.go#L138
	- eventually on cli too:
	https://github.com/docker/docker/pull/14012/files#diff-07773483549b2d3c003c53614e1e2ab2R48
	- be able to associate an HTTP status code on a per error message/condition basis for error
	messages returned over HTTP

	My Design Thoughts (insight into my thought process - scary!):
	The current engine code, not in all cases but most cases, does something minimal like:
	func doFoo() error {
	...
	return fmt.Errorf( "Can't do XXX because YYY" )
	}
	and this `error` is bubbled up to the http layer. I think this model is actually very nice for
	a couple of reasons:
	1 - the amount of code the docker dev needs to write is minimal
	2 - the dev, for the most part, only need to be concerned with the logic of determining if there's
	an error. Once they've done that they just return the error and everything else above
	them in the stack acts appropriately based on how they want to process the thrown error.
	3 - it uses standard Go error.Error() logic to get the error text. This means that almost all of
	the code in the engine doesn't need to do anything special with these errors - and normal
	tooling (e.g. logrus) just works. The only exception to this is that the http layer will need
	to have some smarts to serialize the errors onto the wire but ideally that should be written
	in a generic way and not require specialized code each time a new error is created. I think
	asking the Docker dev to touch the http layer (or any other layer) each time a new error is
	created is: 1) asking too much, 2) asking them to forget to modify the other layers and things
	will get missed and go wrong. They should have a model as close to: 1) detect the error,
	2) return the error. Nothing else.

	By default we return HTTP 500 in error cases. We, however, have special logic to return something
	else in cetain cases, for example:
	- https://github.com/docker/docker/blob/master/api/server/httputils/httputils.go#L138

	Doing strcmp is very brittle as the error text may change over time. A better solution is to check
	for some kind of error ID. This means that each error condition should be uniquely identified
	(have its own ID).

	In order to associate a unique ID with each error message/condition doing `fmt.Errorf` isn't
	sufficient, we need to define each error along with some ID - e.g.:
	struct {
	string ID
	string errorMessage
	}

	This means that instead of:
	return fmt.Errorf( "Can't do XXX because YYY" )
	the docker dev would write something like:
	return CANT_XXX_YYY
	where CANT_XXX_YYY is defined using the above struct. While this is a little more work than the
	simple fmt.Errorf(), its not that burdensome given the issues we're trying to address. Plus,
	its basically the same thing as having well defined error types.

	While this change so far seems trivial, its then worth explaining why there are other bits of
	data needed in the error's struct:
	- Description: Often the error text isn't really descriptive enough to help newbies. Experts will
	can get by with our terse error messages, but often (and for docs) we need more. I think its
	useful to include some text that can go into more detail about the error message and error
	condition under which this error is generated. Keeping it with the error definition will help
	ensure that its always in-sync and will help avoid the case where we document an error that
	has been removed from the code.

	- HTTP Status Code: Having the status code in the same struct as the error itself has some
	benefits. It allows the error to be self-describing. This means that as devs add new errors
	they don't need to "remember" to modify the HTTP layer with additional logic to ensure the
	error returns the correct status code.

	Additionally, it keeps a clean separation between the HTTP layer and the engine. Some have
	suggested that its better to keep HTTP logic out of the engine, and in general I agree.
	However, there needs to be a balance. If we let each error be self-describing then we don't
	need to make code modifications in two locations for each error - we just need to do it at
	the location where the error happens. Also, it means that the HTTP layer can be reused more
	easily. There's work underway to make the HTTP layer reusable between the engine and swarm.
	Unless we want the HTTP layer to be forced to know about all possible engine and swarm errors
	we need to have a way for the HTTP layer to return the appropriate status code w/o error
	specific logic. The best way to do this is to have the error itself contain that knowledge.

	Nor do I think it would be right, or even possible, to define all possible error conditions
	as part of the interface between the HTTP layer and the implementation of the APIs. The set
	of errors returned by swarm will probably very different (or possibly a super-set) of what
	the engine would return. Each needs the freedom to define their own set of errors based on
	their (and their end-user's) needs w/o having to change a common component each time.

	So, that then leaves us with something like:
	struct {
	string ID
	string errorMessage
	string description
	int statusCode
	}

	Bigger than just a fmt.Errorf() but statusCode could be optional, defaulting to whatever we want,
	and description is just good hygene - like adding comments to the code. Something we that our devs
	should do anyway - and I'm sure our doc folks would appreciate.

	Other Topics:
	=============
	- Some have suggested that we should let each engine package define their own errors and then the
	caller should wrapper, or map, those into another error. The problem with this is that you then
	lose the ability to uniquely identify each error condition. If more than one error is mapped or
	wrapped into a more general error then there's no way to know which exact error condition we
	have w/o the use of substr and/or strcmp - which means we're back to where we started.
	For example, see:
	https://github.com/docker/docker/pull/14012/files#diff-07773483549b2d3c003c53614e1e2ab2R45
	where they first use substr to strip off the "wrapper" part and then are forced to use
	strcmp on the rest to know which exact error they had.

	Aside from the need to use strcmp/substr, this wrappering adds little, to no, value.
	Take that case, again, as an example. Adding "Error response from daemon:" to the
	error message doesn't really help anyone at all. Its the stuff that comes next that's
	going to help the most as its the most specific. So, if we want to allow for programmatic
	discovery of the error conditions then we need to let the unique error ID for each condition
	bubble all the up to the code that needs it (either docker itself, or the CLI/API user).

	- No wrapper at all? Not true. I do think we need to wrap errors from libraries that we have
	no control over - or that we expect to be used by other projects (like the ones in our
	'pkg' dir, or that a truly generic like perhaps in our 'utils' dir). In those cases since
	the error we get back will not be Docker specific it
	makes sense to wrapper those errors with Docker specific text and context. For example,
	while removing an image if the OS can't find a file and we get back os.ErrNotExist from
	a golang library call, then we should wrap that with a graph/Docker specific error
	so that the error message is more meaningful than just "file does not exist".

	- Why do we need to programmatically determine the error? Some cases:
	- Programmatic error recovery. Its not clear how many cases fall into this category
	but that's not for us to decide. We should allow smart devs to try if they want.
	But there are some cases that I know do exist. For example, when you try to delete
	a container but its still running we'll return:
	Conflict, You cannot remove a running container. Stop the container before
	attempting removal or use -f
	This is fully automatable by some CLI UI then prompt the user if they really want
	to delete it and then it can try again with the -f flag.
	- We've seen cases (mentioned in this page) where the docker code itself needs to
	take some action based on the type of error. I think:
	https://github.com/docker/docker/pull/14012/files#diff-07773483549b2d3c003c53614e1e2ab2R45
	is a good example.

	- Why are we putting all errors into /errors instaed of into each engine package?
	This can easily be changed and we can certainly define each error within the package
	that generates it. However, there were two reasons for why they were originally put into
	one dir (but each package having its own file):
	1 - it makes it easier for to import errors.
	If we spread them around the engine then not only will CLIs that want access to
	our error definitions need to have lots of imports, but they'll have to change them
	each time we move around our packaging. I don't see why we should force them to change
	their code just because we do an internal restructuring.
	2 - it makes life easier for the docker doc devs. If all of the descriptive text for the
	errors are in one spot then they can more easily review and manage them. Like the previous
	bullet, if things are spread out across the entire code base then while we might be able
	to write tooling to help, I think it becomes less of a burden if they're all in one spot.
	But this is a minor point and we can move the errors to any place in the tree that people want.

	- Ealier I mentioned that the daemon's log shows too many user-generated errors. Once we have
	all of this in place we should be able make sure that all errors that are due to bad user
	input returns 4xx status codes and are not logged in the daemon's logs - or at least aren't
	shown as "ERROR". Then we can reserve 5xx status codes for cases where the daemon got into
	a bad state and the admin (or a docker dev) really should investigate. These 5xx cases should
	be logged as "ERROR".

	- Eventually I do think we should return the error ID back to the CLI. Ideally, all error
	messages from the daemon should be in some json format, that's a major API change. In the
	short term we can simply encode the error ID in an HTTP header - that should be backwards
	compatible but still make the data available for those that want to leverage it.

	- The Distrubtion errocode package wasn't designed for all of this.
	True and originally I had an engine specific error code package but in the interest in trying
	to reuse code we tried to merge them. If the conceptual differences between what the registry
	wants to do and what the engine needs are too different then let's not try to reuse the code.
	But, in all honesty, there isn't much of a difference and I think the biggest issue is more
	of a matter of perception than anything else - and accepting the usecases listed in this doc.

	- There have been some disagreement over modifying the existing errcode package to better suit
	the engine's needs. I'm listing the proposed changes here with some explainations for why I
	think they are needed:
	1 - substitution args need to be kept separate from the message.
	This will allow for code to easily extract those values w/o trying to parse the error
	message itself - which is risky and error prone. If the variables are simply kept in a
	slice then they can be easily retrieved via APIs. However, this is an advance API and
	the err.Error() methods should always just return the error message post substitution.

	2 - err.Error() needs to always return the same thing. We have cases now where based on the
	type of error you have err.Error() will return very differnt things. Sometimes it'll
	return the text of error, sometime it'll return just the ID of the error and other times
	it returns the ID + the text. As mentioned above, err.Error() should behave as it does
	today for all errors so that we can avoid doing hacks like:
	https://github.com/docker/docker/blob/master/daemon/start.go#L48
	where we have to call GetErrorMessage() - it should just be: err.Error() in all cases.
	Some users of the error package may want different output from Error() which is
	why I think the return value of Error() should be configurable based on what the
	code using the package needs (not on a per error basis).

	- Some people have expressed concern about the number of errors we're defining. Well, yes
	there are a lot but that's because we have a lot of errors in the code today. If we assume
	no UX change then yes we'll have one error object for each unique error text we have today.
	I don't see the issue with that. If people think that's too many error objects then that
	implies there are too many error messages and they should issue PRs to reduce the number
	of errors generated, but that's totally independent of this exercise. All this is doing is
	formalizing the errors we're generating, not changing the number of them.