Last active
October 7, 2015 14:29
-
-
Save duglin/d802724a4ff011f1df8d to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
tl;dr: | |
- Each unique error message that flows on the wire should use the error processing package. | |
- This means each unique error message will have a unique ID associated with it. | |
- Errors generated by libraries, non-Docker-specific code or that are not seen on the wire | |
can/should continue to use the standard go-lang error mechanisms. | |
Meaning: strings, types, etc... e.g. https://golang.org/src/os/file.go - line 84 | |
- Each error, defined using the error processing package, will be self-describing. Which includes: | |
- unique ID | |
- error text | |
- descriptive text | |
- HTTP status code (optional if default is ok) | |
Issues I'm trying to address: | |
- trying to define patterns for handling error messages - for both internal and external. | |
- strcmp to determine which error we have is bad - for example: | |
https://github.com/docker/docker/blob/master/api/server/httputils/httputils.go#L147 | |
- strsub to determine which error we have is bad - for example: | |
https://github.com/docker/docker/pull/14012/files#diff-07773483549b2d3c003c53614e1e2ab2R45 | |
- no programmatic (consistent) way (either in docker or by user) to determine what error we have. | |
- mechanism by which we determine the type of HTTP status code for errors is brittle due to above. | |
- we have no mechanism to differentiate between daemon issues/errors and user errors. By this I mean, | |
when we see ERRORs in the deamon logs they're not really daemon errors, most are due to bad user | |
input/configuration which can be logged but should be logged differently than daemon errors. We | |
should only show ERRORs in the daemon log for things that an admin (or Docker dev) needs to | |
investigate. | |
- use patterns in the code base that work for auto-generation of the documentation for our errors | |
along with the APIs. | |
Guiding thoughts: | |
- this is mainly an internal change so the UX should not change as a result of simply using a | |
new error processing package. | |
- if the UX does change it needs to be because its better for the end-user not for some | |
internal processing reasons. | |
- I'm trying to make life as easy a possible on the end-user, the API user, the Docker dev and | |
Docker doc dev. If so-called "go best practices" conflict with those goals, then UX wins | |
because our choice of programming language should not impact the UX (for the most part). | |
Parts of solution: | |
- associate a unique ID per unique error message/condition | |
- associate decriptive text for each error condition - for doc generation | |
- remove strcmp/substr of error messages text - which is brittle | |
- https://github.com/docker/docker/blob/master/api/server/httputils/httputils.go#L138 | |
- eventually on cli too: | |
https://github.com/docker/docker/pull/14012/files#diff-07773483549b2d3c003c53614e1e2ab2R48 | |
- be able to associate an HTTP status code on a per error message/condition basis for error | |
messages returned over HTTP | |
My Design Thoughts (insight into my thought process - scary!): | |
The current engine code, not in all cases but most cases, does something minimal like: | |
func doFoo() error { | |
... | |
return fmt.Errorf( "Can't do XXX because YYY" ) | |
} | |
and this `error` is bubbled up to the http layer. I think this model is actually very nice for | |
a couple of reasons: | |
1 - the amount of code the docker dev needs to write is minimal | |
2 - the dev, for the most part, only need to be concerned with the logic of determining if there's | |
an error. Once they've done that they just return the error and everything else above | |
them in the stack acts appropriately based on how they want to process the thrown error. | |
3 - it uses standard Go error.Error() logic to get the error text. This means that almost all of | |
the code in the engine doesn't need to do anything special with these errors - and normal | |
tooling (e.g. logrus) just works. The only exception to this is that the http layer will need | |
to have some smarts to serialize the errors onto the wire but ideally that should be written | |
in a generic way and not require specialized code each time a new error is created. I think | |
asking the Docker dev to touch the http layer (or any other layer) each time a new error is | |
created is: 1) asking too much, 2) asking them to forget to modify the other layers and things | |
will get missed and go wrong. They should have a model as close to: 1) detect the error, | |
2) return the error. Nothing else. | |
By default we return HTTP 500 in error cases. We, however, have special logic to return something | |
else in cetain cases, for example: | |
- https://github.com/docker/docker/blob/master/api/server/httputils/httputils.go#L138 | |
Doing strcmp is very brittle as the error text may change over time. A better solution is to check | |
for some kind of error ID. This means that each error condition should be uniquely identified | |
(have its own ID). | |
In order to associate a unique ID with each error message/condition doing `fmt.Errorf` isn't | |
sufficient, we need to define each error along with some ID - e.g.: | |
struct { | |
string ID | |
string errorMessage | |
} | |
This means that instead of: | |
return fmt.Errorf( "Can't do XXX because YYY" ) | |
the docker dev would write something like: | |
return CANT_XXX_YYY | |
where CANT_XXX_YYY is defined using the above struct. While this is a little more work than the | |
simple fmt.Errorf(), its not that burdensome given the issues we're trying to address. Plus, | |
its basically the same thing as having well defined error types. | |
While this change so far seems trivial, its then worth explaining why there are other bits of | |
data needed in the error's struct: | |
- Description: Often the error text isn't really descriptive enough to help newbies. Experts will | |
can get by with our terse error messages, but often (and for docs) we need more. I think its | |
useful to include some text that can go into more detail about the error message and error | |
condition under which this error is generated. Keeping it with the error definition will help | |
ensure that its always in-sync and will help avoid the case where we document an error that | |
has been removed from the code. | |
- HTTP Status Code: Having the status code in the same struct as the error itself has some | |
benefits. It allows the error to be self-describing. This means that as devs add new errors | |
they don't need to "remember" to modify the HTTP layer with additional logic to ensure the | |
error returns the correct status code. | |
Additionally, it keeps a clean separation between the HTTP layer and the engine. Some have | |
suggested that its better to keep HTTP logic out of the engine, and in general I agree. | |
However, there needs to be a balance. If we let each error be self-describing then we don't | |
need to make code modifications in two locations for each error - we just need to do it at | |
the location where the error happens. Also, it means that the HTTP layer can be reused more | |
easily. There's work underway to make the HTTP layer reusable between the engine and swarm. | |
Unless we want the HTTP layer to be forced to know about all possible engine and swarm errors | |
we need to have a way for the HTTP layer to return the appropriate status code w/o error | |
specific logic. The best way to do this is to have the error itself contain that knowledge. | |
Nor do I think it would be right, or even possible, to define all possible error conditions | |
as part of the interface between the HTTP layer and the implementation of the APIs. The set | |
of errors returned by swarm will probably very different (or possibly a super-set) of what | |
the engine would return. Each needs the freedom to define their own set of errors based on | |
their (and their end-user's) needs w/o having to change a common component each time. | |
So, that then leaves us with something like: | |
struct { | |
string ID | |
string errorMessage | |
string description | |
int statusCode | |
} | |
Bigger than just a fmt.Errorf() but statusCode could be optional, defaulting to whatever we want, | |
and description is just good hygene - like adding comments to the code. Something we that our devs | |
should do anyway - and I'm sure our doc folks would appreciate. | |
Other Topics: | |
============= | |
- Some have suggested that we should let each engine package define their own errors and then the | |
caller should wrapper, or map, those into another error. The problem with this is that you then | |
lose the ability to uniquely identify each error condition. If more than one error is mapped or | |
wrapped into a more general error then there's no way to know which exact error condition we | |
have w/o the use of substr and/or strcmp - which means we're back to where we started. | |
For example, see: | |
https://github.com/docker/docker/pull/14012/files#diff-07773483549b2d3c003c53614e1e2ab2R45 | |
where they first use substr to strip off the "wrapper" part and then are forced to use | |
strcmp on the rest to know which exact error they had. | |
Aside from the need to use strcmp/substr, this wrappering adds little, to no, value. | |
Take that case, again, as an example. Adding "Error response from daemon:" to the | |
error message doesn't really help anyone at all. Its the stuff that comes next that's | |
going to help the most as its the most specific. So, if we want to allow for programmatic | |
discovery of the error conditions then we need to let the unique error ID for each condition | |
bubble all the up to the code that needs it (either docker itself, or the CLI/API user). | |
- No wrapper at all? Not true. I do think we need to wrap errors from libraries that we have | |
no control over - or that we expect to be used by other projects (like the ones in our | |
'pkg' dir, or that a truly generic like perhaps in our 'utils' dir). In those cases since | |
the error we get back will not be Docker specific it | |
makes sense to wrapper those errors with Docker specific text and context. For example, | |
while removing an image if the OS can't find a file and we get back os.ErrNotExist from | |
a golang library call, then we should wrap that with a graph/Docker specific error | |
so that the error message is more meaningful than just "file does not exist". | |
- Why do we need to programmatically determine the error? Some cases: | |
- Programmatic error recovery. Its not clear how many cases fall into this category | |
but that's not for us to decide. We should allow smart devs to try if they want. | |
But there are some cases that I know do exist. For example, when you try to delete | |
a container but its still running we'll return: | |
Conflict, You cannot remove a running container. Stop the container before | |
attempting removal or use -f | |
This is fully automatable by some CLI UI then prompt the user if they really want | |
to delete it and then it can try again with the -f flag. | |
- We've seen cases (mentioned in this page) where the docker code itself needs to | |
take some action based on the type of error. I think: | |
https://github.com/docker/docker/pull/14012/files#diff-07773483549b2d3c003c53614e1e2ab2R45 | |
is a good example. | |
- Why are we putting all errors into /errors instaed of into each engine package? | |
This can easily be changed and we can certainly define each error within the package | |
that generates it. However, there were two reasons for why they were originally put into | |
one dir (but each package having its own file): | |
1 - it makes it easier for to import errors. | |
If we spread them around the engine then not only will CLIs that want access to | |
our error definitions need to have lots of imports, but they'll have to change them | |
each time we move around our packaging. I don't see why we should force them to change | |
their code just because we do an internal restructuring. | |
2 - it makes life easier for the docker doc devs. If all of the descriptive text for the | |
errors are in one spot then they can more easily review and manage them. Like the previous | |
bullet, if things are spread out across the entire code base then while we might be able | |
to write tooling to help, I think it becomes less of a burden if they're all in one spot. | |
But this is a minor point and we can move the errors to any place in the tree that people want. | |
- Ealier I mentioned that the daemon's log shows too many user-generated errors. Once we have | |
all of this in place we should be able make sure that all errors that are due to bad user | |
input returns 4xx status codes and are not logged in the daemon's logs - or at least aren't | |
shown as "ERROR". Then we can reserve 5xx status codes for cases where the daemon got into | |
a bad state and the admin (or a docker dev) really should investigate. These 5xx cases should | |
be logged as "ERROR". | |
- Eventually I do think we should return the error ID back to the CLI. Ideally, all error | |
messages from the daemon should be in some json format, that's a major API change. In the | |
short term we can simply encode the error ID in an HTTP header - that should be backwards | |
compatible but still make the data available for those that want to leverage it. | |
- The Distrubtion errocode package wasn't designed for all of this. | |
True and originally I had an engine specific error code package but in the interest in trying | |
to reuse code we tried to merge them. If the conceptual differences between what the registry | |
wants to do and what the engine needs are too different then let's not try to reuse the code. | |
But, in all honesty, there isn't much of a difference and I think the biggest issue is more | |
of a matter of perception than anything else - and accepting the usecases listed in this doc. | |
- There have been some disagreement over modifying the existing errcode package to better suit | |
the engine's needs. I'm listing the proposed changes here with some explainations for why I | |
think they are needed: | |
1 - substitution args need to be kept separate from the message. | |
This will allow for code to easily extract those values w/o trying to parse the error | |
message itself - which is risky and error prone. If the variables are simply kept in a | |
slice then they can be easily retrieved via APIs. However, this is an advance API and | |
the err.Error() methods should always just return the error message post substitution. | |
2 - err.Error() needs to always return the same thing. We have cases now where based on the | |
type of error you have err.Error() will return very differnt things. Sometimes it'll | |
return the text of error, sometime it'll return just the ID of the error and other times | |
it returns the ID + the text. As mentioned above, err.Error() should behave as it does | |
today for all errors so that we can avoid doing hacks like: | |
https://github.com/docker/docker/blob/master/daemon/start.go#L48 | |
where we have to call GetErrorMessage() - it should just be: err.Error() in all cases. | |
Some users of the error package may want different output from Error() which is | |
why I think the return value of Error() should be configurable based on what the | |
code using the package needs (not on a per error basis). | |
- Some people have expressed concern about the number of errors we're defining. Well, yes | |
there are a lot but that's because we have a lot of errors in the code today. If we assume | |
no UX change then yes we'll have one error object for each unique error text we have today. | |
I don't see the issue with that. If people think that's too many error objects then that | |
implies there are too many error messages and they should issue PRs to reduce the number | |
of errors generated, but that's totally independent of this exercise. All this is doing is | |
formalizing the errors we're generating, not changing the number of them. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment