dannycoates/logs.md

Last active August 29, 2015 14:04

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/dannycoates/0c7d04f9aa8d3908007a.js"></script>
Save dannycoates/0c7d04f9aa8d3908007a to your computer and use it in GitHub Desktop.

Download ZIP

FxA Log Format

Raw

logs.md

FxA Log Format

version 0.0.1-draft-1

Goals

A standard base log format that enables log messages to be easily routed and filtered while allowing applications maximum flexibility in data they are able to record.

record format
- few required fields (just enough for coarse routing/filtering)
- "easy" serialization and deserialization
- easily shipped through Heka
- support arbitrary application data
file (stream) format
- newline delimited
- "grepable" on disk

Non-Goals

serialized size
processing speed over format flexibility
comprehensive schemas for all applications

Open Questions

How closely to Heka message format should we try to be?
- Heka "Fields" are less flexible than JSON.
- Heka natively uses protobufs, which are not flexible.
How big of a concern is routing performance?
- Is a strict field order important?
- Should we expect a router to parse the entire message?
Is it important for the message to use a single encoding?
- For example, if the envelope was protobuf and payload was JSON?
- I lean toward yes, a single encoding would be ideal.
How important is nested data in a log message?
- I think its very important.
- Without nesting, ad hoc "flattening" is inevitable.

Record Format

Log messages are objects (in JSON terminology) containing a small list of required fields. Additional fields belong to the payload field which may be of any JSON compatible type.

Top Level Fields

The purpose of the top level fields is to uniquely identify each log message by application, location, time, and importance.

version
- string
- log format version (semver)
severity
- number
- syslog severity
type
- string
- designates the "type" of the payload field
- optionally a JSON Schema URL
logger
- string
- application name
hostname
- string
pid
- number
- Process ID of the logging process
timestamp
- number
- microseconds since UNIX epoch (UTC)
payload
- object
_*
- optional
- string
- additional metadata fields

These fields are strictly ordered from least to most specific. A process may choose to leave any value unparsed that it doesn't require. For example, a router may only need to parse up to the name in order to send the message to the correct log aggregator. It can leave the rest of the message unparsed and pass the entire message as its payload.

The type value may be any string signaling additional context of the payload value for applications that emit structured log messages. It may be simply "object", or it may be "FMLError:v1.0.2", or a JSON Schema URL.

_* fields are for metadata about the payload that don't fit into the other fields or in the payload itself. Some examples may be _appVersion, _threadId, _awsRegion

Application Data Types

Application specific payload types may be hierarchical (even encouraged) but their schemas are beyond the scope of this document.

Example

{
  "version":"0.0.1-draft-1",
  "severity":6,
  "type":"http://schemas.accounts.firefox.com/v1/request_summary.json",
  "logger":"fxa-auth-server",
  "hostname":"apollo",
  "timestamp": 1405915343869000,
  "payload": {
    "code":400,
    "errno":103,
    "rid":"1405915831047-56717-80",
    "path":"/v1/password/change/start",
    "lang":"en-US,en",
    "remoteAddressChain":["127.0.0.1"],
    "t":1,
    "email":"[email protected]"
  },
  "_appVersion": "0.18.0"
}

Serialization

TO BE DETERMINED

JSON
TNetStrings
MsgPack

Lines may be compressed. (???)

Human Readability

Serialized messages do not need to be directly human readable. A pretty printer application may be used to display messages in a human optimized format.

Grepability

Raw log files should be easily searchable from a *nix shell, somehow

Potential Problems

arbitrarily nested objects
- parser skipping unknown fields (+1 for tnetstrings)

Prior Art

References

chilts commented Jul 21, 2014

My 12factor-config package pretty much does all this already, except type is a string representation of the level rather than describing the additional data. The additional data is always an object. I suspect it's not worth re-using it since this is slightly different, but there may be something we can steal from there. :)

https://www.npmjs.org/package/12factor-log

fmarier commented Jul 21, 2014

Do we want microseconds or milliseconds for the timestamps? microseconds seem overkill to me and everything else than JavaScript seems to use milliseconds since the "beginning of time"

fmarier commented Jul 21, 2014

Should host just be the hostname or should if be the fully-qualified hostname?

Author

dannycoates commented Jul 21, 2014

Do we want microseconds or milliseconds for the timestamps?

Ha, Heka uses NANO seconds! I know javascript only goes to milliseconds, but I thought we might want to leave room for other super fast languages to use them if they want. There's been a couple rare times in my career where microsecond precision would have helped me debug something. If you think its a foot-gun we could go back to milliseconds. Or we could put the burden on the parser to figure out the precision automatically.

seanmonstar commented Jul 21, 2014

Ohh, logging. I like logging.

level

Various logging frameworks have different ideas on levels. Should we standardize on levels, or should each application just say what level numbers they're interested in when they want data?

name

Is this the name of application? Can't that be inferred by where the log is coming from? Some frameworks share a single logger throughout the app, and others have the concept of a named hierarchy; for instance, I name a new logger in each module, and that uses the name field on the record, such as fxa.routes.verify or fxa.db.memory.

data

It's quite common when logging something to just log a message, and possibly a couple arguments. I'm not sure how to map that to a key-value data object... Should we just never ever do that kind of logging, and always log some object instead?

Author

dannycoates commented Jul 21, 2014

Should we standardize on levels, or should each application just say what level numbers they're interested in when they want data?

We should pick a scheme. We ought to be able to easily query for all logs at a certain level across all apps.

Is this the name of application?

Yes, it identifies the code that is running. @chilts had a good suggestion of adding the app version also.

Can't that be inferred by where the log is coming from?

Yes, I think all of these top level fields should be set by the logger, not by the caller. These top level fields are more like an envelope for routing. The data that we as programmers set with a logger API will go into data.

It's quite common when logging something to just log a message, and possibly a couple arguments.

Indeed. I expect the logger library would output something like this:

(these are just rough ideas)

{
  "version":"0.0.1-draft-1",
  "level":3,
  "name":"fxa-auth-server",
  "type":"string",
  "host":"apollo",
  "pid":2532,
  "timestamp": 1405915343869000,
  "data": "I got here!"
}

for this:

logger.info("I got here!")

Or maybe:

{
  "version":"0.0.1-draft-1",
  "level":5,
  "name":"fxa-auth-server",
  "type":"array",
  "host":"apollo",
  "pid":2532,
  "timestamp": 1405915343869000,
  "data": ["the foo count was %s", 7]
}

for:

logger.warn("the foo count was %d", 7)

However, now the semantics of what 7 means is harder for a machine to determine (and query later). An object with a printf string for display may be better. I like the way your 'intel' library handles print formatting because it allows us to pretty print without losing semantics. Maybe something like:

{
  "version":"0.0.1-draft-1",
  "level":5,
  "name":"fxa-auth-server",
  "type":"object",
  "host":"apollo",
  "pid":2532,
  "timestamp": 1405915343869000,
  "data": {
    "printf":"the foo count was %(fooCount)d",
    "fooCount":7
  }
}

logger.warn("the foo count was %(fooCount)d", {fooCount: 7})

Now we can query fooCount later and also pretty print it in a log stream.

named hierarchy

I imaging this would fit more into type but I'm not sure. Logging hierarchy is also commonly use to "curry" extra data into a log function for convenience so one doesn't have to repeatedly add requestId (for example) on every log call. If not type then perhaps anything that does this sort of thing will implicitly be an object type and each level would add something that object somehow 👋 ✨.

Author

dannycoates commented Jul 21, 2014

I wonder if having type limited to a single string is too restrictive? My thinking was that a log router that doesn't care to fully parse data could use type for routing decisions, but there is almost no value in a "string" and "object" type for this. Maybe an array of tags would be better. Tags could be manually set by the programmer or automatically by the logger. For example if I call logger.info(foo) the logger could add a tag for typeof(foo) or I could manually tag a special debug line logger.debugWithTag("alertme", foo) that the router could send directly to my email. Anyway, the logger API isn't too important at this point. The question is, is an array of strings more valuable for routing than a single string, and is the additional complexity worth it?

Author

dannycoates commented Jul 21, 2014

The question is, is an array of strings more valuable for routing than a single string, and is the additional complexity worth it?

Or do we not do either and force the router to parse data if it wants to make intelligent routing decisions?

Author

dannycoates commented Jul 21, 2014

I've renamed some of the fields to be closer to Heka's native message naming

seanmonstar commented Jul 22, 2014

Yes, it identifies the code that is running. @chilts had a good suggestion of adding the app version also.

Can't that be inferred by where the log is coming from?

Yes, I think all of these top level fields should be set by the logger, not by the caller.

I meant that the actual log pipe (fxa-auth-server.log, or whatever) would be able to infer the app name.

However, now the semantics of what 7 means is harder for a machine to determine (and query later)

Yea, I struggled a lot in intel trying to figure out how to serialize strings like that. Currently, intel's Record.toJSON does json.args = args, and figures that a machine can put them back together if desired.

Or do we not do either and force the router to parse data if it wants to make intelligent routing decisions?

I'm not sure. That's what I ended up expecting in intel. I see that hapi uses tags for server and request logging. I catch "uncaught" errors from hapi by checking tags.error && tags.implementation. Though, coming from Python, and hence what I built into intel, was the idea of routing on logger.name and level. It felt like that information was sufficient for getting the right log messages wherever.

Since we're talking about logging here, and I happen to work an logging library, I'd be interested in adjusting intel to solve all these problems easily.

seanmonstar commented Jul 22, 2014

For the app name + version, how about app?

Author

dannycoates commented Jul 22, 2014

From #heka irc channel:

11:10 (dcoates) so, I'd really like to be able to encode a tree structure in a heka message, and I don't see a straightforward way to do that looking at message.proto. Is there a recommended way to do it?
11:12 (trink) dcoates: the only wayto do it is flatten the structure and put the hierarchy in the key name
11:13 (trink) dcoates: util.table_to_fields
11:14 (RaFromBRC) https://github.com/mozilla-services/lua_sandbox/blob/dev/modules/util.lua#L13
11:15 (dcoates) trink: yeah, thats what i was afraid of. It means I can't just use heka messages as a "native" format and have to kind of "double encode" my message in a sense.
11:16 (dcoates) I think it would be *really nice* to make Field recursive
11:20 (trink) dcoates: that still wouldn't do what you want (since there is no concept of a map)
11:24 (dcoates) well, you got my intent ;) I'm not trying to offer a solution
11:31 (dcoates) Is supporting encoding trees in messages without flattening something y'all would be willing to consider?
11:32 (trink) dcoates: the what is clear, the intent [why] is unclear (i.e. the need for a highly structured/nested log to be used in real time analysis)
11:36 (dcoates) trees are natural structures for a lot of things. I may want to trigger on something without having to push it up to the top scope. It'd also be nice to not have to transform my data for each system it goes through.
11:41 (chance) just wondering, what kinda data in this case might be in the tree format
11:42 (chance) maybe git commits? :D
11:42 (chance) you could serialize to json, and unserialize back
11:43 (chance) not in the message fields itself, but nonetheless its there.
11:44 (dcoates) chance: indeed, but now I've got 2 encodings in one message when I could have 1, and as a string payload its harder to use in heka filters.
11:45 (chance) i dont disagree. but supporting a tree like structure in pure heka messages would likely make it harder to use *everything* in heka, not even just filters.
11:46 (chance) but its hard to say.
11:46 (chance) at least protbuf supports it im reading, so its a possibility.
11:47 (chance) probably not on the top of things we'll be working on until we get other stuff done.
11:48 (RaFromBRC) dcoates: the double encoding issue is valid, and i know json sucks in a lot of ways, but in lua filters it'd actually be easier to deal w/ tree structured data in json format than it would embedded in the message directly
11:48 (dcoates) chance: im interested in what would be harder for heka internally so that I can understand what I should and shouldn't expect heka to dofor me
11:49 (RaFromBRC) `cjson.decode(your_json)` will give you a lua table (like a javascript object) all ready for action
11:49 (chance) it's hard to say, because i haven't investigated. we use protobuf, protobuf supports this kind of data. however that isn't to say doing such a thing would be as simple as changing protobuf
11:50 (RaFromBRC) extracting data from a bunch of different message fields would be multiple `read_message(FieldName)` calls, and constructing the data structure by hand
11:54 (dcoates) RaFromBRC: awesome, that really helps!
11:55 (chance) yeah, its a good point that its probably easier for you as a user to just get the data from the payload for that specific rason
11:55 (chance) reason, even
11:55 (RaFromBRC) :)
11:55 RaFromBRC likes being helpful
11:56 (dcoates) chance: yes, that is what I was trying to understand, which would be better, all native protobuf, or something else in the payload. payload wins!
11:59 (chance) definitely. sometimes i find myself unsure as well, but im still new to heka :)

trink commented Jul 22, 2014

Non-Goals - comprehensive schemas for all applications

This non-goal is already causing problems in Ops. If the output isn't documented it should be identifiable as such and not sent to Heka or any data warehouse. Everything else should be fully documented and versioned to prevent breaking downstream consumers. There have been several ad-hoc FxA logging changes over the past few months. From removing memory statistics that we actually alert on, an error message that is now sometimes an object instead of a string, and sporadically missing required fields.

Author

dannycoates commented Jul 22, 2014

@trink I'm not saying having documented types is not a goal. I am saying it shouldn't be a prerequisite for getting any logs flowing. The purpose of this document is to get a format in place that will not fail to parse in heka for any message going through. There is a type field that should be strict enough that we can write heka filters to process them if we want, but I think it is important to allow untyped dynamic payload data to at least not get dropped on the floor, and to somewhere a developer can at least look at it to decide how fucked up it is. Strict schemas for everything makes it unnecessarily hard to discover what is important, especially for unexpected errors. Our current setup needs to improve, this is working towards that.

rfk commented Jul 22, 2014

Suggestion: could the _* fields be cleaner if put into another sub-object, e.g. meta to go alongside payload? It's not clear what advantage we get from having them in the top-level fields.

Author

dannycoates commented Jul 23, 2014

could the _* fields be cleaner if put into another sub-object, e.g. meta to go alongside payload?

Yep, I took that convention from GELF, but a meta object might be nicer. If it were named fields it would map directly to the heka message but be less clear in its purpose.

dannycoates/logs.md

FxA Log Format

Goals

Non-Goals

Open Questions

Record Format

Top Level Fields

Application Data Types

Example

Serialization

Human Readability

Grepability

Potential Problems

Prior Art

References

chilts commented Jul 21, 2014

Uh oh!

fmarier commented Jul 21, 2014

Uh oh!

fmarier commented Jul 21, 2014

Uh oh!

dannycoates commented Jul 21, 2014

Uh oh!

seanmonstar commented Jul 21, 2014

Uh oh!

dannycoates commented Jul 21, 2014

Uh oh!

dannycoates commented Jul 21, 2014

Uh oh!

dannycoates commented Jul 21, 2014

Uh oh!

dannycoates commented Jul 21, 2014

Uh oh!

seanmonstar commented Jul 22, 2014

Uh oh!

seanmonstar commented Jul 22, 2014

Uh oh!

dannycoates commented Jul 22, 2014

Uh oh!

trink commented Jul 22, 2014

Uh oh!

dannycoates commented Jul 22, 2014

Uh oh!

rfk commented Jul 22, 2014

Uh oh!

dannycoates commented Jul 23, 2014

Uh oh!