A conversation about encodings transcribed into a gist

Anyway, about encodings: the most useful thing I find to remember is that we can only exchange any information at all because we have a previous agreement about what things mean. This applies as much to this conversation and the language in which we are having it as it does to encodings, it’s just that encodings are rather more precise agreements.

So.

If I send you something, it’s just a stream of bytes. If I POST something to your form, it’s just a stream of bytes. If I send something via TinyTDS to your database, it's just a stream of bytes.

The thing that makes it something other than a stream of bytes (which is itself just an agreement about bits at a different level) is that at every stage of transfer we say what it is:

the browser warrants that that textarea is producing UTF-8
the Content-Encoding warrants that that thing that we’re POSTing from the textarea is UTF-8
TinyTDS warrants that the thing that we are sending to the database connection is UTF-8

Everything must agree all the way down because at any stage to process this stream of bytes in a different way may or may not produce errors depending on the degree of overlap between the agreements.

We have an almost complete degree of overlap between ASCII and UTF-8 because of the infrequency of emoji in formal forms of communication.

But we get these outliers, these people who use emoji in formal forms instead of just Snapchat, and they point up the flaws in our agreements - in our case that we said we're sending a byte per character with a restricted bit range, but in fact we're sending two bytes for that rofl face.

TL; DR

We send streams of bytes, and only accompanying out-of-band information about the encodings enables us to make any sense of the results.

If I sent you UTF-8 but told you it was UTF-16 you’d start mindlessly mashing two bytes at a time into characters, creating an unholy binary mess that resembles what happens when you cat a compiled artefact into a shell.

We're sending UTF-8 and saying that it's UTF-8, but the database doesn't agree at the final step and bubbles that error back through TinyTDS as either TinyTds::Error: Unclosed quotation mark after the character string 'mum '. or Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT

rgarner/encodings.md

TL; DR