Anyway, about encodings: the most useful thing I find to remember is that we can only exchange any information at all because we have a previous agreement about what things mean. This applies as much to this conversation and the language in which we are having it as it does to encodings, it’s just that encodings are rather more precise agreements.
So.
If I send you something, it’s just a stream of bytes.
If I POST something to your form, it’s just a stream of
bytes.
If I send something via TinyTDS
to your database, it's just a stream of bytes.
The thing that makes it something other than a stream of bytes (which is itself just an agreement about bits at a different level) is that at every stage of transfer we say what it is:
- the browser warrants that that textarea is producing UTF-8
- the
Content-Encoding
warrants that that thing that we’rePOST
ing from the textarea is UTF-8 TinyTDS
warrants that the thing that we are sending to the database connection is UTF-8
Everything must agree all the way down because at any stage to process this stream of bytes in a different way may or may not produce errors depending on the degree of overlap between the agreements.
We have an almost complete degree of overlap between ASCII and UTF-8 because of the infrequency of emoji in formal forms of communication.
But we get these outliers, these people who use emoji in formal forms instead of
just Snapchat, and they point up the flaws in our agreements - in our case that we said
we're sending a byte per character with a restricted bit range, but in fact we're sending
two bytes for that rofl
face.
We send streams of bytes, and only accompanying out-of-band information about the encodings enables us to make any sense of the results.
If I sent you UTF-8 but told you it was UTF-16 you’d start mindlessly mashing
two bytes at a time into characters, creating an unholy binary mess that
resembles what happens when you cat
a compiled artefact into a shell.
We're sending UTF-8 and saying that it's UTF-8, but the database doesn't agree at
the final step and bubbles that error back through TinyTDS
as either
TinyTds::Error: Unclosed quotation mark after the character string 'mum '.
or
Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT