Skip to content

Instantly share code, notes, and snippets.

@OhMeadhbh
Last active August 29, 2015 14:03
Show Gist options
  • Save OhMeadhbh/e93b1ddc946930a34725 to your computer and use it in GitHub Desktop.
Save OhMeadhbh/e93b1ddc946930a34725 to your computer and use it in GitHub Desktop.
Dysfunction : a transfer syntax indented to irritate everyone
; dysfunction.dys
;
; Dysfunction is a new specification for communicating typed, structured,
; dynamic data between computing systems (or processes on the same system.)
; It defines a transfer syntax and processing expectations for systems which
; produce and consume formated data. This file is a brief introduction to
; Dysfunction, along with a few examples.
;
; Dysfunction is used in the same way JSON or XML might be used to store or
; communicate human readable representations of structured data. It is designed
; to be consumable by small 8 and 16 bit systems currently popular with
; "smart" RFID and low power sensors users.
;
; Based on analysis initially performed for the IETF's VWRAP working group,
; Dysfunction defines a technique for denoting six scalar types (null,
; boolean, integer, floating point, date and URI) and four vector types
; (character string, array, dictionary and octet string.) It describes rules
; for building a parser to unambiguously identify and denote data of each of
; these types. It also describes three optional type systems, each describing
; rules for converting data into octet strings suitable for use by a computing
; system.
;
; The motivation for developing a new data communication format was primarially
; the desire to overcome perceived issues with existing formats and transfer
; syntaxes:
;
; * XML - Though there is broad support for XML across a wide number of
; operating environments, its flexibility generally imposes processing
; costs inappropriate for small systems. While it is certainly possible
; to build a XML parser on an 8 bit system, it is resource intensive
; enough many designers prefer to use custom, less flexible parsers.
;
; * JSON - JSON is an excellent choice for representing dynamic structured
; data, but it does not distinguish between integral and floating
; point numbers. And its lack of comments also annoys many developers.
;
; * ASN.1/BER - Abstract Syntax Notation One and it's Basic Encoding Rules form
; the basis of a powerful system for denoting arbitrarially complex
; structured data. But BER parsers are even more complex than XML
; parsers, making their use on small systems problematic.
;
; Another motivating factor for producing a new transfer syntax is to
; experiment with the idea of representing structured data as a computer
; program in a very limited virtual machine where a side-effect of executing
; the program is the creation of the data structure described in the
; dysfunction "program."
;
; But this file is less a specification and more a practical example.
;
; You might have noticed the semi-colons at the beginning of most of the lines
; in this file. LISP and assembler programmers will recognize this.
; These are "to end of line" comments. Or "2EoL" if you want to abbreviate
; things. I sometimes try to pronounce "2EoL" and it comes out as "tool."
( I appreciate FORTH, so parens demark begin and end of bounded comments. )
( Parens are nestable (so the first close paren doesn't end all your
comments.) This is because when I'm writing, I like to put things in
parens. It also allows me to write code generators without having to worry
about global scope in comments.
But it does require humans to balance the parens they create.
Oh, and you might have noticed bounded comments can span lines.
)
; Dysfunction separates the transfer syntax from type semantics. This means
; it's perfectly valid to have a bazillion digit number and still expect it
; to be recognized as a number by the parser. But arbitrary length data types
; aren't needed in most cases, so we define three type models:
;
; small - Designed for 8-bit embedded CPUs with a 16 bit address space, the
; small type model assumes integers are in the range of -32768 to
; 32767. Floating point numbers are assumed to be IEEE 754-2008
; BINARY32 or DECIMAL32 (half precision.) Strings and octet strings
; are at most 2^16 - 1 (65535) octets long. 'utf16le' and 'utf16be'
; keywords are explicitly not supported in this model.
;
; medium - The medium type model is intended for modern 32 bit systems.
; Integers are assumed to be between -2^31 and 2^31 - 1. Floating
; point numbers are assumed to be double precision BINARY64 or
; DECIMAL64. Strings are at most 2^32 - 1 octets long.
;
; large - Intended for 64 bit systems, this type model assumes integers will
; fall between -2^63 and 2^63 - 1. Floating points are assumed to be
; BINARY128 or DECIMAL128. Strings are at most 2^64 - 1 octets long.
;
; The keywords 'small', 'medium' and 'large' are used to inform the data
; consumer that the producer promises not to include values outside the range
; described above. In the absence of one these "type mode contract" keywords,
; the consumer should be prepared to accept data of arbitrary length.
;
; Note that these "type model contract" keywords make no representation about
; the data structures used by either the sender or receiver of the information.
; They only indicate the sender's assertion that integral and string types will
; be within the valid value range for that type model. It's prefectly
; acceptable to use the small type model keyword when sending data between
; systems with 32 or 64 bit integers.
;
; Type model contract keywords are intended to give the consumer of data a
; "heads up" about the data they'll be recieving. Specifically, it's so small
; 8 and 16 bit microcontrollers can decide to reject an entire message before
; parsing it.
;
; Here's an example where we send a string:
small
"I am a string that's no more than 65535 octets long."
; See? Simple. The first line tells the receiver that the string it sends won't
; take up more than 65535 octets in memory. This is probably a good time to
; mention UTF-8 support. By default, dysfunction assumes things are UTF-8. The
; practical result of UTF-8 support is strings (and keywords) _may_ take more
; octets in memory than there are apparent characters on the screen when you
; type them out.
;
; So if you go hog wild with unicode, don't freak out if a 65534 codepoint
; string is rejected by a reciever for being too long.
;
; I'm about to use the 'reset' keyword. Don't freak out, I'll explain it later.
; Here's another example of sending a string:
;
reset
small
"This is a string that contains newlines.
Because it's annoying when people think you can't just embed a newline in a
string. It makes it a little nicer if you want to put formatted HTML in a
string:
<html>
<body>
<h1>w00t!</h1>
<p>I am some formatted text! w00t!</p>
</body>
</html>
And let's add a semicolon to the party; just to explain what has higher
precedence: comments or string delimeters. The answer is string delimeters
have higher precedence. So the following text is included in the string. It
isn't removed from the string 'cause the parser thinks it's a comment:
; This would have been a comment if it had been outside the double-quote string
; delimeter. But it's not, so it isn't."
; Watch out. If you edit a lot of dysfunction data by hand it can be easy to
; get confused if you're in a string or not. But what if you want to put a
; double quote in a string? You have two options. First, just escape the
; double quote:
reset
small
"I am a string with a couple of \"double quotes\" i it."
; and the other escapes you have in strings are \uXXXX for unicode and \xXX for
; an 8 bit character.
reset
small
"My favorite unicode character is the lower case glagolitic spidery ha: \u2c22
and my favorite ascii character is \x64"
; See? that easy? But what if you're lazy enough that you don't want to copy
; and paste something into a file without searching and replacing \" for "?
; You can use the BASH inspired heredoc. In this example we create a string
; that starts immediately after the EOF" and ends when the parser encounters
; the sequence E-O-F:
reset
small
EOF"This is an example of the dysfunctional heredoc construction. It allows
me to create a string with double quotes (") or semi-colons (;) or even
parentheses without the parser thinking they denote the end of a string or
a comment. the one constraint is i can't use the characters E, followed by
O, followed by F. Doing so would terminate the string.
EOF
; Adding heredoc complicates the parser a little bit, but I'm lazy and I like
; heredocs.
; Okay, let's fast forward to arrays and dictionaries. Arrays are vectors of
; things addressed by a number. Dictionaries (aka Associatve Arrays or Maps)
; are arrays of things addressed by a string. If we wanted to denote an array
; that was similar to the JSON:
; [ 9, 0, 1, 2, 5 ]
; we would do this:
reset
small
array
9 ! 0 ! 1 ! 2 ! 5 !
; That looks pretty crazy, right? It makes it seem like the exclamation point
; is used like a comma. No. It's not. Remember, this is a program on a stack
; based virtual machine. The 'array' keyword creates a new array and pushes it
; on the stack. The number 9 pushes a 9 onto the stack. The '!' keyword removes
; the thing on the top of the stack and appends it to the next thing on the
; stack (which is the array.) I'm not a purist, so the '!' keyword leaves
; the array on the top of the stack.
;
; Here's how you would replicate the following JSON array of arrays:
; [ "i am a string", [ "this", "is", "an", "array", "in", "an", "array" ] ]
;
reset
small
array
"i am a string" !
array
"this" ! "is" ! "an" ! "array" ! "in" ! "an" ! "array" !
!
; Note that the parser doesn't care about newlines. You could have easily did
; this for the same result:
reset
small
array
"i am a string" !
array
"this" ! "is" ! "an" ! "array" !
"in" ! "an" ! "array" !
!
; but. wow. that's a lot of exclamation points. so there's a shortcut.
; the '!' keyword scans down the stack until it finds an array and pushes
; everything on the stack onto the arry in reverse order. so another way to
; write the previous example would be:
reset
small
array
"i am a string" !
array
"this" "is" "an" "array" "in" "an" "array" !
!
; So what about dictionaries (aka maps)? Dictionaries are arrays that contain
; associations. You create an association with the '%' character. Here's an
; example:
reset
medium
dict
"username" "msh" %
"password" <16>'6d31fb8e43bdfec0726b2b76f72616f9779d4577' %
"uid" 1000 %
"groups" array
"users" ! "admin" ! "video" ! %
"phone" "650.283.1234" %
; So... execution of this program goes like...
; 1. create an empty dictionary, push it to the stack.
; 2. push "username" on the stack
; 3. push "msh" on the stack
; 4. remove "msh" and "username" from the stack, create an association from
; "username" to "msh" and add it to the dictionary that's left on the stack
; and then repeat steps 2 through 4 for password, uid, groups and phone.
; While we're here, look at the password value. It's an octet stream of 20
; bytes. the '<16>' ahead of the tick mark says it's represented in "base 16"
; or plain ol' hexadecimal. by default, binary strings are represented using
; base 64. (or you could put a '<64>' before the tick mark. that's actually
; valid.
;
; Here's an array of octet strings, each element represents the same 16 byte
; value:
reset
small
array
<16>'0708d3c304696b553293009aa81ba607' !
<16>'07 08 d3 c3 04 69 6b 55 32 93 00 9a a8 1b A6 07' !
'BwjTwwRpa1UykwCaqBumBw==' !
<64>'BwjTwwRpa1UykwCaqBumBw==' !
EOF'
BwjTwwRpa1Uy
kwCaqBumBw==EOF !
EOF<16>'
07 08 d3 c3 04 69 6b 55
32 93 00 9a a8 1b A6 07
EOF !
EOF<64>'
BwjTwwRpa1Uy
kwCaqBumBw==EOF !
; So let's talk about other types. Here's a map with a couple different
; values:
reset
small
dict
"null value" nil %
"boolean true" true %
"boolean false" false %
"another boolean false" FALSE %
"integer" -17 %
"a hexadecimal integer" <16>-80 %
"an unsigned integer" 65535u %
"an unsigned hex integer" <16>FFFFu %
"an octal value for the octal freaks" <8>371 %
"and this integer is the same value as the last two" -1 %
"floating point" -3.14159 %
"another floating point" 6.02E23 %
"date" {2014-06-25} %
"time" {13:15:30} %
"datetime" {2014-06-25T13:15:30Z} %
"in the past" {1776-07-04} %
"a url" [http://sm5.us/] %
"a mailto url" [mailto:[email protected]] %
"a uuid" [urn:uuid:6e8bc430-9c3a-11d9-9669-0800200c9a66] %
; So what you're supposed to get from this example is:
; * the null object is a real thing, represented by the keyword 'nil'
; * booleans are represented by the case insensitive keywords 'true' & 'false'
; * integers are a sequence of digits without a decimal point. you can prefix
; them with a hyphen to make then negative. if you put a 'u' (or a 'U') after
; it, it's unsigned, so we let you get away with numbers between
; 2^(whatever - 1) and 2^whatever.
; * if you put a decimal number in angle brackets in front of a number, it
; changes the base of that number. valid bases for integers are binary <2>,
; octal <8>, decimal <10> and hexadecimal <16>.
; * floating point numbers can be in the form of X.XX...EYYY.
; * dates are enclosed in curly braces.
; * URIs, URLs & URNs are enclosed in square braces.
; So let's get back to the 'reset' keyword. Remember I said dysfunction files
; are programs in a small stack based virtual machine? And remember the goal
; of these non-turing complete programs is to produce a data structure as a
; side effect? Well... we assume that any parser is going to have an internal
; state where it stores the stack. The 'reset' keyword tells the parser to
; re-initialize that state.
;
; So, you could do something like this:
reset
small
-15
reset
large
<16>DEADBEEFDEADBEEFu
; And it would be perfectly valid. I don't know why you would want to do it,
; but you certainly can.
;
; It might be a good time to mention dysfunction makes no attempt to store
; how long a message is inside the message. You have to do that yourself, or
; use a transport that already does this (like HTTP w/ content-length headers.)
; I bet you're wondering about UTF-16 and UTF-32 support. Yes, we support
; UTF-16 because all those old Windows devices that used UCS-2. The 'utf16le'
; keyword tells the parser we're looking at UTF-16 little endian text, likely
; produced by a windows programmer who couldn't find the documentation about
; how to convert UCS-2 or UTF-16 into UTF-8. The 'utf16be' keyword tells the
; parser to expect text from a unix programmer whose believes windows
; programmers have actually read RFC2781.
;
; We support UTF-32 by maintaining a list of support groups for people who
; believe UTF-32 is something that should be supported.
;
; The 'utf8' keyword is used to return to sanity after using the 'utf16le' or
; 'utf16be' keywords. UTF-8 is the default, and the 'reset' keyword should
; reset the parser to UTF-8 mode if we shifted into UTF-16 mode previously.
; A quick word about dates. Unix people like to believe the universe began in
; 1970 (or 1969 in some time zones.) Javascript people believe the fundamental
; quanta of time is the millisecond. And don't get me started about VMS dates,
; Julian days and 18 bits of SAO time of day.
;
; By default, dysfunction makes no requirements about dates and times other
; than they have to be representable by a RFC3339 date-time. This means you can
; represent dates between January 1st, 0 AD and December 31st 9999AD. And it
; means the resolution of your clock is arbitrarially small. So I'm adding the
; time contract keyword 'unixtime'. If the sender includes it, it means they
; promise to only send times that are representable by the 32 bit "seconds
; since 1970" thing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment