TLDR version: People who were obsessed*
about simplifying things created a coding pattern that is overtly complex to do a chore that is brain-dead simple.
* Developers drowning in object taxonomies, mainly.
So, anyway...
One thing that really grinds my gears while writing safexz is the ByteReader antipattern. Consider this:
IIRC this pattern first appeared in Smalltalk or Objective-C then found its way over to Java into (what I call) a Nastypattern™:
import java.io.IOException;
public interface ByteReader {
byte[] readBytes() throws IOException;
}
public class SimpleByteReader implements ByteReader {
private byte[] source;
private int currentPosition = 0;
public SimpleByteReader(byte[] source) {
this.source = source;
}
@Override
public byte[] readBytes() throws IOException {
if (currentPosition >= source.length) {
return null; // EOF
}
byte[] bytesRead = new byte[1024]; // Or any other size buffer
int bytesToRead = Math.min(bytesRead.length, source.length - currentPosition);
System.arraycopy(source, currentPosition, bytesRead, 0, bytesToRead);
currentPosition += bytesToRead;
return bytesRead;
}
}
This is just maddening to understand at first glance. And what about this thing is "simple"? For starters: the line calling Math.min
which is trying to find the safe point to read (realistically, it's a write) from. I really like how Lasse Collin (of liblzma
fame) approaches this pattern. He gives you a C macro that sets up a data structure for you to work with, which you can call like this:
What's in? lzma_stream
? This:
Granted, there are more internal fields on this structure (and that's where the triple indirect pointer to the XZ backdoor was located, which was placed there during _init
, but these simple fields at the top are all that you use to move a stream through the compress/decompress cycle. It's much easier for me to understand.
In a stream processor you generally have some block of working storage to do "work" and you have an input tray and an output tray, like a desk. Things come into the IN tray, you work on them, and you put the results in the OUT tray. The only things that matter are the pointers (and in the case of byte buffers, it's just an index indicating now much free space is inside the input tray), directing whoever is feeding stuff into the IN tray if it's safe to add more work, similary a pointer (index) is given to whoever is taking things from the OUT tray.
You don't even need the total_in
and total_out
fields either. Those are just there for convenience in case you're running this code on something weird that doesn't give you enough registers to track your progress.
Think of it like this... how did big institutions like the IRS manage to process millions of taxpayer records with almost no database technology, on computers with less than 2MB of RAM?
The data structure Lasse Collin wrote gives you an idea how. It wasn't until the very late 1970s that there was enough memory inside computers to store a significant amount of state. The only code that people wrote were stream processing jobs. You would collect data from the environment with sensors or humans tapping away at punch card machines, and magtape was the only medium that had a significant storage size. A 6250-bpi reel of tape, when formatted, could hold about 180MB of data:
Since XML and JSON did not exist yet, files were written with positional records. Here's an example:
05200MARY JOHNSTON 2159091212 041677
061001234 MARKET ST PHILA PA 19107
06325DIST A BAL 00030025
06326LSTR 111488 CR 00001645
The first 5 bytes denotes a record type. Somewhere, someone decided how many bytes {FIRSTNAME}
gets and its position on the 05200 - CUSTOMER NAME PHONE BDAY
record. Each one of these lines used to be punched on to cards, and when video was cheap enough to adopt in the 1970s, clerks would not have to look at a record list or set a punch card machine to output the data in the correct position on a record. Convenient data entry forms were made to key the information into the files a speedy way.
Your streamer program you might be writing is going to upsert into these records, mixed together like this. This ancient video from 1981 describes how that was done.
Computers shared data by reading from copies of these tapes mailed all over the globe. It was already known that tapes were mostly storing 00
, so to create tapes where most of the bytes on it had sigificant values, they would create packed
and unpacked
versions. Packing is just this:
05200MARY*JOHNSTON*2159091212*041677
061001234 MARKET ST*PHILA*PA*19107
06325DIST*A**BAL 00030025
06326LSTR*111488*CR***00001645
But for data like CITYNAME, STATENAME
, this data would still repeat like mad throughout the dataset as well as any other significant data that was repetitive in nature. (Fun fact: this exact stlye of packing flat files is what EDI is, which is still used today and it's pushed over TLS connections instead of huge reels of tape).
That's were the origins of DEFLATE up to LZMA come from. Positional recordkeeping generates tons of whitespace and repetitve data, lending itself very well to compaction. Instead of mailing a hefty 14 tapes of bank account information from one bank to another, you could feed all the master record tapes through an LZ77 job to pack everything into one tape and mail it for less than $5.
People like xz
in particular because it's great at finding any arbitary binary patterns (think about all the 7-bit ASCII text out there being stored on computers today as a double-byte, leaving huge amounts of 00
in the data! or 32-bit machine code that leaves out lots of high opcodes, hence more 00
), and text is only a subset of binary encoding, so programs and machine data that is repetitve would also be found and packed, not just the whitespace.
Let's take a look at what I did to call lzma
directly:
go func(receive chan []byte, sender chan []byte) {
decompressChanStream(receive, sender)
}(input, output)
and here is how decompressChanStream
is defined:
func decompressChanStream(in <-chan []byte, out chan<- []byte) {
Essentially, the in
channel can be thought of as a computer tape that's loaded full of LZMA-compressed data (in reality this source can be from anywhere). When I start the job, decompressed data will begin to emerge from the out
channel. When the input tape runs out, close()
ing the input channel is the signal Lasse's liblzma
needs to know the input is completed and there will be no more data coming, so it's time to wrap it up and finish the last writes.
With this simple pattern you can then do whatever you'd like with the output stream. Maybe you need to filter and disperse individual records, or broadcast the data, or put it on the screen or into a file. I don't have to care what you do with what comes out of it.
Data will continue to come out of the out
channel until LZMA's work is done, then it will close the out
channel.
By ranging
over the out
channel like this:
I can now copy all the bytes coming out into a file (or to anywhere). I can come back to this code 30 years later and still understand it and not need any comments, because the physical act of connecting inputs and outputs is a basic human understanding, and it's a foundation pattern of stream processing. A foundation that's existed since the ENIAC, which was also a stream-processing computer.
I hate the way Java does this, and C# is similarly nasty. The only thing ByteReader
and ByteWriter
patterns do is encourage top-level coders to write surfaces over them to hide them from users.