Skip to content

Instantly share code, notes, and snippets.

@cablehead
Last active October 1, 2024 03:17
Show Gist options
  • Save cablehead/efbe67fb90b8dde21a7c885f91f3f75d to your computer and use it in GitHub Desktop.
Save cablehead/efbe67fb90b8dde21a7c885f91f3f75d to your computer and use it in GitHub Desktop.
small tools everywhere

What would it look like if we just used small tools, everywhere?

Original revision: Sep 6, 2018

Most developers are familiar and proponents of the Unix Philosophy Unix philosophy - Wikipedia particularly, Write programs that do one thing and do it well. In practice though, the tooling just doesn’t exist to build useful network services which follow this approach.

Let’s take a lightweight WebSocket service. In 2018 we’ve no shortage of languages and frameworks - however largely incompatible with each other, to create the service.

I’m most familiar with the Python world so can break out the different frameworks in that world that you could use: twisted, eventlet, gevent, tornado, asyncio, sanic - and even though these use the same base language, using libraries designed to be used with one of these frameworks would likely be difficult to use with another framework. And then there are also a myriad of options with Java, Golang, Erlang, Rust.

I think it’s telling when an interesting innovation happens in one of these ecosystems, people who prefer a different one begin porting (or requesting ports).

I’ve been messing with https://pptr.dev, a Node.js library to easily automate Chrome a bunch recently (it’s great). It’d be nice to process data scraped with Puppeteer seamlessly using libraries I’m familiar with in Python. As near as I can tell Puppeteer was first made publicly available around Aug 18, 2017. Come Aug 28, 2017 Are there any plans to port puppeteer to Python? · Issue #575 · GoogleChrome/puppeteer · GitHub

There is now what looks to be a useful port GitHub - miyakogi/pyppeteer: Headless chrome/chromium automation library (unofficial port of puppeteer)

I run (badly) a few open source projects and I feel exhausted just thinking about the on-going effort that’ll be needed to maintain this unofficial port.

Boot strapping an entirely new method of development, a new language, a new approach on concurrency is even more daunting. Unless you are able to attract sufficient volunteers to flesh out your ecosystem with the essential batteries included, realistically even if your approach has significant novel advantages it won’t be usable for real work.

A quick brain dump of batteries an ecosystem could really use:

  • Protocols: json / msgpack / thrift / grpc
  • Ability to read / write document formats: cvs, xls, pdf
  • Network: TCP, HTTP, HTTP/2, WebSockets
  • Bindings for AWS, Kafka, Redis, MySql, Sqlite, Mongo
  • Rich date handling
  • DNS resolution
  • Sane primitives to coordinate async
  • Template rendering
  • Package management
  • Cryptography, tls, ssh
  • Science and math libraries
  • Heck: even just slugify-ing a url using industry best practices

Pony Lang is a new language that has a lot of interesting qualities. The project list "batteries required" under reasons not to use it yet Discover - Pony

So what would it look like if we constructed our systems with small tools that can communicate easily with each other?

The first thing, I think(?) is that this is largely not possible currently. The suite of small tools needed don’t exist.

This is a shot at a HTTP Server that takes a JSON payload with two keys, a and b and returns their sum.

$ s6-tcpserver 127.0.0.1 8080 sh -c '
	http2json | \
	jq .body | jq "{\"res\": .a + .b}" \
	json2http'

$ jo a=3 b=4 | curl -d @- localhost:8080
{"res": 7}

Some more thoughts looking at this snippet:

  • Bash quoting is prohibitive to building complex system on the command line.
  • s6 use of “Bernstein chaining” Chain loading - Wikipedia has a lot of advantages but isn’t as natural

TCP socket server, binds to a port, spawns a process for each connection, a maps the connection’s socket’s read to the processes stdin and the processes stdout to the socket’s write.

In this case a small shell script is spawned that has an imaginary binary http2json that parses HTTP requests from it’s stdin and translate it to a JSON document, perhaps in the form:

{
	"method": "POST",
	"path": "/",
	"headers": {...},
	"body": "{\"a\":3,\"b\":4}"}

This is then piped to an instance of jq to extract the body of the request, and then a second version of jq which parses the body as JSON, and sums fields a and b and finally pipes this result to an imaginary binary that would take the JSON payload and turn it into a HTTP response.

  • Constantly serializing / deserializing data to pass between tools is an issue:
    • obviously, it's really inefficient
    • it kinda defeats the purpose of small tools, as each tool would also need to bundle it's own serializer / deserializer
    • this is a bigger deal than it may seem on the surface. one of the symptoms of monolithic ecosystems that need to provide batteries included is they often provide low quality solutions. e.g. streaming json parsing vs, read everything into memory as a string and then json decode
    • alternatives? shared memory? environment variables?
    • it'd be great to get Jeremy's input
@kalamay
Copy link

kalamay commented Sep 12, 2018

Sorry for the delay getting some of these thoughts out. I was away this weekend, and I've been trying to cobble together a few projects that I've had on the back burner. I'm kind of just dumping a few ideas, and I'll post more thoughts as they come.

I think what you are touching on is really about the ability to compose small, well written, units. The UNIX philosophy really shines because of the ability to compose these units together. You can write things that do one thing well because they don't have to be concerned too much with what goes on either side (stdin and stdout). This restriction is really the catch-22 though. By being so restricted, you give up considering how you might improve throughput because there is this unavoidable barrier. But that very barrier is such a powerful unit of composability. As soon as you ask the question of performance, you have to either accept the limitations or entirely break the model.

But, it is fun to write simple tools that utilize the standard POSIX APIs. Its like they were designed for that or something. :)

It does beg the question of a hypothetical system that doesn't suffer the limitation while retaining the flexibility, but I'm not sure if that is really feasibly. Microsoft had tried with PowerShell if I recall correctly, but that was still an attempt at improving management tools, not really a service composition system.

@kalamay
Copy link

kalamay commented Sep 14, 2018

Just a few more thoughts that I manage to miss. Overall, I love the idea of composable small tools. I've like both using the s6 ones as well as writing additional ones when my needs weren't met. Its gotten me thinking a bit more about the performance angle. This was the strategy I had employed when trying to shore up the logging collection, and it came out pretty good. The s6-tcpserver had one little "annoyance" in that it really wanted stdout to be the reply socket. Generally, (as in, 99% of the time) this is what is wanted, so it is entirely reasonable. But for the logging collector I had written a simple server that only set stdin to the accepted socket and allowed all the forked processes to share stdout. Each process would parse the input and spit it out as a JSON line as long as it was <= 4096. (I think after running for a good couple of days we never had a line rejected for that reason). But then, we could pipe those into some log files. This ended up using very little system resources despite the parsing overhead of stdio, and ended up being quite a bit faster than the system it replaced. I'm not quite sure what the moral of this is, but I think text IO doesn't need to be such a burden from a performance perspective.

@cablehead
Copy link
Author

batteries-required

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment