The Open Game Engine is our flagship open source software for computational game theory. It is an embedded domain specific language that extends the functional programming language Haskell with a powerful syntax for defining games in a compositional way, and then running computational analyses on them.
Supported by a grant from the Ethereum Foundation, we are developing applications of the Open Game Engine to game-theoretic analysis of smart contracts, for example for token auctions and DeFi contracts. You can read about this project here. We are also working on integration with the dapptools project to partially automate this process.
We are investigating the possible collusive behaviour of pricing algorithms in realistic training environments, using a reinforcement learning framework built on top of the Open Game Engine. You can read more about this project in this paper.
A team of academic collaborators based mainly at the MSP Group in Glasgow are studying the category-theoretic structure of general cybernetic systems, and the ways in which different real-world methods and problems map onto these structures. This includes deep learning, dynamic programming and reinforcement learning, Bayesian and variational learning, and economic game theory. You can read more in this paper.# Archive of category 'parsing'
- Apr 8, 2024
•
Modular Error Reporting with Dependent Lenses# Archive of category 'compiler'
- Apr 8, 2024
•
Modular Error Reporting with Dependent Lenses# Archive of category 'haskell'
- Apr 22, 2024
•
The Build Your Own Open Games Engine Bootcamp — Part I: Lenses# Archive of category 'update'
- Nov 26, 2023
•
About the CyberCat Institute blog# Archive of category 'model'
- Sep 2, 2024
•
- Jul 15, 2024
•
Compositionality and the Mass Customization of Economic ModelsWe are an EU-based nonprofit organisation working on pure and applied research and open source software development. We are interested in all aspects of complex systems, optimization, control and intelligence, but our primary focus is on machine learning and economic theory.
We are neither academic nor startup, but at a nexus between both. Our work is motivated by both curiosity and urgent social need.
The popular understanding of the word “cybernetics” has drifted over time to mean biotechnology, but its original meaning in the 1960s was the interdisciplinary study of controlling complex systems, in fields such as ecology, economics and computer science. Due to the urgent need for an interdisciplinary approach to 21st-century problems, and the emergence of category theory as a new mathematical foundation, we believe the time is now right to try to reclaim the original dream of cybernetics, and its name.
Category theory is a field of mathematics that arose out of topology and geometry. Applied category theory (ACT) is a recently-emerged interdisciplinary field that applies methods of category theory to practical problems, achieving industrial success in fields such as quantum computing and natural language processing.
Categorical cybernetics, or CyberCat for short, is a sub-field of ACT focussing on optimization and control, using a small set of category-theoretic tools such as optics and open games.Cross-posted from Oliver’s EconPatterns blog
The last EconPatterns post traced the history of economic design, focusing on the operations research group at Stanford’s business school and its role in developing auction design and market design. In this post I want to take this a bit further and describe the overlapping roles of operations research and economic design in more detail, anchoring on typical “operations research” domains, and how they quickly cross over into “economic” domains.
But a short definition of terms first: operations research is an applied branch of mathematics, mostly focusing on optimization (or “programming” in the original sense of linear programming, dynamic programming, combinatorial programming, etc.): orchestrating inputs to optimize (minimize or maximize) an output in the form of an objective variable.
The canonical result is a constrained programming or optimization problem expressed as one objective function and any number of inequalities expressing constraints or limits.
In the first post on the cybernetic economy I already stressed the role of these limits: available stocks can ran out or warehouses can overflow, machines can only transform so many pieces in an hour, pipelines, roads, conveyor belts can reach their capacity and get congested.
“Orchestration” means putting tasks into their correct order, balancing loads and flows, minimizing stocks without risking stockouts, avoiding congestions, disruptions, or volatility in the flow of goods, people, information, targeting objectives like fastest time to completion, minimal slack times, or lowest cost of inventory.
Operations research is the formal mathematical tool used by industrial engineers to design production plants, supply chains, workforce deployment plans, transport schedules and sundry other things that require juggling many parts under tight and often volatile conditions.
Even if it’s steeped in industrial (and military) lore, it’s also used in areas like microchip design, financial engineering, and all over the place in the digital economy. It’s pretty much everywhere in those parts of the economy that typically remain invisible to the casual observer: the engine room of a modern economy.
The hallmark of operations research is that it’s set up to serve one principal, focusing mostly on operations within an organization. This distinguishes it from economics proper, which focuses on exchange between and the resulting tension in objectives, motivations, and desires of multiple principals.
Operations research cuts over to (mathematical) economics at the same juncture where decision theory crosses over to game theory: when the diverging interests of the participants move to the forefront of the analysis.
EconPatterns deliberately straddles the boundary between the disciplines for a number of reasons: operations research has much closer ties to both computer science and industrial design, offers a much richer toolset to aggregate and disaggregate processes within a hierarchical structure, has a closer connection between theory and practice, is a much better design paradigm to model complex longitudinal interactions with many specialized components, and ultimately has more tangible and straightforward objectives, typically those that can be measured with a stopwatch or a yardstick, rather than abstractions such as the idea of an equilibrium as a stable state where conflicts are resolved.
On the other hand operations research works from a paradigm of central planning, a paradigm that is losing analytical heft the more the connected process under scrutiny — the value chain — involves interaction, goal and resource conflicts between principals rather than between machines, tools, parts, information, and labor.
So roughly speaking, as soon as the tension between principals becomes the driving factor, we cross over to economics. As soon as the need to concatenate activities or to disaggregate higher-level processes into tasks and subtasks dominates, we’ll lean more on operations research.
But the core message is still that from the EconPatterns vantage point, where the value chain is the analytical starting point for any design endeavor, that all but the most trivial value chains have multiple crossings not only between machines but also between organizations, jurisdictions, and even belief systems, and that not only efficiency but also accountability is relevant to the integrity of that value chain, the formal aspects of economic design will inevitably be on the cusp between the disciplines.
Let’s put this to use in two examples.
Machine replacement is one of the core problems in industrial engineering. In its simplest form, it means finding the ideal time of putting an existing machine or process out of use and replacing it with another, presumably superior one.
The calculus, easy enough for bachelor-level exams, requires comparing the cost of the new machine (minus scrap value of the old machine) to the performance differential, most likely in a net present value calculation. If the performance benefit is higher than the cost of replacement, it’s a go.
From this starting point we can make the problem arbitrarily complex. What if the performances of the machines are not constant over time? What if the old one becomes gradually less efficient, with lower throughput and more frequent outages, or the new machine needs time to ramp up? What if the replacement itself doesn’t only include the purchasing cost of the new machine but also a work stoppage? What if the machine is part of a production facility? Does the whole production line have to be closed down, or are other similar machines on a different line able to take on the shortfall? What if the new machine is only able to perform better in conjunction with other replacements? What if uncertainty is involved?
Even if it’s useful to think of industrial engineering in terms of real industrial machines for milling, turning, or drilling, in an industrial machine shop, these “machines” could be pretty much everything. If a bank considers a new process for checking creditworthiness or if a college department contemplates restructuring its degree curriculum, they encounter similar planning and orchestrating problems. The introduction of video assist in sports is an example of machine replacement.
Today, in most cases the “machine” is simply a computer. More abstractly, a “machine” is simply any workplace where a defined transformation is taking place.
All of this happens, if nothing goes wrong, according a meticulously planned program, and if something goes wrong, hopefully according to a meticulously crafted contingency plan — the hallmark of central planning.
If we remove that requirement, and allow two new stakeholders in — competitors and customers — we get something we might call coordinated machine replacement, or in a more succinct and better known term: innovation.
Innovation in its most technical definition is the increase in total factor productivity, or aggregate outputs produced by aggregate input factors (in economics, famously labor, capital, and soil, but I’ll devote another post to that). In other words: the collective replacement of machines, processes, activities, to make resource use more efficient, in a (more or less) competitive economy.
In a “textbook” model of the economy, where firms are seen as singular and solitary production functions, replacement happens by Schumpeterian competition: companies which improve their efficiency by optimizing their production will gain a competitive advantage which lets them capture value, “Schumpeterian rents” in the innovation economics nomenclature, for as long as that efficiency advantage persists.
This economic pressure: technologically disadvantaged competitors see their margins evaporate until they either catch up or give up and leave the industry, is the driver of economic innovation and, in turn, economic growth.
So much for the textbook treatment.
The era of Henry Ford shutting down production for five months to replace the Model T with the Model A being decisively over, the problem of keeping interdependencies uninterrupted while interrupting a single step in a complex value chain moves to the forefront.
Research and development for new car generations now starts long before the existing car generation gets taken off the market. There is no more reason to lay off workers, cancel orders for parts, keep dealerships waiting for new vehicles, or hope that customers are willing to wait a few months rather than wandering off to the competition.
Some of this still falls under a competitive bracket. Laid-off workers and stranded dealers might also defect to the competition. In other cases, outright coordination might become necessary such as in the adoption of shared technology standards or auditing rules. Value chains might become reintegrated such as electric vehicle manufacturers, recognizing that market competition does not supply enough charging stations, reluctantly entering the market for charging infrastructure.
The less we think about technological disruption of value chain as a purely competitive event of isolated actors, the more we need to reach into the toolbox of operations research methods.
Machine scheduling is at the heart of operations research, and even if one of its synonyms, “ job shop scheduling”, betrays its origin on the shop floors of the industrial era, it’s still at the heart of most algorithmic processes that try to direct inputs toward productive outputs.
The underlying idea is that jobs have to be allocated to machines on which they can performed. In its simplest form, these jobs consist of a sequence of steps, similar to Adam Smith’s pin factory, where a prior step has to be finished before a subsequent job can be started.
This setup can be made more complicated in many ways. Machines (and their operators) might be specialized to perform only certain tasks. Jobs might require setup times which either have to wait until prior jobs are finished or can be started while the prior job is still running. Uncertainty can come into play in many ways.
The objective is typically to minimize time to completion, maximize machine utilization, or some related measure.
Machine scheduling has successfully crossed over from the shop floor to the digital economy, especially when it comes to platform operations where the “machines” can be vehicles: taxicabs, scooters, coaches, and the “jobs” can be passengers trying to get from A to B in a timely, cheap, and secure manner.
This is again a scenario where the worlds of economics and operations research intersect. We can think of a platform as a central conductor trying to move people from A to B, which inevitably requires operations research knowledge, but we also have passengers (and in some cases, drivers) as participants with diverging interests, which requires economic and especially game theoretic knowledge.
The boundary is blurry and the scale might tip whenever we realize that we’re better off assigning a modicum of autonomy to the many interlocking parts, that the machines might find a better solution if we let them compete for scarce resources and avoid congestions rather than insisting that coordination requires central control.
But it also helps to think of operations research as the discipline that operates bottom-up, assembling economic engines from universal elementary operations, while economics tend to operate top-down, from a highly aggregated macroeconomic perspective to individual microperspectives. But it also helps to think of operations research as the discipline that moved from the shop floors to academia while economics is still trying to move in the opposite direction.
Design is ultimately about breaking complex problems down into their constituent parts, solving them in isolation and reassembling them in the hope that the partial solutions fit together. This requires, almost regardless of the application domain, that we start with a rough outline of the potential solution and decide step by step which partial problems require particular attention to detail.
This can be done in a methodical or in a haphazard fashion. In particular, the opposing risks of not enough attention to detail or too much attention to detail loom large over failed design projects. This is certainly not restricted to economic design, but economics as a discipline suffers from a lack of conceptual rigor and increasingly an overflow of formal rigor.
This isn’t only the case for the part of the design process where we go from a “rough outline” (a conceptual understanding of the overall problem) to a fully fleshed out formal model, but also, once we understand that we need to apply a formal toolset, a lack of understanding which toolset applies to the problem at hand.
In this post, we’re in the latter part of that process. Both operations research and mathematical economics are highly formalized frameworks which share a common history in the evolution of constrained optimization but which for at least two generations (roughly from the inception of the Econ Nobel and the deliberate choice by the Nobel committee to reward the economists but not the operations researchers working on the same problem) barely talked to each other.
Over the last ten years or so, we’ve seen a gradual rapprochement between the disciplines, in large part because the new players of the digital economy started to realize that their machinery is often economic in nature — auctions, matching markets, information and risk aggregators — even if they deal in abstract information goods rather than in physical objects assembled on the shop floors of the industrial economy.
In the process they’ve also recognized that the academic paper exercises which constitute the main output of modern economics aren’t sufficient to assemble production-ready economic engines. For this you also need scalability, modularity, interoperability, and an understanding of human interaction that bows as much to drab realism as it does to formal aesthetics.
To offer a simple example: in the push to succeed in the global coordinated machine replacement problem known as the transition from fossil to renewable sources of energy, we can’t just assume that we’re fine and markets clear if aggregate supply matches aggregate demand.
We also have to take into account that energy is rarely ever produced where and when it’s needed, neither in place nor in time. So we have to apply a model of an energy economy that pays attention to stocks and flows — in other words, a cybernetic economy.# Archive of category 'categorical'
- May 29, 2022
•
What is Categorical Cybernetics?# Archive of category 'categorical cybernetics'
- May 29, 2024
•
- Apr 15, 2024
•
- Apr 12, 2024
•
- Feb 22, 2024
•
Iteration with OpticsRecently we held a workshop in Edinburgh titled Mathematics for Governance Design, consisting of a roughly 50/50 split between social scientists and category theorists.
The workshop was organised by Philipp and myself from the CyberCat Institute together with Seth Frey, Saba Siddiki and Josh Tan from (in an overlapping way) Metagov, the Institutional Grammar Research Initiative and the Computational Institutional Science lab. It was funded and hosted by the International Centre for Mathematical Sciences as part of their Mathematics for Humanity programme of events.
We designed the workshop to have as little scheduled time as possible and as much unstructured working group time as possible - inspired by our past experience running a workshop at Wytham Abbey and originally inspired by Dagstuhl. And it was a resounding success: it felt like the theme of the week was seeing famous people who we would never expect to interact with each other interacting. The danger of running a workshop like this is that the two different groups would form cliques and only interact with each other under duress, but the exact opposite happened.
Probably my personal highlight was being able to meet Matilde Marcolli and talk about our shared interest in the very hard question of how to surpass the well-known scalability barrier for human self-organisation (for example written about extensively by Elinor Ostrom). We agreed that compositional game theory and related categorical cybernetics methods could plausibly have a role to play for building models of social situations consisting for example of groups of groups arranged in an approximate hierarchy. (When taken as revolutionary this is an aspect of anarcho-communism, although my personal interest is a bit too theoretical to call it that.) In fact I should write a blog post on the general topic of hierarchies of lenses, which is something I’ve talked extensively about with several people, most notably Toby Smithe in the context of modelling the human cortex. I talked quite a bit it in high level terms in this blog post, but a more technical post might be in order.
Another highlight was being able to finally engage with institutional grammar and think seriously about how it could relate to open games. There is some past work ( this and this, plus a paper I wrote earlier this year with Vincent Wang-Maścianica that isn’t released yet) on connections between open games and natural language, but to me the limiting factor has always been that the open game semantics of individual words must be hand-crafted, which will not scale beyond the tiniest of toy examples, and there has never been a plan for it besides “maybe hand-craft enough examples to fine-tune an LLM and hope for the best”. My immediate thought now is that the type of natural language texts we would like to apply this to will often factor through an institutional grammar representation, in a way that is likely to be extremely useful for game-theoretic analysis. I learned last week that institutional grammar has already been connected to agent-based models, and if it is possible to go to an agent-based model then it should be not significantly harder to go to a game-theoretic model. The benefit, of course, is that Nash equilibrium is a very standard perspective for thinking about the theoretical nature of institutions.
Of course we understand as well as anybody that interdisplinary research is extremely hard and we should not expect immediate technical results after bringing together two such different groups. But we felt real excitement in the room, and we have every reason to expect multiple new collaborations to be formed, and we are already planning a successor workshop next year.# Archive of category 'game theory'
- May 9, 2024
•
- Apr 22, 2024
•
- Apr 1, 2024
•
- Jun 24, 2022
•
A Software Engine For Game Theoretic Modelling - Part 2# Archive of category 'software engineering'
- Apr 8, 2024
•
Modular Error Reporting with Dependent Lenses# Archive of category 'compositionality'
- Jul 15, 2024
•
Compositionality and the Mass Customization of Economic ModelsCross-posted from Oliver’s Substack blog, EconPatterns
The fundamental economic exchange is surprises for eyeballs.
Modern economics is built around understanding the mechanics of market exchange, but it hasn’t always been that way. The etymological root of economics, the Greek oikonomia points toward household management, or husbandry of the (largely self-sufficient) estate, the oikos. Today we would call it home economics.
After discussing the fundamental grid of the economy in the last post, it makes sense to lay out the underlying assumptions of human behavior within that economy in some detail — and both the title and the introductory statement (possibly the first pattern introduced) should make it clear that these assumptions differ somewhat from the traditional textbook treatment of economic agents.
But they also differ from the various attempts to bound the rationality assumptions of textbook economics in some way, be it in the Carnegie “ satisficing” or in the Berkeley “ behavioral” tradition. It nevertheless incorporates both, in addition to a variety of other behavioral quirks which we might not associate with the economic realm.
The major reason to tweak our behavioral assumptions is that to design economic structures we need a coherent framework for dealing with a variety of settings in which we need to be able to apply a varying set of behavioral assumptions while still trying to stay coherent.
So it’s not so much a behavioral assumption but a template for developing context-specific behavioral assumptions — or in other words, a design pattern. Humans behave differently in different social settings, and we should be able to pick the right model for the right circumstances, but still be able to treat it as a special instantiation of a shared underlying pattern.
This explicitly includes using the assumption of perfect rationality wherever it is warranted.
So let’s grab our opening statement and take it apart.
“Eyeballs” is marketing vernacular for attention. The term can be taken quite literally — there are devices that track eyeball movement to find out how much screentime is spent staring at ads. But for the most part I will use it metaphorically as the cognitive effort devoted to a task.
It is perfectly fine to assume away cognitive limitations in a wide variety of circumstances. It simplifies our model significantly. It deflects accusations that a given policy claim is the outcome of an opportunistically chosen (boundedly rational) behavioral model rather than an underlying economic force. And in many scenarios it creates good-enough predictions for the task at hand.
Assumptions are simplifications that ideally give us more gain in parsimony than loss in predictive accuracy. As long as that’s what they do, they do their job.
But there are also situations where such an simplifying assumption produces results that stray too far from the observable reality, and we need to have a plan for how we want to adjust the behavioral model in those situations.
A fair starting assumption is to expect that the economic actor will allocate cognitive resources economically and allocate the most attention to those tasks where she expects the most bang for the buck. And that brings us to the other part of the statement.
The economic expression for “expects most bang for the buck” is “maximum expected utility”, but this requires a lot of foreknowledge where we can’t simply assume under all circumstances that our economic actor already possesses it. Every time you see an economics paper assuming that our actor knows something about the distribution of a random variable you know we’re on shaky ground.
So the next level is to assume that our actor will venture to find out and acquire this knowledge step-by-step in what we can call a process of discovery — which usually means a sequence of failures that terminates either with a moment of success or the decision to call it off. In econspeak, this discovery process is known as tatônnement.
But we shouldn’t assume that our agent just wanders around in the desert aimlessly hoping to find an oasis — a stark example of such a discovery process with a life-or-death ending — but that there should be a plan behind those wanderings.
That plan is usually to devote the existing resources, cognitive and physical, in a way that maximizes the knowledge gained about the terrain. In our desert scenario this might translate to climbing to the top of a ridge to survey the territory, or alternatively to stay near the valley floor to limit exposure to sunlight.
We can call this process in two ways: uncovering secrets — where a secret is anything that wasn’t known before but is known after — or hunting for surprises.
Surprise expresses the same thing — some difference between what was known before vs what is known after — but it also gives us the opportunity to express it in two ways: positive surprise and negative surprise.
Loosely translated, positive surprise is beneficial — something worth seeking out — and negative surprise is harmful — something to be avoided. On this single dimension we can build a (surprisingly) wide range of behavioral models, including differentiating individuals by their propensity to seek out positive surprise and accept negative surprise in the process, in other words by their affinity for disorder.
This has clear connotations to the behavioral assumption of risk preference, and this connection definitely warrants further attention — risk is a transferable economic commodity — but it also gives us the additional angle that planning is a vehicle to mitigate negative surprise for individual actors, and contracting is a vehicle to mitigate negative surprise for collective action, including the canonical form of collective action: the organization (which will be at the center of next week’s post).
A lot of this will be fleshed out in the weeks to come, and some of the jumping-off points should already be apparent. Surprise gives us the opportunity to invoke both information entropy and ultimately thermodynamic entropy. But as already mentioned, this series will only use these ideas conceptually, and point towards formal treatments in their respective literatures.
Design is a guided trial-and-error process where judgment calls have to be made about the structure of the problem, about splitting it into its constituent parts and putting the parts back together in the hope that no unwanted interaction effects emerge, about taking requirements and putting them in an order, about defining and resolving contingencies and dependencies, about the level of detail at which a problem needs to be resolved, at which precision, and how far into the future.
For this we need a flexible model of behavioral assumptions that can be adjusted to fit the task at hand, that can be experimented with. “Surprises for eyeballs”, or in other words, “secrets for attention”, gives us exactly that.
There’s an obvious objection to this treatment, and it’s a fair one. “Surprise for eyeballs” is most obviously suited to the information economy, or maybe more aptly: the attention economy, and in the trad economy we might be better off dealing with the canonical exchange of supply vs demand in its trad form of an effort (a product or service) vs a payment.
Let me use George Akerlof’s famous essay on the market for lemons to show why even in a world of a one-off transfer of a physical object for a simultaneous transfer of a monetary equivalent is still a special case of an attention economy full of surprises.
Akerlof’s paper kicked off the field of information economics, and is most widely associated with introducing the concept of asymmetric information. But as the second half of its title suggests, it’s actually about quality competition (a “lemon” being a colloquial term for a used car of poor quality), and the information angle is about the inability of conveying this quality — especially about the inability of an owner of a high-quality car to establish that his car is not a lemon.
But how do we find out if a car is a lemon? And how do we insure ourselves against the risk of acquiring a lemon? By finding out.
In the same sense of the stranded-in-the-desert example above, the process of finding out is a discovery process except with opposite signs. It’s a sequence of successes terminated by a failure — which is true for all machines: they run until they break down.
But there’s an inevitable random element to this process, and even if we can assume that lemon-ness correlates negatively with longevity, that relationship is far from deterministic. We cannot conclude with certainty from the time of failure whether the car was a lemon — even if the prior owner knew about its lemon-ness.
This simple recognition has a wide array of ramifications worth taking apart in detail, because most of them are central to economic design — not only of economic engines like markets, auctions, recommenders or reputation engines, but also to the design of economic institutions. Notoriously, the business model of the Roman Catholic Church is that of a certifier of good conduct: a good old-fashioned reputation engine.
The tl;dr of this excursion is that almost all goods are experience goods in that their value only becomes apparent when they are consumed, and the consumption harbors the possibility for surprise, positive or negative.
If this happens over a longer time span like driving a car, if it happens immediately like eating ice cream, or if immediate consumption might trigger belated effects like getting toothache, depends on the circumstances.
But the canonical economic trade of a perfectly substitutable commodity of perfectly equal quality is a simplifying assumption resting on a lot of institutional underpinnings. Almost all trades, in the trad economy or the digital economy, contain an element of surprise, and in turn engage our propensity to shield ourselves from it, or to embrace it.Cross-posted from Oliver’s Substack blog, EconPatterns
In the first four posts, I tried to map out an economy structured around the need to find out. This didn’t happen by accident, but is the result of spending a couple of decades in a realm where academic economic knowledge is held in little regard in no small part because its gatekeepers like to give off an air of having it already figured out, even if from the circumstances it’s clear that is rarely ever the case.
It doesn’t match my own opinion, but I perfectly understand when, say, the founders of a three-person startup bid adieu to their knowledge of academic economics when they learn that there is no such thing as a demand curve unless they put in the effort to assemble it piece by piece, transaction by transaction, price change by price change.
Most of them stop at this point and direct their attention to other, more pressing concerns, and I can’t blame them for it. The “need to find out” gets short shrift in most economics classes since economic instruction at universities generally starts from a vantage point where the groundwork has already been laid by wizards behind the curtain, and all that’s needed for mere mortals is to fine-tune the preconceived machinery.
That’s also a major reason why economists find employment in government, big banks, and, increasingly publicly listed tech firms, within large machineries, but are rarely ever in demand as one of the three founders of a recently minted startup with more enthusiasm than cash — or data.
This series tries to remedy that situation, and I could have subtitled it either “economics for startuppers” or “startup thinking for economists”, except the intended scope — and my intended audience — is a bit wider than that.
The underlying idea of “finding out” pursued in EconPatterns is ultimately derived from Adam Smith’s gains from specialization that drives specialization of labor, and that ultimately influenced another key contribution to economic lore, Friedrich Hayek’s Use of Knowledge in Society.
Hayek’s point was that there’s no point trying to steer the whole economy from a central vantage point because there is always someone somewhere closer to the ground, steeped in operational detail, who knows better, and can put that knowledge to better use than the central planner.
This idea that there is always local knowledge that is more detailed than the aggregated knowledge on the macro level, that there is knowledge that is nested, and that all participants have a mental map of the economy that is most detailed in their own vicinity and that degrades in detail, certainty, or precision, that resorts to using coarse-grained models, aggregates, or even stereotypes, the further one moves away from one’s own location, is deeply embedded in EconPatterns.
And this isn’t only true for the physical dimensions, it’s also true for the temporal dimension. Both the past and the future get hazy very quickly, and we resort to increasingly coarse-grained knowledge the further we go in each direction: Hayek’s “knowledge of the particular circumstances of time and place”.
There is an inevitable urge to remedy this shortcoming with the magic potion of “more transparency”. Every time we hear news about another supply chain pile-up, there is the inevitable stratum of pundits opining that this (the negative surprise, that is) could have been avoided if we just magically gave every participant a detailed map of the whole economy, or at least the whole chain of events — network really — leading to the participant’s problems stemming from the unexpected supply chain outage.
This is illusory of course to anyone attuned to the operational details of supply chain, not only because these pundits habitually underestimate by several orders of magnitude just how much operational raw data is out there, most of which is of no use to anyone but the data owner, but also because the countervailing demands of privacy and transparency (usually leading to the conundrum of each side demanding transparency from the other party but insisting on privacy for oneself) will inevitably lead to privacy winning out, except in those cases where the more powerful actor can compel less powerful actors to disclose their secrets.
Designing a mechanism that orchestrates the conflicting information needs of the participants in a value chain or its mapping into the physical realm, the supply chain, is still a holy grail in operations and in economic trade, but in no small part because the reasons why such a governance mechanism is hard to come by are still poorly understood.
Finding this holy grail, and mapping out the path to its discovery, is of course the goal of this series. A starting point is to arrive at a better understanding how knowledge disseminates thru an economy, where, when, and why it forms clusters (especially in the form of belief clusters), and how to interfere in that flow in a structured, goal-oriented way.
Just to offer a simple example, novices in the field of supply chain are often surprised to learn that the bill of lading, one of the crucial documents ensuring integrity of a product thruout its transit along ports, flights, shipments, loading and unloading, handovers and often rough handling, is still legally required to be in paper form, sent by courier from station to station.
A simple impulse is to blame an overbearing bureaucracy or an industry staunchly resistant to organizational change and technological progress, but an alternative and more plausible explanation is that paper solves a few integrity requirements that electronic communication still has a hard time to solve.
When handovers and handshakes are still the literal thing involving actual hands, if signatures are still done by hand in the presence of the counterparty, we are solving a few problems about identity that turn out to be quite tricky once we try to shift them online, into the digital domain where ascertaining that an individual is who they claim to be can be exceedingly tricky.
Turns out this simple example is repeated all over the place, in all kinds of domains and scenarios, with a number of idiosyncratic details added, but the underlying pattern still the same.
This is why I will come back to that example again and again. Because that is what EconPatterns is about.Categorical cybernetics, or CyberCat to its friends, is – no surprise – the application of methods of (applied) category theory to cybernetics. The “ category theory” part is clear enough, but the term “ cybernetics” is notoriously fluid, and throughout history has meant more or less whatever the writer wanted it to mean. So, let’s lay down some boundaries.
I first proposed CyberCat, both as a field and as a term, in this 2019 blog post (for which this one is partly an update). There I fixed a definition that I still like: cybernetics is the control theory of complex systems. That is, cybernetics is the interaction of control theory and systems theory.
We add to this applied category theory, which has some generic benefits. Most importantly we have compositionality by default, and a more precise way of talking about it than in fields like machine learning where it is present but informal. Compositionality also gets us half way to computer implementation by default, by making our models similar to programs. Finally category theory gives us a disciplined way to talk about interaction between models in different fields.
It turns out - and this fact is at the heart of CyberCat - that the category-theoretic study of control has a huge amount of overlap with things like learning and strategic analysis. Those were also historically part of cybernetics, and can be seen as aspects of control theory with a certain amount of squinting, so we also include them.
On top of that definition, a cultural aspect of the historical cybernetics movement that we want to retain is that cybernetics is inherently interdisciplinary. Cybernetics is not just the theory but the practice: in engineering, artificial intelligence, economics, ecology, political science, and anywhere else where it might be useful. (Part of the reason we created the Institute – more on that in a future post – is to make this cross-cutting collaboration easier than in a unviersity.)
Cybernetics has been an academic dirty word since many decades now: in the 60s and 70s it went through a hype cycle, things were over-claimed and the field eventually fell apart. As founders of the CyberCat Institute we believe that the time is right to reclaim the word cybernetics. Apart from anything else, the word is just too cool to not use. More importantly, the objects of study – and the interdisciplinary approach to studying them – are even more important now than 50 years ago.
Having laid out what CyberCat could potentially be, I will now narrow the scope. At the Institute we are focussing on not just any applications of category theory to cybernetics, but to a small set of very closely interrelated tools. These are, roughly, things that have a family resemblance to open games.
This post isn’t the place to go into technical details, but what these things have in common is that they model bidirectional processes: they are processes (that is, they have an extent in time) in which some information appears to flow backwards (I described the idea in more detail in this post). The best known of these is backpropagation, where the backwards pass goes backwards. A key technical idea behind CyberCat is the observation that many other important processes in cybernetics have a lot in common with backprop, once you take the right perspective. The category-theoretic tool used to model these processes is optics.
Besides backprop, the things we have put on a uniform mathematical foundation using optics are value iteration, Bayesian inference, filtering, and the unnamed process that is the secret sauce of compositional game theory.
This is the academic foundation that we start from. The question that comes next is, so what? How can this knowledge be exploited to solve actual problems? This is where the CyberCat Institute comes in, but I want to leave that for a future post. In the meantime, you can look at our projects page to see the kinds of things we are working on right now.# Archive of category 'probability'
- Feb 6, 2024
•
André Videla# Archive of category 'dependent lenses'
- Apr 8, 2024
•
Modular Error Reporting with Dependent Lenses# Archive of category 'machine learning'
- Oct 14, 2024
•
- May 9, 2024
•
- Apr 15, 2024
•
Building a Neural Network from First Principles using Free Categories and Para(Optic)# Archive of category 'deep learning'
- Mar 18, 2024
•
Learning with Invariant PreferencesI thank Oliver Beige for many helpful comments.
Economic theory formulates thoughts via what we call “models.” The word model sounds more scientific than the word fable or tale, but I think we are talking about the same thing. (Ariel Rubinstein) 1
Are economic models useful for making decisions? One might expect that there is a clear answer to this simple question. But in fact opinions on the usefulness or non-usefulness of models as well as what exactly makes models useful vary widely - within the economic profession and of course even more so beyond. Sometimes the question feels like a Rorschach test - telling more about the person than about the subject.
In this post, I want to explore the question of usefulness. Even more so, I want to explore how the usefulness ties into the modelling process. The reason for doing so is simple: Part of our efforts at CyberCat is to build software tools to improve and accelerate the modelling process.
The importance of this is also evident: If models are useful, and we improve the process generating useful models, we improve our decision-making. And in so far as these improvements tie into computing technology, as they do in our opinion, improvements could be significant.
My question, "are economic models useful", is quite lofty. So, let's first do some unpacking.
What do I mean by economic model? A mathematical, formal model which relates to the domain of decision-making at hand. A prototypical example is a model that tells us how to bid in an auction. Such models are often classified as applied economic models.2
Why do I emphasize "economic"? If my question was: Are mathematical models useful for decision-making, the answer would be a simple yes and we could call it a day. Operations research models are in production for a multitude of tasks (job scheduling, inventory management, revenue management etc.). In fact, many of these models are so pervasive that it is easy to forget them. Just think about the business models that have been built on the navigation and prediction functionalities of Google maps.
The distinction between operations research and economics is obviously blurry and more due to artificial academic barriers than fundamental differences ( check out Oliver's post on\ this). I am making the crude distinction that economic models are about several agents interacting - most often strategically - whereas traditional operations research models are focused on single decision-makers.
Now, this is crude because obviously operations research by now also includes auctions and other models that are interactive in this way. Moreover, as Oliver pointed out in another\ post several leading economists who advanced the practical use of economic models (which we still come to) have an operations research background.
It is, I think, also not a coincidence that operations research has moved into the realm of interactive agents: Due to globalization and in particular the internet, companies have become more interconnected and also have much more technical leverage. 50 years ago, the idea that a regular company could be designing their own market probably would have been quite a thing. Today, it is part of the standard startup toolkit.
Technology and interconnectedness are driving the need for models that help decide in such a world as well as design the frameworks and protocols in which decisions take place. Economic models are the natural candidate for this task.
Let's turn to the central part of my question. What do I mean by useful? Opinions on this vary widely. According to Rubinstein, the question how a model can be useful is already ill-posed. Models are not useful. Models might carry a lesson and can transform our thinking. But they are of little value for concrete decisions.
In economics, Rubinstein's position is an extreme point. On the other side of the extreme, economists and even more importantly computer scientists are working on market design and mechanism design models.3 Models in this spirit are "very" practical: they do affect decisions in a concrete sense - they get implemented in the form of algorithms and are embedded in software systems.
We can think of fables and algorithms as two ends of a spectrum - from basically irrelevant to decisive for a choice we have to make. While it is hard to precisely locate a given model on this "usefulness" line, we can consider how a model can become more useful when moving along the spectrum. Of course, what constitutes value and who benefits how from a model changes along this path as well. The usefulness of a model is a matter of degree and not an absolute.
Let's begin at the fable end and start moving inroads. How can a model produce value? If we are faced with infinitely many ways to think about a situation, even a simplistic model can be valuable. It helps to focus and to select a few key dimensions. This aspect becomes even more important in an organizational context where people have to collaborate and it is very easy to get lost in a myriad of possibility and different interpretations.
Many classic games (in the game theory sense) like the Battle of the Sexes, Matching Pennies, and of course the Prisoners' Dilemma help to focus on key issues - for instance the interdependency between actions and their consequences. To be clear, the connection how to map a model into a concrete decision is very loose in this case and the value of the model lies in the eyes of the analyst.
These games often focus on a few actions ("defect" or "cooperate"). Moreover, agents have perfect information about the consequences of their actions and the actions of others. In many situations, e.g. in business contexts, choices are more fine-grained and information is not perfect. Models in Industrial Organization routinely incorporate these aspects, for instance analyzing competition between companies. From a practical perspective, these models often resemble the following pattern: If we had information X, the model would help us make a decision. Consider strategic pricing: It is standard in these models to assume demand to be known or at least drawn from a known distribution. The demand curve will then be typically a smooth, mathematically well behaved object. Such models can produce insights - no doubt about it.
But they rarely help to make a concrete decision, e.g. what prices to charge. There are many reasons for this but let me just give an obvious one as a co-founder of a startup: I would love to maximize a demand curve and price our services accordingly. But the reality is: I do not have a curve. Hell, if I am lucky I observe a handful of points (price-quantity combinations). But these points might not even be on any actual demand curve in the model's sense. So, while useful for structuring discussions around pricing, in the actual decision to set prices, the model is only one (possibly small) input. And this is very typical. Such models provide insights and do help to inform decisions. But they are only part of a collage of inputs into a decision.
There are economic models which do play a more dominant role in shaping decisions. Consider auctions. There is a rich theory that helps to choose a specific auction format to solve a given allocation problem. Still, even in this case, there are gaps between the model and the actual implementation, for instance when it comes to multi-unit auctions.
The examples I gave are obviously not meant to be exhaustive. There are other ways how a model can be useful. But this is not so important. The main point is, that all along the usefulness line, economic models can produce value. The question is not whether a model produces a choice but whether, at the margin, it helps us make better decisions. And this can happen all along the spectrum. Moreover, ceteris paribus, the further we move along the path towards the algorithm end, the more influence the economic model gains relative to other inputs into a decision and the more value it produces.
If we accept this, then an immediate question comes up: How can we push models from the fable side more towards the algorithm side? Let's explore this.
I first need to discuss how models get located on a specific point on the usefulness line in the first place. But this requires digging into the actual modelling process. Note again that I am only interested in "instrumental" modelling - models that are useful for a specific decision at hand. My exposition will be simplistic and subjective. I will neither cover the full range of opinions nor be grounded in any philosophical discussions of economics. This is just me describing how I see this (and also how I have used models in my work at 20squares).
Applied models in economics are a mixture of mathematical formalism and interpretative mapping connecting the internals of the model to the outside world. Mappings are not exclusive: The same formal structure can be mapped to different domains. The Prisoner's dilemma is such an example. It has various interpretations from two prisoners in separate cells to nuclear powers facing each other.
The formal, inner workings of models are "closed" objects. What do I mean by that? Each model describes a typically isolated mechanism, e.g. connecting a specific market design with some desirable properties. The formal model has no interfaces to the outside world. And therefore it cannot be connected to other models at the formal level. In that sense a model is a self-contained story.
Let me contrast this with a completely different domain: If one thinks about functional programming, then everything is about the composability of functions (modulo types). The whole point of programming is that one program (which is a function) can be composed with another program (which is a function) to produce a new program (which is a function).4
Back to economic models. When it comes to applications, the "right" model is not god given. So, how does the process of modelling real world phenomena look like?
As observed by Dani Rodrik5, the evolution of applied models in economics is different from the evolution of theories in physics. In physics one theory regularly supersedes another theory. In economics, the same rarely happens. The practice of modelling is rather about developing new models, like new stories, that then get added to the canon.
One can compare this to a library where each book stands for a model that has been added at some point. Applied modelling then means mapping a concrete problem into a model among the existing staple or, if something is missing, develop a new model and add it to the canon.
Inherent in this process is the positioning of a model on a specific point in the spectrum between fables and algorithms. Models mostly take on a fixed position on the line and will stay there. There are exogenous factors that influence the positioning and that can change over time. For instance, the domain matters. If you build a model of an intergallactic trading institution, it is safe to assume that this model will not be directly useful. Of course, this might change.
Like stories, certain models do get less fashionable over time, others become prominent for a while, and a select few stay ever-greens. Economists studying financial crises in 2006 were not really standing in the spotlight of attention. That changed radically one year later.6
Let me emphasize another aspect. I depicted applied models as packages of internal, formal structure and interpretative maps connecting the internals with some outside phenomenon. This interpretative mapping is subjective. And indeed discussions in economic policy often do not focus on the internal consistency of models but instead are more about the adequateness of the model's mapping (and its assumptions) for the question at hand. Ultimately, this discourse is verbal and it is structurally not that different from deciding which story in the bible (or piece of literature, or movie) is the best representation of a specific decision problem.
The more a model will lean towards the fable side, the more it will be just one piece in a larger puzzle and the more other sources of information a decision-maker will seek. This might include other economic models but of course also sources outside. Different models and other sources of information need to be integrated.
As a consequence, whatever powers we gain through the formal model, a lot of it is lost the moment we move beyond the model's inner working and need to compare and select between different models as well as integrate with other sources. A synthesis at the formal level is not feasible.
Let me summarize so far: A model's position on the spectrum of fable to algorithm is mostly given. There is not much we can do to push a single model along. Moreover, we have no systematic way of synthesizing different models - which would be another possibility to advance along the spectrum.
We have been mostly concerned with the type of output the modelling process generates. Let's also briefly turn to the inputs. Modelling by and large today is not that different compared to 50 years ago. Sure, co-authorships have increased, computers are used, and papers circulate online. But in the end, the modelling process is still a slow, labor-intensive craft and demands a lot from the modeller. He or she needs knowledge in the domain, must be familiar with the canon of models, needs judgment to balance off the tradeoffs involved in different models, etc.
This makes the modelling process costly. And it means we cannot brute force our way to push models from fable to algorithm. In fact, in the context of policy questions many economists like Dani Rodrik7 criticize the fact that discussions focus on a single model whereas a discussion would be more robust if it could be grounded in a collage of different models. But generating an adequate model is just very costly.8
Taken together, the nature of the model generating process as well as its cost function, are bottlenecks that we need to overcome if we want to transform the modelling process.
Let's go back to our (functional) programming domain to see an alternative paradigm. Here, we are also relying on libraries. But the process of using them is markedly different. Sure, one can just simply choose programs from a library an apply it. But one can also compose models and form new, more powerful programs. One can synthesize different programs; and one can find better abstractions through the patterns of multiple programs which do similar things. Lastly, one can refine a program by adding details. And of course, if you consider statistical modelling, this modularity is already present in many software packages.
It is modularity which gives computing scalability. And it is this missing modularity which severely limits the scalability of economic modelling.
Consider the startup pricing example I gave before. Say, I thought about using a pricing model to compute prices but I am lacking the demand information. What am I supposed to do? Right now, I am most likely forced to abandon the model altogether and choose a different framework instead.
What I would like to do instead is to have my model in a modular shape so that I could add a "demand" module and combine it with my pricing optimization - maybe a sampling procedure or even just a heuristic. The feature I want is that I have a coherent path from low to higher resolution.
The goal behind our research and engineering efforts is to lift economic modelling to this paradigm. Yet, we do not just want to compose software packages. We want an actual composition of economic models AND the software built on top.
Say, we want to turn the manual modelling process, which mostly relies on craft, experience and judgement, into a software engineering process. But not only that. We are aiming for a framework of synthesis in which formal mathematical models can be composed.
How should we go about this? This is totally unclear! Even more, the question does not even make sense. This is a bit like asking how do we multiply a story from Hemingway with a story by Marquez.9
Similarly, models in economics are independent and closed objects and generally do not compose. It is here where the "Cat" in CyberCat comes in. Category theory gives us a way to consider open systems and model them by default relative to an environment. It is this feature which allows us to even consider the composition of models - for instance the composition of game theoretic models we developed.
Another central feature that is enabled through category theory is the following paradigm:
model == code
That is, the formalism can be seamlessly translated back and forth between model and an actual (software) implementation. Thereby, instead of modelling on pen and paper, modelling itself becomes programming. It is important to note that we do not just want to translate mathematical models into simulations but code does actually symbolically represent mathematical statements.
To summarize, category theory gives us a formal language of composable economic models which can be directly implemented.
Equipped with this foundation, we can turn to the programming language design task to turn the modelling process into a process of software engineering.
Modelling as programming enables the iterative refinement of models. Whereas in the traditional sense, models are not only closed but also dead wood (written on paper), under this paradigm models are more like living objects which can be (automatically) updated over time.
Instead of building a library of books, in our case the models are part of a software library. Which means the overall environment becomes way more powerful over time, as the ecosystem grows.
Composition also means division of labor. We can build models where parts are treated superficially at first but then details get filled in later. This can mean more complexity but most importantly means that we can build consistent models that are extended, refined, and updated over time.
These aspects resemble similar attempts in mathematics and the use of proof assistants and verification systems more generally. Here is Terence Tao on these efforts10:
One thing that changed is the development of standard math libraries. Lean, in particular, has this massive project called mathlib. All the basic theorems of undergraduate mathematics, such as calculus and topology, and so forth, have one by one been put in this library. So people have already put in the work to get from the axioms to a reasonably high level. And the dream is to actually get [the
libraries] to a graduate level of education. Then it will be much easier to formalize new fields [of mathematics]. There are also better ways to search because if you want to prove something, you have to be able to find the things that it already has confirmed to be true. So also the development of really smart search engines has been a major new development.
It also means different forms of collaboration between field experts and across traditional boundaries. Need a financial component in that traditional IO model? No problem, get a finance expert to write this part - a modern pin factory equivalent. See again Terence Tao11:
With formalization projects, what we’ve noticed is that you can collaborate with people who don’t understand the entire mathematics of the entire project, but they understand one tiny little piece. It’s like any modern device. No single person can build a computer on their own, mine all the metals and refine them, and then create the hardware and the software. We have all these specialists, and we have a big logistics supply chain, and eventually we can create a smartphone or whatever. Right now, in a mathematical collaboration, everyone has to know pretty much all the mathematics, and that is a stumbling block, as [Scholze] mentioned. But with these formalizations, it is possible to compartmentalize and contribute to a project only knowing a piece of it.
Lastly, the current developments of ML and AI favor the setup of our system. We can leverage the rapid development of ML and AI to improve the tooling on both ends of the pipeline: Users are supported in the modelling setup and solving or analyses of models becomes easier.
The common thread behind all of our efforts is to boost the modelling process. The traditional process is manual, slow, and limited by domain expertise - in other words very expensive.
Our goal is to turn manual work into mass customizable production.
What I described so far is narrowly limited to economic modelling. Where is the "Cybernetics"?
First, I focused on the composability of economic models. But the principles of the categorical approach extend beyond this domain. This includes the understanding how apparently distinct approaches share commonality (e.g. game theory and learning) and how different structures can be composed (build game theoretic models on top of some underlying structure like networks). In short, we work towards a whole "theory stack".
Second, the software engineering process depicted above focuses very narrowly on extending the economic modelling process itself. But the same approach will mirror the theory stack with software enabling analyses along each level.
Third, once we are operating software, we open the ability towards leveraging other software to support the modelling process. This follows pragmatic needs and can range from data analytics to LLMs.
A general challenge to decision-making is the hyper-specialization of expert knowledge. But as decisions are more and more interconnected, what is lacking is the ability to synthesize this knowledge. Just consider the decision-making of governments during the Covid epidemic. For instance, in the decision to close schools, one cannot simply rely on a single group of domain experts (say physicians). One needs to synthesize the outcomes of different models following different methodologies from different domains. We want to develop frameworks in which these tradeoffs can be articulated.
-
Ariel Rubinstein. Economic fables. Open book publishers, 2012, p.16 ↩
-
I will focus on micro-economic models. They are simply closest to my home base and relevant for my daily work. ↩
-
The view on what economists do there is markedly different from Rubinstein's. Prominently Al Roth: The Economist as Engineer: Game\ Theory, Experimentation, and Computation as Tools for Design\ Economics. ↩
-
And probably most importantly, functions themselves can be input to other functions. ↩
-
Economics Rules: The Rights and Wrongs of The Dismal Science. New York: W.W. Norton; 2015 ↩
-
Of course, the classification of practical and non-practical is not exclusive to economics. Mathematics is full of examples of domains that are initially seen as without any practical use and then turned out to be important later on. ↩
-
Ibid. ↩
-
In addition, if the modelling falls to academics, then also their incentives kick in. The chances for publishing a model on a subject that has already been tackled by a prominent model can be very low - in particular in the case of a null-result. ↩
-
We might of course come up with a way how these two stories can be combined or compared. But this requires extra work; there is no operation to achieve this generically. These days we might ask an LLM to do so. And indeed this might be a useful direction for the future to support this process. ↩
-
Quoted from this\ interview ↩
-
Ibid. ↩Cross-posted from Oliver’s EconPatterns blog
There’s a non-zero chance that sometime in the not-so-far future we will think of the “Bayesian Revolution” in the same way we think of the “Marginal Revolution”.
Bayesian beliefs give us the same opportunity to think of expectations as an attribute linked to the observer rather than to the observed object in the same way utilities gave us an opportunity to accept that value is not an attribute intrinsic to an object, but exists only in the eye of the beholder.
Much of modern economics rests on the recognition that differences in valuation create opportunities for mutually beneficial — and thus voluntary — exchange.
We’re still a few steps away from translating that recognition to differences in expectations, in no small part because most of the current effort seems to go into trying to shoehorn Bayesian statistics into the domains dominated by trad statistics (aka frequentism), but one dimension on which we’re inching towards a better understanding of subjective probabilities is that it’s slowly dawning on us that there might be perfectly legitimate reasons why different individuals might attach different likelihoods to the same event — and even more, why their estimates might diverge over time.
The operative word being “legitimate” here, since if we start from the idea that whichever event we’re only partially observing was suddenly revealed to us in full, a persistent divergence in beliefs about this event surely means that at least one party must be dead wrong.
And even if we’re allowing that there might be multiple paths to the ultimate objective truth, if one party is wrong, it must surely be shown up in the future.
Economists love to use sports metaphors when teaching statistics, simply because sport is a realm where the outcome is recorded right at the conclusion of the contest, from a single authoritative source, without ambiguity.
As a teaching device this is perfectly understandable in that it takes away a distraction which seems peripheral to the topic, but ultimately every practitioner will run into the problem that such an occurrence of an unfailing, immediate, and impartial single source of truth is quite rare in the real world.
(Indeed once we’re getting into the nitty-gritty of how the outcomes of sports events are tallied, it’s quickly becoming obvious that the “unfailing, immediate, and impartial” umpire is mostly a figment of our imagination. From photo finishes to disqualifications, not to mention accusations of favoritism, the idea that the true value of a random variable can suddenly be disclosed with finality is a popular ploy in economics that doesn’t even have a realistic foil in sports. Economic theory might get away with such a foreshortening of reality, but economic design doesn’t.)
We can go one step further and claim that for most uncertain events, meaning all events that are only partially observable, there might not be an underlying true value, no ground truth, at all. We’re eternally in limbo about what the underlying “true value” is, simply because there is no moment of truth.
In most scenarios the truth typically remains elusive, at the end of a lengthy, costly, meandering, and conflicted discovery process, and we tend to swap in “truth” in the computer science meaning of “(single) source of truth”: that which is held to be true at any stage of the discovery process, as a proxy for the “ground truth”: that which would be held to be true at the end of a fully exhaustive discovery process.
In either form, whether there’s no unimpeachable source of truth or no underlying ground truth, once we make that leap to accepting that the truth is elusive, we suddenly gain access to a far richer world of behaviors, especially collective behaviors.
As a pattern, it means to start from the assumption that observed values are never perfectly true, true values are never perfectly observed, and truthfinding is an asymptotic reconciliation process between conflicting beliefs that’s not guaranteed to terminate in finite time.
I have previously called organizations tectonic plates shaped around shared beliefs broken up by fault lines where beliefs — especially beliefs about the future and the likely outcomes of actions taken — are no longer reconcilable.
In this series I am going to be a bit more precise in working out the question how belief convergence or belief divergence shape the coordinated sequence of activities within and across organizations we call a value chain. In particular, how these concepts can be applied across domains: social, political, cultural, religious, with the economic realm just a special case.
One of the first lessons of organization theory is that organizations are paradoxes: they can, under best conditions, produce more collectively than the sum of individual efforts of all members, but this requires that individual efforts have to be constrained away from what the individuals would do if left to their own devices. In other words, to achieve an organizational goal (almost) everyone will have to compromise.
The other lesson for organizations is that they take two steps: first, the aggregation of efforts, and second, the disaggregation of gains. Both have to work in the eyes of the beholders making up the organization for it to work, or otherwise — especially in the case of efforts being exerted in the present but rewards only collected and distributed in the future.
If this doesn’t hold true: every participant holds the shared belief that individual contributions are being rewarded beyond what they could expect outside the organization, it will either disintegrate or has to be propped up by force.
Nothing up to this point has limited what type of organization we’re talking about here. Simply put, we’ll find this type of pattern in any kind of organization: economic, religious, or political, flat or hierarchical, although we expect the kinds of rewards to differ.
Along with the three underlying sources of goal conflict (moving forward vs staying together, moving forward vs staying put, moving in which direction), this gives us a grid to think about the sources of conflict between the individual and the group, and later between ingroup and outgroup, to think about why organizations are shaped the way they are (flat or hierarchical, open or closed…) and why which forms of organizations are prevalent in which environment: why firms are organized differently than political parties, sports clubs, or religious congregations.
So far the discussion has looked at shared beliefs from the vantage point of the (formally incorporated) organization. In my previous post, I looked at the question from the perspective of value chains. But what about starting with the individual?
Rather than asking how organizations fail due to a breakdown of shared beliefs, we might as well ask how organizations emerge as a consequence of shared beliefs.
Luckily, most people have heard of filter bubbles, the subgroups on social media channels created (and algorithmically amplified) by positive interaction with like-minded people, and negative interaction with other-minded people.
This is a fundamental economic and social mechanism (formalized in my dissertation before social media or filter bubbles were a thing), not only to socialize with like-minded people, especially to filter incoming information based on whether the sender holds shared beliefs on prior issues, but also to adjust the credibility we assign a sender based on how much the newly transmitted information matches our priors.
If you take this filtering-by-affinity mechanism and impose it on the network structure of the cybernetic economy, you get an information flow pattern known as a “gossip network”.
It is such an ubiquitous mechanism that we find it everywhere, on all levels of social and economic organization, but it also comes across as slightly unevolutionary. If we want to fully understand the hazards of our environment, we should take in all information from all sides and not just amplify the information that confirms what we think we already know. But we don’t. We filter — which makes sense — but we filter out what we don’t want to hear, not what we shouldn’t hear — which doesn’t make sense.
The interesting thing is that we know much more about these mechanisms now thanks to the internet, and especially thanks to recommender systems: Amazon uses them to guess what else we want to buy. Spotify offers new music to discover, Tinder believers matching social preferences translate into a good romantic match.
The reason why recommender systems emerged in the early days of the internet is also tightly connected to Amazon, and later to the resurgence of machine learning thanks to the Netflix prize: the recognition that there is an unsurveyably large number of choice alternatives, and the inevitable corollary that our notion of consumers having well-informed preferences between them — an axiom that undergirds modern microeconomics — is no longer tenable.
Recommender systems have become one of the fundamental economic design templates, and in the process have reshaped economics (and in the case of matchmaking platforms, also our social lives), not only because they provide the raw data for much more fine-tuned consumer choice, but also because they give us a deeper insight into the evolution of preferences.
But there is also a deeper insight which comes from the information good that is being transferred. Preferences are in our economic understanding unimpeachable. They express individual tastes and are as such not rankable by social desirability, as much as we might want to impose our own, undoubtedly (in our minds) more sophisticated, tastes on others, or construct an argument why some tastes are more in tune with a more or less well-defined social good than others.
On the other extreme of the spectrum are mutually agreed-upon ground truths, better known as facts. In-between these extremes preferences (no truth value can be assigned) and facts (undisputed common truth value aligned with ground truth) lies the wide world of counterfactuals: states of the world to which we assign (and, as new knowledge emerges, adjust) truth values between zero and one. This world of counterfactuals inevitably involves the future.
This is a pattern we find everywhere, not only in the economic realm and its various subrealms like entrepreneurship (high individual belief in the success of a venture opportunity countering widespread low beliefs), but also in the political and social realms. We can connect this to Thomas Kuhn’s model of scientific inquiry and Ludwik Fleck’s harmony of illusions, or to the geographic expansion of religion, language, pottery design, agriculture, or any idea based on shared counterfactuals.
The search pattern: avoidance of negative surprise, I’ve already discussed in a previous newsletter. The canonical conflict resolution mechanisms: markets or hierarchies in the economic realm, wars or elections in the political realm, are typically tied to the fundamental pattern of exchange in that realm, as I will discuss in a future newsletter.tl;dr - Advanced AI making economic decisions in supply chains and markets creates poorly-understood risks, especially by undermining the fundamental concept of individuality of agents. We propose to research these risks by building and simulating models.
For many years, AI has been routinely used for economic decision making. Two major roles it has traditionally played are high frequency trading and algorithmic pricing. Traditionally these are quite simple, at the level of tabular Q-learning agents. Even these comparatively simple algorithms can behave in unexpected ways due to emergent interactions in an economic environment. Probably the most infamous of these events was the flash crash, for which algorithmic high speed trading was a major contributing cause. Much less well known is the subtle issue of implicit collusion in pricing algorithms, which are ubiquitous in several markets such as airline tickets and Amazon: a widely 2020 cited paper found that even very simple tabular Q-learning will converge to prices higher than the Nash equilibrium price - but our research found that this depends sensitively on the exact method of training, and the effect vanishes when the algorithms are trained independenly in simulated markets.
Besides markets, AI is also already used for making decisions in supply chains (see for example [ 1 2 3 4]), and surely will be moreso in the future. Contemporary supply chains are extraordinarily complex. A typical modern technology product can have hundred of thousands of components sourced from ten thousand suppliers across half a dozen tiers which need to be shipped across the globe to the final assembly. A single five-dollar part can stop an assembly line, which in the case of industries like automotive can cost millions per hour of downtime. The worst type of inventory a company can carry is a 99.9% finished product it cannot sell. Over time, supply chains have been hyper-optimised at the expense of integrity, so that a metaphorical perfect storm in the shape of an Icelandic volcano named Eyjafjallajökull erupting or a container ship named Ever Given getting stuck in the Suez Canal caused massive disruption that inevitably leads to delayed goods, spoiled perishables, lawsuits and contested insurance claims easily in the ten digits. The COVID-19 pandemic was a business school case for all the types of havoc supply chain disruptions can wreak, oscillating wildly from not enough containers to too many containers in port, obstructing the handling of cargo, from COVID-related work shutdowns in China to sudden shifts in consumer behavior in Western countries, leading to layoffs in hospitality industries and labour shortages in production and transportation. Beyond these knock-on effects that can explode planning horizons for procurement and shift the delicate power balance from buyer to supplier, another major problem in supply chain is the knock-off effect: fashion brands and pharmaceutical companies alike fight the problem of counterfeit products being introduced into the supply chain when no one is looking, leading to multi-million dollar losses along with the reputational damage, and, especially in pharmaceuticals, posing a hazard to health and life for many. Supply chain integrity crucially on transparency across a multitude of participants who are typically less than eager to share confidential data.
Moving fowards from these events, the delicate tredeoff between efficiency and integrity is a perfect use-case for the integrated and inter-connected decision-making that is afforded by AI.
This brings us to the issue of economic decisions being deferred to large language models such as GPT4. The well known examples are not “natively economic”, but many people are adapting transformer architectures to operate on various types of data besides linguistic data, and it is only a matter of time before there are “economics LLMs”. In the meantime, GPT is entirely capable of making economic decisions with the right prompting - although virtually nothing is known about its performance on these type of tasks. We do not recommend using GPT to make investment decisions for you, but we expect it to become widespread anyway, if it isn’t already. Similarly, we expect large parts of complex supply chains to be almost entirely deferred to AI, extending the existing automation and its associated benefits and risks.
The traditional (tabular Q-learning) and contemporary (LLMs) situations are very different in many ways, but they have a subtle and crucial point in common. This is that decisions that look independent are secretly connected. There are two ways this could happen: one is that human decision-makers defer to off-the-shelf software that comes from the same upstream supplier - as is the case for algorithmic pricing in the airline industry for example. The other is that there really is a single instance of the AI system in the world and everybody is calling into it - as is the case with GPT.
For off-the-shelf implementations of tabular Q-learning for algorithmic pricing, there is some evidence that having a single upstream supplier has a significant impact on the behaviour of the market, and this is something that regulators are actively investigating. For LLMs virtually nothing is known, but we expect that the situation is worse. At the very least, the situation will certainly be more unpredictable, and we expect the compounding of implicit biases to be worse as these systems become ubiquitous and deeply embedded into decision-making. We plan to research this, by building economic simulations where decisions are made by advanced AIs and studying their behaviour.
A further possibility is more hypothetical, but we expect it to become a reality within the next few years. Right now the technology behind large language models - generative transformers - mainly operates on textual data, but it is actively being adapted for other types of data, and for other tasks besides text generation. Making economic decisions is very similar to playing games, and so there is an obvious analogy to the wildly successful application of deep reinforcement learning to strategically complex game playing tasks such as Go and StarCraft 2 by DeepMind. Combining this with generative transformer architectures could be immensely powerful, and it is not hard to believe such a system could surpass human performance on the task of economic decision-making.
Compositional game theory - a technology that we developed and implemented - is currently the state of the art for implementing complex meso-scale microeconomic models. The way things are traditionally done, models are written first in mathematics and are later converted into computational models in general purpose languages (traditionally Fortran, but increasingly in modern languages such as Python), a process that is very slow and very prone to introducing hard-to-detect errors. We use a model is code paradigm, where both the mathematical and computational languages are modified to bring them very close to each other - most commonly we build our models directly in code, with a clean separation of concerns between the economic and computational parts. Our models are not inherently more accurate, but they are 2 orders of magnitude faster and cheaper to build, and this unlocks our secret weapon: rapid prototyping models. By iterating quickly, and continuously obtaining feedback from data and stakeholders, we reach a better model than could be built monolithically.
Why do we want to build these models? The bigger picture is, we want to inform the discussion about regulation of AI. This discussion is already widespread at the highest level of governments around the world, but is currently heavily lacking in evidence one way or the other. There’s a good reason for this: the domain of LLMs is language, and it is extremely difficult to make convincing predictions about the possible harms that can happen mediated by linguistic communication. More restricted domains, such as the behaviours of API bots, are easier to reason about. We have identified the general realm of economic decision-making as a critically under-explored part of the general AI safety question, which our tools are well-placed to explore through modelling and simulations.
Our implementation of compositional game theory allows modularly switching the algorithm that each player uses for making decisions. Normally when doing applied game theory we use a monte carlo optimiser for every player. But we also have a version that calls a Python implementation of Q-learning over a web socket. We could also easily switch it to calls to an open source LLM, or API calls to a GPT API bot or similar.
What’s more, this is emphatically not a mere hack that we bolt on top of game theory. At the core of our whole approach is our discovery, as seen in this paper, that the foundations of compositional game theory and several branches of machine learning are extremely closely related - this foundation is what we call categorical cybernetics. This foundation is what guides us and tells us that what are are doing is really meaningful. More than that, though, it opens a realistic possibility that we can know qualitative things about the behaviour of AIs making economic decisions, a much higher level of confidence than making inferences from simulation results. And when it comes to informing the discussion on regulation when the stakes are as high as they are, more certainty is always better.
So far we have focussed on the likely negative accidental impacts AI is likely to have on markets and supply chains, where they perform their intended purpose locally but interact in unforeseen ways. This is already concerning, but there is another side to the issue. What if decisions that should be independent are made by a single AI that has “gone rogue”, i.e. has a goal that is not the intended one? Depending on your personal assessment of the likelihood of this situation you could read this section as a fun thought experiment or a warning.
Being handed direct control of markets and supply chains gives perhaps the most powerful leverage over the physical world that an AI could have. Since it can collude with itself, it can easily create behaviours that would never be possible when decisions are made by agents that are independent and at least somewhat rational.
By far the most straightforward outcome of this situation is chaos. Markets and supply chains are so deeply interconnected that it would take very little intentional damage to create a recession deep enough to bring society to its knees. However, by virtually destroying the institutions that it controls this makes it a one-time event, which while extremely bad, would be easily recoverable for humanity as a whole.
Much worse would be the ability of a rogue AI to subtly direct real-world resources towards a secret goal of its own over a long period of time. It isn’t a hypothetical that complex supply chains can easily hide parts of themselves: consider how widespread is modern slavery in the supply chains of consumer electronics, or how the US government secretly procured the resources needed to build the first nuclear weapons at a time when supply chains were much simpler.
It is arguable exactly how extensive are the risks associated to allowing AIs to interact with economic systems, with the scenarios described in the previous section being hypothetical. However, it is undeniable that some serious risks do exist, including already-observed events such as flash crashes and implicit collusion. We have identified that the specific factor of decision-makers using the same upstream provider of decision-making software leads to poorly-understood emergent behaviours of supply chains and markets.
Our theoretical framework, compositional game theory, and our implementation of it, the open game engine, are the perfect tools for building and simulating models of economic situations with AI decision-makers. The goal of creating these models is to produce evidence leading to a better-informed debate on issues around the regulation of AI.It’s been a busy few weeks in the world of category theory for deep learning. First of all come the preprint Categorical Deep Learning: An Algebraic Theory of Architectures from authors at Symbolica and DeepMind, including our friend Bruno. And then hot on the heels of the paper, Symbolica raised a big investment round based largely on applications of the ideas in the paper.
The paper is about structured learning and it proposes a big generalisation of geometric deep learning, which is itself a big generalisation of convolutional networks. The general idea is that the data processed by a neural network is not just random data but is the vectorisation of data coming from some real world domain. If your vectors encode an image then there is implicit geometry inherited from the physical world. Geometric deep learning is all about designing architectures that encode geometric invariants of data, specifically in the form of invariant group actions a la Klein’s Erlangenprogramm.
What the paper points out is that the whole of geometric deep learning can be massively generalised from group actions to arbitrary (co)algebras of functors and (co)monads. From there you can easily re-specialise for specific applications. For example, if your training data is vectorisation of source code of a programming language, you can encode the structure of that language’s source grammar into your architecture in a virtually mechanical way.
Suffice to say, I’m very excited about this idea. This could be a watershed moment for applied category theory in general, and it happens to be something that’s right next door to us - the paper heavily uses categories of parametrised morphisms, one of the two building blocks of categorical cybernetics.
The first thought I had when I read the paper was invariant preferences. A real AI system is not something that exists in isolation but is something that interacts in some way with the world around it. Even if it’s not a direct “intentional” action such as a robot actuator, the information flow from the AI to the outside world is some kind of action, making the AI an agent. For example, ChatGPT is an agent that acts by responding to user prompts.
Intelligent agents who act can have preferences, the most fundamental structure of decision theory and perhaps also microeconomics. In full generality, “having preferences” means selecting actions in order to bring about certain states of the world and avoid others. Philosophical intention is not strictly required: preferences could have been imposed by the system’s designer or user, one extreme case being a thermostat. AI systems that act on an external world are the general topic of reinforcement learning (although some definitions of RL are too strict for our purposes here).
This gave me a future vision of AI safety where neural network architectures have been designed upfront to statically guarantee (ie. in a way that can be mathematically proven) that the learned system will act in a way that conforms to preferences chosen by the system designer. This is in contrast to, and in practice complements, most approaches to AI safety that involve supervision, interpretation, or “dynamic constraint” of a deployed system - making it the very first line of an overall defense in depth strategy.
A system whose architecture has invariant preferences will act in a way to bring about or avoid certain states of the world, no matter what it learns. A lot of people have already put a lot of thought into the issue of “good and bad world-states”, including very gnarly issues of how to agree on what they should be - what I’m proposing is a technological missing link, how to bridge from that level of abstraction to low-level neural network architectures.
This post is essentially a pitch for this research project, which as of right now we don’t have funding to do. We would have to begin with a deep study of the relationship between preference (the thing that actions optimise) and loss (the thing that machine learning optimises). This is a crossover that already exists: for example in the connection between softmax and Boltzmann distributions, where thermodynamics and entropy enter the picture uninvited yet again. But going forward I expect that categorical cybernetics, which has already built multiple new bridges between all of the involved fields (see this picture that I sketched a year ago), is going to have a lot to say about this, and we’re going to listen carefully to it.
There’s a few category-theoretic things I already have to say, but this post isn’t the best place for it. To give a hint: I suspect that preferences should be coalgebraic rather than algebraic according to the structural invariant learning machinery, because they describe the output of a neural network, as opposed to things like geometric invariant which describe the input.
The thing that will stop this being easy is that in a world of incomplete information, such as the real world, agents with preferences can only act with respect to their internal model of the outside world. If we’re relying on invariant preferences for safety, they can only be as safe as the agent’s internal model is accurate. We would also have to worry about things like the agent systematically deceiving itself for long-term gain, as many humans do. The good news is that practitioners of RL have spent a long time working on the exact issue of accurately learning world-models, the first step being off-policy algorithms that decouple exploration (ie. world-model learning) from exploitation (ie. optimisation of rewards).
There is also an alternative possibility of manually imposing a human-engineered world-model rather than allowing the agent to learn it. This would be an absolutely monumental task of industrial-scale ontology, but it’s a big part of what Davidad’s project at the UK’s new ARIA agency aims to do. Personally I’m more bullish on learning world-models by provably-accurate RL at the required scale, but your mileage may vary, and in any case invariant preferences will be needed either way.
To wrap up: this is a project we’re thinking about and pursuing funding to actively work on. The “Algebraic Theory of Architecture” paper only dropped a few weeks ago as I’m writing this and opens up a whole world of new possibilities, of which invariant preferences is only one, and we want to strike while the iron is still hot.# Archive of category 'functional programming'
- Apr 15, 2024
•
Building a Neural Network from First Principles using Free Categories and Para(Optic)# Archive of category 'reinforcement learning'
- May 29, 2024
•
- Mar 18, 2024
•
- Feb 6, 2024
•
Passive Inference is Compositional, Active Inference is Emergent# Archive of category 'open games'
- Apr 22, 2024
•
The Build Your Own Open Games Engine Bootcamp — Part I: Lenses# Archive of category 'AI safety'
- Mar 18, 2024
•
- Jan 16, 2024
•
- Dec 11, 2023
•
AI Safety Meets Value Chain Integrity# Institute for Categorical Cybernetics
Our mission is to develop theory and software for governing systems that learn and make decisions, for the benefit of their users and of humanity.
-
Dependent lenses are useful for general-purpose programming, but in which way exactly? This post demonstrates the use of dependent lenses as input/output-conversion processes, using parsing and error location reporting as a driving example.
-
In which we discuss how knowledge travels thru the economy, and how, when and where it forms clusters.
-
In Towards Foundations of Categorical Cybernetics we built a category whose objects are selection functions and whose morphisms are lenses. It was a key step in how we justified open games in that paper: they're just parametrised lenses weighted by selection functions. In this post I'll show that by adding dependent types and stirring, we can get a nicer category that does the same job but has all colimits, and comes extremely close to having all limits. Fair warning: this post assumes quite a bit of category-theoretic background.
-
In which we describe organization and organizations as tectonic plates shaped by clashing beliefs.
-
A system whose architecture has invariant preferences will act in a way to bring about or avoid certain states of the world, no matter what it learns. A lot of people have already put a lot of thought into the issue of good and bad world-states, including very gnarly issues of how to agree on what they should be - what I'm proposing is a technological missing link, how to bridge from that level of abstraction to low-level neural network architectures.
-
In which we establish an underlying model for human behavior and claim that all economies are just a variation of the attention economy.
-
An Economic Pattern Language (@econpatterns for short) takes the economy and disassembles it into its constituent parts. But first, this blog post describes the economy as a whole.
-
In this post I'll describe the theory of how to add iteration to categories of optics. Iteration is required for almost all applications of categorical cybernetics beyond game theory, and is something we've been handling only semi-formally for some time. The only tool we need is already one we have inside the categorical cybernetics framework: parametrisation weighted by a lax monoidal functor. I'll end with a conjecture that this is an instance of a general procedure to force states in a symmetric monoidal category.
-
This post is a writeup of a talk I gave at the Causal Cognition in Humans and Machines workshop in Oxford, about some work in progress I have with Toby Smithe. To a large extent this is my take on the theoretical work in Toby's PhD thesis, with the emphasis shifted from category theory and neuroscience to numerical computation and AI. In the last section I will outline my proposal for how to build AGI.
-
Suppose your name is x and you have a very important state machine that you cherish with all your heart. Because you love this state machine so much, you don't want it to malfunction and you have a subset which you consider to be safe. If your state machine ever leaves this safe space you are in big trouble so you ask the following question.
-
Advanced AI making economic decisions in supply chains and markets creates poorly-understood risks, especially by undermining the fundamental concept of individuality of agents. We propose to research these risks by building and simulating models.
-
This is a short summary of the post. It is meant to explain how to write for our blog.
-
Some time ago, in a previous blog post, we introduced our software engine for game theoretic modelling. In this post, we expand more on how to apply the engine to use cases relevant for the Ethereum ecosystem. We will consider an analysis of a simplified staking protocol. Our focus will be on compositionality – what this means from the perspective of representing protocols and from the perspective of analyzing protocols.
-
Categorical cybernetics, or CyberCat to its friends, is – no surprise – the application of methods of (applied) category theory to cybernetics. The "category theory" part is clear enough, but the term "cybernetics" is notoriously fluid, and throughout history has meant more or less whatever the writer wanted it to mean. So, let’s lay down some boundaries.
Newer »The best way to contact us is to email Jules or one of the other directors directly.Processing math: 3%
In this post I’ll describe the theory of how to add iteration to categories of optics. Iteration is required for almost all applications of categorical cybernetics beyond game theory, and is something we’ve been handling only semi-formally for some time. The only tool we need is already one we have inside the categorical cybernetics framework: parametrisation weighted by a lax monoidal functor. I’ll end with a conjecture that this is an instance of a general procedure to force states in a symmetric monoidal category.
This post is strongly inspired by the account of Moore machines in David Jaz Myers’ book Categorical Systems Theory, and Matteo’s enthusiasm for it. There’s probably a big connection to things like Delayed trace categories, but I don’t understand it yet.
The diagrams in this post are made with Quiver and Tangle.
For the purposes of this post, we’ll be working with a symmetric monoidal category C, and the category Optic(C) of monoidal optics over it. Objects of Optic(C) are pairs of objects of C, and morphisms are given by the coend formula
\mathbf{Optic} (\mathcal C) \left( \binom{X}{X'}, \binom{Y}{Y'} \right) = \int_{M : \mathcal C} \mathcal C (X, M \otimes Y) \times \mathcal C (M \otimes Y', X')
which amounts to saying that an optic \binom{X}{X’} \to \binom{Y}{Y’} is an equivalence class of triples
(M : \mathcal C, f : X \to M \otimes Y, f' : M \otimes Y' \to X')
I’m pretty sure everything in this post works for other categories of bidirectional processes such as mixed optics and dependent lenses, this is just a setting to write it down which is both convenient and not at all obvious.
The iteration functor is a functor \mathrm{Iter} : \mathbf{Optic} (\mathcal C) \to \mathbf{Set} defined on objects by
\mathrm{Iter} \binom{X}{X'} = \int_{M : \mathcal C} \mathcal C (I, M \otimes X) \times \mathcal C (M \otimes X', M \otimes X)
We refer to elements of \mathrm{Iter} \binom{X}{X’} as iteration data for \binom{X}{X’}. We call the object M the state space, the morphism x_0 : I \to M \otimes X the initial state and the morphism i : M \otimes X’ \to M \otimes X the iterator.
Note that in the common case that \mathcal C is cartesian monoidal, we can eliminate the coend to obtain a simpler characterisation:
\mathrm{Iter} \binom{X}{X'} = \mathcal C (1, X) \times \mathcal C (X', X)
Given an optic f : \binom{X}{X’} \to \binom{Y}{Y’} given by f = (N, f : X \to N \otimes Y, f’ : N \otimes Y’ \to X’), we get a function
\mathrm{Iter} (f) : \mathrm{Iter} \binom{X}{X'} \to \mathrm{Iter} \binom{Y}{Y'}
Namely, the state space is M \otimes N, the initial state is
I \overset{x_0}\longrightarrow M \otimes X \xrightarrow{M \otimes f} M \otimes N \otimes Y
and the iterator is
M \otimes N \otimes Y' \xrightarrow{M \otimes f'} M \otimes X' \overset{i}\longrightarrow M \otimes X \xrightarrow{M \otimes f} M \otimes N \otimes Y
This is evidently functorial. Funnily enough, although the action of \mathrm{Iter} on objects when \mathcal C is cartesian is easier to understand, its action on morphisms is less obvious and is not evidently functorial, instead demanding a small proof.
We have an existing functor K : \mathbf{Optic} (\mathcal C)^{\mathrm{op}} \to \mathbf{Set}, given on objects by K \binom{X}{X’} = \mathcal C (X, X’). This is the continuation functor, and it is the contravariant functor represented by the monoidal unit \binom{I}{I}. (This functor first appeared in Morphisms of Open Games.)
For the remainder of this section I’ll specialise to the case \mathcal C = \mathbf{Set}, in which case an optic \binom{X}{X’} \to \binom{Y}{Y’} is determined by a pair of functions f : X \to Y and f’ : X \times Y’ \to X’, and iteration data i : \mathrm{Iter} \binom{X}{X’} is determined by an initial value x_0 : X and a function i : X’ \to X.
Given iteration data and a continuation that agree on their common boundary, we know enough to run the iteration and produce an infinite stream of values:
\left< - | - \right> : \mathrm{Iter} \binom{X}{X'} \times K \binom{X}{X'} \to X^\omega
Namely, this stream is defined corecursively by
\left< x_0, i | k \right> = x_0 : \left< i (k (x_0)), i | k \right>
This operation is natural (technically, dinatural): for any iteration data i : \mathrm{Iter} \binom{X}{X’}, optic f : \binom{X}{X’} \to \binom{Y}{Y’} and continuation k : K \binom{Y}{Y’}, we have
\left< i | K (f) (k) \right> = f^\omega \left( \left< \mathrm{Iter} (f) (i) | k \right> \right)
where f^\omega (-) : X^\omega \to Y^\omega means applying the forwards pass of f to every element of the stream. As a commuting diagram,
Here’s a tiny implementation of the iteration functor and the pairing operator in Haskell:
data Iterator s t = Iterator {
initialState :: s,
updateState :: t -> s
}
mapIterator :: Lens s t a b -> Iterator s t -> Iterator a b
mapIterator l (Iterator s f) = Iterator (s ^# l) (\b -> (f (s & l .~ b)) ^# l)
runIterator :: Iterator s t -> Lens s t () () -> [s]
runIterator (Iterator s f) l = s : runIterator (Iterator (f (s & l .~ ())) f ) l
The next step is to form the category of elements \int \mathrm{Iter}, also known as the discrete Grothendieck construction. This is a category whose objects are tuples \left( \binom{X}{X’}, i \right) of an object \binom{X}{X’} of \mathbf{Optic} (\mathcal C) and a choice of iteration data i : \mathrm{Iter} \binom{X}{X’}. A morphism \left( \binom{X}{X’}, i \right) \to \left( \binom{Y}{Y’}, j \right) is an optic f : \binom{X}{X’} \to \binom{Y}{Y’} with the property that \mathrm{Iter} (f) (i) = j, that is to say, the iteration data on the left and right boundary have to agree.
The functor \mathrm{Iter} : \mathbf{Optic} (\mathcal C) \to \mathbf{Set} is lax monoidal: there is an evident natural way to combine pairs of iteration data into iteration data for pairs:
\nabla : \mathrm{Iter} \binom{X}{X'} \times \mathrm{Iter} \binom{Y}{Y'} \to \mathrm{Iter} \binom{X \otimes Y}{X' \otimes Y'}
This means that the tensor product of \mathbf{Optic} (\mathcal C) lifts to \int \mathrm{Iter}, by
\left( \binom{X}{X'}, i \right) \otimes \left( \binom{Y}{Y'}, j \right) = \left( \binom{X \otimes Y}{X' \otimes Y'}, i \nabla j \right)
The category \int \mathrm{Iter} can essentially already describe iteration with optics, although in a slightly awkward way. Suppose we draw a string diagram that not coincidentally resembles a control loop:
Here, f and f’ denote some morphisms f : X \to Y and f’ : Y \to X in our underlying category, and x_0 represents an initial state x_0 : I \to X.
Normally string diagrams denote morphisms of a monoidal category, but we make a cut just to the right of the backwards-to-forwards turning point, and consider that everything left of that is describing a boundary object. Namely in this case, we have the object \left( \binom{X}{X}, i \right) where the iteration data i : \mathrm{Iter} \binom{X}{X} is given by the state space I, the initial state x_0 : I \to I \otimes X and the iterator \mathrm{id} : I \otimes X \to I \otimes X.
The remainder of the string diagram to the right of the cut denotes an ordinary optic f : \binom{X}{X} \to \binom{I}{I}, namely the one given by f = (Y, f, f’), with forwards pass f : X \to Y \otimes I and backwards pass f’ : Y \otimes I \to X. This boils down to describing the composite morphism f; f’ : X \to X.
Overall, we can read this diagram as denoting a morphism f in \int \mathrm{Iter} of type f : \left( \binom{X}{X}, i \right) \to \left( \binom{I}{I}, \mathrm{Iter} (f) (i) \right). The iteration data on the right boundary is \mathrm{Iter} (f) (i) : \mathrm{Iter} \binom{I}{I}, which concretely has state space Y, the initial state x_0; f : I \to Y and iterator f’; f : Y \to Y.
This works in principle, but splitting the diagram between denoting an object and denoting a morphism is very non-standard. So far, this amounts to doing for the iteration functor what we did for the selection functions functor in section 6 of Towards Foundations of Categorical Cybernetics.
Now we take the final step to fix the slight clunkiness of using \int \mathrm{Iter} as a model of iteration. This continues the firmly established pattern that categorical cybernetics contains only two ideas that get combined in more and more intricate ways: optics and parametrisation.
There is a strong monoidal functor \pi : \int \mathrm{Iter} \to \mathbf{Optic} (\mathcal C) that forgets the iteration data, namely the discrete fibration \pi \left( \binom{X}{X’}, i \right) = \binom{X}{X’}. This functor generates an action of the monoidal category \int \mathrm{Iter} on \mathbf{Optic} (\mathcal C), namely
\left( \binom{X}{X'}, i \right) \bullet \binom{Y}{Y'} = \binom{X \otimes Y}{X' \otimes Y'}
See section 5.5 of Actegories for the Working Amthematician for far too much information about actegories of this form.
We now take the category \mathbf{Para}_{\int \mathrm{Iter}} (\mathbf{Optic} (\mathcal C)) of parametrised morphisms generated by this action. We also refer to this kind of thing (parametrisation for the action generated by a discrete fibration) as the Para construction weighted by \mathrm{Iter}, \mathbf{Para}^\mathrm{Iter} (\mathbf{Optic} (\mathcal C)) - the name comes from it being a kind of weighted limit and I think the reference for this is Bruno’s PhD thesis, which is sadly unreleased as I’m writing this.
Working things through: an object of \mathbf{Para}^\mathrm{Iter} (\mathbf{Optic} (\mathcal C)) is still a pair \binom{X}{X’}, but a morphism \binom{X}{X’} \to \binom{Y}{Y’} consists of three things: another pair of objects \binom{Z}{Z’}, iteration data i : \mathrm{Iter} \binom{Z}{Z’}, and an optic \binom{X \otimes Z}{X’ \otimes Z’} \to \binom{Y}{Y’}.
Now suppose we have a diagram of an open control loop, that is to say, a control loop that is open-as-in-systems (not to be confused with an open loop controller, which is unrelated):
Here the primitive morphisms in the diagram are f : A \otimes X \to B \otimes Y, f’ : B’ \otimes Y \to A’ \otimes X, and an initial state x_0 : I \to X. The idea is that f is the forwards pass, f’ is the backwards pass, and after the backwards pass comes another forwards pass but one time step in the future.
To make formal sense of this diagram, we imagine that we deform the backwards-to-forwards bend upwards, treating the state as a parameter, and then cut the diagram as we did before:
Now we can read this off as a morphism \binom{X}{X’} \to \binom{Y}{Y’} in \mathbf{Para}^\mathrm{Iter} (\mathbf{Optic} (\mathcal C)). The (weighted) Para construction makes everything go smoothly, so this is an entirely standard string diagram with no funny stuff.
Technically categories of parametrised morphisms are always bicategories (or better, double categories), and I think this is a rare case where we actually want to quotient out all morphisms in the vertical direction, i.e. identify \left( f : \binom{X \otimes Z}{X’ \otimes Z’} \to \binom{Y}{Y’}, i : \mathrm{Iter} \binom{Z}{Z’} \right) with \left( g : \binom{X \otimes W}{X’ \otimes W’} \to \binom{Y}{Y’}, j : \mathrm{Iter} \binom{W}{W’} \right) whenever there is any optic h : \binom{Z}{Z’} \to \binom{W}{W’} making \mathrm{Iter} (h) (i) = j and commuting with f and g. Coming back to our earlier picture of cutting a string diagram, this exactly says that we identify all of the different ways we could make the cut. In order to do this we change the base of enrichment along the functor \pi_0 : \mathbf{Cat} \to \mathbf{Set} taking each category to its set of connected components.
One final note: Almost everything in this post used nothing but the fact that \mathrm{Iter} is a lax monoidal functor \mathbf{Optic} (\mathcal C) \to \mathbf{Set}. With minimal translation, I think the entire thing works as a story about “forcing states in a symmetric monoidal category”: given any symmetric monoidal category \mathcal C and a lax monoidal functor F : \mathcal C \to \mathbf{Set}, the category \mathbf{Para}^F (\mathcal C) is equivalently described as \mathcal C freely extended with a morphism x : I \to X for every x : F (X). I’ll leave this as a conjecture for somebody else to prove.# Institute for Categorical Cybernetics
Our mission is to develop theory and software for governing systems that learn and make decisions, for the benefit of their users and of humanity.
-
Recently we held a workshop in Edinburgh titled Mathematics for Governance Design, consisting of a roughly 50/50 split between social scientists and category theorists.
-
In which we connect the physics Nobel Prize to machine learning and economic design.
-
In this post we will make probably the single most important step from a generic type theory to one specialised to bidirecional programming.
-
In this post we'll begin designing a kernel language in which all programs are optics. What I mean by a "kernel language" is that it will serve as a compiler intermediate representation, with a surface language compiling down to it. I intend the surface language to be imperative style like the current Open Game Engine (with an approximately Python-like syntax), but the kernel language will reflect the category theory as closely as possible. I plan the kernel language to be well typed by construction, something that seems like overkill until I think about the problem of figuring out how pattern matching should work in a bidirectional language.
-
In which we learn why "flat earth" is a perfectly sound scientific proposition and why being wrong two thirds of the time can actually be quite lucrative.
-
This is the first post in a new series documenting my work developing a bidirectional programming language, in which all programs are interpreted as optics. This is something I've been thinking about for a long time, and eventually I became convinced that there were enough subtle issues that I should take things extremely slowly and actually learn some programming language theory. As a result, this post will not be about categorical cybernetics at all, but is a foundation to a huge tower of categorical cybernetics machinery that I will build later.
-
In which we try to capture all the ways how beliefs can shape social and economic interaction.
-
Are economic models useful for making decisions? One might expect that there is clear answer to this simple question. But in fact opinions on the usefulness or non-usefulness of models as well as what exactly makes models useful vary widely. In this post, I want to explore the question of usefulness. Even more so, I want to explore how the usefulness ties into the modelling process. The reason for doing so is simple: Part of our efforts at CyberCat is to build software tools to improve and accelerate the modelling process.
-
Suppose we have some category, whose morphisms are some kind of processes or systems that we care about. We would like to be able to talk about contexts (or environments) in which these processes or systems can be located.
-
This is an overview of the 'RL lens', a construction that we recently introduced to understand some reinforcement learning algorithms like Q-learning
-
In which we bring back together the estranged fraternal disciplines of economics and operations research and map out how we can combine them to design cybernetic economies.
-
I explore the effect of players following their best response dynamics in large random normal form games.
-
The first installment of a multi-part series demistifying the underlying mechanics of the open games engine in a simple manner.
-
In this post we will look at how dependent types can allow us to effortlessly implement the category theory of machine learning directly, opening up a path to new generalisations.
-
I'm going to record something that I think is known to everyone doing research on categorical cybernetics, but I don't think has been written down somewhere: an even more general version of mixed optics that replaces the backwards actegory with an enrichment. With it, I'll make sense of a curious definition appearing in The Compiler Forest.
« OlderCross-posted from Oliver’s EconPatterns blog
Despite its bad reputation, “flat earth” is a perfectly scientific, mathematically grounded, and highly useful model of reality.
Indeed, it might be the perfect example to illustrate George Box’s famous aphorism about how all models are wrong, but some are useful.
We confirm its usefulness every time we open a map, on paper or on a screen, and manage to get from A to B thanks to its guidance, nevermind that it (and we) completely ignored the curvature of the earth all along the way.
Of course we could’ve consulted a globe, but for most everyday pairings of A and B, a travel-sized globe won’t help us much in navigating the route, and a globe that’s big enough to provide us with sufficient detail about our particular route would be much too big to carry around.
Indeed, if we push this forward, of the hierarchy of simplifying abstractions:
- earth is flat
- earth is spherical (a ball)
- earth is a rotational oblate ellipsoid (a pill)
- earth is a rotational oblate ellipsoid with a variance in surface elevation never exceeding 0.2% of its diameter. the last one — hills and valleys — matters much more in everyday life than the knowledge that earth is spherical or even an ellipsoid.
That doesn’t mean it’s entirely useless knowledge. Nobelist Ken Arrow’s very first academic publication, about optimizing flight paths just after World War 2, pushed the envelope, so to speak, from a planar world to a spherical world.
“All the literature assumed that the world was flat, that everything was on a plane, which may be germane if you’re flying a hundred miles. But we were already flying planes across the Atlantic, from Newfoundland to Scotland. It turned out to be an interesting mathematical problem to change these results to be applicable to the sphere — and that was my contribution.” — Kenneth Arrow, 1995.
Indeed today, Arrow’s contribution is used in any long-distance flight planning software, and its effect is visible every time we fly from Frankfurt to New York and are surprised looking out the window to find ourselves above Greenland.
But we shouldn’t be led into thinking that before Arrow scientists believed that the earth is flat. They just recognized that for their task it didn’t matter that it wasn’t, and at a time when “computers” were still human professionals rather than electronic devices, simplifying the calculations mattered.
One reason why “flat earth” is such a great example for proper modeling is that it gets the point about scope across, simply because modeling scope matches geographic scope: for short hops, flat earth is perfectly fine, but for transatlantic flights, you’re bound to run into trouble. Somewhere inbetween is a fuzzy boundary where the simple model gradually fails and complication becomes necessary.
Another reason is that it’s a perfect little ploy to expose a particular type of academic conceit, simply because it goes against the Pavlovian reflex by certain academics to roll out “flat earth” as the synecdoche for conclusively disproven pseudoscience.
But there is a critical difference between claiming earth is flat (an empirical hypothesis without support) and proposing a deliberate counterfactual of a flat earth (a modeling design choice), and it strikes at the heart of George Box’s aphorism, which we could amend to say that models are useful because they are wrong.
A map is useful because it’s not the territory.
This distinction is crucial, because so many people, including and maybe especially academics get it wrong: a counterfactual is not the same as an unobservable.
Unobservables are hidden truths. Counterfactuals are openly expressed falsehoods.
In model design, counterfactuals — things we invoke even if we know they’re wrong — play an important role in the form of assumptions, which is why making assumptions explicit is a crucial but oft-ignored exercise in model design.
Graduate students in econometrics often get the advice (or at least used to) to pick up Peter Kennedy’s A Guide to Econometrics in addition to whichever weighty and forbidding tome is assigned as the official textbook (typically Greene’s doorstop), with the undertone that Kennedy’s book might provide a cheat code to unlock the arcane secrets encloistered in its more voluminous peer.
Kennedy succeeds in doing this not because he dilutes the mathematical heft of the official textbooks, but because he offers a very succinct exposé of how we should approach ordinary least squares (OLS), the workhorse of econometric modeling. Step by step:
- This is the original OLS setup
- Here are the five major assumptions that undergird it
- Because of these assumptions, the scope of OLS in its original form is quite limited
- But we can, one by one, test if each assumption is violated and implement a fix
- And if everything fails, here are a few alternatives…
This is so lucid that it’s surprising (and somewhat disheartening) to see it rarely ever expressed in this succinct — and compositional — form.
Assumptions are counterfactuals that limit the scope, and to expand the scope we have to investigate and potentially drop some of the assumptions, but this always involves a tradeoff between parsimony and universality.
Models are by design complexity-reducing conceits. But for this to be successful the modeler has to be willing to start by ruthlessly reducing complexity to expose the underlying mechanism, and academia isn’t always an environment where Occam’s razor is sharp.
Stereotypically speaking, academia is incentivized to err on the side of giving convoluted answers, including giving convoluted answers to simple questions, or even the worst of all words: convoluted wrong answers to simple questions.
Pretty much everyone in academia who laughed about large language models getting very simple questions horribly wrong ( “9.11>9.9”, “how to ferry a goat”, “countries with ‘mania’”) should’ve felt a pang of recognition. Trying to get the big questions right often comes at the cost of getting the simple questions wrong. That might be an acceptable tradeoff in academia, in design it can be fatal.
“Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity.” — George Box, 1976.
The point Box tries to drive home is not only that there are decreasing returns to upping model complexity and pursuing correctness is an elusive goal, but that returns are indeed quite often negative.
The scene-setting assumption of EconPatterns, expressed in the very first post, is that we operate in a cybernetic economy were macroeconomic aggregates are often deceptive.
Economic engines — take for example a dynamic pricing model for an airline or an energy provider — are elaborate beasts by necessity. If we want to capture time, portfolio, geography, bundling, and other consumer preferences in price finding, we are quickly confronted with a staggering number of variables we have to juggle to produce anything coherent, nevermind accurate, which seems to go counter to Box’s remonstrances about beauty in simplicity.
But the seeming contradiction is easily resolved. Even the newsletter emphasized that “economics usually skips this operational layer for the sake of expositional expediency, and for the most part it does ok doing so.”
The skill in modeling rests first and foremost in the ability to ruthlessly pursue parsimony, but also in the ability to recognize when and where parsimony fails us.
Translated into a design strategy, this means both to have a mental model of the ultimate end product when starting with the first sketches, but also to recognize the underlying fundamental mechanisms — the primitives — in the end product.
The only way to resolve this is to modularize the design process, not only to make it composable (we can assemble the system from the modules), but also compositional (we can infer system behavior from module behavior).
Anyone who has ever spent any time in the engine rooms of capitalism knows how ubiquitous quantitative modeling is to predict anything at all. Even smalltown bakeries predict sales to decide how much flour to buy and how many loaves of bread to bake. Insurances run massive predictive operations staffed by actuaries. Even the microchips in our smart phones use prediction to allocate tasks.
We are surrounded by model-based predictions in our everyday lives, indeed one might claim our livelihoods depend on them. We just choose to ignore them because they’re mostly operating quite well until we’re confronted with the consequences of them breaking down — the negative surprise that gets our attention.
Undergrad microeconomic classes at business schools teach expected value of imperfect information ( EVII) as a simple “managerial” framework to explain Bayesian updating from a decision-theoretic perspective.
If you as a decision maker have a 20% chance of being right in your predictions, how much would you pay someone who has a track record of being right 30% of the time for a particular prediction? Not much, you’d think.
Narrowly speaking, that’s a pretty good mental model about how the consulting industry works, but a bit more philosophically speaking, the idea that models have to be perfect to be useful is a (very common) fallacy — usually expressed as “we don’t have enough data to model anything” or “the model made a wrong prediction so the model must be wrong”. Indeed models can often be especially useful even if they are far from accurate.
Strictly economically speaking, models are useful if they shift the probability of being right about the future upward, even if it’s only by a small delta. We only have to compare the salaries of baseball players who hit a .200 batting average (the proverbial Mendoza line) with those who hit at a .300 clip. Getting it wrong 70 percent of the time can be a pretty lucrative skill under the right circumstances.
The purpose of modeling is to reduce the propensity of negative surprise, which is why we usually only notice when they do the opposite.
To update Max Planck’s famous dictum that science progresses one funeral at a time, formal modeling helps us to speed up science so that it progresses one public embarrassment at a time — which happens every time a confidently made prediction is stumped by reality.1
To up the ante, managerially speaking, models are important decision tools even if they don’t improve chances of being right at all, simply because they act as tiebreaker tools to overcome decision paralysis, especially in scenarios where “we don’t have enough data to model anything”.
It’s a simple explanation why soothsaying and bird divining existed throughout history, and why they’re still around. Sometimes people need a device — any device — to prune the branches of their personal decision trees, to overcome the form of decision paralysis we know as “procrastination”.
Good models, beyond improving predictive accuracy, also help simply by providing a formal grid to map out the structure of the decision scenario. This is what makes game (and decision) theory relevant: it’s a formal tool to map out the interrelatedness of scenarios around the decisions the participants face.
Popular opinions about modeling range from “a modeled prediction establishes a scientific fact” to “models can’t predict anything at all since (prominent example where a model went wrong)”. Strangely enough, both mental models seem to be especially popular in the natural sciences, sometimes even proposed by the same person.
Neither of these extreme positions have any grounding in reality, and their popularity is likely more the result of ambiguity intolerance and conceptual problems with the idea that improvement can come in the form of an upshift in likelihoods.
As Milton Friedman put it in the context of positive economics as a scientific endeavor:
“Its task is to provide a system of generalizations that can be used to make correct predictions about the consequences of any change in circumstances. Its performance is to be judged by the precision, scope, and conformity with experience of the predictions it yields.” — Milton Friedman, 1953.
But this is only one side of a coin. We can devise models that are purely predictive (the internal causal mechanism remains opaque to us) or models that are purely causal (they make no claim about predictive accuracy or might even be deliberately wrong in the expectation that how and where they go wrong reveals something about the internal causal mechanics) — like we do with pretty much every financial plan ever.
Most models end up somewhere in-between on that spectrum. The important thing is to be upfront about what the design objective is.
I’ve written about the role of modeling in science, the social sciences, and economics before, but it remains a contested issue, so it felt like devoting a whole post to it might be worth the effort.
My take is ultimately shaped by my own experience in industry, and in turn shapes what I am trying to achieve with EconPatterns.
The short version is that formal modeling is a relevant part of economic practice, especially the unobserved part (the “engine room”) of economic practice, that a sound understanding of formal modeling tools is necessary for anyone within economics (even if the need for mathematical rigor varies widely between fields).
The economy is also a data-rich environment, and we have enough experience to know that certain things in the economic realm are bound to follow discernible and generalizable patterns.
But formal modeling has to rest on a sound conceptual understanding, and economic endeavors, especially those that include economic design, should spend enough time on the conceptual architecture to not end up building complicated models that fail at the simple answers.
On the same topic, see also Philipp Zahn’s perspective.
- It should be noted here that Planck’s dictum itself is an astute observation about belief systems and the glacial progress of belief propagation in academia. This might be worth its own newsletter. ↩# Archive of category 'cybernetics'
- Mar 22, 2024
•
- Mar 8, 2024
•
- May 29, 2022
•
What is Categorical Cybernetics?# Archive of category 'jekyll'
- Nov 26, 2023
•
About the CyberCat Institute blogCross-posted from Oliver’s Substack blog, EconPatterns
On a certain level of abstraction, an economy can be described as a network of stocks, flows, and transformations. Let’s call this level the cybernetic economy.
Stocks and flows are two fundamental forms of displacement: in time and space respectively, and they are typically restricted by upper and lower capacity constraints: overstock vs stockout, overflow vs desiccation.
Transformation in the usual sense of industrial production means the recombination of inputs to produce new outputs, but we can also include creation and consumption as starting and endpoints of network flow. In the case of natural resources, creation often takes the form of extraction.
The stocks and flows usually come in the form of information, materials, effort, payments, equipment, and on a more abstract level, risks, beliefs, rights, and commitments. Risk is just as much an economic good that can be transformed, bundled, disassembled, transported as any physical material.
Most of these objects should sound familiar from economic textbooks, especially macroeconomic textbooks. The cybernetic economy differs from this textbook treatment mostly by explicitly highlighting the network of interactions, and by stressing the global ramifications of local interactions.
This network view of the economy on the other hand should be familiar to anyone with a background in industrial production, where orchestrating multi-step processes on shop floors densely packed with machines, pathways, buffers, and assembly stations is a major part of the job description, and where stockouts of five-dollar parts can stop ten-million-an-hour assembly lines — as can pathways congested by improvised material buffer overflows.
Economics, especially macroeconomics, usually skips this operational layer for the sake of expositional expediency, and for the most part it does ok doing so. As long as the operational friction stays within bounds, no stocks and flows pushing against their upper or lower capacity limits, no production schedules foiled by unobtainable five-dollar components, we can safely assume a frictionless world and focus on the established gears and levers central to macroeconomic inquiry.
In other words, as long as there is only a modicum of disorder in the economy, it’s perfectly fine to assume a well-ordered economy.
Which underlines a key principle: the right level of aggregation matters. A map is not the territory, but we might need different maps to do different things within the territory. In the same sense we can drop operational details and aggregate activity on a high level as long as we can be sure that the loss of realism — the loss of predictability — is inconsequential for the task at hand.
But we should have a more fine-grained map at the ready just in case our survey map fails to capture the finer points.
The economy we’re looking at is an economy that can be disaggregated and disassembled to the individual component, the individual participant, the individual activity, just as needed whenever it is needed.
I’m resurrecting the somewhat outmoded term “cybernetic” for it because it conveys the focus on flows, on routing, buffering, concatenating, on orchestrating activities and resources.
Routing, network flow, buffering, job shop scheduling, machine replacement models are all standard tools of the trade in operations research. They are no longer, or not yet again, standard tools in economics, but in order to describe the economic activities as intended, and to couch them in a wider social and political context, they should become economic tools again.
EconPatterns intends to bring them back together under the same motivation that it intends to bring mathematical, statistical and computational tools together: to build up a toolset which we can use to design economic objects.
But, and this is the conjurer’s trick, it’ll do so almost entirely without resorting to formal modeling or even mathematical notation. This is not out of nostalgia for an era where political economy was a branch of the philosophical faculties. The economy is as data rich as any field of inquiry and we seem to have just enough recognizable, repeating and generalizable patterns to give the scientific method a try.
But the point of the exercise is to develop an economic design language, to establish a conceptual foundation, rather to rephrase current economic knowledge. This is why it invokes the famous Bauhaus Vorkurs, the foundational course that gave the Bauhaus students a starting point from which to branch out into their respective workshops.
The things for which economics, mathematics, statistics, operations research, computer science, and other fields have developed very intricate formal mechanisms will pop up mostly as pointers. The question which sorting, filtering, or separating algorithm to use is relevant and often decisive to the success of an economic activity, but it is secondary to the question when to sort, filter or separate — and what.
Instead it will take very close looks — some might think unreasonably close looks but my hope is the reasons for doing so will reveal themselves in due time — at existing economic artifices and their constituent parts. One of the motivations is to show that the Grand Bazaar in Istanbul and an online e-commerce platform have surprisingly many things in common, and there’s a reason for it.
To this end, EconPatterns — and I believe this is the defining novelty — will borrow liberally from design theory and practice, as well as from architecture. The chosen container for this endeavor is Christopher Alexander’s design pattern. There are many reasons for this choice, not the least of which is that design patterns have successfully been translated from architecture to software design.
The in-depth discussion of “why design patterns?” surely deserves its own article, but it also introduces an interesting tension. As design philosophies go, Alexander and the Bauhaus stalwarts are certainly at opposing ends of the spectrum, A to B, organic to geometric, habitable spaces to machines for living.
I’m hoping to put this tension to good use. Designing economic contraptions poses relevant questions beyond their productivity and efficiency. Which is a major reason why I am not trying to resolve that conflict or take sides.
Admittedly, the whole endeavor is open-ended, and the crucial question if the patterns sketched out so far will ultimately come together as a coherent whole is still unresolved. This is why the blog format is the right one at this juncture: to put the question out in the open while I present the first pieces of the puzzle.
EconPatterns will inevitably be shaped by my own background and my own particular interests, which is one reason why economic organization will be the initial focus. The fundamental model of the economy is different, as is the underlying concept of human behavior (as next week’s entry will show). I’m somewhat inclined to say that there are not that many people out there with a background both in design and economics, so I’m quite comfortable in claiming that the exercise should offer sufficient novelty.
I’m also very clear that I don’t hold exclusive rights to the very concept of design patterns — if anything I might be the first practitioner to apply them to economic design problems — but the ultimate defining characteristic of a design pattern that sets them apart from economic laws is that they’re entirely voluntary. They are simply proposals of how to look at, structure, and solve a certain design problem, and the ultimate arbiter for their success is if enough practitioners will find them useful enough to apply them to express their ideas.
Which in itself should hopefully take much of the pedantry out of economic debates.# Archive of category 'supply chains'
- Apr 4, 2024
•
- Dec 11, 2023
•
AI Safety Meets Value Chain Integrity> “All of this will lead to theories [of computation] which are much less rigidly of an all-or-none nature than past and present logic. They will be of a much less combinatorial, and much more analytical, character.
In fact, there are numerous indications to make us believe that this new system of formal logic will move closer to another discipline which has been little linked in the past with logic.
This is thermodynamics, primarily in the form it was received from Boltzmann, and is in part theoretical physics which comes nearest in some of its aspects to manipulating and measuring information.”
– John Von Neumann, The General and Logical Theory of Automata, 1948.
Allow me a quick excursion from the regular programming to celebrate the 2024 physics Nobel Prize awarded to John Hopfield, inventor of the eponymous Hopfield network, and Geoffrey Hinton, co-inventor (with Terrence Sejnowski) of the Boltzmann machine.
Since this is an economic design series, the question why a physics Nobel, and especially a Nobel Prize awarded for a contribution to machine learning, should be of interest is a fair one.
The long answer is that, having spent a few long years translating the underlying mechanisms of both networks into economic game theory, and in turn into the emergence of consensus (or its opposite, partisanship) in social groups, I think I can offer a fairly unique perspective to discuss the impact of this prize on economics.
The short answer is that these two networks also shape the whole economic outlook presented in EconPatterns.
To recapitulate.
In the first post, I established the economy as a network of a small set of fundamental activities: stocks, flows, transformations, which have to be orchestrated to produce desirable outputs.
This orchestration requires agreement on beliefs among participants, first that these activities do indeed lead to these outcomes, and second that these outcomes are indeed desirable.
This framing mapped a network of economic activities onto a belief network, with the underlying assumption that unless all participants have perfectly homogenous beliefs, goal conflict within the network becomes inevitable as the network becomes larger, until ultimately the network has to crumble into smaller subnetworks (aka clusters) that can hold shared beliefs.
Expressed in the first law of organization: the objective of organization is to resolve the conflict between moving forward (orchestrate activities that produce a desirable output) and staying together (hold the shared belief that these activities do indeed lead to the proposed desirable output).
Where this conflict cannot be resolved within an organization, competition emerges.
Competition is the starting point of economic inquiry, and it typically treats it as exogenous. In other words, competition has to happen, and by virtue of simply happening (and by drawing attention to surpluses and scarcity in the economic network via price signals) it helps steer the economy in the right direction.
What it skips is the question where exactly the beliefs diverge sufficiently that orchestration within the same organization is no longer possible so that rifts open up and competition emerges.1
This is a question that economics of organization tries to tackle in the form of the make-or-buy decision, but finding an appropriate formalization has been elusive. And this is where Hopfield, Hinton, and Sejnowski come in.
To make that leap we first have to divest ourselves from any expectation that our formalization expresses any kind of tangible economic activity, and accept that we go down to bare-bones expressions of individual beliefs, and the main activity is to both be influenced by and trying to influence the beliefs of ours peers.
In other words, for any proposition, participants express their beliefs in the simplest possible way as subjective expectations: in simple Boolean logic, zero for “I believe it’s false”, one for “I believe it’s true”, or in a stochastic setting, any value between zero and one as the expression how probable they consider the proposition to be true. (Alternatively we can consider a wider range from −1 to +1 to express opposing beliefs, which is especially useful in political settings.)
Hopfield’s first paper, published in the midst of the first “ AI winter” in 1982, astounds in its brevity. It is only five pages long.
Up to this juncture, including the emerging connectionist revolution that lead to Rumelhart & McClelland’s famous two-volume work in 1986 (which also included Hinton & Sejnowski’s paper), neural networks where almost exclusively conceived as feedforward networks (information flows from input to output) with backpropagation (feedback flows from output to input) as learning mechanism.
Hopfield’s recognition was to fold the network unto itself: all network nodes can send and receive signals to and from all others, and the designation as input or output nodes is arbitrary.
In isolation this wouldn’t be particularly interesting, but the marvel of neural networks in general, and Hopfield networks in particular, is that the behaviors of individual nodes are connected, and that this connectivity can be expressed in a weight (or covariate) matrix, where high positive weights translate as “shared beliefs” and high positive weights as “opposing beliefs”.
Neural networks function in two modes: training mode (weights are flexible) and execution mode (weights are fixed). Training in this case translates into finding out which nodes hold correlating beliefs, and setting the weights accordingly.
Hopfield’s question is what happens when a connected network with a given set of (symmetric) weights plus a vector of isolated beliefs (aka biases) per node is allowed to converge from a given starting state (the input) to a stable state (the output), when each node tries to agree with all connected neighboring nodes with shared beliefs and disagree with neighboring nodes with opposing beliefs.
Hopfield’s first paper from 1982 tackles this question with a Boolean choice of zero and one for all nodes, and a second paper from 1984, also five pages long, expands this to allow uncertainty in the form of probabilistic belief values between zero and one, plus a sigmoid function to connect inputs and outputs.
His conclusion, in the shortest possible form, is that the network exhibits memory, in a form that makes it a “content-addressable memory”.
In other words, the network converges from an input pattern to the nearest pattern it has been trained on — an important feature in pattern recognition with an obvious early application in detecting handwriting. If the input pattern is something that vaguely looks like a 7, the output ideally should identify this as a 7 and not a 3.
Under the right conditions, if the training data set contains a number of shapes that are vaguely 7-ish looking, the network should memorize this as a distinct pattern and when activated, recognize this.
In somewhat more technical language, the network should contain local optima representing the trained-on patterns and basins of attraction that capture all the trained variants (and their interpolations).
As a physics-inspired mathematical construct, this is extremely neat and its translation into belief-driven collective action expands beyond the metaphorical similarity. Implemented as a feedback network, it has quite a few drawbacks which curbed its widespread adoption in favor of less finicky backpropagation architectures.
One major drawback, in an analogy to what I like to call the “bicycle repair cooperative on Shattuck Ave”, is that it doesn’t scale particularly well.
Shattuck Avenue is in downtown Berkeley and the bicycle cooperative prided itself on its strong collectivist ethos, where all topics are discussed and decided together. This might work if the collective is small and beliefs are highly aligned, but runs into trouble when the collective gets bigger (adding one new member adds N new connections) and beliefs diverge.
Which is why “fully connected consensus” never translates as a template for large companies.
The other problem is that it produces a whole lot of local optima which don’t map to trained patterns, so the network is always at risk of producing meaningless output — a problem that also increases with network size.
One remedy for this comes from Hinton & Sejnowski’s Boltzmann machine, which introduces “vanishing noise” as a means to avoid local minima.
Vanishing noise just means that as a node is called upon to update its belief (aka state), we introduce a small likelihood that the node accepts a new state even if it is unfavorable, and that this likelihood becomes smaller over time.
This is very much an analogy for shaking up the system, implemented as “distributed simulated annealing” — annealing being the metallurgical procedure to add short spurts of heat in a cooling process to avoid getting trapped in imperfect lattice structures.
The connection to thermal annealing not only creates a connection to physics proper, it also opens another batch of neat features.
For one, we suddenly have a system that even if behavior happens on the individual level — each node updates its belief individually and only according to its own interests — we can still express the behavior of the whole system in a single macro equation.
The economic equivalent of this is a potential game (introduced by Dov Monderer and another Nobelist, Lloyd Shapley) where changes in individual utilities can be captured in a single equation for the whole game.
The other intriguing feature is that under vanishing noise, we can characterize the equilibria the system reaches using the eponymous Boltzmann distribution, which tightly connects the model to statistical mechanics, and in turn to entropy, free energy, and — importantly for us — surprise.
This might answer the question why the physics Nobel committee deemed their work worthy of recognition.
In the early 1980s, at a time when artificial intelligence in general and perceptrons in particular were seemingly going nowhere, a bunch of researchers centered around Princeton and Carnegie Mellon put computation on a physical footing, just as John Von Neumann had predicted.
But the goal of this post is not to resolve the befuddlement that befell some physicists at the news that the physics prize was seemingly awarded to a discovery in computer science, but to map out why this matters to economics, and in particular to economic design.
Starting out in this endeavor, I had at best a vague notion of statistical mechanics and mostly considered this kind of feedback network a bare-bones metaphorical model of what would happen in a social group that faces a simple binary choice and influences each other in their choices, couched in the at the time headlines-making problem of technology standard competition, with the major claim to novelty being that if you consider heterogeneous network effects, you’ll get more interesting results — especially that you can get interesting partisan dynamics that go against the then agreed-upon paradigm that positive network effects inevitably lead to monopolization.
The findings were mostly met with indifference at the time, but economics has evolved significantly since then. Graph theory has become a recognized tool in large part because of the emergence of social networks and the “network science” revolution in sociology. Machine learning has arrived in economics some ten years ago and is currently in a state that can only be described as feeding frenzy.
There is a recognition that the landscape of products is unsurveyable for the consumer, leading to attention dynamics and herding behavior (then of little interest outside finance), and to the introduction of two novel economic engines that found widespread adoption in the internet age: recommender engines and reputation engines.
There is an emerging understanding that preferences are not inscrutable and outside the scope of economic inquiry, but that they are tractable and that they evolve in predictable ways.
There is now far less discomfort dealing with scenarios that have more than one equilibrium, so existence proofs have gradually given way to convergence and evolutionary dynamics.
Bayesian inference, including Bayesian games of incomplete information, is slowly making inroads, giving us a richer toolset to thing about belief propagation and the evolution of norms.
And most importantly, very rich interaction data has become available, pushing the “not very interesting” findings of ideological polarization as an outcome of network heterogeneity to the forefront of the academic (and non-academic) debate.
My own position also evolved over time.
For one, I no longer consider it merely a metaphorical model of human behavior, useful as an illustrative but empirically intractable shorthand for what happens in a social group when traditional preferences and peer influence intersect.
The evidence that statistical mechanics plays a role not only in human behavior, that thing we call rationality (or even bounded rationality), but also in the goal-directed behavior of organisms we don’t generally consider rational, keeps mounting, as I mapped out in the post on surprises for eyeballs as the fundamental exchange of (not only) the attention economy.
Mostly outside of economics, much progress has been made at the intersection of statistical mechanics, Bayesian inference, and information theory, and it slowly trickles into economics proper via the translation into (evolutionary) game theory.
Cognitive effort is a scarce resource which needs to be allocated towards the activities that promise the highest return. The mechanism by which this allocation happens is attention, a term that, other than a common acceptance that we now live in the attention economy, has gained little traction in economics, even if the administration of scarce resources is at the core of economic theory.
In economic design this is somewhat different, since it ultimately deals with conceiving structures that facilitate mutually beneficial exchange. So the orthodoxy that befits theory is of little help, as the primary goal is to make the engines work, to make sure they fulfill their designed function.
This inevitably requires a wider vocabulary, including learning agents, including agents that learn from each other, including agents that form belief clusters as the subset of peers they’re willing to learn from. That includes machine learning as a facsimile for learning agents.
This also includes investigating rationally erratic behavior, mutation, or “physiological” annealing, deviation from self-interested behavior that gets more frequent as the temperature increases.
It includes developing an understanding of the emergence of norms as decentralized constraints of individual behavior.
Once we start incorporating these tools into our design handbook, it quickly becomes apparent that they go a long way towards explaining common behaviors, including behavior we don’t necessarily consider “economic”.
It also fuels the prospect that we will conclude in due time that the mechanism Hopfield, Hinton, and their peers described is embedded in far more biological systems than we thought, and maybe humans are indeed lightning calculators of pleasure and pain, but not in the way we assumed they were.
- John J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities”, PNAS, 1982.
- John J. Hopfield, “Neurons with graded response have collective computational properties like those of two-state neurons”, PNAS, 1984.
- Geoffrey E. Hinton, Terrence J. Sejnowski, David H. Ackley, “Boltzmann machines: constraint satisfaction networks that learn”, Tech report, CMU, 1984.
- Geoffrey E. Hinton, Terrence J. Sejnowski, “Learning and relearning in Boltzmann machines”, in Rumelhart & McClelland, 1986.
- This is of course extremely truncated. The famous economic calculation debate centered on the question of central planning, where “central” was defined as the administrative state also orchestrating economic activity. The perspective EconPatterns takes is of course a different one, as laid out in the newsletters on organizations as tectonic plates and the blurry boundary between economics and operations research. ↩# Archive of category 'cryptoeconomics'
- Jun 24, 2022
•
A Software Engine For Game Theoretic Modelling - Part 2# Archive of category 'AI'
- Feb 6, 2024
•
Passive Inference is Compositional, Active Inference is Emergent# Archive of category 'events'
- Oct 28, 2024
•
Mathematics for Governance Design# Archive of category 'category theory'
- Jun 28, 2024
•
- Apr 22, 2024
•
- Apr 12, 2024
•
- Apr 1, 2024
•
- Feb 22, 2024
•
- Jan 16, 2024
•
How to Stay Locally Safe in a Global World# Archive of category 'active inference'
- Feb 6, 2024
•
Passive Inference is Compositional, Active Inference is Emergent# Archive of category 'economics'
- Oct 14, 2024
•
- Sep 2, 2024
•
- Aug 15, 2024
•
- Jul 15, 2024
•
- Apr 4, 2024
•
- Mar 22, 2024
•
- Mar 15, 2024
•
- Mar 8, 2024
•
- Dec 11, 2023
•
AI Safety Meets Value Chain Integrity# Archive of category 'programming languages'
- Sep 12, 2024
•
- Sep 5, 2024
•
- Aug 26, 2024
•
Foundations of Bidirectional Programming I: Well-Typed Substructural LanguagesLoading [MathJax]/jax/output/HTML-CSS/jax.js
See part I of this series
In this post we’ll begin designing a kernel language in which all programs are optics. What I mean by a “kernel language” is that it will serve as a compiler intermediate representation, with a surface language compiling down to it. I intend the surface language to be imperative style like the current Open Game Engine (with an approximately Python-like syntax), but the kernel language will reflect the category theory as closely as possible. I plan the kernel language to be well typed by construction, something that seems like overkill until I think about the problem of figuring out how pattern matching should work in a bidirectional language.
My first design choice is that object language types denote pairs of metalanguage types, with one denoting the forward part (sometimes I might call it the covariant denotation) and the other denoting the backward part (the contravariant denotation).
Cov : Ty -> Type
Con : Ty -> Type
The interpretation of a term will be a lens:
eval : Term xs y -> All Cov xs -> (Cov y, Con y -> All Con xs)
Here All
is an Idris standard library function that combines a type-level map with cartesian product:
data All : (a -> Type) -> List a -> Type where
Nil : All p []
(::) : p x -> All p xs -> All p (x :: xs)
(Idris has what in Haskell would be called rebindable syntax switched on by default, so we can use the usual syntactic sugar for lists to refer to elements of All
.)
eval
is a well typed interpreter, and is a placeholder while we prototype: much later, this is where the backend of the compiler will begin.
My second design choice is that I want the basic product former to be interpreted as the tensor product of lenses, which is pairwise cartesian product on the covariant and contravariant parts. This is an uncontroversial choice, but importantly this product is symmetric monoidal but not cartesian, so it means that we are designing some kind of linear type theory.
My third design choice is that I want negation to be interpreted as swapping the covariant and contravariant parts. This sounds uncontroversial at first - many well known semantic categories do the same thing - until we notice that it is not functorial. That is to say, if we have a lens (X,X′)→(Y,Y′) then in general we can’t make a lens between (X′,X) and (Y′,Y) in either direction. Years ago I wrote a paper called Coherence for lenses and open games in which this non-functorial pair swapping featured heavily, and I still stake everything on the claim that it is the right way to go.
At this point we have enough to build a term language and its interpretation. I will add a single ground type which will be interpreted as purely covariant.
data Ty : Type where
Unit : Ty
Ground : Ty
Not : Ty -> Ty
Tensor : Ty -> Ty -> Ty
mutual
Cov : Ty -> Type
Cov Unit = Unit
Cov Ground = Nat
Cov (Not x) = Con x
Cov (Tensor x y) = (Cov x, Cov y)
Con : Ty -> Type
Con Unit = Unit
Con Ground = Unit
Con (Not x) = Cov x
Con (Tensor x y) = (Con x, Con y)
Let’s think through the consequences of these choices. We think of Tensor
as linear conjunction, so its neutral element Unit
is linear truth. The interpretation of Unit
is the pair (1,1), and so Not Unit
- which we would think of as linear falsity - has the same interpretation. So we have a linear logic where falsity and truth coincide semantically. Similarly, the de Morgan dual of Tensor
, which we would call linear disjunction, coincides with it semantically. So we have an inconsistent interpretation of linear logic. This is nowhere near as bad as it sounds, since many reasonable semantic categories do the same, but we need to keep it in mind.
Since Tensor
is a perfectly cromulent symmetric monoidal product, its introduction and elimination rules will be exactly the same as the ones in my previous post. But the negation rules are going to be quite a puzzle.
Our interpretation of negation is strictly involutive - swapping twice is a no-op - something we can call a classical linear negation. This means our semantics validates the principles of double negation introduction and double negation elimination: both of them are interpreted as an identity lens.
The principle of explosion says that p and ¬p together entail falsity, for every proposition p. Since falsity and truth coincide, for us the principle of explosion says that p and ¬p together entail truth. This is a valid principle in our semantics. Suppose p is interpreted as (X,X′), then p⊗¬p is interpreted as (X×X′,X′×X). There is indeed a canonical lens (X×X′,X′×X)→(1,1), namely the “turnaround” lens, which I normally call a counit.
In Idris, suppose we have
explosion : {a : Ty} -> Term [a, Not a] Unit
Then we must have
eval explosion : {a : Ty} -> All Cov [a, Not a] -> (Unit, Unit -> All Con [a, Not a])
which up to isomorphism reduces to
eval explosion : {a : Ty} -> (Cov a, Con a) -> (Con a, Cov a)
Of course, we want to implement eval
so that this gives us the swap function.
The de Morgan dual of the principle of explosion is the principle of excluded middle, which says that truth entails p or ¬p. Remembering that our conjunction and disjunction coincide, if p has interpretation (X,X′) then excluded middle would denote a lens (1,1)→(X×X′,X′×X). In general there is no lens of this type, so our semantics does not validate excluded middle.
In Idris, suppose we had
lem : {a : Ty} -> Term [] (Tensor a (Not a))
Then we must have
eval lem : {a : Ty} -> All Cov [] -> ((Cov a, Con a), (Con a, Cov a) -> All Con [])
which up to isomorphism reduces to
eval lem : {a : Ty} -> (Cov a, Con a)
which is impossible as soon as we introduce any types that are not pointed.
This suggests a logic puzzle: can we design a proof system for negation that validates double negation introduction, double negation elimination and the principle of explosion, but does not validate excluded middle?
After some tinkering I did indeed invent a system with these properties. Sadly it turned out to be a red herring, since it ended up proving these principles that are valid for lenses in terms of more primitive principles that are not valid for lenses. But I still think it’s an interesting enough sideline to report here.
The system I designed was a 2-sided hybrid of a natural deduction calculus and a sequent calculus, with general right-elimination, and both left-elimination and right-introduction restricted to empty sequents on the right. In standard proof theory syntax I would write it like this:
Γ,φ⊢Γ⊢¬φ(RI)Γ,¬φ⊢Γ⊢φ(LE)Γ⊢φ,ΔΓ′⊢¬φ,Δ′Γ,Γ′⊢Δ,Δ′(RE)
In Idris:
data Term : List Ty -> List Ty -> Type where
Var : Term [x] [x]
LAct : Symmetric xs' xs -> Term xs ys -> Term xs' ys
RAct : Term xs ys -> Symmetric ys ys' -> Term xs ys'
NotIntroR : Term (x :: xs) [] -> Term xs [Not x]
NotElimL : Term (Not x :: xs) [] -> Term xs [x]
NotElimR : Simplex xs1 xs2 xs3 -> Simplex ys1 ys2 ys3
-> Term xs1 (y :: ys1) -> Term xs2 (Not y :: ys2) -> Term xs3 ys3
Symmetric
is the structure for permutations that I introduced in the previous post.
Here are what our principles look like, together with some non-proofs that are ruled out by the restrictions on right-introduction and left-elimination:
dni : {a : Ty} -> Term [a] [Not (Not a)]
dni = NotIntroR (LAct (Insert (There Here) (Insert Here Empty))
(NotElimR (Left Right) Right Var Var))
dne : {a : Ty} -> Term [Not (Not a)] [a]
dne = NotElimL (NotElimR (Left Right) Right Var Var)
-- ruled out by NotIntroR restriction
-- dne = NotElimR Right (Left Right) (NotIntroR Var) Var
explosion : {a : Ty} -> Term [a, Not a] []
explosion = NotElimR (Left Right) Right Var Var
lem : {a : Ty} -> Term [] [Not a, a]
-- ruled out by NotIntroR restriction
-- lem = NotIntroR Var
-- ruled out by NotElimL restriction
-- lem = NotElimL (NotElimL (NotElimR (Left Right) Right Var Var))
Unfortunately, although my restricted left-elimination and right-introduction rules can be used to prove the semantically valid principles of double negation introduction and elimination, they are themselves not semantically valid. The problems start to appear once we add back in the rules for tensor, which in this 2-sided calculus are
TensorIntro : Simplex xs1 xs2 xs3 -> Simplex ys1 ys2 ys3
-> Term xs1 (y1 :: ys1) -> Term xs2 (y2 :: ys2)
-> Term xs3 (Tensor y1 y2 :: ys3)
TensorElim : Simplex xs1 xs2 xs3 -> Simplex ys1 ys2 ys3
-> Term xs1 (Tensor x y :: ys1) -> Term (x :: y :: xs2) ys2
-> Term xs3 ys3
Now we can write a bad term:
bad : {a : Ty} -> Term [] [Not (Tensor a (Not a))]
bad = NotIntroR (TensorElim (Left Right) Right Var explosion)
Although these rules don’t seem to be strong enough to prove the distributive law between tensor and negation, semantically this is the same shape as excluded middle. I think it would be possible to restrict left-elimination and right-introduction differently to rule out this kind of thing, but only at the expense of leaving us with unprovable instances of double negation introduction and elimination.
Although I would love to come up with a calculus that fulfills my requirements using pure logic, I currently believe that it’s impossible. So instead I will bring out the big guns, and use a Structure
. The methodology I introduced in the previous post yields a clean conceptual separation into syntax and logic. If we want to say that two things are syntactically identical, for example permutations of contexts, we use a Structure
to encode that. So what we are about to do is to encode a principle that p and ¬¬p are not merely logically equivalent but syntactically identical.
This is how we do it:
data Parity : Ty -> Ty -> Type where
Id : Parity x x
Raise : Parity x y -> Parity x (Not (Not y))
Lower : Parity x y -> Parity (Not (Not x)) y
data Involutive : Structure Ty where
Empty : Involutive [] []
Insert : Parity x y -> Insertion y ys zs -> Involutive xs ys -> Involutive (x :: xs) zs
An element of Involutive xs ys
is a witness that ys
is a permutation of xs
but with an arbitrary number of double negatives inserted or removed.
With double negation introduction and elimination taken care of, all we have to do is to make a logic that validates the principle of explosion and not excluded middle, which is easy: it’s an ordinary 1-sided natural deduction calculus with the negation introduction rule omitted.
data Term : List Ty -> Ty -> Type where
Var : Term [x] x
Act : Involutive xs ys -> Term ys t -> Term xs t
NotElim : {xs, ys : List Ty} -> {default (simplex xs ys) prf : _}
-> Term xs t -> Term ys (Not t) -> Term prf.fst Unit
TensorIntro : {xs, ys : List Ty} -> {default (simplex xs ys) prf : _}
-> Term xs t -> Term ys t' -> Term prf.fst (Tensor t t')
TensorElim : {xs, ys : List Ty} -> {default (simplex xs ys) prf : _}
-> Term xs (Tensor x y) -> Term (x :: y :: ys) z -> Term prf.fst z
We can write exactly the terms that we want to:
dni : {a : Ty} -> Term [a] (Not (Not a))
dni = Act (Insert (Raise Id) Here Empty) Var
dne : {a : Ty} -> Term [Not (Not a)] a
dne = Act (Insert (Lower Id) Here Empty) Var
explosion : {a : Ty} -> Term [a, Not a] Unit
explosion = NotElim Var Var
lem : {a : Ty} -> Term [] (Tensor a (Not a))
-- impossible
And it’s now possible to write a well typed interpreter for this definition of terms, although I’ll skip it here because it involves several pages of mostly tedious boilerplate code. In the next post we’ll add the missing scoping rules to our language, so that by the time we come back to the well typed interpreter in post number 4, we’ll be able to use it to do a little bit of differentiable programming.Processing math: 0%
In Towards Foundations of Categorical Cybernetics we built a category whose objects are selection functions and whose morphisms are lenses. It was a key step in how we justified open games in that paper: they’re just parametrised lenses “weighted” by selection functions. In this post I’ll show that by adding dependent types and stirring, we can get a nicer category that does the same job but has all colimits, and comes extremely close to having all limits. Fair warning: this post assumes quite a bit of category-theoretic background.
Besides being a nice thing to do in itself, we have a very specific motivation for this. The recently realised paper Categorical deep learning: An algebraic theory of architectures proposed using initial algebras and final coalgebras in categories of parametrised morphisms to build neural networks with learning invariants designed to operate on complex data structures, in a huge generalisation of geometric deep learning. This post is the first step to replicating the same structure in compositional game theory, and is probably the first case where a class of deep learning architectures has a game-theoretic analogue right from the beginning (ok, the first other than GANs) - something that is absolutely key to our vision of AI safety, as I described in this previous post.
In this post I’m going work over the category of sets, to make my life easy. A container (also known as a polynomial functor) is a pair \binom{X}{X’} where X is a set and X’ is an X-indexed family of sets.
Given a pair of containers, a dependent lens f : \binom{X}{X’} \to \binom{Y}{Y’} is a pair of a function f : X \to Y and a function f’ : (x : X) \times Y’ (f (x)) \to X’ (x). There’s a category \mathbf{DLens} whose objects are containers and whose morphisms are dependent lenses (also known as the category of containers\mathbf{Cont} and the category of polynomial functors\mathbf{Poly} by different authors).
The category \mathbf{DLens} has all limits and colimits, distinguishing it from the category of simply-typed lenses which is missing many of both (see my old paper Morphisms of Open Games). In this post I want to just take that as a given fact, because calculating them is not always so easy. The slick way to prove it is by constructing \mathbf{DLens} as a fibration \int_{X : \mathbf{Set}} \left( \mathbf{Set} / X \right)^\mathrm {op}, and using the fact that a fibred category has all co/limits if every fibre does and reindexing preserves them (a fact that we’ll be seeing again later).
Write I for the tensor unit of dependent lenses: it’s made of the set 1 = \{ * \} and the 1-indexed set \* \mapsto 1. A dependent lens I \to \binom{X}{X’} is an element of X, and a dependent lens \binom{X}{X’} \to I is a section of the container: a function k : (x : X) \to X’ (x). For shorthand I’ll write H = \mathbf{DLens} (I, -) : \mathbf{DLens} \to \mathbf{Set} and K = \mathbf{DLens} (-, I) : \mathbf{DLens}^\mathrm{op} \to \mathbf{Set} for these representable functors.
By analogy to what happens in the simply-typed case, a dependent selection function for a container \binom{X}{X’} should be a function \varepsilon : K \binom{X}{X’} \to H \binom{X}{X’} - that is, a thing that turns costates into states.
But I think we’re going to need things to be multi-valued in order to get all colimits (and we need it to do much game theory anyway), so let’s immediately forget that and define a dependent multi-valued selection function of type \binom{X}{X’} to be a binary relation \varepsilon \subseteq H \binom{X}{X’} \times K \binom{X}{X’}.
To be honest, I don’t really have any serious examples of these things to hand, I think they’ll arise from taking colimits of things that are simply-typed. For game theory the main one we care about is still \arg\max, which is a “dependent” multi-valued selection function but only in a boring way that doesn’t use the dependent types - it’s a binary relation \arg\max \subseteq H \binom{X}{\mathbb R} \times K \binom{X}{\mathbb R}, where \mathbb R here means the X-indexed set that is constantly the real numbers.
For each container \binom{X}{X’}, write E \binom{X}{X’} = \mathcal P \left( H \binom{X}{X’} \times K \binom{X}{X’} \right) for the set of multi-valued selection functions for it. Since it’s a powerset it inherits a posetal structure from subset inclusion, which is a boolean algebra. That means that as a thin category, it has all limits and colimits, something that will come in useful later.
Given \varepsilon \subseteq H \binom{X}{X’} \times K \binom{X}{X’} and a dependent lens f : \binom{X}{X’} \to \binom{Y}{Y’} we can define a “pushforward” selection function f_* (\varepsilon) \subseteq H \binom{Y}{Y’} \times K \binom{Y}{Y’} by f_* (\varepsilon) = \{ (hf, k) \mid (h, fk) \in \varepsilon \}. Defining it this way means we get functoriality for free, and it’s also monotone, so we have a functor E : \mathbf{DLens} \to \mathbf{Pos}.
The fact that we could just as easily have defined a contravariant action on dependent lenses means that the fibration we’re about to get is a bifibration, something that will definitely come in useful one day, but not today.
The next thing we do is take the category of elements of E. Objects of \int E are pairs \left( \binom{X}{X’}, \varepsilon \right) of a container and a selection function for it. A morphism f : \left( \binom{X}{X’}, \varepsilon \right) \to \left( \binom{Y}{Y’}, \delta \right) is a dependent lens f : \binom{X}{X’} \to \binom{Y}{Y’} with the property that f_* (\varepsilon) \leq \delta - which is to say, any h : H \binom{X}{X’} and k : K \binom{Y}{Y’} satisfying (h, fk) \in \varepsilon must also satisfy (hf, k) \in \delta.
So, \int E is a category whose objects are dependent multi-valued selection functions and morphisms are dependent lenses. The only difference to the original category of selection functions from Towards Foundations is that we replaced simply typed lenses with dependent lenses. This is enough to get all limits, and I’d call \int E a “nice category of selection functions”.
The good way to prove that a fibred category has all co/limits (see this paper) is to show that (1) the base category has all co/limits, (2) every fibre has all co/limits, and (3) reindexing preserves co/limits. We already know (1) and (2) (remember the fibres are all boolean algebras), so we just need to prove (3). Since limits and colimits in the fibres are unions and intersections, this should not be too hard.
For some container \binom{X}{X’}, suppose we have some family \varepsilon_i \subseteq E \binom{X}{X’} indexed by i : I. We can define the meet \bigwedge_{i : I} \varepsilon_i and join \bigvee_{i : I} \varepsilon_i : E \binom{X}{X’} by intersection and union. To get all colimits in \int E, what we need to prove is that for any dependent lens f : \binom{X}{X’} \to \binom{Y}{Y’}, f_* \left( \bigvee_{i : I} \varepsilon_i \right) = \bigvee_{i : I} f_* (\varepsilon_i). Let’s do it:
Going forwards, suppose (h, k) \in f_* \left( \bigvee_i \varepsilon_i \right), so by definition of f_* there must be h’ such that h = h’f and (h’, fk) \in \bigvee_i \varepsilon_i. So there is some i : I such that (h’, fk) \in \varepsilon_i, so (h’f, k) = (h, k) \in f_* (\varepsilon_i), therefore (f, k) \in \bigvee_i f_* (\varepsilon_i).
The other direction, suppose (h, k) \in \bigvee_i f_* (\varepsilon_i), so (h, k) \in f_* (\varepsilon_i) for some i : I. So we must have h’ such that h = h’f and (h’, fk) \in \varepsilon_i. So (h’, fk) \in \bigvee_i \varepsilon_i, therefore (h’f, k) = (h, k) \in f_* \left( \bigvee_i \varepsilon_i \right).
Note, this is intentionally a pure existence proof. Actually calculating these things can be quite a pain, and I’m going to put it off until later, specifically until a paper we’re cooking up on branching open games.
If we also had f_* \left( \bigwedge_{i : I} \varepsilon_i \right) = \bigwedge_{i : I} f_* (\varepsilon_i) then \int E would also have all limits, but sadly in general the best we can do is f_* \left( \bigwedge_{i : I} \varepsilon_i \right) \subseteq \bigwedge_{i : I} f_* (\varepsilon_i). I’d guess this probably means that \int E has some kind of lax limits or something, but I’ll deal with that another day.
It’s instructive to look at what goes wrong. If (h, k) \in \bigwedge_i f_* (\varepsilon_i), then for all i : I we have (h, k) \in f_* (\varepsilon_i). So, for every i we have h’_i such that h = h’_i f and (h’_i, fk) \in \varepsilon_i. We can make progress if f is a monomorphism, in which case all of the h’_i are equal because h’_i f = h = h’_j f implies h’_i = h’_j. In fact, while I don’t know what general monomorphisms in \mathbf{DLens} look like, in this case it’s enough that the forwards pass of f is an injective function. This probably gives us a decent subcategory of \int E that has all limits as well as all colimits, but I don’t know whether that category will be useful for anything.Loading [MathJax]/jax/output/HTML-CSS/jax.js
Cross-posted from Jade’s blog: parts 1, 2, 3
Suppose your name is x and you have a very important state machine Nx:Sx×Σ→P(Sx) that you cherish with all your heart. Because you love this state machine so much, you don’t want it to malfunction and you have a subset P⊆Sx which you consider to be safe. If your state machine ever leaves this safe space you are in big trouble so you ask the following question. If you start in some subset I⊆P will your state machine Nx ever leave P? In math, you ask if
μ(◼(−)∪I)⊆P
where μ is the least fixed point and ◼(−) indicates the next-time operator of the cherished state machine. What is the next-time operator?
Definition: For a function N:X×Σ→P(Y) there is a monotone function ◼N:P(X)→P(Y) given by
◼N(A)=⋃a∈A⋃s∈ΣN(a,s)
In layspeak the next-time operator sends a set of states the set of all possible successors of those states.
In a perfect world you could use these definitions to ensure safety using the formula
μ(◼(−)∪I)=∞⋃n=0(◼(−)∪I)n
or at least check safety up to an arbitrary time-step n by computing this infinite union one step at a time.
Unfortunately there is a big problem with this method! Your state machine does not exist in isolation. You have a friend whose name is y with their own state machine Ny:Sy×Σ→P(Sy). y has the personal freedom to run their state machine how they like but there are functions
Nxy:Sx×Σ→P(Sy)
and
Nyx:Sy×Σ→P(Sx)
which allow states of your friend’s machine to change the states of your own and vice-versa. Making matters worse, there is a whole graph G whose vertices are your friends and whose edges indicate that the corresponding state machines may effect each other. How can you be expected to ensure safety under these conditions?
But don’t worry, category theory comes to the rescue. In the next sections I will:
- State my model of the world and the local-to-global safety problem for this model (Part II)
- Propose a solution to the local-to-global safety problem based on an enriched version of the Grothendieck construction (Part III)
Suppose we have a directed graph G=(V(G),E(G)) representing our world. The vertices of this graph are the different agents in our world and an edge represents a connection between these agents. The semantics of this graph will be the following:
Definition: Let Mach be the directed graph whose objects are sets and where there is an edge e:X→Y for every function
e:X×Σ→P(Y)
A world is a morphism of directed graphs W:G→Mach.
A world has a set Sx for each vertex x called the local state over x and for each edge e:x→y a function W(e):Sx×Σe→P(Sy) representing the state machine connecting the local state over x to the local state over y. Note that self edges are ordinary state machines from a local state to itself. An example world may be drawn as follows:

Definition: Given a world W:G→Mach, the total machine of W is the state machine ∫W:∑x∈V(G)Sx×∑e∈E(G)Σe→P(∑x∈V(G)Sx)
given by
((s,x),(τ,d))↦⋃e:x→yF(e)(s,τ)
The notation ∫ is used based on the belief that this is some version of the Grothendieck construction. Exactly which flavor will be left to future work. The transition function of this state machine comes from unioning the transition functions of all the state machines associated to edges originating in a vertex.
Definition: Given a world W:G→Mach, a vertex x∈V(G), and subsets I,P⊂Sx, we say that I is locally safe in a global context if
μ(◼∫W(−)∪I)⊆P
where ◼∫W is the next-time operator of the state machine ∫W.
The state machine ∫W may be large enough to make computing this least fixed point by brute force impractical. Therefore, we must leverage the compositional structure of W. We will see how to do this in the next post.
In this section we will give a compositional solution to the local safety problem in a global context in two steps:
- First by turning the world into a functor ˆW:FG→Poset
- Then by gluing this functor into a single poset ∫ˆW whose inequalities solve the problem of interest.
First we define the functor.
Given a world W:G→Mach, there is a functor
ˆW:FG→Poset
where
- FG is the free category on the graph G,
- Poset is the category whose objects are posets and whose morphisms are monotone functions.
Functors from a free category are uniquely defined by their image on vertices and generating edges.
- For a vertex x∈V(G), ˆW(x)=P(Sx),
- for an edge e:x→y, we define ˆW(e):P(Sx)→P(Sy) by A↦◼W(e)(A)
Now for step two.
Given a functor ˆW:FG→Poset defined from a world W, the global safety poset is a poset ∫ˆW where
- elements are pairs (x∈V(G),A⊆Sx),
- (x,A)≤(y,B)⟺⋀f:x→y∈FGˆW(f)(A)⊆B
Given a world W:G→Mach, a vertex x∈V(G), and subsets I,P⊆Sx then I is locally safe in a global context if and only if there is an inequality (x,I)⊆(x,P) in the global safety poset ∫ˆW
My half-completed proof of this theorem involves a square of functors

Going from right and then down, the first functor uses a Grothendieck construction to turn a world into a total state machine and then turns that state machine into it’s global safety poset. Going down and then right follows the construction detailed in the last two sections. The commutativity of this diagram should verify correctness. I will explain all of this in more detail later. Thanks for tuning in today!Loading [MathJax]/jax/output/HTML-CSS/jax.js
This post is a writeup of a talk I gave at the Causal Cognition in Humans and Machines workshop in Oxford, about some work in progress I have with Toby Smithe. To a large extent this is my take on the theoretical work in Toby’s PhD thesis, with the emphasis shifted from category theory and neuroscience to numerical computation and AI. In the last section I will outline my proposal for how to build AGI.
The starting point is the concept of a Markov kernel, which is a synonym for conditional probability distribution that sounds unnecessarily fancy but, crucially, contains only 30% as many syllables. If X and Y are some sets then a Markov kernel φ from X to Y is a conditional probability distribution Pφ[y∣x]. Most of this post will be agnostic to what exactly “probability distribution” can mean, but in practice it will probably eventually mean “Gaussian”, in order to go brrr, by which I mean effective in practice at the expense of theoretical compromise. (I blatantly stole this usage of that meme from Bruno.)
There are two different perspectives on how Markov kernels can be implemented. They could be exact, for example, they could be represented as a stochastic matrix (in the finite support case) or as a tensor containing a mean vector and covariance matrix for each input (in the Gaussian case). Alternatively they could be Monte Carlo, that is, implemented as a function from X to Y that may call a pseudorandom number generator. If we send the same input repeatedly then the outputs are samples from the distribution we want. Importantly these functions satisfy the Markov property: the distribution on the output depends only on the current input and not on any internal state.
An important fact about Markov kernels is that they can be composed. Given a Markov kernel Pφ[y∣x] and another Pψ[z∣y], there is a composite kernel Pφ;ψ[z∣x] obtained by integrating out y:
Pφ;ψ[z∣x]=∫Pφ[y∣x]⋅Pψ[z∣y]dy
This formula is sometimes given the unnecessarily fancy name Chapman-Kolmogorov equation. If we represent kernels by stochastic matrices, then this is exactly matrix multiplication; if they are Gaussian tensors, then it’s a similar but slightly more complicated operation. Doing exact probability for anything more complicated is extremely hard in practice because of the curse of dimensionality.
If we represent kernels by Monte Carlo funtions, then composition is literally just function composition, which is extremely convenient. That is, we can just send particles through a chain of functions and they’ll come out with the right distribution - this fact is basically what the term “Monte Carlo” actually means.
A special case of this is an ordinary (non-conditional) probability distribution, which can be usefully thought of as a Markov kernel whose domain is a single point. Given a distribution Pπ[x] and a kernel Pφ[y∣x] we can obtain a distribution π;φ on y, known as the pushforward distribution, by integrating out x:
Pπ;φ[y]=∫Pπ[x]⋅Pφ[y∣x]dx
Suppose we have a Markov kernel Pφ[y∣x] and we are shown a sample of its output, but we can’t see what the input was. What can we say about the input? To do this, we must start from some initial belief about how the input was distributed: a prior Pπ[x]. After observing y, Bayes’ law tells us how we should modify our belief to a posterior distsribution that accounts for the new evidence. The formula is
P[x∣y]=Pφ[y∣x]⋅Pπ[x]Pπ;φ[y]
The problem of computing posterior distributions in practice is called Bayesian inference, and is very hard and very well studied.
If we fix π, it turns out that the previous formula for P[x∣y] defines a Markov kernel from Y to X, giving the posterior distribution for each possible observation. We call this the Bayesian inverse of φ with respect to π, and write Pφ†π[x∣y].
The reason we can have y as the input of the kernel but we had to pull out π as a parameter is that the formula for Bayes’ law is linear in y but nonlinear in π. This nonlinearity is really the thing that makes Bayesian inference hard.
Technically, Bayes’ law only considers sharp evidence, that is, we observe a particular point y. Considering inverse Markov kernels also gives us a way of handling noisy evidence, such as stochastic uncertainty in a measurement, by pushing forward a distribution Pρ[y] to obtain Pρ;φ†π[x]. This way of handling noisy evidence is sometimes called a Jeffreys update, and contrasted with a different formula called a Pearl update - see this paper by Bart Jacobs. Pearl updates have very different properties and I don’t know how they fit into this story, if at all. Provisionally, I consider the story of this post as evidence that Jeffreys updates are “right” in some sense.
So far we’ve introduced 2 operations on Markov kernels: composition and Bayesian inversion. Are they related to each other? The answer is a resounding yes: they are related by the formula
(φ;ψ)†π=ψ†π;φ;φ†π
We call this the chain rule for Bayesian inversion, because of its extremely close resemblance to the chain rule for transpose Jacobians that underlies backpropagation in neural networks and differentiable programming:
J⊤x(f;g)=J⊤f(x)(g)⋅J⊤x(f)
The Bayesian chain rule is extremely folkloric. I conjectured it in 2019 while talking to Toby, and he proved it a few months later, writing it down in his unpublished preprint Bayesian Updates Compose Optically. It’s definitely not new - some people already know this fact - but extremely few, and we failed to find it written down in a single place. (I feel like it should have been known by the 1950s at the latest, when things like dynamic programming were being worked out. Perhaps it’s one of the things that was well known in the Soviet Union but wasn’t discovered in the West until much later.) The first place Toby published this fact was in The Compositional Structure of Bayesian Inference with Dylan Braithwaite and me, which fixed a minor problem to do with zero-probability observations in a nice way.
What this formula tells us is that if we have a Markov kernel with a known factorisation, we can compute Bayesian posteriors efficiently if we already know the Bayesian inverse of each factor. Since this is exactly the same form as differentiable programming, we have good evidence that it can go brrr. At first I thought it was completely obvious that this must be how compilers for probabilistic programming languages work, but it turns out this is not the case at all, probabilistic programming languages are monolithic. I’ve given this general methodology for computing posteriors compositionally the catchy name deep inference, by its very close structural resemblance to deep learning.
I wrote “we can compute Bayesian posteriors efficiently if we already know the Bayesian inverse of each factor”, but this is still a big if: computing posteriors even of simple functions is still hard if the dimensionality is high. Numerical methods are used in practice to approximate the posterior, and we would like to make use of these while still exploiting compositional structure.
The usual way of approximating a Bayesian inverse φ†π is to cook up a functional form φ′π(p) that depends on some parameters p∈RN. Then we find a loss function on the parameters with the property that minimising it causes the approximate inverse to converge to the exact inverse, ie. φ′π(p)⟶φ†π. This is called variational inference.
There are many ways to do this. Probably the most common loss function in practice is KL divergence (aka relative entropy),
KL(φ†π,φ′π(p))=∫Pφ†π[x∣y]logPφ†π[x∣y]Pφ′π(p)[x∣y]dx
This expression is a function of y, which can optionally also be integrated over (but the next paragraph reveals a better way to use it). A closely related alternative is variational free energy, which despite being more complicated to define is more computationally tractable.
Ideally, we would like to use a functional form for which we can derive an analytic formula that tells us exactly how we should update our parameters to decrease the loss, given (possibly batched) Monte Carlo samples that are assumed to be drawn from a distribution in a certain class, such as Gaussians.
Of course in 2024 if you are serious then the functional form you use is a deep neural network, and you replace your favourite loss function by its derivative. I refer to this version as deep variational inference. There is no fundamental difference in theory, but in practice deep variational inference is necessary in order to go brrr.
Now, suppose we have two Markov kernels Pφ[y∣x] and Pψ[z∣y] which we compose. Suppose we have a prior Pπ[x] for φ, which pushes forward to a prior Pπ;φ[y] for ψ. We pick a functional form for approximating each Bayesian inverse, which we call Pφ′π(p)[x∣y] and Pψ′π;φ(q)[y∣z].
Doing this requires a major generalisation of our loss function. This was found by Toby Smithe in Compositional active inference 1. The method he developed comes straight from compositional game theory, and this appearance of virtually identical structure in game theory and Bayesian inference is absolutely one of the most core ideas of categorical cybernetics as I envision it.
The idea is to define the loss of an approximate inverse to a kernel φ:X→Y in a context that includes not only a prior distribution on X, but also a (generally nonlinear) function k called the continuation, that transforms probability distributions on Y. The continuation is a black box that describes how predictions transform into observations. Then when y appears free in the expressions for KL divergence and variational free energy, we integrate it over the distribution k(π;φ).
So for our composite kernel φ;ψ, as well as the prior π on X we also have a continuation k that transforms distributions on Z. In order to optimise the parameters (p,q) in this context, we divide them into two sub-problems:
- Optimise the parameters p for φ in the context given by the prior π on X and the continuation k′ on Y given by k′(σ)=k(σ;ψ);ψ′σ(q)
- Optimise the parameters q for ψ in the context given by the prior π;φ on Y and the continuation k on Z
Notice that the optimisation step for p involves the current value of q, but not vice versa. It is easy to prove that this method correctly converges to the total Bayesian inverse by a dynamic programming argument, if we first optimise q to convergence and then optimise p. However, Toby and me conjecture that this procedure also converges if p and q are optimised asynchronously, which means the procedure can be parallelised.
The convergence conjecture in the previous section crucially relies on the fact that the prediction kernels φ and ψ are fixed, and we are only trying to approximate their Bayesian inverses. That is why I referred to it as passive inference. The term active inference means several different things (more on this in the next section) but one thing it should mean is that we simultaneously learn to do both prediction and inference.
Toby and me think that if we do this, the compositionality result breaks. In particular, if we also have a parametrised family of prediction kernels φ(p) which converge to our original kernel φ, it is not the case that
ψ′π;φ(p)(q);φ′π(p)⟶(φ;ψ)†π
Specifically, we think that the nonlinear dependency of ψ′π;φ(p)(q) on φ′(p) causes things to go wrong.
One way of saying this negative conjecture is: compositional active inference can fail to converge to true beliefs, even in a stationary environment. The main reason you’d want to do this anyway, even at the expense of getting the wrong answer, is that it might go brrr - but whether this is really true remains to be seen.
We can, however, put a positive spin on this negative result. I am known for the idea that the opposite of compositionality is emergence, from this blog post. A compositional active inference system does not behave like the sum of its parts. The interaction between components can prevent them from learning true beliefs, but can it do anything positive for us? So far we know nothing about how this emergent learning dynamics behaves, but our optimistic hope is that it could be responsible for what is normally called things like intelligence and creativity - on the basis that there aren’t many other places that they could be hiding.
Boosted by the last paragraph, we now fully depart the realm of mathematical conjecture and enter the outer wilds of hot takes, increasing in temperature towards the end.
So far I’ve talked about active inference but not mentioned what is probably the most important thing in the cloud of ideas around the term: conflating prediction and control. Ordinarily, we would think of Pπ;φ[y] as prediction and Pφ†π[x∣y] as inference. However it has been proposed (I believe the idea is due to Karl Friston) that in the end Pπ;φ[y] is interpreted as a command: at the end of a chain of prediction-inference devices comes an actuator designed to act on the external environment in order to (try to) make the prediction true. That is, a prediction like “my arm will rise” is the same thing as the command “lift my arm” when connected to my arm muscles.
This lets us add one more piece to the puzzle, namely reinforcement learning. A deep active inference system can interact with an environment (either the real world or a simulated environment), by interpreting its ultimate predictions as commands, effecting those commands into the environment, and responding with fresh observations. Over time, the system should learn to predict the response of the environment, that is to say, it will learn an internal model of its environment. If several different active inference systems interact with the same environment, then we should consider the environment of each to contain the others, and expect each to learn a model of the others, recursively.
I am not a neuroscientist, but I understand it is at least plausible that the compositional structure of the mammalian cortex exactly reflects the compositional structure of deep active inference. The cortex is shaped (in the sense of connectivity) approximately like a pyramid, with both sensory and motor areas at the bottom. In particular, the brain is not a series of tubes with sensory signals going in at one end and motor signals coming out at the other end. Obviously the basic pyramid shape must be modified with endless ad-hoc modifications at every scale developed by evolution for various tasks. So following Hofstadter’s Ant Fugue, I claim the cortex is shaped like an anthill.
The idea is that the hierarchical structure is roughly an abstraction hierarchy. Predictions (aka commands) Pφ[y∣x] travel down the hierarchy (towards sensorimotor areas), transforming predictions at a higher level of abstraction Pπ[x] into predictions at a lower level of abstraction Pπ;φ[y]. Inferences Pφ†π[x∣y] travel up the hierarchy (away from sensorimotor areas), transforming observations at a lower level of abstraction Pρ[y] into observations at a higher level of abstraction Pρ;φ†π[x].
Given this circularity, with observations depending on predictions recursively through many layers, I expect that the system will learn to predict sequences of inputs (as any recursive neural network does, and notably transformers do extremely successfully) - and also sequences of sequences and so on. I predict that stability will increase up the hierarchy - that is, updates will usually be smaller at higher levels - so that at least conceptually, higher levels run on a slower timescale than lower levels. This comes back to ideas I first read almost 15 years ago in the book On Intelligence by Jeff Hawkins and Sandra Blakeslee.
Conceptually, this is exactly the same idea I wrote about in chapter 9 of The Road to General Intelligence - the main difference is that now I think I have a good idea how to actually compute commands and observations in practice, whereas back then I hand-crafted a toy proof of concept.
If both sensory and motor areas are at the bottom of the hierarchy, this raises the obvious question of what is at the top. It probably has something to do with long term memory formation, but it is almost impossible to not be thinking about consciousness at this point. I’m going to step back from this so that the hot takes in this post don’t reach their ignition temperature before the next paragraph.
The single hottest take that I genuinely believe is that deep variational reinforcement learning is all you need, and is the only conceptually plausible route to what is sometimes sloppily called “AGI” and what I refer to in private as “true intelligence”.
I should mention that none of my collaborators is as optimistic as me that deep variational reinforcement sequence learning is all you need. Uniquely among my collaborators, I am a hardcore connectionist and I believe good old fashioned symbolic methods have no essential role to play. Time will tell.
My long term goal is obviously to build this, if it works. My short term goal is to build some baby prototypes starting with passive inference, to verify and demonstrate that what works in theory also works in practice. So watch this space, because the future might be wild…A big part of programming language design is in feedback delivery. One aspect of feedback is parse errors. Parsing is a very large area of research and there are new developments from industry that make it easier and faster than ever to parse files. This post is about an application of dependent lenses that facilitate the job of reporting error location from a parsing pipeline.
A simple parser could be seen as a function with the signature
parse : String -> Maybe output
where output
is a parsed value.
In that context, an error is represented with a value of Nothing
, and a successful value is represented with Just
. However, in the error case, we don’t have enough information to create a helpful diagnostic, we can only say “parse failed” but we cannot say why or where the error came from. One way to help with that is to make the type aware of its context and carry the error location in the type:
parseLoc : string -> Either Loc output
where Loc
holds the file, line, and column of the state of the parser.
This is a very successful implementation of a parser with locations and many languages deployed today use a similar architecture where the parser, and its error-reporting mechanism, keep track of the context in which they are parsing files and use it to produce helpful diagnostics.
I believe that there is a better way, one that does not require a tight integration between the error-generating process (here parsing) and the error-reporting process (here, location tracking). For this, I will be using container morphisms, or dependent lenses, to represent parsing and error reporting.
Dependent lenses are a generalisation of lenses where the backward part makes use of dependent types to keep track of the origin and destination of arguments. For reference the type of a lens Lens a a' b b'
is given by the two functions:
get : a -> b
set : a -> b' -> a'
Dependent lenses follow the same pattern, but their types are indexed:
record DLens : (a : Type) -> (a' : a -> Type) -> (b : Type) -> (b' : b -> Type) where
get : a -> b
set : (x : a) -> b' (get x) -> a' x
The biggest difference with lenses is the second argument of set
: b' (get x)
. It means that we always get a b'
that is indexed over the result of get
, for this to typecheck, we must know the result of get
.
This change in types allows a change in perspective. Instead of treating lenses as ways to convert between data types, we use lenses to convert between query/response APIs.
On each side A
and B
are queries and A'
and B'
are corresponding responses. The two functions defining the lens have type get : A -> B
, and set : (x : A) -> A' (get x) -> B' x
, that is, a way to convert queries together, and a way to rebuild responses given a query. A lens is therefore a mechanism to map between one API to another.
If the goal is to find on what line an error occurs, then what the get
function can do is split our string into multiple lines, each of which will be parsed separately.
splitLines : String -> List String
Once we have a list of strings, we can call a parser on each line, this will be a function like above parseLine : String -> Maybe output
. By composing those two functions we have the signature String -> List (Maybe output)
. This gives us a hint as to what the response for splitLine
should be, it should be a list of potential outputs. If we draw our lens again we have the following types:
We are using (String, String)
on the left to represent “files as inputs” and “messages as outputs” both of which are plain strings.
There is a slight problem with this, given a List (Maybe output)
we actually have no way to know which of the values refer to which line. For example, if the outputs are numbers and we know the input is the file
23
24
3
and we are given the output [Nothing, Nothing, Just 3]
we have no clue how to interpret the Nothing
and how it’s related to the result of splitting the lines, they’re not even the same size. We can “guess” some behaviors but that’s really flimsy reasoning, ideally the API translation system should keep track of that so that we don’t have to guess what’s the correct behavior. And really, it should be telling us what the relationship is, we shouldn’t even be thinking about this.
So instead of using plain lists, we are going to keep the information in the type by using dependent types. The following type keeps track of an “origin” list and its constructors store values that fulfill a predicate in the origin list along with their position in the list:
data Some : (a -> Type) -> List a -> Type where
None : Some p xs
This : p x -> Some p xs -> Some p (x :: xs)
Skip : Some p xs -> Some p (x :: xs)
We can now write the above situation with the type Some (const Unit) ["23", "", "24", "3"]
which is inhabited by the value Skip $ Skip $ Skip $ This () None
to represent the fact that only the last element is relevant to us. This ensures that the response always matches the query.
Once we are given a value like the above we can convert our response into a string that says "only 3 parsed correctly"
.
Equipped with dependent lenses, and a type to keep track of partial errors, we can start writing a parsing pipeline that keeps track of locations without interfering with the actual parsing. For this, we start with a simple parsing function:
containsEven : String -> Maybe Int
containsEven str = parseInteger str >>= (\i : Int => toMaybe (even i) i)
This will return a number if it’s even, otherwise it will fail. From this we want to write a parser that will parse an entire file, and return errors where the file does not parse. We do this by writing a lens that will split a file into lines and then rebuild responses into a string such that the string contains the line number.
splitFile : (String :- String) =%> SomeC (String :- output)
splitFile = MkMorphism lines printErrors
where
printError : (orig : List String) -> (i : Fin (length orig)) -> String
printError orig i = "At line \{show (cast {to = Nat} i)}: Could not parse \"\{index' orig i}\""
printErrors : (input : String) -> Some (const error) (lines input) -> String
printErrors input x = unlines (map (printError (lines input)) (getMissing x))
Some notation: =%>
is the binary operator for dependent lenses, and :-
is the binary operator for non-dependent boundaries. Later !>
will be used for dependent boundaries.
printErrors
builds an error message by collecting the line number that failed. We use the missing values from Some
as failed parses. Equipped with this program, we should be able to generate an error message that looks like this:
At line 3: could not parse "test"
At line 10: could not parse "-0.012"
At line 12: could not parse ""
The only thing left is to put together the parser and the line splitter. We do this by composing them into a larger lens via lens composition and then extracting the procedure from the larger lens. First we need to convert our parser into a lens.
Any function a -> b
can also be written as a -> () -> b
and any function of that type can be embedded in a lens (a :- b) =%> (() :- ())
. That’s what we do with our parser and we end up with this lens:
parserLens : (String :- Maybe Int) =%> CUnit -- this is the unit boundary () :- ()
parserLens = embed parser
We can lift any lens with a failable result into one that keeps track of the origin of the failure:
lineParser : SomeC (String :- Int) =%> CUnit
lineParser = someToAll |> AllListMap parserLens |> close
We can now compose this lens with the one above that adjusts the error message using the line number:
composedParser : (String :- String) =%> CUnit
composedParser = splitFile |> lineParser
Knowing that a function a -> b
can be converted into a lens (a :- b) =%> CUnit
we can do the opposite, we can convert any lens with a unit codomain into a simple function, which gives us a very simple String -> String
program:
mainProgram : String -> String
mainProgram = extract composedParser
Which we can run as part of a command-line program
main : IO ()
main = do putStrLn "give me a file name"
fn <- getLine
Right fileContent <- readFile fn
| Left err => printLn err
let output = mainProgram fileContent
putStrLn output
main
And given the file:
0
2
-3
20
04
1.2
We see:
At line 2: Could not parse ""
At line 3: Could not parse "-3"
At line 6: Could not parse "1.2"
The program we’ve seen is great but it’s not super clear why we would bother with such a level of complexity if we just want to keep track of line numbers. That is why I will show now how to use the same approach to keep track of file origin without touching the existing program.
To achieve that, we need a lens that will take a list of files, and their content, and keep track of where errors emerged using the same infrastructure as above.
First, we define a filesystem as a mapping of file names to a file content:
Filename = String
Content = String
Filesystem = List (Filename * Content)
A lens that splits problems into files and rebuilds errors from them will have the following type:
handleFiles : Interpolation error =>
(Filesystem :- String) =%> SomeC (String :- error)
handleFiles = MkMorphism (map π2) matchErrors
where
matchErrors : (files : List (String * String)) ->
Some (const error) (map π2 files) ->
String
matchErrors files x = unlines (map (\(path && err) => "In file \{path}:\n\{err}") (zipWithPath files x))
This time I’m representing failures with the presence of a value in Some
rather than its absence. The rest of the logic is similar: we reconstruct the data from the values we get back in the backward part and return a flat String
as our error message.
Combining this lens with the previous parser is as easy as before:
filesystemParser : (Filesystem :- String) =%> CUnit
filesystemParser = handleFiles |> map splitFile |> join {a = String :- Int} |> lineParser
fsProgram : Filesystem -> String
fsProgram = extract filesystemParser
We can now write a new main function that will take a list of files and return the errors for each file:
main2 : IO ()
main2 = do files <- askList []
filesAndContent <- traverse (\fn => map (fn &&) <$> readFile fn) (reverse files)
let Right contents = sequence filesAndContent
| Left err => printLn err
let result = fsProgram contents
putStrLn result
We can now write two files. file1:
0
2
-3
20
04
1.2
file2:
7
77
8
And obtain the error message:
In file 'file1':
At line 2: Could not parse ""
At line 3: Could not parse "-3"
At line 6: Could not parse "1.2"
In file 'file2':
At line 0: Could not parse "7"
At line 1: Could not parse "77"
All that without touching our original parser, or our line tracking system.
We’ve only touched the surface of what dependent lenses can do for software engineering by providing a toy example. Yet, this example is simple enough to be introduced, and resolved in one post, but also shows a solution to a complex problem that is affecting parsers and compilers across the spectrum of programming languages. In truth, dependent lenses can do much more than what is presented here, they can deal with effects, non-deterministic systems, machine learning, and more. One of the biggest barriers to mainstream adoption is the availability of dependent types in programming languages. The above was written in idris, a language with dependent types, but if your language of choice adopts dependent types one day, then you should be able to write the same program as we did just now, but for large-scale production software.
The program is available on gitlab.Loading [MathJax]/jax/output/HTML-CSS/jax.js
Some time ago, in a previous blog post, we introduced our software engine for game theoretic modelling. In this post, we expand more on how to apply the engine to use cases relevant for the Ethereum ecosystem. We will consider an analysis of a simplified staking protocol. Our focus will be on compositionality – what this means from the perspective of representing protocols and from the perspective of analyzing protocols.
We end with an outlook on the further development of the engine, what its current limitations are and how we work on overcoming them.
The codebase of the example discussed can be found here. If you have never seen the engine before, we advise you to go back to our earlier post. Also note that there exists a basic tutorial that explains how the engine works. Lastly, here is a recent presentation Philipp gave at the Ethconomics workshop at DevConnect Amsterdam.
Consider a simplified model of a staking protocol. The staking protocol is motivated by Ethereum proof of stake. The model we introduce is relevant as, even though simple, it shines a light on how a previous version of the staking protocol was subject to reorg attacks as discussed in this paper. We thank Barnabé Monnot for pointing us to the problem in the first place and helping us with the specification and modelling.
In what follows, we give a short verbal summary of the protocol.
To begin with, we model a chain as a (compositional) relation. The chain contains blocks with unique identifiers as well as voting weights. The weights correspond to votes by validators on the specific blocks contained in the chain. Here is an example of such a chain in the case of two validators:
The staking protocol consists of episodes. Within each episode, which lasts for several time steps, a proposer decides to extend the chain by a further block. The proposer can decide to extend or not to extend it. If the proposer extends the chain, he chooses on which block to build. Consider the following example when the proposer extends the above chain:
The new block he generates will have initially no votes attesting to this block being the legitimate successor. This assessment is conducted by two validators.
These two validators observe the last stage of the chain before their episode starts and they observe a possible change to the chain made by the proposer within their episode. The validators can then vote on the block which they view as the legitimate successor. Here is the continued example from above:
Both the proposer’s as well as the validators’ choices will be evaluated in the next episode. If the decisions they made, i.e. the building on a specific block by the proposer as well as the voting by the validators, is on the path to the longest weighted chain, they will receive a reward.
From a modelling perspective, this is an important feature. The agents’ remuneration in episode t will be determined in episode (t+1). We will come back to this feature.
So far, the setup seems simple enough. However, the picture is complicated by possible network issues. Messages may be delayed. For instance, the two validators might not observe a message by the proposer in their episode simply due to the network being partitioned.
Hence, in this specific case, the validators cannot be sure when a message does not reach them, that the message was actually not sent, or that they just have not received it yet.
Real world network issues like delay complicate the incentives. They also open avenues for malicious agents. Modelling the arising incentive problems in game-theoretic terms is a formidable challenge as the timing of moves and information is itself affected by the moves of players. For instance, in the reorg attack mentioned in the beginning, a malicious proposer might want to wait with sending information until the next episode has started. In that way he might draw validators away from the honest proposer of that episode and instead have them vote on his block that he created late.
The practical modelling of such interactions is not obvious (and in fact motivated a new research project on our end). Here, we dramatically simplify the problem. We get rid of time completely. Instead, we leverage a key feature of our approach: Games are defined as open systems —- open to their environment and waiting for information.
Through the environment we can feed in specific information we want. Concretely, we can expose the proposer and validators in a given episode exactly to the kind of reorg scenario mentioned above: Proposer and validators are facing differing information regarding the state of the chain.
Besides simplifying the model, proceeding in this way has a further advantage. The analysis of optimal moves is static and only relative to the context. It thereby becomes much simpler.1
In order to construct a game-theoretic model of the protocol, we will build up the protocol from the bottom up using building blocks.
We begin with the boring but necessary parts that describe the mechanics of the protocol. These components are mostly functions lifted into games as computations. In order not to introduce too much clutter in this post, we focus on the open games representations and hide details of the auxiliary function implementations. These functions are straightforward and is should be hopefully clear from the context what they do.
Given a chain, determineHeadOfChain
produces the head of the current chain:
determineHeadOfChain = [opengame|\
inputs : chain ;\
feedback : ;\
\
:-----:\
inputs : chain ;\
feedback : ;\
operation : forwardFunction $ determineHead ;\
outputs : head ;\
returns : ;\
:-----:\
\
outputs : head ;\
returns : ;\
\
|]
Given the old chain from (t−1) and the head of the chain from (t−2), oldProposerAddedBlock
determines whether the proposer actually did send a new block in (t−1). It also outputs the head of the chain for period (t−1) - as this is needed in the next period.
oldProposerAddedBlock = [opengame|\
\
inputs : chainOld, headOfChainIdT2 ;\
feedback : ;\
\
:-----:\
inputs : chainOld, headOfChainIdT2 ;\
feedback : ;\
operation : forwardFunction $ uncurry wasBlockSent ;\
outputs : correctSent, headOfChainIdT1 ;\
returns : ;\
:-----:\
\
outputs : correctSent, headOfChainIdT1 ;\
returns : ;\
|]
Given the decision by the proposer to either wait or to send a head, addBlock
creates a new chain. Which means either the old chain is copied as before or the chain is actually appended by a new block.
addBlock = [opengame|\
\
inputs : chainOld, chosenIdOrWait ;\
feedback : ;\
\
:-----:\
inputs : chainOld, chosenIdOrWait ;\
feedback : ;\
operation : forwardFunction $\
uncurry addToChainWait ;\
outputs : chainNew ;\
returns : ;\
\
:-----:\
\
outputs : chainNew ;\
returns : ;\
|]
The following diagram summarizes the information flow in these building blocks.
Given the old chain from (t−1), proposer decides to append the block to a node or not to append. Conditional on that decision, a new chain is created.
proposer name = [opengame|\
inputs : chainOld;\
feedback : ;\
\
:-----:\
inputs : chainOld ;\
feedback : ;\
operation : dependentDecision name\
alternativesProposer;\
outputs : decisionProposer ;\
returns : 0;\
\
inputs : chainOld, decisionProposer ;\
feedback : ;\
operation : addBlock ;\
outputs : chainNew;\
returns : ;\
//\
\
:-----:\
\
outputs : chainNew ;\
returns : ;\
|]
Given a new chain proposed and the old chain from (t−1), validator then decides which node to attest as the head.
validator name = [opengame|\
\
inputs : chainNew,chainOld ;\
feedback : ;\
\
:-----:\
inputs : chainNew,chainOld ;\
feedback : ;\
operation : dependentDecision name\
(\(chainNew, chainOld) ->\
[1, vertexCount chainNew]) ;\
outputs : attestedIndex ;\
returns : 0 ;\
// ^ NOTE the payoff for the validator comes from the next period\
\
:-----:\
\
outputs : attestedIndex ;\
returns : ;\
|]
This open game is parameterized by a specific player ( name
). The information flow of the decision open games are depicted in in the next diagram:
The central aspect of the protocol is how the payoffs of the different players are determined. For both proposers and validators we split the payoff components into two parts. First, we create open games which are mere accounting devices, i.e. they just update a player’s payoff.
updatePayoffValidator
:
- determines the value that an validator should receive conditional on his action being assessed as correct and
- updates the value for a specific validator. This open game is parameterized by a specific player (
name
).
updatePayoffValidator name fee = [opengame|\
inputs : bool ;\
feedback : ;\
\
:-----:\
inputs : bool ;\
feedback : ;\
operation : forwardFunction $ validatorPayoff fee ;\
outputs : value ;\
returns : ;\
// ^ Determines the value\
\
\
inputs : value ;\
feedback : ;\
operation : addPayoffs name ;\
outputs : ;\
returns : ;\
:-----:\
\
outputs : ;\
returns : ;\
\
|]
updatePayoffProposer
works analogously to the validators’. First, determine the value the proposer should receive depending on his action. Second, do the book-keeping and add the payoff to name
’s account.
updatePayoffProposer name reward = [opengame|\
inputs : bool ;\
feedback : ;\
\
:-----:\
inputs : bool ;\
feedback : ;\
operation : forwardFunction $ proposerPayoff reward;\
outputs : value ;\
returns : ;\
// ^ Determines the value\
\
\
inputs : value ;\
feedback : ;\
operation : addPayoffs name ;\
outputs : ;\
returns : ;\
:-----:\
\
outputs : ;\
returns : ;\
\
|]
proposerPayment
embeds updatePayoffProposer
into a larger game where the first stage includes a function, proposedCorrect
, lifted into the open game. That function does what its name suggests, given the latest chain and a Boolean value whether the proposer actually added a block, it determines whether the proposer proposed correctly - according to the protocol.
proposerPayment name reward = [opengame|\
\
inputs : blockAddedInT1, chainNew ;\
feedback : ;\
\
:-----:\
inputs : blockAddedInT1, chainNew ;\
feedback : ;\
operation : forwardFunction $ uncurry\
proposedCorrect ;\
outputs : correctSent ;\
returns : ;\
// ^ This determines whether the proposer was\
correct in period (t-1)\
\
\
inputs : correctSent ;\
feedback : ;\
operation : updatePayoffProposer name reward;\
outputs : ;\
returns : ;\
// ^ Updates the payoff of the proposer given\
decision in period (t-1)\
\
:-----:\
\
outputs : ;\
returns : ;\
|]
This last game already show-cases a pattern that we will see from now on repeatedly. Using the primitive components, we go on to build up larger games. All the needed “molding” parts have been put on the table. All what follows will be about composing those elements.
Let us consider another example for composition.
validatorsPayment
groups the payment for validators included, here two, into one game.
validatorsPayment name1 name2 fee = [opengame|\
\
inputs : validatorHashMap, chainNew, headId;\
feedback : ;\
\
:-----:\
inputs : validatorHashMap, chainNew, headId ;\
feedback : ;\
operation : forwardFunction $ uncurry3 $\
attestedCorrect name1 ;\
outputs : correctAttested1 ;\
returns : ;\
// ^ This determines whether validator 1 was\
correct in period (t-1) using the latest\
hash and the old information\
\
inputs : validatorHashMap, chainNew, headId ;\
feedback : ;\
operation : forwardFunction $ uncurry3 $\
attestedCorrect name2 ;\
outputs : correctAttested2 ;\
returns : ;\
// ^ This determines whether validator 2 was\
correct in period (t-1)\
\
\
inputs : correctAttested1 ;\
feedback : ;\
operation : updatePayoffValidator name1 fee ;\
outputs : ;\
returns : ;\
// ^ Updates the payoff of validator 1 given\
decision in period (t-1)\
\
inputs : correctAttested2 ;\
feedback : ;\
operation : updatePayoffValidator name2 fee ;\
outputs : ;\
returns : ;\
// ^ Updates the payoff of validator 2 given\
decision in period (t-1)\
\
:-----:\
\
outputs : ;\
returns : ;\
|]
This concludes the blocks for generating payments. The information flow of these components is depicted in the following diagram:
validatorsGroupDecision
` groups all validators’ decisions considered into one game. The output of this game is a map (in the programming sense) connecting the name of the validator with her/his decision.
validatorsGroupDecision name1 name2 = [opengame|\
\
inputs : chainNew,chainOld, validatorsHashMapOld ;\
feedback : ;\
\
:-----:\
\
inputs : chainNew, chainOld ;\
feedback : ;\
operation : validator name1 ;\
outputs : attested1 ;\
returns : ;\
// ^ Validator1 makes a decision\
\
inputs : chainNew, chainOld ;\
feedback : ;\
operation : validator name2 ;\
outputs : attested2 ;\
returns : ;\
// ^ Validator2 makes a decision\
\
inputs : [(name1,attested1),(name2,attested2)],\
validatorsHashMapOld ;\
feedback : ;\
operation : forwardFunction $ uncurry\
newValidatorMap ;\
outputs : validatorHashMap ;\
returns : ;\
// ^ Creates a map of which validator voted for\
which index\
\
inputs : chainNew, [attested1,attested2] ;\
feedback : ;\
operation : forwardFunction $ uncurry updateVotes ;\
outputs : chainNewUpdated;\
returns : ;\
// ^ Updates the chain with the relevant votes\
\
\
:-----:\
\
outputs : validatorHashMap, chainNewUpdated;\
returns : ;\
|]
The group of validators together is not really anything exciting but it serves to illustrate a general point. The nesting of smaller games into larger games is mostly about establishing clear interfaces. As long as we do not change the interfaces, we can change the internal behavior. When we build our model in several steps and refine it over time, this is very helpful. Here, for instance, the payment for an individual validator might change. But such a change is only required in one place - assuming the interaction with the outside world does not change - and will not affect the wider construction of the game. In other words, it reduces efforts in rewriting games.
Similarly, we chose the output type of the grouped validators with the intention that it would be easy to add more validators while keeping the interface, the mapping of validators to their decisions, intact.
The next diagram illustrates the composition of components and the information flow.
Having assembled all the necessary components, we can now turn to a model of an episode of the complete protocol.
Given the previous chain (t−1), the block which was the head of the chain in (t−2), and the voting decisions of the previous validators, this game puts all the decisions together.
oneEpisode p0 p1 a10 a20 a11 a21 reward fee = [opengame|\
\
inputs : chainOld, headOfChainIdT2,\
validatorsHashMapOld ;\
// ^ chainOld is the old hash\
feedback : ;\
\
:-----:\
inputs : chainOld ;\
feedback : ;\
operation : proposer p1 ;\
outputs : chainNew ;\
returns : ;\
// ^ Proposer makes a decision, a new hash is\
proposed\
\
inputs : chainNew,chainOld, validatorsHashMapOld;\
feedback : ;\
operation : validatorsGroupDecision a11 a21 ;\
outputs : validatorHashMapNew, chainNewUpdated ;\
returns : ;\
// ^ Validators make a decision\
\
inputs : chainNewUpdated ;\
feedback : ;\
operation : determineHeadOfChain ;\
outputs : headOfChainId ;\
returns : ;\
// ^ Determines the head of the chain\
\
inputs : validatorsHashMapOld, chainNewUpdated,\
headOfChainId ;\
feedback : ;\
operation : validatorsPayment a10 a20 fee ;\
outputs : ;\
returns : ;\
// ^ Determines whether validators from period (t-1)\
were correct and get rewarded\
\
inputs : chainOld, headOfChainIdT2 ;\
feedback : ;\
operation : oldProposerAddedBlock ;\
outputs : blockAddedInT1, headOfChainIdT1;\
returns : ;\
// ^ This determines whether the proposer from\
period (t-1) did actually add a block or not\
\
inputs : blockAddedInT1, chainNewUpdated ;\
feedback : ;\
operation : proposerPayment p0 reward ;\
outputs : ;\
returns : ;\
// ^ This determines whether the proposer from\
period (t-1) was correct and triggers payments\
accordingly\
\
:-----:\
\
outputs : chainNewUpdated, headOfChainIdT1,\
validatorHashMapNew ;\
returns : ;\
|]
For clarity, the diagram below illustrates the interacting of the different components and their information flow.
One important thing to note is that this game representation has no inherent dynamics. This is due to a general principle behind the theory of open games as it does not have the notion of time.2
This is a limitation in the sense that we cannot see the dynamics unfold. It also has advantages though: The incentive analysis has no side-effects; in functional programming terms, it acts like a pure function and is fully referential relative to some state of the game.
Once we have represented the one-episode model, we have choices. We can directly work with that model. And we will do that in the next section. But we can also construct “larger models”: Either by manually combining several episodes into a new multi-episode model or by embedding the single episode into a Markov game structure.
We do not cover the construction proper or the analysis of the Markov game in this post. But the idea is simple: The stage game is a state in a Markov game where the state is fully captured by the inputs to the stage game. A Markov strategy then determines the move in the stage game. This, in turn, allows to derive the next state of the Markov game. To analyze such a game we can approximate the future payoff from the point of view of a single player under the assumption that the other players keep playing their strategy. In that way we can also assess unilateral deviations for the player in focus.
After having established a model of the protocol, let us turn to its analysis. The whole point is of course not to represent the games but to learn something about the incentives of the agents involved.
It is important to note, here, that the model we arrived at above is just one possible way to represent the situation. Obviously, the engine cannot guarantee that you end up with a useful model. But what it should guarantee is that you can adapt the model quickly and iterate through a range of models. The “one true model” rarely exists. Instead being able to adapt and consider many different scenarios is the default.
We will illustrate two analyses showing that following the protocol’s intention, namely agents who follow the protocol truthfully, end up in an equilibrium. But we also see that in the current form, there are problems if a proposer chooses to delay his message strategically. Thus, this makes the protocol susceptible to attacks.
We will first illustrate that the protocol works as intended if all agents involved are honest. They all observe the current head of the chain. Proposers then build a new block on top of that head; validators validate that head. The analysis can be found in HonestBehavior.hs
.
We use the oneEpisode
model. That is, we are slicing the protocol into one period and supply the initial information at which that round begins and a continuation for how the game continues in the next round. Recall that the rewards for proposer and validators in period t is determined in period (t+1). This information, the initialization and the continuation are fed in through initialContextLinear
where linear signifies that we consider a non-forked chain.
initialContextLinear p a1 a2 reward successFee =
StochasticStatefulContext
(pure ((),(initialChainLinear, 3, initialMap)))
(\_ x -> feedPayoffs p a1 a2 reward successFee x)
This expression looks more complicated than it actually is. The first part, (pure ()),(initialChainLinear, 3, initialMap)))
, determines the starting conditions of the situation we consider. That is, we provide the input parameters which oneEpisode
expects from us. Among other things, this contains the initial chain we start with. Here, replicated from above as a reminder:
The second part, (\_ x -> feedPayoffs p a1 a2 reward successFee x)
describes a function which computes the payoff from the current action in the next period. Details of how this payoff is determined can be found under the implementation of feedPayoffs
.
Again, the way we approach this problem is by exploiting the key feature of open games: the one-episode model is like a pipe expecting some inflows and outflows. Once we have them defined, we can analyze what is going on inside of that “pipe”.
The last element needed for our analysis are the strategies. We define “honest” strategies for both proposer and validators: strategyProposer
and strategyValidator
.
Both type of agents observe previous information, for instance the past chain, then build on the head of the chain (proposer) or attest the head of the chain (validators).
Note that we include a condition in the strategies that deals with the scenario when there is not a unique head of the chain. In the analysis we focus on here, when everyone behaves honest, we will never reach this case. However, once not all agents are honest, there might be scenarios where the head is not unique. This will be important in the second case we analyze below.
Once we have defined the strategies, there is only one thing left to do: Initialize the game with some parameters, specifically rewards and fees for proposer, respectively validators.
In the file HonestBehavior.hs
you can find one such parameterization, analyzeScenario
:
analyzeScenario = eqOneEpisodeGame "p0" "p1" "a10" "a20" "a11" "a21" 2 2 strategyOneEpisode (initialContextLinear "p1" "a11" "a21" 2 2)
This game employs the honest strategies. If we query it, we see that the proposer as well as the validators have no incentive to deviate. These strategies form an equilibrium - as intended in the design of the protocol.
Let us turn to a second analysis. This analysis can be found in Attacker.hs
.
In that episode we continue to consider the behavior of honest agents. However, these agents will start out on a chain that we assume has been intentionally delayed by the proposer in the episode before. This is achieved by adding an additional input to oneEpisodeAttack
, chainManipulated
, which is otherwise equivalent to oneEpisode
: we as analysts can manipulate the chain that the proposer sees and that the validators see.
oneEpisodeAttack p0 p1 a10 a20 a11 a21 reward fee = [opengame|\
\
inputs : chainOld, headOfChainIdT2,\
validatorsHashMapOld, chainManipulated ;\
// ^ chainOld is the old hash\
feedback : ;\
\
:-----:\
inputs : chainOld ;\
feedback : ;\
operation : proposer p1 ;\
outputs : chainNew ;\
returns : ;\
// ^ Proposer makes a decision, a new hash is\
proposed\
\
inputs : chainNew, chainManipulated ;\
feedback : ;\
operation : mergeChain ;\
outputs : mergedChain ;\
returns : ;\
// ^ Merges the two chains into a new chain for the\
validators\
\
inputs : mergedChain,chainOld,\
validatorsHashMapOld;\
feedback : ;\
operation : validatorsGroupDecision a11 a21 ;\
outputs : validatorHashMapNew, chainNewUpdated ;\
returns : ;\
// ^ Validators make a decision\
\
inputs : chainNewUpdated ;\
feedback : ;\
operation : determineHeadOfChain ;\
outputs : headOfChainId ;\
returns : ;\
// ^ Determines the head of the chain\
\
inputs : validatorsHashMapOld, chainNewUpdated,\
headOfChainId ;\
feedback : ;\
operation : validatorsPayment a10 a20 fee ;\
outputs : ;\
returns : ;\
// ^ Determines whether validators from period (t-1)\
were correct and get rewarded\
\
inputs : chainOld, headOfChainIdT2 ;\
feedback : ;\
operation : oldProposerAddedBlock ;\
outputs : blockAddedInT1, headOfChainIdT1;\
returns : ;\
// ^ This determines whether the proposer from\
period (t-1) did actually add a block or not\
\
inputs : blockAddedInT1, chainNewUpdated ;\
feedback : ;\
operation : proposerPayment p0 reward ;\
outputs : ;\
returns : ;\
// ^ This determines whether the proposer from\
period (t-1) was correct and triggers payments\
accordingly\
\
:-----:\
\
outputs : chainNewUpdated, headOfChainIdT1,\
validatorHashMapNew ;\
returns : ;\
|]
This simulates the situation where the malicious proposer from the episode before sends a block after the honest proposer from this episode has added his own block. As a result there are now two nodes in the chain with 0 votes on it. In other words, there are two contenders for the head of the chain. The chain at point in time looks like this:
The next steps are analogous to the analysis before, we define inputs and how the game continues. Lastly, we need to define strategies.
We consider two strategies by the validators adapted to the specific scenario: Either they vote with the honest proposer, i.e. vote for node 4 ( strategyValidator4
), or they vote with the attacker’s block, i.e. vote for node 5 ( strategyValidator5
). We assume the proposer behaves honest as before.
If we ran the equilibrium check on these two scenarios, analyzeScenario4
and analyzeScenario5
, we see that both constitute an equilibrium. That is, in both cases none of the players has an incentive to deviate. Obviously, the scenario where the validators vote for the malicious proposer is not an equilibrium we want from the design perspective of the protocol.
We can shed further light in what is going on here: So far we assumed that the validators will coordinate on one node, they either both choose node 4 or both choose node 5. The key issue is that they observe two candidate nodes for the new head of the chain. We can also consider the case where the validators randomize when facing a tie ( analyzeScenarioRandom
). In that case, we see that the result is a non-equilibrium state. Both validators would profit from voting on another block. The reason is simple: They are not coordinated. In case of randomly drawing one of the heads, there is the possibility that the validators output mutually contradictory information. Which means, they will not be rewarded.
The development of the engine is ongoing. Protocols which involve a timing choice, as for instance for a proposer waiting to send information and thereby potentially learning something about the validators’ behavior in the meantime, pose a challenge for the current implementation. One should add, they also pose a challenge for classical game representations such as the extensive form. As we have shown, it is still entirely possible to represent such games in the engine. However, such modelling puts the burden on the modeller to make reasonable choices. It would be nice to start with an actual protocol and extract a game-theoretic model out of it. Extending the underlying theory and the engine to better accommodate such scenarios is on the top of our todo list.
-
This is not the only way to model the protocol in the current implementation. It is also possible to consider a timer explicitly as a state variable. This branch contains such a model. ↩
-
We should be more precise: In the current theory of open games there is always a clear notion of causality - who moves when and what is observed when by whom. The relevant “events” can be organized in a relation. This follows the overall categorical structure in which open games are embedded. We are working on a version of the theory where time - or other underlying structures like networks - are what open games are based on. ↩Loading [MathJax]/extensions/TeX/mathchoice.js
Cross-posted from Riu’s blog.
In modelling disciplines, one often faces the challenge of balancing three often conflicting aspects: representational elegance, the breadth of examples to capture, and the depth or specificity in capturing those examples of interest. In the context of reinforcement learning theory, this raises the question: what is an adequate ontology for the techniques involved in agents learning from interaction with an environment?
Here we make a structural approach to the above dilemma, both in the sense of structural realism and stuff, structure, property. The characteristics of RL algorithms that we capture are their modularity and specification via typed interfaces.
To keep this exposition grounded in something practical, we will follow an example, Q-learning, which from this point of view captures the essence of reinforcement learning. It is an algorithm that finds an optimal policy in an MDP by keeping an estimate of the value of taking a certain action in a certain state, encoded as a table Q:A×S→R, and updating it from previous estimates ( bootstrapping) and from samples obtained by interacting with an environment. This is the content of the following equation (we’ll give the precise type for it later):
Q(s,a)←(1−α)Q(s,a)+α[r+maxa′:AQ(s′,a′)]
One does also have a policy that is derived from the Q table, usually an ε-greedy policy that selects with probability 1−ε for a state s the action that maximizes the estimated value, maxa:AQ(s,a), and a uniformly sampled action with probability ε. This choice helps to overcome the exploration-exploitation balance.
Ablating either component produces other existing algorithms, which is reassuring:
- If we remove the bootstrapping component, one recovers a (model-free) one-step Monte Carlo algorithm.
- If we remove the samples, one recovers classical Dynamic Programming methods1 such as Value Iteration. We’ll come back to these sample-free algorithms later.
Q-learning as we’ve just described, and other major RL algorithms, can be captured as lenses; the forward map is the policy deployment from the model’s parameters, and the backward map is the update function.
The interface types vary from algorithm to algorithm. In the case of Q-learning, the forward map P is of type RS×A→(DA)S (where D is the distribution monad). It takes the current Q-table Q:S×A→R and outputs a policy S→DA. This is our ε-greedy policy defined earlier. The backward map G has the following type (we define ˜Q in (2)):
RS×A×(S×A×R×S)→T∗(s,a)(S×A)Q,(s,a,r,s)↦˜Q
The type of model parameter change Δ(RS×A)=T∗(s,a)(S×A) has as elements cotangent vectors to the base space S×A (not to RS×A). This technicality allows us to define the pointwise update of equation (1) as ((s,a),g), where g=(r+γmaxa′:AQ(s′,a′))∈R is the update target of our model. The new Q function then is defined as:
˜Q(˜s,˜a)={(1−α)Q(s,a)+α[r+γmaxa′Q(s′,a′)](˜s,˜a)=(s,a)Q(s,a)o/w
The quotes in the diagram above reflects that showing explicitly the S and A wires below loses the dependency of the type RS×A over them. This is why in the paper we prefer to write the backward map as a single box G with inputs RS×A×(S×A×R) and output T∗(s,a)(S×A).
Writing the change type as a cotangent space allows us to bridge the gap to Deep Learning methods. In our running example, we can do the standard transformation of the Bellman update to a Bellman error to decompose G into two consecutive steps:
-
Backward map: G:RS×A×(S×A×R×S′)→S×A×RQ,(s,a,r,s′)↦(s,a,L) The loss L is defined as the MSE between the current Q-value and the update target g: L=(Q(s,a)−g)2=(Q(s,a)−(r+γmaxa′ˉQ(s′,a′)))2 We treat ˉQ(s′,a′) (Q bar) as a constant value, so that the (semi-)gradient of L wrt. the Q-matrix is the Bellman Q-update, as we show next.
-
Feedback unit (Bellman update): (1+S×A×R)×RS×A→RS×A∗,Q↦Q(s,a,L),Q↦˜Q=Q−α2∂L∂Q=∀(˜s,˜a).{Q(s,a)−α(Q(s,a)−g)(˜s,˜a)=(s,a)Q(s,a)o/w∀(˜s,˜a).{(1−α)Q(s,a)+αg(˜s,˜a)=(s,a)Q(s,a)o/w This recovers (2), so we can say that the backwards map is doing pointwise gradient descent, by only updating the (s,a) indexed Q-value.
Focusing now on sample-free algorithms, the model’s parameter update is an operator (X→R)→(X→R) between function spaces. State value methods for example update value functions S→R, whereas state-action value methods update functions S×A→R (the Q-functions). More concretely, the updates of function spaces that appear in RL are known as Bellman operators. It turns out that a certain subclass which we call linear Bellman operators can be obtained functorially from lenses as well!
The idea is to employ the continuation functor2 which is the following representable functor:
\begin{align*} \mathbb{K} =\mathbf{Lens}(-,1) : \mathbf{Lens}^\mathrm{op} &\to \mathbf{Set} \newline {X\choose R} &\mapsto R^X \newline {X\choose R}\rightleftarrows {X'\choose R'} &\mapsto R'^{X'} \to R^X \end{align*}
The contravariance hints already at the corecursive nature of these operators: They take as input a value function of states in the future, and return a value function of states in the present. The subclass of Bellman operators that we obtain this way is linear in the sense that it uses the domain function in R’^{X’} only once.
An example of this is the value improvement operator from dynamic programming. This operator improves the value function V:S\to R to give a better approximation of the long-term value of a policy \pi:S\to A, and is given by
V(s) \gets \mathbb{E}_{\mkern-14mu\substack{a\sim \pi(s)\newline (s',r)\sim t(s,a)}}[r+\gamma V(s')] = \sum _{a\in A}\pi(a\mid s) \sum _{\substack{s'\in S\newline r\in R}}t(s',r\mid s, a) (r + \gamma V(s'))
This is the image under \mathbb{K} of a lens 3 whose forward and backward maps are the transition function \mathrm{pr}_1(t(-,\pi(-))):S \to S under a fixed policy \pi:S\to A, and the update target computation (-)+\gamma\cdot(=):\mathbb{R}\times \mathbb{R}\to \mathbb{R} respectively, as shown below.
If you want to read more about this “optics perspective” on Value Iteration and its relation with problems like the control of an inverted pendulum, the savings problem in economics and more, check out our previous ACT2022 paper.
Once we have transformed the Bellman operator into a function using \mathbb{K}, this embeds into the backward map of the RL lens.
It is then natural to ask what a backward map that does not ignore the sample input might look like, and these are what we call parametrised Bellman operators. These are obtained by the lifting of \mathbb{K} to the (externally parametrised) functor \mathrm{Para}(\mathbb{K}):\mathrm{Para}(\mathrm{Lens}^\mathrm{op})\to\mathrm{Set}, and captures exactly what algorithms like SARSA are doing in terms of usage of both bootstrapping and sampling.
We talked about learning from bootstrapping and from sampling as two distinct processes that fit into the lens structure. While the difference between these two is usually not emphasized enough, we believe that it is useful for understanding the structure of novel algorithms by making the information flow explicit. You can find more details, along with a discussion on Bellman operators, the iteration functor used to model stateful environments, prediction and bandit problems as nice corner cases of our framework, and more on our recent submission to ACT2024.
Moreover, this opens up the study of stateful models: multi-step methods like n-temporal difference or Monte Carlo Tree Search (used e.g. in AlphaZero), which we will leave for a future post, so stay tuned!
-
This is sometimes denoted “offline” RL, but one should note that offline methods include learning from a constant dataset and learning by updating one’s estimates only at the end of episodes too. ↩
-
In general, the continuation functor is defined for any optic as \mathbb{K}=\mathbf{Optic}(\mathcal{C})(-,I):\mathbf{Optic}(\mathcal{C})^\mathrm{op}\to\mathbf{Set}, represented by the monoidal unit I. ↩
-
Ok I lied here a little: To be precise, the equation shown arises as the continuation of a mixed optic, where the forwards category is \mathrm{Kl}(D) for a probability monad D, and the backwards category is \mathrm{EM}(D). The value improvement operator that arises from the continuation of a lens is a deterministic version of this, where there’s no expectation taken in the backwards pass because we fix the policy and the transition functions to be deterministic. ↩Processing math: 100%
Suppose we have some category C, whose morphisms are some kind of processes or systems that we care about. We would like to be able to talk about contexts (or environments) in which these processes or systems can be located.
This post is to finally write part of the lore of categorical cybernetics that I’ve been working out on the backburner for a few years, and I’ve talked about in front of various audiences a few times. I never thought it was quite compelling enough to write a paper about it, but it’s been part of my bag of tricks for a while, for example playing a central role in my lecture series on compositional game theory. In the meantime, similar ideas have been invented a few times in applied category theory, most notably being taken further for talking about quantum supermaps.
Topologically, we draw morphisms of our category as nodes, which have a hole outside but no hole inside (that is to say they are really point-like, despite how we conventionally draw them) - and dually, we draw contexts as diagram elements that have a hole inside but no hole outside.
Being good category theorists, we choose not to say what a context is but how it transforms, which will lead to being able to define them via additional structure we can equip our categories with. If we have a context for morphisms X→Y, and we have morphisms f:X→X′ and g:Y′→Y, we should be able to demote these morphisms into being part of an extended environment for morphisms X′→Y′:
By asking that demoting twice gives the same result as demoting a composite, and the order of demoting on the domain and codomain doesn’t matter, we end up inventing the following definition: A system of contexts for a category C is a functor ¯C:C×Cop→Set, and a context for morphisms X→Y is an element of ¯C(X,Y).
Things get much more interesting when C is not just a category but a symmetric monoidal category, as is virtually always the case in any applied domain. Our first guess might be to replace the functor C with some kind of monoidal functor. Lax monoidal (for the cartesian monoidal product on Set) turns out to be probably what we want - this says that if we have a context for morphisms X→Y and one for morphisms X′→Y′ we can compose them to get a context for morphisms X⊗X′→Y⊗Y′, but this operation is not necessarily reversible. Topologically this is a bit subtle, and says we can bridge 2 holes with a single morphism:
We probably get away with this because we are assuming everything is symmetric monoidal. I sometimes think of holes as anti-nodes that we can slide around as though they are nodes. This part of the definition has an odd status right now: it seems that we can virtually always get it in practice, and it plays a role in the theory, but I have never actually deployed the lax monoidal structure of contexts while doing any applied work.
In any case, this is not enough to describe contexts in a symmetric monoidal category, so we need to go back to first principles.
Suppose we have a symmetric monoidal category and we have a context for morphisms X⊗X′→Y⊗Y′, and suppose we have a morphism f:X→Y. Similarly to before, we should be able to demote f into the context, obtaining a context for morphisms X′→Y′:
I wrote this definition in section 9 of The Game Semantics of Game Theory. But it turns out this isn’t the best way to write it: it’s enough to be able to demote an identity morphism, with an operation ¯C(Z⊗X,Z⊗Y)→¯C(X,Y):
A category theorist would call this a (monoidal) costrength for ¯C, although I find it useful to think of it as a kind of tensor contraction.
But there’s another way to think about this whole thing. Given a symmetric monoidal category C, a comb in C is a diagram element with 1 hole on the inside and 1 hole on the outside:
(Note, drawing them with this “comb” shape is enough because our ambient category is symmetric. In a planar setting, we would actually have to puncture a box with a hole.)
Concretely, a comb consists of a pair of morphisms coupled through a “residual” wire - but by drawing a box around it, we lose the ability to distinguish combs that differ by sliding a morphism between the front and back along the residual wire:
This turns out to be exactly the the definition of an optic in C - I think of combs as one syntactic presentation (among several others) of the semantic concept of an optic in a category. There is a category Optic(C) whose objects are pairs of objects of C, and whose morphisms are combs. Whereas string diagrams in C compose left-to-right, these “comb diagrams” in C compose outside-in, like an operad:
We also get a symmetric monoidal product on Optic(C) that encompasses what I said earlier about sliding holes around. Now we get an alternative definition of context: it’s a generalised state of optics. That is to say, it’s an ultimate outside, which can be transformed by attaching a comb to the inside of the hole:
If we do this, the properties we had to demand of the co-strength map get absorbed into the quotient defining optics.
What is a “generalised state”? A state in a monoidal category C is a morphism from the monoidal unit, and a generalised state is something that transforms like a state: an element of some lax monoidal functor C→Set. That is to say: if we have a generalised state x of X and a morphism f:X→Y, we get a pushforward state f∗(x); and if we have generalised states x of X and y of Y, we get a state x⊗y of X⊗Y.
So now we have 2 different definitions of a system of contexts: as a lax monoidal functor C×Cop→Set equipped with a co-strength map, or as a lax monoidal functor Optic(C)→Set. Fortunately, these definitions turn out to be equivalent: it’s a dual of the profunctor representation theorem. The normal version of this theorem says that Tambara modules - endo-profunctors on C equipped with a strength map - are equivalent to functors Optic(C)op→Set. It turns out that a Tambara module on Cop is the same thing as a Tambara module, which conveniently frees up the name Tambara co-module to be used for this thing.
(A word of warning: the paper I linked defines “Optic(C)” to be Optic(C)op, which means they say Optic(C)→Set when they mean Optic(C)op→Set and vice versa.)
As a personal anecdote, at different points I’ve convinced myself that both of these definitions were the correct definition of “system of contexts”, before realising that they were equivalent by the profunctor representation theorem - this led to me getting some quite good, graphical intuition for this otherwise notoriously abstract theorem.
Some time after working out the last part of this, I learned about the existence of this paper by Hermida and Tennent, which finally backed up my intuition behind my definition of generalised states by formulating a universal construction forcing them to become actual states. Incredibly this construction itself also falls squarely in the small cluster of methods we call categorical cybernetics, which caps off the whole thing very nicely. I touched on this construction in this blog post, and perhaps I’ll have more to say about it later too.
Often we don’t need generalised states, and ordinary states are enough: that’s when we take the representable functor C(I,−):C→Set, which is indeed lax monoidal. (General representable functors on a monoidal category are not lax monoidal in general!)
This leads to what I call the “representable system of contexts” for a symmetric monoidal category C: it’s the one described by Optic(C)(I,−), where the monoidal unit of Optic(C) is (I,I). What this ends up saying is that a context for morphisms X→Y in C is an equivalence class of pairs of a state and a costate in C, coupled through a residual:
This turns out (in a non-trivial way) to be equivalent to the definition of context used for both deterministic and Bayesian open games. In those cases, C is itself a category of optics, making systems of contexts examples of double optics. Iterating the Optic(−) construction can be usefully depicted in 2 different ways: as 1-hole combs in a bidirectional category:
or as 3-hole combs:
Moving back and forth between these equivalent views of the iterated optic construction is a key part of the yoga of contexts as it applies to categorical cybernetics.
An example of a non-representable system of contexts is the “iteration functor” I talked about in this post. It’s closely related to the algebra of Moore machines which plays a major role in David Jaz Myers’ book on categorical systems theory.
But, the actual reason this is a blog post and not a paper is that I don’t have any really compelling examples outside of categorical cybernetics. But I’ll talk more about my struggles with that in part II, where I’ll build a category of “behaviours in context” given a system of contexts, generalising the construction of open games.Processing math: 100%
I have been playing around with Nash equilibrium search in random normal form games by following best response dynamics - partly because I was curious, and partly as a learning exercise in NumPy.
Here are my preliminary conclusions:
- The key to success is replacing argmax with softmax for a sufficiently high temperature. For low temperatures, nothing I could do would make it converge.
- Provided the temperature is high enough, the learning rate is irrelevant, with a learning rate of 1 (ie. literally overriding each strategy with its softmax best response every stage) converging in just a few steps.
- Normal form games are big. Like, really big. For 9 players and 9 moves, the payoff matrix no longer fits in my VRAM. For 10 players and 10 moves (which I still consider absolutely tiny) the payoff matrix contains exactly 1011 = 100 billion parameters, making it large by the standards of an LLM at the time of writing.
My understanding of the theory is that this type of iterative method will not in general converge to a Nash equilibrium, but it will converge to an ε-equilibrium for some ε. What I don’t know is how the error ε can depend on the game and the learning algorithm. That’s something I’ll look into in some follow-up work, presumably by comparing my results to what NashPy finds.
For a normal form game with N players and M moves, the payoff matrix is an MN×N tensor, of rank N+1. Each player gets one tensor rank of dimension M for their move, and then there is one more rank of dimension N to assign a payoff to each player.
Here is my incantation for initialising a random payoff tensor, with payoffs drawn uniformly between 0 and 1:
gen = np.random.default_rng()
shape = tuple(numMoves for i in range(numPlayers)) + (numPlayers,)
payoffMatrix = gen.random(shape)
It turns out that with this distribution of payoffs, the law of large numbers kicks in and payoffs for any mixed strategy profile are extremely close to 0.5, and the more players there are the closer to 0.5 they are. Normal form games are defined up to arbitrary positive affine transformations of the payoffs, so I ended up going with a sort-of exponential distribution of payoffs, so that much higher payoffs could sometimes happen. This made very little difference but made me feel happier:
payoffMatrix = gen.exponential(1, shape) * gen.choice((-1, 1), shape)
A mixed strategy profile is an M×N stochstic matrix:
strategies = gen.random((numPlayers, numMoves))
for i in range(numPlayers):
strategies[i] = strategies[i] / sum(strategies[i])
Here is the best incantation I could come up with for computing the resulting payoffs:
payoffs = payoffMatrix
for player in range(numPlayers):
payoffs = np.tensordot(strategies[player], payoffs, (0, 0))
(For 9 players this is already too much for my poor laptop.)
I wish I could find a more declarative way to do this. For small and fixed number of players, this kind of thing works, but I didn’t want to mess with the stringly-typed incantation that would be necessary to do it for N players:
payoffs = np.einsum('abcu,a,b,c', payoffMatrix, strategies[0], strategies[1], strategies[2])
For best response dynamics, the key step is: for each player, compute the payoffs that player can obtain by a unilateral deviation to each of their moves.
Here is the very unpleasant incantation I came up with to do this:
deviations = payoffMatrix[..., player]
for opponent in range(player):
deviations = np.tensordot(strategies[opponent], deviations, (0, 0))
for opponent in range(player + 1, numPlayers):
deviations = np.tensordot(strategies[opponent], deviations, (0, 1))
First we slice the payoff tensor so we only have the payoffs for the player in question. Then for opponents of index lower than the player, we contract their strategy against the lowest tensor dimension. After that loop, the lowest tensor dimension corresponds to the current player’s move. Then for opponents of index higher than the player, we contract their strategy but skipping the first tensor dimension. At the end we’re left with just a vector, giving the payoff of each of the player’s moves when each opponent plays their current strategy.
As a functional programmer, I find all of this in very bad taste.
The rest is the easy part. We can use softmax:
def softmax(x, temperature):
exps = np.exp(x / temperature)
return exps / sum(exps)
newStrategy = softmax(deviations, temperature)
or, in the zero-temperature limit, take a Dirac distribution at the argmax using this little incantation:
newStrategy = np.identity(numPlayers)[np.argmax(deviations)]
Then apply the learning rate:
delta = newStrategy - strategies[player]
strategies[player] = strategies[player] + learningRate*delta
All of this is looped over each player, and then over a number of learning stages, plus logging each player’s payoff and the maximum delta:
for stage in range(numStages):
# Compute expected payoffs
tempPayoffs = payoffMatrix
for player in range(numPlayers):
tempPayoffs = np.tensordot(strategies[player], tempPayoffs, (0, 0))
payoffs[stage] = tempPayoffs
for player in range(numPlayers):
# Compute deviation payoffs
deviations = payoffMatrix[..., player]
for opponent in range(player):
deviations = np.tensordot(strategies[opponent], deviations, (0, 0))
for opponent in range(player + 1, numPlayers):
deviations = np.tensordot(strategies[opponent], deviations, (0, 1))
# Update strategy
newStrategy = softmax(deviations, temperature)
delta = newStrategy - strategies[player]
strategies[player] = strategies[player] + learningRate*delta
# Log errors
errors[stage] = max(errors[stage], max(delta))
Everything in this section is for 8 players and 8 moves, which is the largest that my laptop can handle.
Here is a typical plot of each player’s payoff over 100 stages of learning, with a temperature of 0.01 and a learning rate of 0.1:
With this temperature, the learning rate can be increased all the way to 1, and the dynamics visibly converges in just a few stages:
In fact, this is so robust that it makes me wonder whether there could be a good proof of the constructive Brouwer theorem using statistical physics methods.
If the temperature is decreased further to 0.001, we lose convergence:
Although I haven’t confirmed it, my assumption is that lower temperature will converge to an ε-equilibrium for smaller ε, so we want it to be as low as possible while still converging.
Worst of all, if we decrease the learning rate to compensate we can get a sudden destabilisation after hundreds of stages:
That’s all for now. I’ll come back to this one I’ve figured out how to calculate the Nash error, which is the next thing I’m interested in finding out.Loading [MathJax]/jax/element/mml/optable/MathOperators.js
I’m going to record something that I think is known to everyone doing research on categorical cybernetics, but I don’t think has been written down somewhere: an even more general version of mixed optics that replaces the backwards actegory with an enrichment. With it, I’ll make sense of a curious definition appearing in The Compiler Forest.
An actegory consists of a monoidal category M, a category C and a functor ∙:M×C→C that behaves like an “external product”: namely that it’s equipped with coherent isomorphisms I∙X≅X and (M⊗N)∙X≅M∙(N∙X).
An enriched category consists of a category C, a monoidal category M and a functor [−,−]:Cop×C→M that behaves like an “external hom” (I’m not going to write down what this means because it’s more complicated).
There’s a very close relationship between actegories and enrichments, to the point that I consider them different perspectives on the same idea. This is the final form of the famous tensor-hom adjunction, aka. currying. (I learned this incredible fact from Matteo Capucci, and I have no idea where it’s written down, although it’s definitely written down somewhere.)
A tensored enrichment is one where every [Z,−]:C→M has a left adjoint −∙X:M→C. Allowing Z to vary results in a functor ∙ which (nontrivial theorem) is always an actegory.
A closed actegory is one where every −∙Z:M→C has a right adjoint [Z,−]:C→M. Allowing Z to vary results in a functor [−,−] which (nontrivial theorem) is always an enrichment.
So, closed actegories and tensored enrichments are equivalent ways of defining the same thing, namely a monoidal category M and category C equipped with ∙ and [−,−] related by a tensor-hom adjunction C(X∙Z,Y)≅M(Z,[X,Y]).
Given an actegory, we can define a bicategory ParaM(C), whose objects are objects of C and 1-cells are pairs of M:M and f:C(M∙X,Y). We can also define a bicategory CoparaM(C), whose objects are objects of C and 1-cells are pairs of M:M and f:C(X,M∙Y).
Given an enriched category, we can define a bicategory ParaM(C), whose objects are objects of C and morphisms are pairs of M:M and f:M(M,[X,Y]). If this is a tensored enrichment then the two definitions of ParaM(C) are equivalent.
In all of these cases we are locally fibred over M, and I will write ParaM(C)(X,Y)(M), CoparaM(C)(X,Y)(M) for the set of co/parametrised morphisms with a fixed parameter type.
It’s not possible to define CoparaM(C) for an enrichment. There’s a very slick common generalisation of actegories and enrichments called a locally graded category, which is a category enriched in presheaves with Day convolution. There’s also a very slick definition of Para for a locally graded category. I’d like to know, for exactly which locally graded categories is possible to define Copara?
If we have two actegories C,D that share the same acting category M then we can define mixed optics, which first appeared in Profunctor Optics: A Categorical Update. This is a 1-category OpticM(C,D) whose objects are pairs (XX′) of an object of C and an object of D, and a morphism (XX′)→(YY′) is an element of the coend
∫M:MCoparaM(C)(X,Y)(M)×ParaM(D)(Y′,X′)(M)
There’s a slightly more general definition called “weighted optics” that appears in Bruno’s thesis and was used very productively there, which replaces M with two monoidal categories related by a Tambara module. I think that it’s an orthogonal generalisation to the one I’m about to do here.
Putting together everything I’ve just said, the next step is clear. If we have categories C,D and a monoidal category M, with M acting on C and D enriched in M, then we can still define OpticM(C,D) in exactly the same way, replacing ParaM(D) with its enriched version. But now, unlike before, we can use the ninja Yoneda lemma to eliminate the coend and get
OpticM(C,D)((XX′),(YY′))≅C(X,[Y′,X′]∙Y)
In general I refer to optics that can be defined without type quantification as lenses, and so this is an enriched closed lens. It’s the final form of “linear lenses”, the version of lenses that is defined like Lens s t a b = s -> (a, b -> t)
.
Section 5 of The Compiler Forest by Budiu, Galenson and Plotkin has a very interesting definition in it. They have a cartesian closed category C (whose internal hom I’ll write as →) and a strong monad T on it, and they define a category whose objects are pairs of objects of C and whose morphisms f:(XX′)→(YY′) are morphisms f:X→T(Y×(Y′→TX′)) of C.
They also nail an intuition for lenses that I use constantly and I haven’t seen written down anywhere else: problems go forwards, solutions go backwards.
Me and this definition have quite a history. It came to my attention while polishing Bayesian Open Games for submission. For a while, I thought that it was equivalent to optics in the kleisli category of T, and we’d wasted a years of our lives trying to understand optics (this being around 2018, when optics were still a niche idea). Then, for a while I thought that the paper made a mistake and these things don’t compose associatively. Now I’ve made peace: I think their definition is conceptually subtly wrong in a way that makes no difference in practice, and I can say very precisely how it relates to kleisli optics.
There is an action of C on Kl(T) given by M∙X=M⊗X, where ⊗ is the tensor product of Kl(T) which on objects is given by the product × of C. That’s the actegory generated by the strong monoidal embedding C↪Kl(T). There is also an enrichment of Kl(T) in C, given by [X,Y]=X→TY. This action and enrichment are adjoint to each other: Kl(T)(M⊗X,Y)≅C(X,M→TY).
The category defined in Compiler Forest turns out to be equivalent to
OpticC(Kl(T),Kl(T))
whose forwards pass is given by the action of C on Kl(T) and whose backwards pass is given by the enrichment of Kl(T) in C. Its hom-sets are given by
OpticC(Kl(T),Kl(T))((XX′),(YY′))=∫M:CC(X,T(M×Y))×C(M,Y′→TX′)
which Yoneda-reduces to the definition in the paper.
Even though the action and enrichment are adjoint, this is not the same as optics in the klesli category:
OpticC(Kl(T),Kl(T))≇
where the hom-sets of the latter are defined by
\mathrm{Optic}_{\mathrm{Kl} (T)} (\mathrm{Kl} (T), \mathrm{Kl} (T)) \left( \binom{X}{X'}, \binom{Y}{Y'} \right)= \int^{M : \mathrm{Kl} (T)} \mathcal C (X, T (M \times Y)) \times \mathcal C (M \times Y', T X')
This equivalence, between optics whose backwards passes are an adjoint action or enrichment, would be a completely reasonable-looking lemma but it just isn’t true!
The difference between them is extremely subtle, though. The “proper” definition of kleisli optics identifies morphisms that agree up to sliding any kleisli morphism, whereas the definition in Compiler Forest only identifies morphisms that agree up to sliding pure morphisms of \mathcal C. So hom-sets of coend optics are a quotient of the hom-sets defined in Compiler Forest. While writing this up, I realised that most of this conclusion actually appears in section 4.9 of Riley’s original paper.
As long as you don’t care about equality of morphisms - which in practice is never, because they are made of functions - the difference between them can be safely ignored. The only genuine reason to prefer kleisli optics is their better runtime performance.Cross-posted from Zanzi’s blog
This is the first post in a new series documenting my work developing a bidirectional programming language, in which all programs are interpreted as optics. This is something I’ve been thinking about for a long time, and eventually I became convinced that there were enough subtle issues that I should take things extremely slowly and actually learn some programming language theory. As a result, this post will not be about categorical cybernetics at all, but is a foundation to a huge tower of categorical cybernetics machinery that I will build later.
The first issue is that because the tensor product of optics is not cartesian, a language for optics is necessarily substructural. So this post is about a way to implement substructural languages in a way that avoids hidden pitfalls. The code in this post is Idris2, a dependently typed language.
We begin with a tiny language for types, with monoidal products (a neutral name because later we will be making it behave like different kinds of product), a unit type to be the neutral element of the monoidal product, and a “ground” type that is intended to contain some nontrivial values.
data Ty : Type where
Unit : Ty
Ground : Ty
Tensor : Ty -> Ty -> Ty
Although we have used the name “tensor”, suppose we want to make an ordinary cartesian language where variables can be implicitly copied and discarded. Here is a standard way to do it: it is an intuitionistic natural deduction calculus.
data Term : List Ty -> Ty -> Type where
-- A variable is a term if we can point to it in scope
Var : Elem x xs -> Term xs x
-- Unit is always a term in every scope
UnitIntro : Term xs Unit
-- Pattern matching on Unit, redunant here but kept for comparison to later
UnitElim : Term xs Unit -> Term xs x -> Term xs x
-- A pair is a term if each side is a term
TensorIntro : Term xs x -> Term xs y -> Term xs (Tensor x y)
-- Pattern matching on a pair, adding both sides to the scope
TensorElim : Term xs (Tensor x y) -> Term (x :: y :: xs) z -> Term xs z
The constructor for Var
uses Elem
, a standard library type that defines pointers into a list:
data Elem : a -> List a -> Type where
Here : Elem x (x :: xs)
There : Elem x xs -> Elem x (x' :: xs)
Here are some examples of programs we can write in this language:
-- \x => ()
delete : CartesianTerm [a] Unit
delete = UnitIntro
-- \(x, y) => x
prjl : CartesianTerm [a, b] a
prjl = Var Here
-- \(x, y) => y
prjr : CartesianTerm [a, b] b
prjr = Var (There Here)
-- \x => (x, x)
copy : CartesianTerm [a] (Tensor a a)
copy = TensorIntro (Var Here) (Var Here)
-- \(x, y) => (y, x)
swap : CartesianTerm [a, b] (Tensor b a)
swap = TensorIntro (Var (There Here)) (Var Here)
The thing that makes this language cartesian and allows us to write these 3 terms is the way that the context xs
gets shared by the inputs of the different term constructors. In the next section we will define terms a different way, and then none of these examples will typecheck.
Next let’s go to the opposite extreme and build a fully substructural language, in which we cannot delete or copy or swap. I learned how to do this from Conor Mc Bride and Zanzi. Here is the idea:
data Term : List Ty -> Ty -> Type where
-- A variable is a term only if it is the only thing in scope
Var : Term [x] x
-- Unit is a term only in the empty scope
UnitIntro : Term [] Unit
-- Pattern matching on Unit consumes its scope
UnitElim : Term xs Unit -> Term ys y -> Term (xs ++ ys) y
-- Constructing a pair consumes the scopes of both sides
TensorIntro : Term xs x -> Term ys y -> Term (xs ++ ys) (Tensor x y)
-- Pattern matching on a pair consumes its scope
TensorElim : Term xs (Tensor x y) -> Term (x :: y :: ys) z -> Term (xs ++ ys) z
This is a semantically correct definition of planar terms and it would work if we had a sufficiently smart typechecker, but for the current generation of dependent typecheckers we can’t use this definition because it suffers from what’s called green slime. The problem is that we have types containing terms that involve the recursive function ++
, and the typechecker will get stuck when this function tries to pattern match on a free variable. (I have no idea how you learn this if you don’t happen to drink in the same pubs as Conor. Dependently typed programming has a catastrophic lack of books that teach it.)
The fix is that we need to define a datatype that witnesses that the concatenation of two lists is equal to a third list - a witness that the composition of two things is equal to a third thing is called a simplex. The key idea is that this datatype exactly reflects the recursive structure of ++
, but as a relation rather than a function:
data Simplex : List a -> List a -> List a -> Type where
Right : Simplex [] ys ys
Left : Simplex xs ys zs -> Simplex (x :: xs) ys (x :: zs)
Now we can write a definition of planar terms that we can work with:
data Term : List Ty -> Ty -> Type where
Var : Term [x] x
UnitIntro : Term [] Unit
UnitElim : Simplex xs ys zs
-> Term xs Unit -> Term ys y -> Term zs y
TensorIntro : Simplex xs ys zs
-> Term xs x -> Term ys y -> Term zs (Tensor x y)
TensorElim : Simplex xs ys zs
-> Term xs (Tensor x y) -> Term (x :: y :: ys) z -> Term zs z
This language is so restricted that it’s hard to show it doing anything, but here is one example of a term we can write:
-- \(x, y) => (x, y)
foo : Term [a, b] (Tensor a b)
foo = TensorIntro (Left Right) Var Var
Manually defining simplicies, which cut a context into two halves, is very good as a learning exercise but eventually gets irritating. We can direct Idris to search for the simplex automatically:
data Term : List Ty -> Ty -> Type where
Var : Term [x] x
UnitIntro : Term [] Unit
UnitElim : {auto prf : Simplex xs ys zs}
-> Term xs Unit -> Term ys y -> Term zs y
TensorIntro : {auto prf : Simplex xs ys zs}
-> Term xs x -> Term ys y -> Term zs (Tensor x y)
TensorElim : {auto prf : Simplex xs ys zs}
-> Term xs (Tensor x y) -> Term (x :: y :: ys) z -> Term zs z
foo : Term [a, b] (Tensor a b)
foo = TensorIntro Var Var
This works, but I find that the proof search gets confused easily (although it works fine for the baby examples in this post), so let’s pull out the big guns and write a tactic:
concat : (xs, ys : List a) -> (zs : List a ** Simplex xs ys zs)
concat [] ys = (ys ** Right)
concat (x :: xs) ys = let (zs ** s) = concat xs ys in (x :: zs ** Left s)
This function takes two lists and returns their concatenation together with a simplex that witnesses this fact. Here **
is Idris syntax for both the type former and term former for dependent pair (aka. Sigma) types.
data Term : List Ty -> Ty -> Type where
Var : Term [x] x
UnitIntro : Term [] Unit
UnitElim : {xs, ys : List Ty}
-> {default (concat xs ys) prf : (zs : List Ty ** Simplex xs ys zs)}
-> Term xs Unit -> Term ys y -> Term prf.fst y
TensorIntro : {xs, ys : List Ty}
-> {default (concat xs ys) prf : (zs : List Ty ** Simplex xs ys zs)}
-> Term xs x -> Term ys y -> Term prf.fst (Tensor x y)
TensorElim : {xs, ys : List Ty}
-> {default (concat xs ys) prf : (zs : List Ty ** Simplex xs ys zs)}
-> Term xs (Tensor x y) -> Term (x :: y :: ys) z -> Term prf.fst z
foo : {a, b : Ty} -> Term [a, b] (Tensor a b)
foo = TensorIntro Var Var
I find Idris’ default
syntax to be a bit awkward, but it feels to me like a potentially very powerful tool, and something I wish Haskell had for scripting instance search.
Unfortunately, going from a planar language to a linear one - that is, ruling out copy and delete but allowing swaps - is much harder. I figured out a technique for doing this that turns out to be very powerful and give very fine control over the scoping rules of a language.
The idea is to isolate a category of context morphisms (technically a coloured pro, that is a strict monoidal category whose monoid of objects is free). Then we will parametrise a planar language by an action of this category. The good news is that this is the final iteration of the definition of Term
, and we’ll be working with it for the rest of this blog post.
Structure : Type -> Type
Structure a = List a -> List a -> Type
data Term : Structure Ty -> List Ty -> Ty -> Type where
Var : Term hom [x] x
Act : hom xs ys -> Term hom ys x -> Term hom xs x
UnitIntro : Term hom [] Unit
UnitElim : {xs, ys : List Ty}
-> {default (concat xs ys) prf : (zs : List Ty ** Simplex xs ys zs)}
-> Term hom xs Unit -> Term hom ys y -> Term hom prf.fst y
TensorIntro : {xs, ys : List Ty}
-> {default (concat xs ys) prf : (zs : List Ty ** Simplex xs ys zs)}
-> Term hom xs x -> Term hom ys y -> Term hom prf.fst (Tensor x y)
TensorElim : {xs, ys : List Ty}
-> {default (concat xs ys) prf : (zs : List Ty ** Simplex xs ys zs)}
-> Term hom xs (Tensor x y) -> Term hom (x :: y :: ys) z
-> Term hom prf.fst z
First, let’s recover planar terms. To do this, we want to define a Structure
where hom xs ys
is a proof that xs = ys
:
data Planar : Structure a where
Empty : Planar [] []
Whisker : Planar xs ys -> Planar (x :: xs) (x :: ys)
Now let’s deal with linear terms. For that, we want to define a Structure
where hom xs ys
is a proof that ys
is a permutation of xs
. We can do this in two steps:
-- Insertion x xs ys is a witness that ys consists of xs with x inserted somewhere
data Insertion : a -> List a -> List a -> Type where
-- The insertion is at the head of the list
Here : Insertion x xs (x :: xs)
-- The insertion is somewhere in the tail of the list
There : Insertion x xs ys -> Insertion x (y :: xs) (y :: ys)
data Symmetric : Structure a where
-- The empty list has a unique permutation to itself
Empty : Symmetric [] []
-- Extend a permutation by inserting the head element into the permuted tail
Insert : Insertion x ys zs -> Symmetric xs ys -> Symmetric (x :: xs) zs
(Incidentally, this is the point where I realised that although Idris looks like Haskell, programming in it feels a lot closer to programming in Prolog.)
Now we write swap as term:
swap : {a, b : Ty} -> Term Symmetric [a, b] (Tensor b a)
swap = Act (Insert (There Here) (Insert Here Empty)) (TensorIntro Var Var)
Now we can come full circle and redefine cartesian terms in a way that uniformly matches the way we do substructural terms.
data Cartesian : Structure a where
-- Delete everything in scope
Delete : Cartesian xs []
-- Point to a variable in scope and make a copy on top
Copy : Elem y xs -> Cartesian xs ys -> Cartesian xs (y :: ys)
With this, we can rewrite all the terms we started with:
delete : {a : Ty} -> Term Cartesian [a] Unit
delete = Act Delete UnitIntro
prjl : {a, b : Ty} -> Term Cartesian [a, b] a
prjl = Act (Copy Here Delete) Var
prjr : {a, b : Ty} -> Term Cartesian [a, b] b
prjr = Act (Copy (There Here) Delete) Var
copy : {a : Ty} -> Term Cartesian [a] (Tensor a a)
copy = Act (Copy Here (Copy Here Delete)) (TensorIntro Var Var)
swap : {a, b : Ty} -> Term Cartesian [a, b] (Tensor b a)
swap = Act (Copy (There Here) (Copy Here Delete)) (TensorIntro Var Var)
Let’s end with a party trick. What would a cocartesian language look like - one where we can’t delete or copy, but we can spawn and merge?
Co : Structure a -> Structure a
Co hom xs ys = hom ys xs
-- spawn : Void -> a
-- spawn = \case {}
spawn : {a : Ty} -> Term (Co Cartesian) [] a
spawn = Act Delete Var
-- merge : Either a a -> a
-- merge = \case {Left x => x; Right x => x}
merge : {a : Ty} -> Term (Co Cartesian) [a, a] a
merge = Act (Copy Here (Copy Here Delete)) Var
-- injl : a -> Either a b
-- injl = \x => Left x
injl : {a, b : Ty} -> Term (Co Cartesian) [a] (Tensor a b)
injl = Act (Copy Here Delete) (TensorIntro Var Var)
-- injr : b -> Either a b
-- injr = \y => Right y
injr : {a, b : Ty} -> Term (Co Cartesian) [b] (Tensor a b)
injr = Act (Copy (There Here) Delete) (TensorIntro Var Var)
Since at the very beginning we added a single generating type Ground
, and the category generated by one object and finite coproducts is finite sets and functions, this language can define exactly the functions between finite sets. For example, there are exactly 4 functions from booleans to booleans:
id, false, true, not : Term (Co Cartesian) [Ground, Ground] (Tensor Ground Ground)
id = Act (Copy Here (Copy (There Here) Delete)) (TensorIntro Var Var)
false = Act (Copy Here (Copy Here Delete)) (TensorIntro Var Var)
true = Act (Copy (There Here) (Copy (There Here) Delete)) (TensorIntro Var Var)
not = Act (Copy (There Here) (Copy Here Delete)) (TensorIntro Var Var)
That’s enough for today, but next time I will continue using this style of term language to start dealing with the difficult issues of building a programming language for optics.Loading [MathJax]/jax/output/HTML-CSS/jax.js
Category theory for machine learning has been a big topic recently, both with Bruno’s thesis dropping, and the paper on using the Para construction for deep learning.
In this post we will look at how dependent types can allow us to almost effortlessly implement the category theory directly, opening up a path to new generalisations.
I will be making heavy use of Tatsuya Hirose’s code that implements the Para(Optic) construction in Haskell. Our goal here is to show that when we make the category theory in the code explicit, it becomes a powerful scaffolding that lets us structure our program.
All in all, our goal is to formulate this: A simple neural network with static types enforcing the parameters and input and output dimensions.
import Data.Fin
import Data.Vect
model : GPath ParaLensTensor [< [4, 2], [4], [0], [2, 4], [2], [0]] [2] [2]
model = [< linear, bias, relu, linear, bias, relu]
The cruicial part is the Para construction, which lets us accumulate parameters along the composition of edges. This lets us state the parameters of each edge separately, and then compose them into a larger whole as we go along.
Para forms a graded category, and in order to understand what this is we will start with a graded monoid first.
namespace Monoid
data Env : (par -> Type) -> List par -> Type where
-- Empty list
Nil : Env f []
-- Add an element to the list, and accumulate its parameter
(::) : {f : par -> Type} -> f n -> Env f ns -> Env f (n::ns)
-- Compare this to the standard free monoid
-- data List : Type -> Type where
-- Nil : List a
-- (::) : a -> List a -> List a
I used this datatype in a previous blog post where it is used to represent variable environments.
We can use it for much more, though. For instance, let’s say that we want to aggregate a series of vectors, and later perform some computation on them.
Our free graded monoid lets us accumulate a list of vectors, while keeping their sizes in a type-level list.
Vec : Nat -> Type
Vec n = Fin n -> Double
f1 : Vec 1
f2 : Vec 2
f3 : Vec 3
fins : Env Vec [1, 2, 3]
fins = [f1, f2, f3]
As we will soon see, Para works the same way, but instead of forming a graded monoid, it forms a graded category.
Before we look at free graded categories, let’s first look at how to work with a plain free category. I’ve used them in another previous blog post. A nice trick that I’ve learned from André Videla is that we can use Idris notation for lists with free categories too, we just need to name the constructors appropriately.
Graph : Type -> Type
Graph obj = obj -> obj -> Type
-- The category of types and functions
Set : Graph Type
Set a b = a -> b
namespace Cat
data Path : Graph obj -> Graph obj where
-- Empty path
Nil : Path g a a
-- Add an edge to the path
(::) : g a b -> Path g b c -> Path g a c
While vectors form graded monoids, matrices form categories.
Matrix : Graph Nat
Matrix n m = Fin n -> Fin m -> Double
mat1 : Matrix 2 3
mat2 : Matrix 3 1
matrixPath : Path Matrix 2 1
matrixPath = [mat1, mat2]
-- matrixPath = mat1 :: mat2 :: Nil
Just as we did at the start of the blog post, we are using the inbuilt syntactic sugar to represent a list of edges. We will now generalise from free paths to their parameterised variant!
A free graded category looks not unlike a free category, except now we are accumulating an additional parameter, just as we did with graded monoids:
ParGraph : Type -> Type -> Type
ParGraph par obj = par -> obj -> obj -> Type
-- A free graded path over a parameterised graph
data GPath : ParGraph par obj -> ParGraph (List par) obj where
-- Empty path, with an empty list of grades
Nil : GPath g [] a a
-- Add an edge to the path, and accumulate its parameter
(::) : {g : par -> obj -> obj -> Type} -> {a, b, c : obj}
-> g p a b -> GPath g ps b c -> GPath g (p :: ps) a c
So a graded path will take in a parameterised graph, and give back a path of edges with an accumulated parameter. Where could we find such parameterised graphs? This is where the Para construction comes in. Para takes a category C, an action of a monoidal category M×C→C, and gives us a parameterised category over C.
-- Para over a monoidal category C
Para : (c : Graph obj) -> (act : par -> obj -> obj) -> ParGraph par obj
Para c act p x y = (p `act` x) `c` y
In other words, we have morphisms and an accumulating parameter. A simple example is the graded co-reader comonad, also known as the pair comonad.
ParaSet : ParGraph Type Type
ParaSet p a b = Para Set Pair p a b
-- A function Nat -> Double, parameterised by Nat
pair1 : ParaSet Nat Nat Double
-- A function Double -> Int, parameterised by String
pair2 : ParaSet String Double Int
-- A function Nat -> Int, parameterised by [Nat, String]
ex : GPath ParaSet [Nat, String] Nat Int
ex = [pair1, pair2]
It works a lot like the standard co-reader comonad, but it now accumulates parameters as we compose functions.
Functional programers tend to be familiar with lenses. They are often presented as coalgebras of the costate comonad, and their links to automatic differentiation are now well known.
Monomorphic lenses correspond to the plain costate comonad, and polymorphic lenses correspond to the indexed version.
-- Monomorphic Lens
MLens : Type -> Type -> Type
MLens s a = (s -> a, (s, a) -> s)
-- Polymorphic Lens, Haskell-style
Lens : Type -> Type -> Type -> Type -> Type
Lens s t a b = (s -> a, (s, b) -> t)
Idris allows us to bundle up the arguments for a polymorphic lens into a pair, sometimes called a boundary. This will help us form the category of parametric lenses more cleanly, as well as cut down on the number of types that we need to wrangle.
Boundary : Type
Boundary = (Type, Type)
-- Polymorphic lenses are morphisms of boundaries
Lens : Boundary -> Boundary -> Type
Lens (s, t) (a, b) = (s -> a, (s, b) -> t)
Both monomorphic and polymorphic lenses form categories. But before we look at them, let’s generalise our notion of lens away from Set and towards arbitrary (cartesian) monoidal categories.
In other words, given a cartesian monoidal category C, we want to form the category Lens(C) of lenses over C.
-- take a category C, and a cartesian monoidal product, to give back the category Lens(C)
LensC : (c : Graph obj) -> (ten: obj -> obj -> obj) -> Graph obj
LensC c ten s a = (s `c` a, (s `ten` a) `c` s)
We then take Para of this construction, giving us the category Para(Lens(C)).
ParaLensSet : ParGraph Type Type
ParaLensSet p s t = Para (LensC Set Pair) Pair p s t
We now have all the theoretical pieces together. At this point, we could simply implement Para(Lens(Set)), which would give us the morphisms of our neural network. However, there is one more trick up our sleeve - rather than working in the category of sets, we would like to work in the category of vector spaces.
This means that we will parameterise the above construction to work over some monoidal functor C→Set.
ParaLensF : (f : k -> Type) -> ParGraph k k
ParaLensF f p m n = ParaLensSet (f p) (f m) (f n)
And now, let us proceed to do machine learning.
First we will introduce the type of tensors of arbitrary rank. Our first instinct would be to do this with a function
Tensor' : List Nat -> Type
Tensor' [] = Double
Tensor' (n :: ns) = Fin n -> Tensor' ns
But unfortunately this will mess up with type inference down the line. Dependent types tend to struggle when it comes to inferring types whose codomain contains arbitrary computation. This is what Conor McBride calls “green slime”, and is one of the major pitfalls that functional programmers encounter when they try to make the jump to dependent types.
For this reason, we will represent our rank-n tensors using a datatype, which will allow Idris to infer the types much more easily. Luckily, tensors are easily represented using an alternative datatype that’s popular in Haskell.
data Tensor : List Nat -> Type where
Scalar : Double -> Tensor Nil
Dim : Vect n (Tensor ns) -> Tensor (n :: ns)
This is essentially a nesting of vectors, which accumulates their sizes.
All together, our datatype of parameterised lenses over tensors becomes
ParaLensTensor : ParGraph (List Nat) (List Nat)
ParaLensTensor pars ms ns = ParaLensF Tensor pars ms ns
We can now start writing neural networks. I’ll be mostly adapting Tatsuya’s code in the following section. The full code for our project can be found here, and I’ll only include the most interesting bits.
Unlike the original code, we will be using a heterogeneous list - rather than nested tuples - to keep track of all of our parameters, which is why the resulting dimensions will be much easier to track.
linear : {n, m : Nat} -> ParaLensTensor [m, n] [n] [m]
linear = (getter, setter) where
getter : (Tensor [m, n], Tensor [n]) -> Tensor [m]
getter (w, x) = joinM w x
setter : ((Tensor [m, n], Tensor [n]), Tensor [m]) -> (Tensor [m, n], Tensor [n])
setter ((w, x), y) = (outer y x, joinM (dist w) y)
bias : {n : Nat} -> ParaLensTensor [n] [n] [n]
bias = (getter, setter) where
getter : (Tensor [n], Tensor [n]) -> Tensor [n]
getter (b, x) = pointwise (+) x b
setter : ((Tensor [n], Tensor [n]), Tensor [n]) -> (Tensor [n], Tensor [n])
setter ((b, x), y) = (y, y)
relu : ParaLensTensor [0] [n] [n]
relu = (getter, setter) where
getter : (Tensor [0], Tensor [n]) -> Tensor [n]
getter (_, x) = dvmap (max 0.0) x
setter : ((Tensor [0], Tensor [n]), Tensor [n]) -> (Tensor [0], Tensor [n])
setter ((Dim [], x), y) = (Dim [], pointwise (*) y (dvmap step x)) where
step : Double -> Double
step x = if x > 0 then 1 else 0
learningRate : ParaLensTensor [0] [] [0]
learningRate = (const (Dim []), setter) where
setter : ((Tensor [0], Tensor []), Tensor [0]) -> (Tensor [0], Tensor [])
setter ((_, (Scalar loss)), _) = (Dim [], Scalar (-0.2 * loss))
crossEntropyLoss : ParaLensTensor [n] [n] []
crossEntropyLoss = (getter, setter) where
getter : (Tensor [n], Tensor [n]) -> Tensor []
getter (y', y) =
let Scalar dot' = dot y' y in
Scalar (log (sumElem (dvmap exp y)) - dot')
setter : ((Tensor [n], Tensor [n]), Tensor []) -> (Tensor [n], Tensor [n])
setter ((y', y), (Scalar z)) = let
expY = dvmap exp y
sumExpY = sumElem expY in
(dvmap (* (-z)) y,
dvmap (* z) (
((pointwise (-) (dvmap (/sumExpY) expY) y'))))
-- Our final model: parameters source target
model : GPath ParaLensTensor [< [4, 2], [4], [0], [2, 4], [2], [0]] [2] [2]
model = [< linear, bias, relu, linear, bias, relu]
All that remains is to implement an algebra for this structure. Normally we would use the generic recursion schemes machinery to do this, but for now we will implement a one-off fold specialized to graded paths.
-- Evaluate the free graded category over ParaLensTensor
eval : GPath ParaLensTensor ps s t -> ParaLensTensorEnvS ps s t
eval [<] = (\(_, s) => s, \((l, s'), s) => ([<], s))
eval (es :< (fw, bw)) = let (fw', bw') = eval es in
(\((ps :< p), s) => let b = fw' (ps, s) in fw (p, b),
(\(((ps :< p), s), dt) => let
b = fw' (ps, s)
(p', b') = bw ((p, b), dt)
(ps', s') = bw' ((ps, s), b')
in (ps' :< p', s')))
It would actually be possible to write an individual algebra for Lens(C) and Para(C) and then compose them into an algebra Para(Lens(C)), but we can leave that for a future blog post.
Running a neural network in Idris compared to NumPy is going to be obviously slow. However, since we’re working entirely with free categories, it means that we don’t have to actually evaluate our functions in Idris!
What we can do is organise all of our functions into a signature, where each constructor corresponds to a primitive function in the target language. We could then use the FFI to interpret them, allowing us to get both the static guarantees of Idris and the performance of NumPy.
data TensorSig : ParGraph (List Nat) (List Nat) where
Linear : TensorSig [m, n] [n] [m]
Bias : TensorSig [n] [n] [n]
Relu : TensorSig [0] [m] [n]
LearningRate : TensorSig [0] [1] [0]
CrossEntropyLoss : TensorSig [n] [n] []
SoftMax : TensorSig [0] [n] [n]
model' : GPath PTensorSig [< [4, 2], [4], [0], [2, 4], [2], [0]] [2] [2]
model' = [< Linear, Bias, Relu, Linear, Bias, Relu]
We’ve also only sketched out the tensor operations, but we could take this a step forward and develop a proper tensor library in Idris.
In a future post, we will see how to enhance the above with auto-diff: meaning that the user needs to only supply the getter, and the setter will be derived automatically.Processing math: 100%
Cross-posted from the 20[ ] blog
Welcome to part I of the Build Your Own Open Games Engine Bootcamp, where we’ll be learning the inner workings of the Open Games Engine and Compositional Game Theory in general, while implementing a super-simple Haskell version of the engine along the way.
In this episode we will learn about Lenses, how to compose them and how they can be implemented in Haskell. But first, let’s set the context for this whole series.
In classical Game Theory, the definitions for (deterministic) Normal-form and Extensive-form games have undoubtedly proved successful as mathematical tools for studying strategic interactions between rational agents. Despite this, the monolithic nature of these definitions becomes apparent over time, eventually leading to a complexity wall in one’s game theoretic modelling career. This limitation arises as games become more intricate, and the rigid structure of these definitions gets in the way of modelling, similar to how mantaining a large codebase written in a x86 assembly quickly becomes a superhuman feat.
Compositional Game Theory solves this exact problem: By turning games into composable open processes, one can build up a library of reusable components and approach the problem compositionally™, in a divide-et-impera fashion. To keep the programming language analogy going: Programming in a high-level language like Haskell or Rust is way easier than programming in straight assembly. The ability to modularize code by breaking it up into modules and functions, which are predictably1 composable and reusable, helps tame the mental overhead of complex programs. It also saves the programmer tons of time and keystrokes that would otherwise be spent re-writing the same chunk of boilerplate code with minor modifications over and over.
The primary goal of this series is to introduce Compositional Game Theory and provide readers with a practical understanding of Open Games. This includes a very simple Haskell implementation of Open Games for readers to play with and test their intuitions against. By the end of this series, you will have the knowledge and tools to start modelling simple deterministic games. Additionally, you’ll be equipped to start exploring the Open Game Engine codebase and see how Open Games are applied in real-world modeling.
In the following posts, we’re going to break down and understand the following definition:
An Open Game is a pair (A,ε), where A is a Parametrized Lens with co/parameters P and Q and ε is a Selection Function on P→Q.
Moreover, we will learn about how Open Games can be composed both sequentially and in parallel, and hopefully some extra cool stuff along the way.
The first and most important component of an Open Game is the arena, i.e. the “playing field” where all the dynamics happens and the players can interface with. The arena is a parametrized lens, a composable typed bidirectional process.
A Parametrized Lens from a pair of sets (XS) to a pair of sets (YR) with Parameters(PQ) is a pair of functions get:P×X→Y and put:P×X×R→S×Q.
Which can be implemented in the following manner in Haskell by making use of currying:
data ParaLens p q x s y r where
-- get put
MkLens :: (p -> x -> y) -> (p -> x -> r -> (s, q)) -> ParaLens p q x s y r
Diagrammatically speaking, a parametrized lens can be represented as a box with 6 typed wires, which under the lens (pun intended) of compositional game theory are interpreted as the following:
- x is the type of game states that can be observed by the player prior to making a move.
- p is the type of strategies a player can adopt.
- y is the type of game states that can be observed after the player made its move.
- r is the type of utilities/ payoffs the player can receive after making its move.
- s is the type of back-propagated utilities a player can send to players that moved before it.
- q is the type of rewards representing the player’s intrinsic utility.
With this in mind, we can open the box in the previous diagram and have a look at the internals of a parametrized lens2:
By looking at the internals of a lens and the direction of the arrows, it becomes clear that data flows in two different directions:
- The forward pass, i.e. the
get
function, is happening at the time a player can observe the state before interacting with the game. - The backward pass, i.e. the
put
function, is happening “in the future”, after all players did their moves, and represents the stage in which payoffs are being computed and passed around.
To limit mental overload, the following definition of non-parametrized lens will also come useful later:
A (non-parametrized) Lens is a parametrized lens with parameters (11), where 1 is the singleton set.
-- A (non-parametrized) `Lens` is a `ParaLens` with trivial parameters
type Lens = ParaLens () ()
-- Non-parametrized Lens constructor
nonPara :: (x -> y) -> (x -> r -> s) -> Lens x s y r
nonPara get put = MkLens (\_ x -> get x) (\_ x r -> (put x r, ()))
Diagrammatically we will represent wires of type ()
(the singleton type) as no wires at all. This will also come useful to us later in order to simplify some definitions and diagrams. For example, here’s a representation of the flow of data in a non-parametrized lens, courtesy of Bruno Gavranović:
What makes Compositional Game Theory compositional is (unsurprisingly) the fact that parametrized lenses are closed under two different kinds of composition operators, one behaving like sequential composition of pure functions and one behaving like parallel execution of programs, or more or less like a tensor product3.
Let’s start with sequential composition: When the right boundary types of A:(XS)→(YR) match the left boundary types of B:(YR)→(ZT), we should be able to build another lens out of it that amounts to running what happens in A first, and then run what happens in B while taking into account the parameters of both lenses:
By trying to code this up in a type-directed way in Haskell, the only sensible definition that can possibly come out is the following:
infixr 4 >>>>
(>>>>) :: ParaLens p q x s y r -> ParaLens p' q' y r z t -> ParaLens (p, p') (q, q') x s z t
(MkLens get put) >>>> (MkLens get' put') =
MkLens
(\(p, p') x -> get' p' (get p x))
(\(p, p') x t ->
let (r, q') = put' p' (get p x) t
(s, q) = put p x r
in (s, (q, q'))
)
From the Haskell implementation we can see that composing two lenses, parametrized or not, isn’t as simple as plugging one end into another, merging the parameter wires and calling it a day4. Something a bit more articulate is happening:
Mathematically, this amounts to the following compositions:
- For the
get
part: P′×P×Xid×get→P′×Yget′→Z - For the
put
part: P′×P×X×Tid×ΔP×ΔX×id→P′×P×P×X×X×Tsym×get×sym→P×P′×Y×T×Xid×put′×id→P×R×Q′×Xrearrange→P×X×R×Q′put×id→S×Q×Q′
Where Δ(x)=(x,x), sym(x,y)=(y,x) and rearrange is a suitable composition of syms.
Luckily, parallel composition is way easier than the sequential one: In fact, parallel composition of A:(XS)→(YR) with parameters (PQ) and B:(X′S′)→(Y′R′) with parameters (P′Q′), amounts to a lens A×B:(X×X′S×S′)→(Y×Y′R×R′) with parameters (P×P′Q×Q′), such that putA×B and getA×B are respectively the cartesian product of the put
and get
functions from A and B, modulo some rearrangement of inputs and outputs.
This is even clearer from the Haskell implementation:
infixr 4 ####
(####) :: ParaLens p q x s y r -> ParaLens p' q' x' s' y' r' -> ParaLens (p, p') (q, q') (x, x') (s, s') (y, y') (r, r')
(MkLens get put) #### (MkLens get' put') =
MkLens
(\(p, p') (x, x') -> (get p x, get' p' x'))
(\(p, p') (x, x') (r, r') ->
let (s, q) = put p x r
(s', q') = put' p' x' r'
in ((s, s'), (q, q'))
)
Diagrammatically, this amounts to just putting the two lenses near each other.
Now that we have laid all the groundwork, let’s have a look at a couple of concrete examples of lenses.
Our first source of lenses will be functions: For each function f:X→S there is a non-parametrized lens F:(XS)→(11) such that get(∗,x)=∗ and put(∗,x,∗)=(f(x),∗). Vice-versa, we can always extract a unique function from non-parametrized lenses of this kind.
funToCostate :: (x -> s) -> Lens x s () ()
funToCostate f = nonPara (const ()) (\x _ -> f x)
costateToFun :: Lens x s () () -> (x -> s)
costateToFun (MkLens _ f) x = fst $ f () x ()
Similarly, for each function f:P→Q there is a parametrized lens ˉF:(11)→(11) with parameters (PQ), such that get(∗,∗)=∗ and put(p,∗,∗)=(f(p),∗). Likewise, we can always extract a unique function from this kind of parametrized lenses.
funToParaState :: (p -> q) -> ParaLens p q () () () ()
funToParaState f = MkLens (\_ _ -> ()) (\p _ _ -> ((), f p))
paraStateTofun :: ParaLens p q () () () () -> (p -> q)
paraStateTofun (MkLens _ coplay) p = snd $ coplay p () ()
For each value ˉy∈Y and for any set R we can build a non-parametrized lens Sˉy:(11)→(YR) such that put(∗,∗)=ˉy and get(∗,∗,r)=(∗,∗).
scalarToState :: y -> Lens () () y r
scalarToState y = nonPara (const y) const
stateToScalar :: Lens () () y r -> y
stateToScalar (MkLens get _) = get () ()
The Identity Lens is a non-parametrized lens of type (XS)→(XS) that serves as the identity morphism for parametrized lenses, i.e. pre-/post-composing a lens A with the identity lens gives you back A modulo readjusting the parameters (we will see how to do that in the next post). In Haskell:
idLens :: Lens x s x s
idLens = nonPara id (\_ x -> x)
(Right) Corners are parametrized lenses of type (11)→(YR) and parameters (YR) that bend parameter wires into right wires, such that get(y,∗)=y and put(y,∗,r)=(r,∗).
corner :: ParaLens y r () () y r
corner = MkLens const (\_ _ r -> ((), r))
As we will see in later posts, corners are an important component of bimatrix games.
Parametrized lenses are not only useful for reasoning about Open Games, but also serve as the base of a whole categorical framework for reasoning about complex multi-agent systems which has also been applied to gradient-based learning, dynamic programming, reinforcement learning, bayesian inference and servers on top of various flavors of game theory (e.g. [2105.06763]). Indeed, this categorical framework is so general and promising that we spawned an entire research institute dedicated to it.
Phew! That’s all for today. I hope that this introduction to the world of parametrized lenses has left you wanting for more! I’ll see you in the next post, were we will explore how to handle spurious parameters with reparametrizations and model players and their agency with selection functions.
-
Without side-effects and/or emergent behavior. ↩
-
Sometimes it will be useful to represent certain lenses in their unboxed form and with product-type wires decoupled when reasoning pictorially, luckily this approach to reasoning with lenses is still completely formal. ↩
-
In mathematical lingo, one would say that parametrized lenses can be organized as the morphisms of some kind of somewhat complicated monoidal category-like structure called a symmetric monoidal bicategory. This is not a 1-category on-the-nose since there’s some issues with the bracketing of parameters after sequential composition that makes associativity hold only up to isomorphism. ↩
-
Actually there’s a useful generalization of the (parametrized) lens definition, called (parametrized) optics which allows this, on top of other operational advantages over the lens definition and allowing to expand the “classical” definition of Open Games to Bayesian Game Theory and more. ↩See parts I and II of this series. There is also now a source repo!
In this post we will make probably the single most important step from a generic type theory to one specialised to bidirecional programming.
In a bidirectional language there are 2 kinds of variables: ones travelling forwards in time, and ones travelling backwards in time. I will refer to the types of these variables covariant and contravariant for short (or at least shorter). Variables of covariant type are the ones we are used to: they behave according to ordinary cartesian scoping rules, which means they can be implicitly deleted (referred to zero times) and implicitly copied (referred to more than once).
Variables of cocartesian type are the weird ones. They can also be implicitly copied and deleted, but because they are going backwards, from the forwards perspective it looks like they can be implicitly merged and spawned. By spawned I mean they can be bound zero times and still referred to; and by merged I mean they can be bound twice without the later binding shadowing the earlier binding. On the other hand they can not be implicitly copied or deleted. Their scoping rules are cocartesian.
I can now reveal that my party trick in the last section of my first post where I built a cocartesian language was secretly not in fact a party trick after all, but was all along in preparation for this exact moment.
Now suppose we take a tensor product of a covariant type and a contravariant type. Now we have a variable referring to a bundle of a cartesian value, which can be deleted and copied but not spawned or merged, and a cocartesian value, which can be spawned and merged but not deleted or copied. Since doing any of these operations to a pair means doing it to both, our variable cannot be deleted or copied or spawned or merged, which means it is a linear variable. I will refer to such a tensor product type as invariant. Secretly, all of the types in my second post were invariant, since that language was linear.
There is a secret fourth thing, which is variables with bicartesian scoping rules, so they can be deleted and copied and spawned and merged. I will refer to these types as bivariant. It would be entirely reasonable to not include these, but I have a secret plan for them that will be revealed in the next post. For now, the only bivariant type will be the monoidal unit.
Just as terms are classified by types, types are classified by kinds, which means I have been talking about kinds the whole time. But these are not kinds as we know them from languages like Haskell. We have the added complication that kinds control the scoping rules of variables of the types they classify; but on the other hand we have the added simplification that there are exactly 4 kinds, rather than an infinite supply of them. I have to thank Michael Arntzenius, who visited Zanzi and me a few weeks ago, for planting the idea of a 4-element lattice of kinds.
Let’s start with the easy parts:
Kind : Type
Kind = (Bool, Bool)
data Ty : Kind -> Type where
Unit : Ty (True, True)
Ground : Ty (True, False)
Not : Ty (cov, con) -> Ty (con, cov)
I implement the 4-element lattice as the product (Bool, Bool)
, where the first flag tracks whether the type is covariant and the second whether it is contravariant. I anticipate that much later, probably when I add type polymorphism, it will become necessary to take seriously that there are subkind relationships here, but I will cross that particular bridge when it is in front of me. Unit
is a bivariant type, Ground
is a covariant type (it is still a placeholder, and in the next post will be replaced with a proper system for base types). The Not
operator acts on the underlying kind by interchanging the covariant and contravariant capabilities; so strictly covariant types become strictly contravariant and vice versa, whereas the bivariant and invariant kinds are both stable under negation.
There is an alternative way of implementing it, but it is much more tedious: instead of representing kinds explicitly, have 4 different versions of Ty
for each of the 4 kinds. That leads to a proliferation of type operators - 4 operators for Not
and 16 for Tensor
(which we’re coming to next), 1 for each combination of kinds of the 2 inputs. I really hope that won’t be necessary, but it’s something I’m keeping in my back pocket just in case I run into an insurmountable problem with this encoding.
Now we come to the hard part: tensor products. A tensor product is covariant if both parts are covariant, and is contravariant if both parts are contravariant. In a world with a sufficiently smart typechecker we would simply write this:
Tensor : Ty (covx, conx) -> Ty (covy, cony)
-> Ty (covx && covy, conx && cony)
But we already know how to handle this situation, it’s the same idea as for list concatenation in the first post. We define the relation corresponding to the boolean conjunction function, and a section of it:
data And : Bool -> Bool -> Bool -> Type where
True : And True b b
False : And False b False
(&&) : (a : Bool) -> (b : Bool) -> (c : Bool ** And a b c)
True && b = (b ** True)
False && b = (False ** False)
(Here’s I’m having fun with Idris’ ability to disambiguate names based on typing information, something I really wish I could do in Haskell.)
Now we can write the actual definition of tensor products, which is much more complicated looking than it should be, but this is just what we have to deal with until the day when somebody builds a typechecker that can handle trivial equational reasoning:
Tensor : {covx, covy, conx, cony : Bool}
-> {default (covx && covy) cov : _} -> {default (conx && cony) con : _}
-> Ty (covx, conx) -> Ty (covy, cony) -> Ty (cov.fst, con.fst)
Now we come to the main topic of this post: how kinds influence scoping rules. In the first post we saw how to implement context morphisms for planar, linear, cartesian and cocartesian languages. Those definitions were all polymorphic over arbitrary lists. Then in the second post we defined context morphisms that could introduce and elimination double negations, which was no longer polymorphic but specialised to a particular language of types, and this will continue.
Now that types are indexed by kinds, everything else becomes more complicated: everything we do from now on becomes additionally indexed by kinds, starting with this.
(Note, I decided to reuse the name Structure
, since this should be the last one we ever need.)
Let’s start with the linear rules:
data Structure : All Ty kas -> All Ty kbs -> Type where
Empty : Structure [] []
Insert : {a, b : Ty (cov, con)} -> {as : All Ty kas} -> {bs : All Ty kbs}
-> Parity a b -> IxInsertion a as as' -> Structure as bs -> Structure as' (b :: bs)
Here IxInsertion
is an indexed version of the Insertion
datatype from the previous post, relationally defining insertions into an All
type indexed by an underlying list:
data IxInsertion : {0 x : a} -> p x
-> {0 xs : List a} -> All p xs
-> {0 ys : List a} -> All p ys -> Type where
Z : IxInsertion {x} a {xs} as {ys = x :: xs} (a :: as)
S : IxInsertion {x} a {xs} as {ys} bs
-> IxInsertion {x = x} a {xs = y :: xs} (b :: as) {ys = y :: ys} (b :: bs)
I’m not going to explain all of the syntax and semantics of Idris happening here because it would take us too far afield, and despite appearances this is not intended to be a tutorial series on Idris programming. Suffice to say, defining indexed versions of standard datatypes is significantly harder than defining the originals.
The definition of Parity
, which handles double negations, is unchanged from the previous post besides adding the kind indexes:
data Parity : Ty (cov, con) -> Ty (cov, con) -> Type where
Id : Parity a a
Raise : Parity a b -> Parity a (Not (Not b))
Lower : Parity a b -> Parity (Not (Not a)) b
The Idris typechecker is, at least, smart enough to figure out that double negation leaves the kind unchanged. This is one reason that I write out kinds as pairs of booleans everywhere rather than defining a function swap : Kind -> Kind
, since that causes Idris to get stuck here.
One thing to note is that I reversed the polarity of the Insert
constructor from the previous post, so that it now conses on the codomain and inserts on the domain rather than the other way round. The previous post was “supply driven”, saying where everything in the domain goes in the codomain, whereas this version is “demand driven”, saying where everything in the codomain came from in the domain. This was a late change after I experienced manually defining terms in this core syntax and found that this way makes manual proof search much easier. If it later turns out that the original supply driven version makes writing elaborators easier, it’s an easy change to turn it back.
Next we have the rules for delete and copy, which can be applied only to covariant types. We don’t care whether the type is also contravariant, ie. we can use it for both strictly covariant types and bivariant types. This is the other reason I define kinds to be pairs of booleans rather than their own dedicated language, because otherwise we would have twice as many constructors here.
Delete : {a : Ty (True, con)} -> Structure as bs -> Structure (a :: as) bs
Copy : {a : Ty (True, con)} -> {as : All Ty xs}
-> IxElem a as -> Structure as bs -> Structure as (a :: bs)
Here IxElem
is, of course, an indexed version of the Elem
datatype, which I won’t bother to write here but was also painful to define. (If you want to you can find it in the IxUtils module of the source repo.)
With that, defining the constructors for spawn and merge is easy, and completes our definition of Structure
once and for all.
Spawn : {b : Ty (cov, True)} -> Structure as bs -> Structure as (b :: bs)
Merge : {b : Ty (cov, True)} -> {bs : All Ty ys}
-> IxElem b bs -> Structure as bs -> Structure (b :: as) bs
We can now put the pieces together and make the first real definition of our kernel language. I hope that nothing here will change, only be added to in the future.
Of course everything is kind indexed:
data Term : All Ty ks -> Ty k -> Type where
Var : Term [x] x
Rename : Structure as bs -> Term bs x -> Term as x
I renamed Act
to Rename
since the last post, since somebody pointed out that’s what it is.
Unit introduction and elimination rules work as they did before:
UnitIntro : Term [] Unit
UnitElim : {as : All Ty kas} -> {bs : All Ty kbs} -> {default (ixSimplex as bs) cs : _}
-> Term as Unit -> Term bs x -> Term (cs.snd.fst) x
Here ixSimplex
is the tactic corresponding to a datatype IxSimplex
which is the relational version of the indexed version of ++
, operating on All
rather than List
. This one is worth writing down because it returns a twice-iterated Sigma type, so we have to extract the part we need with .snd.fst
:
data IxSimplex : {0 xs : List a} -> All p xs
-> {0 ys : List a} -> All p ys
-> {0 zs : List a} -> All p zs -> Type where
Z : IxSimplex [] bs bs
S : IxSimplex {xs} as {ys} bs {zs} cs
-> IxSimplex {xs = x :: xs} (a :: as) {ys = ys} bs {zs = x :: zs} (a :: cs)
ixSimplex : {xs : List a} -> (as : All p xs)
-> {ys : List a} -> (bs : All p ys)
-> (zs : List a ** cs : All p zs ** IxSimplex as bs cs)
ixSimplex {xs = []} [] {ys} bs = (ys ** bs ** Z)
ixSimplex {xs = x :: xs} (a :: as) {ys} bs
= let (zs ** cs ** n) = ixSimplex {xs = xs} as {ys = ys} bs
in (x :: zs ** a :: cs ** S n)
Now we come to the negation rules, and the final twist of this post. In the language we reached at the end of the previous post, which amounts to the fragment of this language for only invariant types, there was no not-introduction rule - correctly so. As I speculated there, there are indeed some valid instances of not-introduction, but they are not sufficient to prove general double negation introduction or elimination - and they can’t even be expressed without the kind system we developed in this post.
It turns out that we have two not-introduction rules, one that is valid for covariant types and one that is valid for contravariant types. This introduces a sort of incompatibility between negation and tensor, since the tensor product of two variables with a valid not-introduction rule can fail to have one.
NotIntroCov : {a : Ty (True, con)} -> Term (a :: as) Unit -> Term as (Not a)
NotIntroCon : {a : Ty (cov, True)} -> Term (a :: as) Unit -> Term as (Not a)
NotElim : {as : All Ty kas} -> {bs : All Ty kbs} -> {default (ixSimplex as bs) cs : _}
-> Term as (Not x) -> Term bs x -> Term (cs.snd.fst) Unit
I would say that this is the first case where my well typed by construction methodology enabled me to do something that I think I would have failed at otherwise. I figured out these rules at the same time as the corresponding cases of the well typed evaluator (which will be the topic of the next post). Although they look like they are among the simplest rules in the language, they are the ones I understand by far the least.
If we have a bivariant type then both of these rules can be applied to produce the same result. It seems tempting to try to roll these two rules into one, which can be applied when either cov
or con
is True
. This is a bit tricky to do, but it turns out to also be a bad idea: although these rules are logically equivalent, they are not type-theoretically equivalent: they have completely different operational semantics! And by that I don’t mean something like different evaluation strategies, I mean that they actually output different results.
Now there’s only the tensor rules, and for once there is nothing to say about them, they are the indexed versions of the tensor rules from the last 2 posts.
TensorIntro : {as : All Ty kas} -> {bs : All Ty kbs} -> {default (ixSimplex as bs) cs : _}
-> Term as x -> Term bs y -> Term (cs.snd.fst) (Tensor x y)
TensorElim : {as : All Ty kas} -> {bs : All Ty kbs} -> {default (ixSimplex as bs) cs : _}
-> Term as (Tensor x y) -> Term (x :: y :: bs) z -> Term (cs.snd.fst) z
And that’s it!
Like I said, in the next post we will build a well typed evaluator, which means we will also write and run our first programs - and we can already do some interesting things, like basic automatic differentiation. The only small thing that will need to be added to the language from this post is a mechanism for adding basic types and basic terms, such as arithmetic operators between doubles.The space at the beginning of each post is used for acknowledgements.
The Cybercat blog website is based on the Whiteglass theme.
Standard github workflow:
- Clone this repo
- Create a branch
- Write your post
- Make a PR
- Wait for approval
The blog will be automatically rebuilt once your PR is merged.
Since the blog uses Jekyll, you will need to install it or use the included nix flake devshell (just run nix develop
with flakes-enabled nix installed) to be able to preview your contents. Once the installation is complete, just navigate to the repo folder and give bundle exec jekyll serve
. Jekyll will spawn a local server (usually at 127.0.0.1:4000
) that will allow you to see the blog in locale.
Posts must be placed in the _posts
folder. Post titles follow the convention yyyy-mm-dd-title.md
. Post assets (such as images) go in the folder assetsPost
, where you should create a folder with the same name of the post.
Each post should start with the following preamble:
---
layout: post
title: the title of your post
author: your name
categories: keyword or a list of keywords [keyword1, keyword2, keyword3]
excerpt: A short summary of your post
image: assetsPosts/yourPostFolder/imageToBeUsedAsThumbnails.png This is optional, but useful if e.g. you share the post on Twitter.
usemathjax: true (omit this line if you don't need to typeset math)
thanks: A short acknowledged message. It will be shown immediately above the content of your post.
---
As for the content of the post, it should be typeset in markdown.
- Inline math is shown by using
$ ... $
. Notice that some expressions such asa_b
typeset correctly, while expressions likea_{b}
ora_\command
sometimes do not. I guess this is because mathjax expects_
to be followed by a literal. - Display math is shown by using
$$ ... $$
. The problem above doesn’t show up in this case, but you gotta be careful:
text
$$ ... $$
text
does not typeset correctly, whereas:
text
$$
...
$$
text
does. You can also use environments, as in:
$$
\begin{align*}
...
\end{align*}
$$
We provide the following theorem environments: Definition, Proposition, Lemma, Theorem and Corollary. Numbering is automatic. If you need others, just ask. The way these works is as follows:
{% def %}
A *definition* is a blabla, such that: $...$. Furthermore, it is:
$$
...
$$
{% enddef %}
This gets rendered as follows:
A definition is a blabla, such that: ……. Furthermore, it is:
......
Numbering is automatic. Use the tags:
{% def %}
For your definitions
{% enddef %}
{% not %}
For your notations
{% endnot %}
{% ex %}
For your examples
{% endex %}
{% diag %}
For your diagrams
{% enddiag %}
{% prop %}
For your propositions
{% endprop %}
{% lem %}
For your lemmas
{% endlem %}
{% thm %}
For your theorems
{% endthm %}
{% cor %}
For your corollaries
{% endcor %}
If you need to reference results just append a {"id":"your_reference_tag"}
after the tag, where your_reference_tag
is the same as a LaTex label. Fore example:
{% def {"id":"your_reference_tag"} %}
A *definition* is a blabla, such that: $...$. Furthermore, it is:
$$
...
$$
{% enddef %}
Then you can reference this by doing:
As we remarked in [Reference description](#your_reference_tag), we are awesome...
We support two types of diagrams: quiver and TikZ.
You can render quiver diagrams by enclosing quiver expoted iframes between quiver
tags:
- On quiver, click on
Export: Embed code
- Copy the code
- In the blog, put it between delimiters as follows:
{% quiver %}
<!-- https://q.uiver.app/codecodecode-->
<iframe codecodecode></iframe>
{% endquiver %}
They get rendered as follows:
quiver: a modern commutative diagram editor
Press Space to add a new object
XXX
``
BBB
``
AAA
``
ggg
``
fff
``
hhh
``
Lightness:
Sync label/edge colours:
Preset:
LaTeX:(None)Diagram:
Import: tikz-cd ctrl+I
Export: Shareable linkEmbed codeLaTeX ctrl+E
Macros:
↵
Y
Savectrl+S
Undoctrl+Z
Redoshift+ctrl+Z
Select allctrl+A
Deselect allshift+ctrl+A
Delete⌫
Transform
Centre viewG
Zoom outctrl+-
Zoom inctrl+=
Reset zoom100%
Hide gridH
Show hintsshift+ctrl+H
Show queue
Shortcutsctrl+/
About
Dismiss errors, and panels; Cancel modification or movement; Hide focus point; Deselect, and dequeue cells |
esc |
Import tikz-cd | ctrl I |
Export to LaTeX | ctrl E |
Pan view | Scroll |
Enable mouse panning | ctrl , alt |
Enable touch panning | Long press |
Enable mouse zooming | ⇧ |
Move focus point | ← , ↑ , ↓ , → |
Select next queued cell | ⇥ |
Select previous queued cell | shift ⇥ |
Select / deselect object | S |
Select cells | ; |
Toggle cell selection | ' |
Focus / defocus label input | ↵ |
Create object, and connect to selection | |
Move selected objects | B |
Change source | , |
Change target | . |
Create arrows from selection | / |
Reverse arrows | R |
Flip arrows | E |
Flip labels | F |
Left-align labels | V |
Centre-align labels | C |
Over-align labels | X |
Modify label position | I |
Modify offset | O |
Modify curve | K |
Modify radius | N |
Modify length | L (hold ⇧ to shorten symmetrically) |
Modify level | M |
Modify style | D |
Display as arrow | A |
Display as adjunction | J |
Display as pullback/pushout | P (press again to switch corner style) |
Modify label colour | Y |
Modify arrow colour | U |
Save diagram in URL | ctrl S |
Undo | ctrl Z |
Redo | shift ctrl Z |
Select all | ctrl A |
Deselect all | shift ctrl A |
Delete | ⌫ , del |
Centre view | G |
Zoom out | ctrl - |
Zoom in | ctrl = |
Toggle grid | H |
Toggle help | shift ctrl H |
Toggle diagram centring | C |
Toggle ampersand replacement | A |
Toggle cramped spacing | R |
Toggle fixed size | F |
quiver is a modern, graphical editor for commutative and pasting diagrams, capable of rendering high-quality diagrams for screen viewing, and exporting to LaTeX via tikz-cd
.
Creating and modifying diagrams with quiver is orders of magnitude faster than writing the equivalent LaTeX by hand and, with a little experience, competes with pen-and-paper.
The editor is open source and may be found on GitHub. If you would like to request a feature, or want to report an issue, you can do so here.
You can follow quiver on Mastodon or Twitter for updates on new features.
- S. C. Steenkamp, for helpful discussions regarding the aesthetic rendering of arrows.
- AndréC, for the custom TikZ style for curves of a fixed height.
- Nathan Corbyn, for adding the ability to export embeddable diagrams to HTML.
- Paolo Brasolin, for adding offline support.
- Carl Davidson, for discussing and prototyping loop rendering.
- Everyone who has improved quiver by reporting issues or suggesting improvements.
quiver is a modern, graphical editor for commutative and pasting diagrams, capable of rendering high-quality diagrams for screen viewing, and exporting to LaTeX via tikz-cd.
quiver is intended to be intuitive to use and easy to pick up. Here are a few tips to help you get started:
- Click and drag to create new arrows: the source and target objects will be created automatically.
- Double-click to create a new object.
- Edit labels with the input bar at the bottom of the screen.
- Click and drag the empty space around a object to move it around.
- Hold Shift (⇧) to select multiple cells to edit them simultaneously.
Get started
Version 1.5.2
Should the picture come out cropped, select fixed size
when exporting the quiver diagram, and choose some suitable parameters.
You can render tikz diagrams by enclosing tikz code between tikz
tags, as follows:
{% tikz %}
\begin{tikzpicture}
\draw (0,0) circle (1in);
\end{tikzpicture}
{% endtikz %}
Tikz renders as follows:
Notice that at the moment tikz rendering:
- Supports any option you put after
\begin{document}
in a.tex
file. So you can use this to include any stuff you’d typeset with LaTex (but we STRONGLY advise against it). - Does not support usage of anything that should go in the LaTex preamble, that is, before
\begin{document}
. This includes exernal tikz libraries such ascalc
,arrows
, etc; and packages such astikz-cd
. Should you needtikz-cd
, use quiver as explained above. If you need fancier stuff, you’ll have to render the tikz diagrams by yourself and import them as images (see below).
Referencing works also for the quiver and tikz tags, as in:
{% tikz {"id":"your_reference_tag"} %}
...
{% endtikz %}
This automatically creates a numbered ‘Figure’ caption under the figure, as in:
quiver: a modern commutative diagram editor
Press Space to add a new object
XXX
``
BBB
``
AAA
``
ggg
``
fff
``
hhh
``
Lightness:
Sync label/edge colours:
Preset:
LaTeX:(None)Diagram:
Import: tikz-cd ctrl+I
Export: Shareable linkEmbed codeLaTeX ctrl+E
Macros:
↵
Y
Savectrl+S
Undoctrl+Z
Redoshift+ctrl+Z
Select allctrl+A
Deselect allshift+ctrl+A
Delete⌫
Transform
Centre viewG
Zoom outctrl+-
Zoom inctrl+=
Reset zoom100%
Hide gridH
Show hintsshift+ctrl+H
Show queue
Shortcutsctrl+/
About
Dismiss errors, and panels; Cancel modification or movement; Hide focus point; Deselect, and dequeue cells |
esc |
Import tikz-cd | ctrl I |
Export to LaTeX | ctrl E |
Pan view | Scroll |
Enable mouse panning | ctrl , alt |
Enable touch panning | Long press |
Enable mouse zooming | ⇧ |
Move focus point | ← , ↑ , ↓ , → |
Select next queued cell | ⇥ |
Select previous queued cell | shift ⇥ |
Select / deselect object | S |
Select cells | ; |
Toggle cell selection | ' |
Focus / defocus label input | ↵ |
Create object, and connect to selection | |
Move selected objects | B |
Change source | , |
Change target | . |
Create arrows from selection | / |
Reverse arrows | R |
Flip arrows | E |
Flip labels | F |
Left-align labels | V |
Centre-align labels | C |
Over-align labels | X |
Modify label position | I |
Modify offset | O |
Modify curve | K |
Modify radius | N |
Modify length | L (hold ⇧ to shorten symmetrically) |
Modify level | M |
Modify style | D |
Display as arrow | A |
Display as adjunction | J |
Display as pullback/pushout | P (press again to switch corner style) |
Modify label colour | Y |
Modify arrow colour | U |
Save diagram in URL | ctrl S |
Undo | ctrl Z |
Redo | shift ctrl Z |
Select all | ctrl A |
Deselect all | shift ctrl A |
Delete | ⌫ , del |
Centre view | G |
Zoom out | ctrl - |
Zoom in | ctrl = |
Toggle grid | H |
Toggle help | shift ctrl H |
Toggle diagram centring | C |
Toggle ampersand replacement | A |
Toggle cramped spacing | R |
Toggle fixed size | F |
quiver is a modern, graphical editor for commutative and pasting diagrams, capable of rendering high-quality diagrams for screen viewing, and exporting to LaTeX via tikz-cd
.
Creating and modifying diagrams with quiver is orders of magnitude faster than writing the equivalent LaTeX by hand and, with a little experience, competes with pen-and-paper.
The editor is open source and may be found on GitHub. If you would like to request a feature, or want to report an issue, you can do so here.
You can follow quiver on Mastodon or Twitter for updates on new features.
- S. C. Steenkamp, for helpful discussions regarding the aesthetic rendering of arrows.
- AndréC, for the custom TikZ style for curves of a fixed height.
- Nathan Corbyn, for adding the ability to export embeddable diagrams to HTML.
- Paolo Brasolin, for adding offline support.
- Carl Davidson, for discussing and prototyping loop rendering.
- Everyone who has improved quiver by reporting issues or suggesting improvements.
quiver is a modern, graphical editor for commutative and pasting diagrams, capable of rendering high-quality diagrams for screen viewing, and exporting to LaTeX via tikz-cd.
quiver is intended to be intuitive to use and easy to pick up. Here are a few tips to help you get started:
- Click and drag to create new arrows: the source and target objects will be created automatically.
- Double-click to create a new object.
- Edit labels with the input bar at the bottom of the screen.
- Click and drag the empty space around a object to move it around.
- Hold Shift (⇧) to select multiple cells to edit them simultaneously.
Get started
Version 1.5.2
Whenever possible, we encourage you to enclose diagrams into definitions/propositions/etc should you need to reference them.
Images are included via standard markdown syntax:

image_path
can be a remote link. Should you need to upload images to this blog post, do as follows:
- Create a folder in
assetsPosts
with the same title of the blog post file. So if the blogpost file isyyyy-mm-dd-title.md
, create the folderassetsPosts/yyyy-mm-dd-title
- Place your images there
- Reference the images by doing:

Whenever possible, we recommend the images to be in the format .png
, and to be 800
pixels in width, with transparent backround. Ideally, these should be easily readable on the light gray background of the blog website. You can strive from these guidelines if you have no alternative, but our definition and your definition of ‘I had no alternative’ may be different, and we may complain.
Referencing works exactly as for diagrams:
{% figure {"id":"your_reference_tag"} %}

{% endfigure %}
CyberCat blog offers support for code snippets:
def print_hi(name)
puts "Hi, #{name}"
end
print_hi('Tom')
#=> prints 'Hi, Tom' to STDOUT.
To include a code snippet, just give:
```language the snippet is written in
your code
```
Check out the Jekyll docs for more info on how to get the most out of Jekyll. File all bugs/feature requests at Jekyll’s GitHub repo. If you have questions, you can ask them on Jekyll Talk.