A brief overview of the work involved in testing complex narrative multimedia

Testing Large and Branching Narrative Entertainment

Overview

One of the most significant challenges in producing large works of text-driven multimedia is the work of testing. There’s a few basic reasons for that:

Games are designed in-flight, through iterative and playful processes, and so a pre-written set of test plans or user stories will invariably need further labor to keep pace with design.
Testing is easy to ignore in the context of deadlines.
Understanding and visualizing any narrative as a complex tree or directed acyclic graph, rather than a piece of “static” fiction which moves from A-Z (or א to ת, if you prefer), is extremely difficult. If this sounds hyperbolic, ask literally anyone who’s ever contracted with a VC-funded interactive televisual or multimedia startup.
Testing is subjectively different from programming in that nothing, not even the compiler, tells us whether we did it right. There’s no model, even in senior software engineering, which tells you whether your tests are sufficient. It can tell you whether they all pass, it can tell you the best strategies for test design, it can even tell you which lines of code or even specific logical branches expressed in that line remain untested, but there’s never a point where one can know “okay, now my codebase is safe.” It’s very hard to motivate one’s self to focus on a task they know upfront they might never feel is complete.
Overly specific tests break, or previous expectations, encoded into invariants (e.g. “every passage has at least one word or N characters of text” or “there are never more than 5 choices”), produce more work than no tests, because you have a larger codebase, and it all needs steady updates.
Unit tests, or smaller tests based on atoms of functionality, are very prone to false positives and false negatives. Over a long enough time period, one will accumulate tests which fail when they should work, and those which pass when their zone of concern actually has bugs in it.

Considerations

Given these difficulties, what focus can we take to lessen the difficulties?

Text-based games, or multimedia which predominantly centers on text, exist in a small commercial space, and an interstitial technological space somewhere between literary publishing, indie game studio, and independent film studio. It’s therefore very difficult to take any one prescription or pattern for validating these games and apply it; one runs more difficulty in stubbornness than openness.
The work of testing is substantially similar to that of a film editor. In the studio system, it’s vanishingly rare for a director to edit their own work. (The Coen brothers are a notable exception to this.) One of the main reasons for that is that the top-level dreamer tends to be the least objective, after years of pre-production and a lengthy production phase, about what works, what doesn’t, and how the final product should be molded. The same is true of testing narrative entertainment, particularly given that most don’t have anywhere near the development or testing time needed to validate something which amounts to a 10-100K line codebase.
A lot of what should be tested is the game, not the codebase. Games aren’t codebases, they’re material, generally singular products, and in time the appearance of the game drifts entirely from the appearance of the prose or markup. Even monospaced type, letter height, and paragraph margins can dramatically change the perception of content.

Strategies

Okay, so what can we do about all that stuff? Here’s a few major areas of focus by long-time interactive fiction and professional narrative devs:

Transcripting

Broadly, transcripting refers to when a text printout is made of the content shown by the player, and the choices made by them. This doesn’t actually need to be a full readout of the game, that just happens to be the simplest available way for parser authors. Most web games, especially those with a lot of styling, inline links, numerous on-page choices, etc., will probably need to find a more structured way to print transcripts.

In parser fiction, this is so simple even the IFComp website will do it for you. That’s a major advantage to publishing parser fiction to the comp, not least of which because those responses are generated in real time by registered comp voters, whereas the comment forms are withheld until the end of the competition.

In hypertext fiction, or some sort of browser/executable delivery, this is a bit more complicated. The simplest solution would likely be to add some sort of event listener for player choices, or add an optional flag to pre-existing choice logic, and stream new content to the in-progress transcript. When the game “ends,” whether that’s losing, winning, quitting, or restarting, the transcript is sealed, and one can read or generate it.

Static Analysis

Static analysis is a phrase more often encountered in programming, but simply it means analytical operations one does on their codebase, or a “headless” version of the system, rather than the running executable (in this case, the live game). This is most useful when combined with one’s development toolchain – a common name for this is a “linter” – and when applied to transcripts themselves. The former is too specific to engine usage to go into here, but the latter bears a lot more unpacking.

Reading transcripts like text is useful, but not significantly more so than playing through the game. It’s faster, you can do it a lot of times, but you’re still looking at text which (outside parser) doesn’t appear the same way, and it’s still the same text you’re already bored of. A more interesting direction is to analyze the output of a large number of randomwalked transcripts, and generate insights from this. Here’s a few basic foci for analyzing metadata-adorned transcripts:

How much content was rendered in a whole playthrough? What’s the median, mode, minimum, and maximum?
How much content was rendered in atomic content blocks? Are there sections which are uncommonly short? Are there ones which are uncommonly long? Are there stretches where the long ones are next to one another, and the short ones are next to the short ones? Is that what you want? etc. etc.
How many choices did the player have to make in each path? Are there very short playthroughs? Are some of them interminably long? Do some involve large amounts of rereading?
What are the “hottest” paths, where the majority of simulations move through? What are the “coldest?” Are there “orphans,” which are either never reached, or seldom reached? Does that reflect expectations, and could it reflect bugs?

Real-time Automation

There are many cases where static, flat, transcript-wise analysis won’t work. One is if you have significant amounts of media. That work combines exponentially if your game involves a large amount of timing and synchronization, long-running media, complex animations, performance flaws, etc. Another possible solution is to produce code which runs inside the game itself, and automatically selects choices at random. Factors in the algorithm which determines wait time can be adjusted based on content character and word length, pixel height, and more, in order to produce a self-executing, fully-visualized approximation of how players might actually perceive your content. Another advantage to this is that having your game played by someone else – even if it’s a totally random oracle – is a neurologically distinct experience from reading and playing one’s own work. This process decontextualizes the appearance of your work, and for reasons too arcane for anyone not a neuroscientist to say, just plain makes the flaws more obvious. You’ll notice a lot more typos watching your game played on Twitch than you will either by staring at the codebase or playing the game 500 times manually.

Conclusion

Feel free to leave any questions, comments, criticisms, or thoughts. In particular, I’m happy to expand on anything we did specifically for our project to fulfill these aims, how that might be made generically useful for others, and useful topics for future diaries. Thanks for reading, all :black_heart:

Venus Wormwood and Mother

TAV Institute

https://tav.institute

tavinstitute/testing-narrative-and-text-games.md