Skip to content

Instantly share code, notes, and snippets.

@rikkimax
Last active August 8, 2024 13:52
Show Gist options
  • Save rikkimax/eed86a7061445a93f214e41fb6445e40 to your computer and use it in GitHub Desktop.
Save rikkimax/eed86a7061445a93f214e41fb6445e40 to your computer and use it in GitHub Desktop.

Type State Analysis

Rationale

Unitialized Variable Initialization

In an attempt to work around other bugs or to improve behavior, it may come across as a wonderful idea to have variables be uninitialized, and to initialize them in separate stages.

The following code originates from a shared library by the author, in an attempt to work around a bug where the default initialization symbols was not crossing the shared library threshold.

MyType variable = void;
variable.tupelof[2 .. $] = MyType.init.tupleof[2 .. $];

There is an obvious solution to this problem, use a method called defaults for MyType that is not templated to initialize the variable.

However, could any code like the above that does not do a full initialization of all fields be appropriete for a D programmer to write? No, it is unsafe and is clearly going to result in bugs.

All variable declarations put the variable into the type state reachable. This enables writing to it, but not reading. For @trusted code what we are looking for is the same guarantees as @safe, in that doing a partial initialization either via .tupleof or by field should cause an error.

Error: example.d:2 Variable `variable` is in type state reachable, but statement would partially initialization and skip fields making it remain in reachable type state post initialization.

For @system code, keeping track of which fields have been initialized would be an appropriete solution to enable partial initialization and once all fields have been initialized move into type state initialized. This is sometimes used when working with Windows API.

Skipping Variable Declarations

D has been defined against type state analysis from the earliest days of development. An example of this is an error when you skip variable declarations using goto.

For a forwards goto the check is quite simply, we do a reverse iteration over the AST to find the goto statement from the label. If a variable declaration is found and error is emitted as that variable was skipped.

In a backwards goto instead of checking, we will destroy any variables that are no longer reachable. This is a good reliable method for handling this scenario.

import std.stdio : writeln;
import std.random : uniform01;

void main()
{
Label:
    SomethingWithSideEffects var;
    if (uniform01() > 0.5)
      goto End;
    goto Label;
    
End:
    writeln(var);
}

struct SomethingWithSideEffects {
    ~this() {
     	writeln("Make your D-Man cry :'(");   
    }
}

Will be transformed into:

void main()
{
 Label:
	SomethingWithSideEffects var = 0;
	try
	{
		if (uniform01() > 0.5)
			goto End;
		goto Label;
 End:
		writeln(var);
	}
	finally
		var.~this();
	return 0;
}

Instead of a dedicated check for a forwards goto, we can remove it in favor of a general use semantic pass that is used by multiple language features. This is the Data Flow Analysis (DFA) that is proposed by this DIP.

goto Label;
int variable;
Label:
const variable2 = variable;

Should error with:

Error: example.d:4 Variable `variable` is in type state unreachable, but statement would load it into ``variable2`` during initialization.

When a forword goto is seen, the current context is split into two. The first copy continues on as normal. The second continues on by performing a jump to the label which acts as a point of convergence. This will catch any differences between these two pathways.

At a point of convergence when multiple contexts have been processed, the convergence process is undertaken which unifies them into a single context before continuing to the next point of convergence such as end of scope.

When converging the goal is take the lowest type state of each variable to unify into a single context state for all variables.

The following is an example of this problem, where D has been defined against type state analysis without actually having type state analysis implemented. It is a known bug.

import std.stdio : writeln;

void main(string[] args) {
    if (args.length == 1)
        goto Fail;
    
    Success:
        int var = 22;
        writeln("success ", var);
        return;
    
    Fail:
        writeln("fail ", var); // fail 1061300560
}

Existing Data Flow Analysis

An existing DFA in D is for @live a singular function borrow checker that lacks guarantees cross function. It is a single use DFA that cannot be applied to other memory analysis language features. This is a problem, this proposal focusses on only one aspect of, replacing its DFA to get the most out of a single set of costs.

The DFA as used by @live is a simplified Intermediate Representation (IR) to define a sequential list with control structures. It utilizes tree walkers for expressions within a function scope. This is a simplified and highly specific solution to the DFA required for a borrow checker which would be described as a subset of of a Control Flow Graph (CFG).

By utilizing a CFG, @live matches other borrow checker designs in premise, but it does so with only the ability of a forward pass to construct state or to proof it. For this reason it is not the basis of the DFA being proposed here. Instead it is an example of what could be rewitten to use the new DFA by taking advantage of its IR along with its value tracking state capability which replaces its own specific solution.

Changes

Each variable gains a type state.

The predefined type states are:

  • Unreachable: cannot be read or written to
  • Reachable: may be written to but not read
  • Initialized: may be read or written to
  • Default initialized: has a value that is the default for its type (.init)
  • Non-null: the type is a pointer and is not null

The order of the type states is: unreachable < reachable < initialized < default-initialized < non-null < user-defined.

When two or more type states are seen during analysis the lowest value is chosen.

User defined type states are ordered based upon definition:

typestate {
	empty = initialized,
	nonempty => !empty,
}

In the above empty is below nonempty but above non-null.

Syntax

The first syntax change is to introduce the type state declaration syntax for structs, classes and unions. Each expression must resolve to a boolean expression involving the this pointer.

struct Foo {
	typestate {
		First = initialized,
		Second => !method
	}
	
	bool method();
}
DeclDef:
+   TypeStateDeclaration

+ TypeStateDeclaration:
+	typestate { TypeStateSpecifiations|opt }

+ TypeStateSpecifications:
+    TypeStateSpecifications , TypeStateSpecification
+    TypeStateSpecification ,|opt

+ TypeStateSpecification:
+    Identifier = Identifier
+    Identifier => Expression

The second change is to introduce a method for specifying a known type state for a given parameter. This will always be the first function parameter attribute.

void func(?nonnull Object input);
ParameterAttributes:
+    TypeStateParameterAttribute

TypeStateParameterAttribute:
+	 ? Identifier , Identifier
+    ? Identifier

The third change is to enable the this pointer and return value to be annotated as part of the parameter list.

class AnObject {
	AnotherObject method(?superduperstate,nonnull this, ?unknown return) {
	}
}

The first identifier is used to specify opening type state, the second for after call. If the second is missing it defaults to the former. For return values only one may be specified. These two parameters may go in any place, in any order, they are ignored as they are not truely function parameters.

Parameter:
+   TypeStateParameterAttribute this
+   TypeStateParameterAttribute return

Unfortunately this has a conflict as return is also in ParameterStorageClass. This is why TypeStateParameterAttribute is always the first attribute, it allows differentiating between them.

Is expressions are expanded to check what the type state of a given variable is at that point in time.

IdentityExpression:
+    ShiftExpression is typestate Identifier
+    ShiftExpression is typestate < Identifier
+    ShiftExpression is typestate > Identifier
+    ShiftExpression is typestate <= Identifier
+    ShiftExpression is typestate >= Identifier

Lastly finate control over accepting type states that allow exact comparisons rather than >= only. The following example shows a function whose parameter obj1 can only accept arguments that are in non-null type state and no others, while the second parameter obj2 can accept any user defined type states but no others.

void func(Object obj1, Object obj2)
	@typestate(obj1 == nonnull ; obj2 > nonnull) {
}
AtAttribute:
+    @ typestate ( TypeStateAttributeComparisons )

+ TypeStateAttributeComparisons:
+    TypeStateAttributeComparisons ; TypeStateAttributeComparison
+    TypeStateAttributeComparison 

+ TypeStateAttributeComparison:
+    Identifier == Identifier
+    Identifier < Identifier
+    Identifier > Identifier
+    Identifier <= Identifier
+    Identifier >= Identifier

Type states change over time

As a function executes, each variable changes its type state. This is analyzed with the help of Data Flow Analysis (DFA).

The primary concern of this DFA is looking at loads, stores and any action that alters a type state of a variable by a predefined transformation.

When a variable declaration is seen the type state changes from unreachable to reachable. In the following example the variable var is accessed by two different function calls that are contained from different labels. However only one was ever initialized (and is therefore reachable).

import std.stdio : writeln;

void main(string[] args) {
    if (args.length == 1)
        goto Fail;
    
    Success:
        int var = 22;
        // var is typestate initialized
        writeln("success ", var);
        return;
    
    Fail: // var: [unreachable, initialized] = unreachable < initialized
        writeln("fail ", var); // Error: variable `var` is in type state unrecahable but is being read.
}

The above code is an example to match an existing bug wherein a goto can skip a variable declaration when more than one label is in play.

Input type states for parameters are inferred

The first expression defines what the input type state of variables that are not informed by their initializer or explicitly set.

This is a straight forward forward pass to find an expression that performs an operation on each function parameter or this pointer.

func(new Object);
func(null); // Error: parameter `obj` must have a minimum type state of non-null

void func(/* ?nonnull */ Object obj) { // The following statement gives obj its type state
	obj.method(); // obj: typestate >= non-null
}

Guaranteed minimum type states

All function parameters including this pointer and return values have a minimum type state of initialized.

Object acquire(bool condition) {
	return condition ? new Object : null;
}

Both the return value an function parameter have type states of initialized. You do not need to annotate any code.

This works for the this pointer as well, in the following two methods the this pointer for increment has a type state of non-null whereas doNothing remains at initialized, as it does not touch the this pointer.

class MyClass {
	int field;
	
	void increment() {
		this.field++;
	}

	void doNothing() {
	}
}

Of course if it was synchronized that would place a load to the monitor object and would result in non-null as well.

Output type states for parameters and return are inferred

Previously infering of input type states for function parameters was shown, in doing so it allowed a type state to be determined that is required. However this type state might be lower than possibly required but a type state could still be determine quickly and easily that requires minimal assumptions.

Unfortunately to infer output type states is a lot tricker and would require running the entire DFA during semantic three to enable modifying the type system. This is not acceptable.

So if we want to infer outputs type states we must limit ourselves to what we can do.

  1. It is unlikely that we will know what the type state is if its user-defined, but we will be able to determine if its a predefined one.
  2. As long as a variable is not taken by pointer to a function call (ref, out, &var), any type state that is predefined will not change.
  3. Loops and goto's are out. They alter code flow and require multiple contexts.

This would need to work in reverse iteration from each return, looking for statements that indicate a higher type state than initialized and the prevention of lower than the currently assumed one.

Input states are known for function parameters and the this pointer, this can resolve output states.

Object method(/* ?nonnull return */, /* ?nonnull,nonnull this */) {
	return this;
}

Simpler statements like if statements may be used, with each statement getting a set of values on the compiler's stack (fixed number of contexts) can be evaluated to find the least type state.

Object func(bool condition, /* ?initialized return */) {
	Object obj;

	if (condition) {
		obj = new Object;
		// obj: nonnull
	} else {
		// obj: initialized
	} // obj: [initialized, nonnull]

	// obj: initialized
	return obj;
}

Function calls can also work. If cyclic dependencies are not resolved through some other method, its always possible of resolving them by defaulting back to initialized.

Object func(/* ?somesuperduperstate return */) {
	return (new Object()).method();
}

Object accepter(/* ?somesuperduperstate return */) {
	Object obj = func();
	return obj;
}

But if any method is called, or function takes it in, that user defined type state must be devolved to non-null. The resulting type state is the highest provable predefined type state below the user defined type state.

Object func(/* ?somesuperduperstate return */) {
	return (new Object()).method();
}

void anotherFunction(Object obj) {
	obj.method();
}

Object accepter(/* ?nonnull return */) {
	Object obj = func();
	anotherFunction(obj);
	return obj;
}

As for ref and out, there is no way to know what the type state is beyond initialized unless it has been annotated explicitly or inferred.

void someFunction(ref Object obj) {
}

Object another() {
	Object obj = new Object;
	someFunction(obj);
	return obj;
}

It may appear that the returned object from another would be non-null but this wouldn't be the case. Inferring of input would result the parameter for someFunction to be initialized not non-null. Even though that parameter is unused.

On the other hand if it was to be used, then it will be non-null.

void someFunction(/* ?nonnull,nonnull */ ref Object obj) {
	obj.method();
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment