Byte types, or how to get rid of i8 abuse for chars in LLVM IR

Authors: George Mitenkov, Nuno Lopes, Juneyoung Lee

Date: 15.10.2021

Part 0, where we introduce our work

This May, together with Nuno Lopes and Juneyoung Lee we made a proposal to add a new type in LLVM IR that represents raw memory. The goals of this type were to make compiler-introduced load type punning correct and fix associated bugs that are reported by Alive2 dashboard.

While our proposal was accepted for the Google Summer of Code 2021 (GSoC), our initial RFC on the mailing list was received somewhat negatively and sparked many questions on the current semantics around ptrtoint, inttoptr, pointer provenance and memory in LLVM. In particular, the following was mentioned:

Proposed semantics for the byte type seemed obscure and the explanation of the underlying issue of load type-punning was unclear.
Adding a new type to LLVM IR seemed like a significant amount of work and many used this as a strong argument against introducing it.
It was also pointed out that the load type-punning issue is only relevant to C, C++ or Rust, and changing Clang and LLVM IR just for these 3 frontends is not worth the effort.

Nevertheless, we proceeded with implementing a prototype. Our aim was to introduce a new type to LLVM, fix any optimization issues/bugs and analyse the performance regression. By the end of GSoC, we think that our results are very promising (summary) and can be used as strong evidence in favour of a byte type being part of LLVM IR. To clarify semantics of a byte type, and to demystify a common disbelief that the proposed changes break everything or not needed at all, we decided to write a blog post, describing the semantic inconsistencies that LLVM has around pointers and memory, as well as our solution. We hope that these posts will help us to make the LLVM community aware of the load type punning issue and why a new type can help to solve it.

Part 1, where we argue whether the memory in LLVM IR is typed or not

Before describing the inconsistencies around pointers, provenance and inttoptr/ptrtoint, let’s first establish whether the memory in LLVM IR is typed or not. According to LangRef:

LLVM IR does not associate types with memory. The result type of a load merely indicates the size and alignment of the memory from which to load, as well as the interpretation of the value. The first operand type of a store similarly only indicates the size and alignment of the store. Because the memory is untyped, any load/store of any type is equivalent to converting the type to an integer (i.e. a bit sequence of some size and alignment) and then loading the integer from memory or storing it to memory. In the original RFC thread, Joshua Cranmer shared 2 examples of equivalent functions under this semantics:

define void @foo(ptr %mem, i8* %foo) {
    store i8* %foo, ptr %mem
}
define void @bar(ptr %mem, i8* %foo) {
    ; Assuming pointers are 64-bit
    %asint = ptrtoint i8* %foo to i64
    store i64 %asint, ptr %mem
}

Moreover, he also pointed out why this semantics is not sound. In LLVM, integers are just a collection of bits, and any iN value can be replaced with any other iN value given they have the same bit pattern. At the same time, two pointers may have the same numerical value (when cast to integers) but cannot be replaced with one another in general. This is because pointers also carry provenance (i.e., information about which objects the pointer can refer to). If the provenance of pointers is different, replacing one pointer with another and then dereferencing it can lead to undefined behaviour.

Under the untyped memory model, we need to accept that every load/store has an implicit ptrtoint/inttoptr attached to it. Hence, we lose provenance information every time a pointer is stored to memory, as if we did an inttoptr(ptrtoint x). Thus, the following two functions (again thanks to Joshua Cranmer for examples) are not semantically equivalent under this model:

define i8* @foo(i8* %in) {
    ret i8* %in
}
define i8* @foo(i8* %in) {
    %mem = alloca i8* 
    store i8* %in, i8** %mem
    %out = load i8*, i8** %mem
    ret i8* %out
}

For equivalence, we must ensure that the load of a pointer recovers the provenance data stored by the store of the pointer. But then the integer and the pointer loads must be different which is not true if the memory is untyped.

Having untyped memory also affects the decision whether a pointer in a function escapes or not. Recall that we say that a pointer escapes if it can be accessed outside of the current function. If the pointer is stored to a global, then it is evident that it can now be accessed globally, and therefore we consider it as escaped. However, pointers that live on the stack of the current function and are not stored to global memory may escape too. Consider the following function:

  %q = alloca i32*
  store i32 0, i32* %p
  store i32* %p, i32** %q
  %q2 = bitcast i32** %q to i64*
  %p_as_int = load i64, i64* %q2
  %cmp = icmp i64 %p_as_int, 0x42
  br i1 %cmp, label %true, label %false
true: 
  call void @foo(i64 0x42)
  br label %false
false:
  %w = load i32, i32* %p

The catch here is that the pointer %p is implicitly cast to integer, and is passed to some other function that may change the value stored at %p without alias analysis noticing. To avoid this problem, we could consider all integer loads as potential pointer carriers that have an implicit ptrtoint instruction attached to it. This way alias analysis is aware of pointer escaping but has to be conservative on every integer load, which is dreadful in terms of performance and invalidates certain LLVM optimizations that do not consider loads as escape sites. An alternative is to say that all pointer stores escape, which again has severe performance consequences and again do not align with all LLVM optimizations.

To conclude our small introduction, it is evident that the current untyped memory model is not enforced consistently. What is more disastrous is the ambiguity of integers carrying pointers. In LLVM, certain optimizations assume that integer load/stores have untyped semantics, whereas other optimizations (and frontends) take provenance information into account. But how bad is this inconsistency?

Part 2, where we question the possibility of implementing a memcpy in LLVM IR, as well asdescribe miscompilations due to lowering of unsigned char and compiler-introduced type punning

As we have already established, integers do not carry provenance in LLVM but pointers do. Now suppose we have a memcpy implementation (a for loop that copies data per-byte using i8 in LLVM terms). If we are to copy a 64-bit pointer as eight 8-bit integers, then what provenance does the result of the copy have? To keep the provenance data one may say that we can allow integers to carry provenance information. But this contradicts the fact that an integer is just a bit pattern. Nicolai Hähnle, in his blog post describes this in greater detail, as well as evaluates how this inconsistency can be fixed. His conclusion, in brief, is that there is a limitation in the current expressiveness of LLVM IR, which has surprising consequences for memcpy (and similar memory-related operations). To deal with the limitation, we have to choose between:

Both pointers and integers carrying provenance
Pointers having provenance but not integers
Nothing having provenance

Unfortunately, there is no free lunch, and any option we pick invalidates some LLVM optimizations.

So far we only talked about semantic inconsistencies in LLVM, but is this that important for compiling code? The obvious answer is yes. For example, there are miscompilations because integers that carry pointers can be treated as pure integers by certain optimizations. Let us consider a simple optimization:

(p == q) ? p : q    ->   q

While it is clearly correct for “pure” integers, it is not correct for integers carrying pointers. Suppose we have 2 arrays, P and Q. If P immediately follows Q in memory, then q = Q[4] = P[0] = p (recall that taking the address of one-past-the end element of an array is legal in C/C++ (unlike dereferencing it) C99 §6.5.6, p8. Hence, p and q have the same integer representation, but their provenances are different which makes the optimization incorrect. Can we exploit this optimization?

Obviously the answer is yes (otherwise there would have been no blog post!), and compiler-introduced type punning can help us with that. If we copy a pointer byte per byte (e.g. with memcpy), the compiler can optimize this into a single load/store pair.

%src8 = bitcast i8** %src to i8*
%dst8 = bitcast i8** %dst to i8*
call void @llvm.memcpy.p0i8.p0i8.i32(i8* %dst8, i8* %src8, i32 8, i1 false)
	=>
%src64 = bitcast i8** %src to i64*
%dst64 = bitcast i8** %dst to i64*
%addr = load i64, i64* %src64, align 1
store i64 %addr, i64* %dst64, align 1

The underlying issue is that while both C and C++ define unsigned char (or std::byte) as handles to the raw bytes of objects, LLVM IR does not have a similar type and uses integers for that. This means that compiler-introduced type punning treats raw data copied with a memcpy as integers (e.g. i64 if pointers are 64-bit wide). Since LLVM’s alias analyses do not take integer operations into account, this type punning can lead to pointer escape without compiler realising and lead to miscompilation.

Note that the issue is not limited to memcpy, but also affects memmove, memcmp and C++ functions that may be lowered to calls to memcpy or memmove and subsequently to integer load/store pairs, such as [uninitialized_copy in C++][https://llvm.godbolt.org/z/nGr6K4cnP].

Now that we have roughly outlined how we can get the desired miscompilation, we describe an end-to-end example from the bug report 37469:

// If we call store_10_to_p(p, p + 4) and p and p + 4 have the same address
// c1 == c2 in the loop, so arr = p_as_arr.
// Hence r = p, and *p should be 10. However, when compiled with -O3, *p is 1.
void store_10_to_p(int *p, int *q) {
    unsigned char p_as_arr[8];
    unsigned char q_as_arr[8];
    unsigned char arr[8];
    memcpy(p_as_arr, &p, sizeof(p));
    memcpy(q_as_arr, &q, sizeof(q));
    // Store p to arr.
    for (int i = 0; i < sizeof(q); ++i) {
        int c1 = p_as_arr[i];
        int c2 = q_as_arr[i];
        // Note that c1 == c2 is a comparison between integers (not pointers).
        if (c1 == c2) arr[i] = p_as_arr[i]; else arr[i] = q_as_arr[i];
    }
  // Now arr is equivalent to p_as_arr, which is p.
  int *r;
  memcpy(&r, arr, sizeof(r));
  // Now r is p.
  *p = 1;
  *r = 10;
}
int main() {
    int P[4], Q[4];
    printf("%p %p\n", P, &Q[4]);
    // If P immediately follows Q, store 10 to P[0].
    if ((uintptr_t)P == (uintptr_t)&Q[4]) {
        store_10_to_p(P, &Q[4]);
        printf("%d\n", P[0]);
    }
    return 0;
}

In this example, we have 2 arrays, P and Q. If P immediately follows Q in memory, we store 1 and then 10 to P[0] and print the contents of P[0]. Unsurprisingly, compiling with -O0 produces a correct answer of 10. However, compiling the same code with -O3 produces 1! Let’s analyse the example in more detail.

P immediately follows Q, and hence the integer representation of pointers p = P[0] and q = Q[4] is the same. We first transform 64-bit integer representations of pointers to arrays of chars by calling memcpy. Then, we proceed by storing a pointer p into array arr by copying every byte of p. Note that condition (c1 == c2) is always true since both p and q have the same integer representation even though they point to different objects. Then, we copy the contents of arr (which is a pointer p) to a new pointer r. Finally, we store 1 to p, and 10 to r (which is p).

So why does this miscompilation happen? There are four main contributing factors (read: LLVM optimizations), some of which we have already covered:

Instcombine has an optimization that, given 2 integers x and y, simplifies the expression (x == y) ? x : y into y. Therefore, in this case, the loop body becomes arr[i] = q_as_arr[i];.
Loop idiom recognizer sees the copy loop and replaces it with memcpy(arr, q_as_arr, 8).
Instcombine replaces the memcpy calls with i64 load/store pair.
Instcombine does the store forwarding of q_as_arr to r (as those are just integers now!). In the end, the optimizer thinks that r is q, so storing 10 to r does not affect p.

To emphasize once again, LLVM IR does not have a universal union type like C (unsigned char) or C++ (std::byte) have, and that can be used to access/copy the raw data in memory. Instead, integers are used which makes it possible for them to carry pointers. Therefore, we can conclude that either we have to:

Change semantics and lower unsigned char and std::byte to something other than i8 so that optimizations are aware that raw data (potentially a pointer) is loaded/stored.
Keep semantics (i.e., the current lowering to integers) but change (or disable) unsound optimizations.

Abuse of integers as universal data holders also doesn’t fit the semantics of poison. Suppose that we have a program that copies a struct:

struct S {
    char s;
    // 3-byte padding
    int i;
};
int struct_cpy(struct S* p, struct S* q) {
    memcpy(s, q, sizeof(struct S);
}

The semantics of memcpy specify that the memory is copied as-is in bytes, including the padding bits if necessary. However, if we widen the memcpy of a struct to a big integer load (i64 in this example), the poisonous padding contaminates the copied value, and produces poison.

So far we have only presented C examples, however this issue is not limited to this language. Ralf Jung came up with a series of examples in Rust which show that optimizations like dead store elimination, integer "substitution" (replacing a by b after an a == b), and provenance-based alias analysis, are in conflict with each other: the current LLVM semantics are inconsistent and can lead to miscompilations. He also stressed that this issue is not just a small bug somewhere, it is a case of the implicit assumptions made by different optimization passes or instructions being mutually incompatible.

Part 3, in which we describe how NOT to fix LLVM

We hope that at this point the reader is aware of how serious the issue is and that LLVM optimizations (especially when combined together) are not sound. Surely, we need to try to solve that issue and the argument “but it does not occur that often” should frighten us!

Solution 0: Fix or disable optimizations

The first possible solution is to keep memory untyped, to get consistent semantics by treating integers as potentially carrying pointers and to give up unsound optimizations (or fix them). This approach has its own benefits. First, it does not require much engineering effort - at least for disabling optimizations. Secondly, all optimizations would accept the same semantics for integers, and therefore would still be sound when combined together.

However, disabling optimizations can have a significant performance regression. The first candidate to be given up would be GVN, which is not sound if integers are treated as pointers. As many readers and developers can agree, this optimization seems to be too important for performance to simply disable it.

Pros:

Low engineering effort.
LLVM IR type system and language constructs are untouched.

Cons:

More pressure on the developers who have to ensure that the optimizations they write adhere to the semantics of integers escaping pointers.
Performance regressions due to disabled optimizations like GVN and more conservative AA.

Solution 1: Better alias analysis

Having examined the first solution, we may ask why do we need to disable optimizations and make integers to be treated as pointers if we can simply make alias analysis better? This is a great solution indeed: we keep the memory untyped, we do not give up performance, we do not have to complicate semantics and on top of that alias analysis works better than ever!

A single problem that we are left with is HOW to make the alias analysis better. Let’s try to find a way:

Easy, just make alias analysis more conservative? Not really. While this solution leads to performance degradation as we would have more “potential escape” sites in the code and less optimizations can kick in, it can still be considered “better” because it will be sound.
To avoid performance regressions be conservative only when needed? This is not a complete solution. Yes, we can explicitly mark potential type punning situations, such that alias analysis (and optimizers) can see and be conservative about. However, it is not clear what those situations would be and right now we would need to be conservative in every function that loads an argument pointer.
Then make alias analysis better without being conservative? That is a great idea but no. At the moment, there is no clear answer of how to do that (any suggestions are welcome).

Pros:

Only alias analysis code needs changes.
LLVM IR type system and language constructs are untouched.
LLVM optimizations are untouched (although it is debatable whether they would kick in if the analysis is too conservative).

Cons:

High engineering cost for designing better alias analysis.
Better alias analysis would not necessarily catch all corner cases.

Solution 2: Explicit inttoptr/ptrtoint casts

So far we have seen that disabling optimizations is undesirable, and developing a better alias analysis is not realistic. So let’s consider an alternative (again, for untyped memory): make integer to pointer (and back to integer) conversions explicit in IR. This way, it is always known that a pointer might escape only with inttoptr/ptrtoint instructions (as mentioned by John McCall on the mailing list).

However, there are cases when the type of the loaded value is not known explicitly to Clang or to LLVM, e.g. when an unsigned char is loaded from memory in C. Since we consider inttoptr/ptrtoint as only cases when pointers may escape, type punning cases are not caught, which breaks alias analysis once again.

Pros:

Not a high engineering effort.
LLVM IR type system and language constructs are untouched.

Cons:

Provenance information is lost when we have inttoptr/ptrtoint instructions.
inttoptr/ptrtoint escape pointers and using them explicitly can be too conservative (i.e. lead to performance degradation).
No guarantee that type punning cases are caught.

Solution 3: Annotations and tags

Having examined the first 3 alternatives, we can see that we want something that:

does not make us disable optimizations like GVN
has a realistic implementation (unlike better alias analysis)
is stronger than explicit ptrtoint/inttoptr
as well as adheres to the same optimization restrictions as pointers and keeps provenance.

In the initial RFC thread, Madhur Amilkanthwar proposed a solution that seems to match all the necessary conditions we have mentioned. In his proposal, he suggested annotating types with attributes, metadata, or flags which optimizations can use to do a better job. These tags would carry the semantic meaning for the type. For example, lowering memcpy to integer load/store pair would also add a tag that marked these instructions as potential escape sites (since we may be loading/storing a pointer).

While this approach is similar to the one we propose later in this post, there is a very important drawback: LLVM optimizers work with the assumption that attributes can be discarded if the optimizer does not know how to handle them. However, under this proposal discarding attributes becomes illegal.

Pros:

Satisfies necessary conditions to solve our problem.

Cons:

Optimizations can drop attributes.
IR becomes less readable.
High engineering effort to enforce that attributes are preserved in every transformation and used by analyses.

Part 4, in which we introduce the byte type

Motivation

Having examined the alternatives, we propose to consider LLVM memory as typed, as well as to add a new type (called a byte type, or a “raw data” type) that represents the raw memory in LLVM IR. We emphasize on the motivation for this type once again:

LLVM IR needs a universal data holder type.

Frontends load data from memory and do not necessarily know the type being loaded. When a char is loaded in C, it can be part of a pointer or an integer. As we have shown previously, representing it as an integer leads to miscompilations and does not fit the pointer provenance model. Moreover, the byte type enables us to express memcpy in IR as a copy of the raw data.
Optimizations for the universal holder type need to be correct for any LLVM IR type.

Since the universal type can hold anything, optimizations that apply to this type must be correct for any data it stores: integers, pointers, etc. Currently, not all of them are correct for this universal type. An example is GVN, which is not correct for pointers in general because if two pointers compare equal, it does not mean they have the same provenance. As LLVM’s alias analysis uses provenance information, this information must be preserved.

This leads us to our proposal - a new type that represents raw data. While it requires some work, it also allows us to keep all integer optimizations as aggressive as we want, and only throttle down when encountering the byte type. Another benefit is that loading/storing raw data from memory does not escape any pointer, while with the integers as universal type you need to consider all stored pointers as escaped when loading an integer. Introducing the byte type allows us to fix all the integer/pointer cast bugs in LLVM IR.

Pros:

LLVM IR becomes more expressive.
The readability of IR does not change too much.
Backends and the optimizer can differentiate between integers and bytes (integers or pointers) to make integer optimizations more aggressive. This can be significant for some platforms (e.g., CHERI).
Only use bytes when frontends need it (e.g., C/C++ code generation for unsigned char or std::byte).
Alias analysis can be less conservative.

Cons:

Reasonable engineering effort (but we created a prototype during GSoC 2021)
Some optimizations need to be fixed to learn about the new type.
IR gets a new type and a new instruction.
Possible increase in compilation time.

Byte type and bytecast semantics

As we have said, the byte type is a type that represents raw data. We propose to have the same bit widths as used for integers, namely to have b8, b32, b64, b128, etc. Moreover, we propose a new instruction to cast a byte type to an integer or a pointer type, which we call bytecast. This extra cast is needed because the byte is a universal type holder that can carry any type, including an integer (the bytecast is simply a no-op or an inttoptr) or a pointer (the bytecast is a bitcast or a ptrtoint). Moreover, bytes preserve provenance.

We define the semantics as follows.

The byte type is allowed to be allocated, stored and loaded. alloca, load and store accept the byte type.

Examples:

%p = alloca bN* 		    ; allocate N bits in memory
%w = load bN, bN* %q	    ; load N bits of raw memory
store bN %w, bN* %p  	; store a N bits %w to %p

For type conversion, let’s denote the byte type carrying a type T as byte (T), a pointer as ptr, and integer as int. We only allow conversions of types of the same bit width. Now, we define the semantics for type conversions as follows: converting any “allowed” type to the byte type is a no-op. Converting from a byte type to any “allowed” type may perform a type conversion. In our prototype, “allowed” types are integers and pointers only (or vectors of these types). To convert any other type (e.g. float or double) to byte, the source type first needs to be casted to an integer.
- int -> byte:
- ptr -> byte: Simple conversions that reinterpret an integer or a pointer as a sequence of bits are done with a bitcast, and are no-ops:
```
%b8 = bitcast i8 %i8 to b8
%vb = bitcast <4 x iN> %vi to <b x N>
%ptr = bitcast i8* %bptr to b64        ; assuming pointers are 64-bit wide
```
The remaining conversions are done with a new bytecast instruction.
- byte (int) -> int: If the byte carries an integer, the bytecast is a no-op.
```
; %i = bitcast iN %b to iN
%i = bytecast bN %b to iN
```
- byte (int) -> ptr: If the byte carries an integer, the bytecast has the same semantics as inttoptr.
```
; %ptr = inttoptr iN %b to i32*
%ptr = bytecast bN %b to i32*
```
- byte (ptr) -> int: If the byte carries a pointer, the bytecast has the same semantics as a ptrtoint.
```
; %i = ptrtoint iN* %b to i64
%i = bytecast b64 %b to i64
```
- byte (ptr) -> ptr: If the byte carries a pointer, the bytecast is a no-op.
```
; %ptr = bitcast iN* %b to i8*
%ptr = bytecast b64 %b to i8*
```
No arithmetic or bitwise operations are allowed on the byte type. The main reasons for this are:
- We defined the byte type as raw data, and it is not clear what arithmetic/bitwise operations would mean when applied to the raw bits in memory and how this would affect the provenance information.
- Disallowing arithmetic/bitwise operations is aligned with how std::byte is defined in C++ (which roughly stems from the previous point).
One might say that if chars are lowered to b8, and no arithmetic is allowed on bytes, the performance would degrade when the source code uses char arithmetic. However, that is not true. Currently, at -O0 Clang promotes small bit width integers to 32-bits to do arithmetic and then truncates the result back. Consider the following C code:
```
unsigned char sum(unsigned char a, unsigned char b) { return a + b; }
```
The current -O0 and -O3 lowerings produce the following IR:
```
; -O0 version
i8 @sum(i8 %a, i8 %b) {
    %1 = alloca i8
    %2 = alloca i8
    store i8 %a, i8* %1
    store i8 %b, i8* %2
    %3 = load i8, i8* %1
    %4 = zext i8 %3 to i32
    %5 = load i8, i8* %2
    %6 = zext i8 %5 to i32
    %7 = add nsw i32 %4, %6
    %8 = trunc i32 %7 to i8
    ret i8 %8
}
; -O3 version
i8 @sum(i8 %a, i8 %b) {
    %1 = add nsw i32 %a, %b
    ret i8 %1
}
```
With byte type we keep the same promotion pattern but with additional bytecast/bitcast operations. The fact that these operations apply to types of the same bit width is very important. It is common in LLVM codebase to check if instruction is a zext/sext/trunc and then assume that the destination type is an integer. If we allow zext/sext/trunc to produce bytes, all existing pattern matches would fail and would require fixing.

The char addition example therefore becomes:
```
; -O0 version
b8 @sum(b8 %a, b8 %b) {
    %1 = alloca b8
    %2 = alloca b8
    store b8 %a, b8* %1
    store b8 %b, b8* %2
    %3 = load b8, b8* %1
    %cast1 = bytecast b8 %3 to i8
    %4 = zext i8 %cast1 to i32
    %5 = load b8, b8* %2
    %cast2 = bytecast b8 %5 to i8
    %6 = zext i8 %cast2 to i32
    %7 = add nsw i32 %4, %6
    %8 = trunc i32 %7 to i8
    %cast3 = bitcast i8 %8 to b8
    ret b8 %cast3
}
; -O3 version
b8 @sum(b8 %a, b8 %b) {
    %cast1 = bytecast b8 %a to i8
    %cast2 = bytecast b8 %b to i8
    %1 = add nsw i32 %cast1, %cast2
    %cast3 = bitcast i8 %1 to b8
    ret b8 %cast3
}
```
Since the semantics of zext/sext/trunc is preserved, and arithmetic is done over integer types, the optimizer picks up the pattern and creates an 8-bit wide addition. The surrounding casts simply reinterpret bits, and if %a, %b carry integers, the casts are no-ops and do not affect performance.
We allow performing comparisons, as we may potentially want to compare the ordering of the memory instances, check for null, etc. Comparison is also needed since char types are commonly compared. We define the following semantic of the byte type comparison.
- If two byte types carry the same type, the result of the comparison is the result of the comparison of the values of the carried types.
- If two byte types carry different types, we cast non-integral carried types to integers and return the result of the integer comparison.

Preliminary results

We developed a prototype version of LLVM and clang during this year’s GSoC. We changed about 2000 lines in LLVM and 100 in Clang (tests excluded). In particular:

Both the byte type and bytecast instruction were introduced to LLVM IR. Currently, SelectionDAG uses a very basic lowering for them: byte is mapped to an integer, and bytecast is a no-op. Moreover, the code generation in Clang for C was adapted to produce bytes for unsigned char/char types and to add casts where necessary.
The wrong type punning optimizations of memcmp, memcpy, memmove and memset were fixed. Moreover, all optimizations that were originally incompatible with a byte type (e.g. SROA, LoopVectorize, LoopIdiom, SLPVectorizer, GVN, etc.) were fixed to make sure that C programs can be compiled at any optimization level correctly.
To test and evaluate our prototype, we used the ARM platform and SPECrate2017 benchmark suite. In particular, we measured compile and execution times, as well as the size of binary files. We compared the performance of our prototype and vanilla LLVM, running experiments multiple times and taking the median value. Moreover, to quantify whether there was a regression or a speedup, we used ±1% error.

Program	Compile-time speedup, %	Execution-time speedup, %	Binary size increase, %
500.perlbench_r.	0.38%	-0.88%	-0.98%
502.gcc_r.	0.37%	0.02%	-2.23%
505.mcf_r.	-5.64%	-0.17%	-0.19%
520.omnetpp_r.	-0.08%	-0.46%	-1.01%
523.xalancbmk_r.	0.10%	-4.83%	-0.17%
525.x264_r.	0.22%	-0.40%	-0.01%
531.deepsjeng_r.	0.56%	0.26%	-0.01%
541.leela_r.	0.02%	-0.01%	-0.01%
557.xz_r.	0.19%	-0.91%	1.84%

We observed that in general, there was no compile-time slowdown on the benchmarks and only 1 out of 9 programs showed a significant slowdown in compile time. While the reasons for this slowdown are unknown, we can conclude that adding a new type did not affect compile time considerably.

The size of binaries did not vary significantly, usually staying within ±1% of the original LLVM trunk value. Again, only 1 program out of 9 showed a nearly 2% increase in the size of the object file for unknown reasons.

Most importantly, the change in execution time of the programs mostly stayed within ±1%, and again only 1 out of 9 programs showed a 5% slowdown. We have tracked the regression down to the vectorizer, and are currently looking for solutions.

What is missing

Currently, there are a number of things that are missing in our prototype and that we hope to address in the near future.

Clang has numerous test failures due to new code generation for char and unsigned char. These include simple test failures like checking for an i8 instead of b8, but also have much more complicated cases which are hard to fix automatically:
- Some target-specific intrinsics also use char in their type signature. We need to ensure type compatibility.
- Currently all tests involving i8 strings are wrong. Addressing the first point would help to avoid that issue, adding byte strings only where needed.
- Some tests require insertion of bytecast/bitcast pairs: this makes rewriting FileCheck directives harder.
We need to come up with a neat way of fixing all tests with the least number of conflicts (some inspiration can be taken from opaque pointers patches).
There are still pending performance issues, including execution time of xalancbmk, compile time of mcf, and the object file size of xz. We plan to investigate the causes of the regressions in these programs, aiming at bringing them down to 0%.
Introduce new optimizations for bytes and bytecasts to improve performance and make more optimizations byte-aware.

Part 5, in which we draw conclusions

Having started from questioning whether memory in LLVM is typed or untyped, we investigated subtle issues of LLVM and its semantics. We stressed that in C unsigned char is a universal type - otherwise it is not possible to implement memcpy in C. We then discussed the main problems and challenges that are associated with the current lowering of unsigned char and char, the abuse of integers as universal types in LLVM IR, and shared worrying miscompilation examples.

Having shown that the current LLVM IR is not expressive enough to solve type punning problems, we described a possible set of solutions. In particular, we argued that the best (in terms of effort, expressiveness and semantics) would be to consider memory in LLVM to be typed, and to introduce a new type in LLVM IR that represents raw memory data. We highlighted the semantics of the new type and the necessary changes in LLVM IR semantics.

We presented a prototype we have implemented during the Google Summer of Code 2021. Our evaluation showed that our solution does not have high engineering costs, has comparable performance to vanilla LLVM, and solves the type punning problems.

We hope that now more LLVM developers understand our proposal and why LLVM needs to have a byte type! For any questions, comments and suggestions feel free to ping us (George, Nuno, Juneyoung) on the mailing list.

georgemitenkov/byte-types-or-how-to-get-rid-of-i8-abuse-for-chars-in-llvm-ir.md

Byte types, or how to get rid of i8 abuse for chars in LLVM IR

Part 0, where we introduce our work

Part 1, where we argue whether the memory in LLVM IR is typed or not

Part 2, where we question the possibility of implementing a memcpy in LLVM IR, as well asdescribe miscompilations due to lowering of unsigned char and compiler-introduced type punning

Part 3, in which we describe how NOT to fix LLVM

Solution 0: Fix or disable optimizations

Solution 1: Better alias analysis

Solution 2: Explicit inttoptr/ptrtoint casts

Solution 3: Annotations and tags

Part 4, in which we introduce the byte type

Motivation

Byte type and bytecast semantics

Preliminary results

What is missing

Part 5, in which we draw conclusions