Levi Broderick GrabYourPitchforks

This article has moved to the official .NET Docs site.

See https://docs.microsoft.com/dotnet/standard/base-types/character-encoding-introduction.

Utf8String design overview

Audience and scenarios

Utf8String and related concepts are meant for modern internet-facing applications that need to speak "the language of the web" (or i/o in general, really). Currently applications spend some amount of time transcoding into formats that aren't particularly useful, which wastes CPU cycles and memory.

A naive way to accomplish this would be to represent UTF-8 data as byte[] / Span<byte>, but this leads to a usability pit of failure. Developers would then become dependent on situational awareness and code hygiene to be able to know whether a particular byte[] instance is meant to represent binary data or UTF-8 textual data, leading to situations where it's very easy to write code like byte[] imageData = ...; imageData.ToUpperInvariant();. This defeats the purpose of using a typed language.

We want to expose enough functionality to make the Utf8String type usable and desirable by our developer audience, but it's not intended to serve as a

Motivations and driving principles behind the `Utf8Char` proposal

Utf8Char is synonymous with Char: they represent a single UTF-8 code unit and a single UTF-16 code unit, respectively. They are distinct from the integral types Byte and UInt16 in that sequences of the UTF-* code unit types are meant to represent textual data, while sequences of the integral types are meant to represent binary data.

Drawing this distinction is important. With UTF-16 data (String, Char[]), this distinction historically hasn't been a source of confusion. Developers are generally cognizant of the fact that aside from RPC, most i/o involves some kind of transcoding mechanism. Binary data doesn't come in from disk or the network in a format that can be trivially projected as a textual string; it must go through validation, recombining, and substitution. Similarly, when writing a string to disk or the network, a trivial projection is again impossible. The transcoding step must run in reverse to get the text data int

This tests the performance of MemoryExtensions.ToUpperInvariant(this ReadOnlySpan<char>, Span<char>), String.GetHashCode(), and String.GetHashCode(StringComparison.OrdinalIgnoreCase).

In below table:

baseline coreclr = 3.0.0-preview1-26808-05
local build (6) = local build from private dev Utf8String branch, 6th rev.
local build (7) = local build from private dev Utf8String branch, 7th rev.

Method	Toolchain	StringLength	Mean	Error	StdDev	Scaled	ScaledSD
ToUpperInvariant	baseline coreclr	0	27.112 ns	0.7416 ns	1.1763 ns	1.00	0.00

Memory<T> API documentation and samples

This document describes the APIs of Memory<T>, IMemoryOwner<T>, and MemoryManager<T> and their relationships to each other.

See also the Memory<T> usage guidelines document for background information.

First, a brief summary of the basic types

Memory<T> is the basic type that represents a contiguous buffer. This type is a struct, which means that developers cannot subclass it and override the implementation. The basic implementation of the type is aware of contigious memory buffers backed by T[] and System.String (in the case of ReadOnlyMemory<char>).

	using System;
	using System.IO;
	using System.Runtime.Serialization;
	using System.Runtime.Serialization.Formatters.Binary;

	class Program
	{
	static void Main(string[] args)
	{
	Stream inputStream = GetInputStream();

	using System;
	using System.Runtime.InteropServices;
	using System.Text;

	class Program
	{
	static void Main(string[] args)
	{
	{
	// the text below is meaningless

	<?xml version="1.0" encoding="utf-8"?>
	<Project ToolsVersion="4.0" DefaultTargets="Build" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
	<!-- ... -->
	<Import Project="$(MSBuildToolsPath)\Microsoft.CSharp.targets" />
	<!-- This task adds a module initializer to {IL}.txt. -->
	<UsingTask TaskName="InjectModuleInitializer" TaskFactory="CodeTaskFactory" AssemblyFile="$(MSBuildToolsPath)\Microsoft.Build.Tasks.v4.0.dll">
	<ParameterGroup>
	<Path ParameterType="System.String" Required="true" />
	<InitializerMethod ParameterType="System.String" Required="true" />
	</ParameterGroup>

	// In a loop, try reading a natural word at a time.

	const int CharsPerNuint = sizeof(nuint) / sizeof(char);
	for (; inputLength >= CharsPerNuint; pInputBuffer += CharsPerNuint, inputLength -= CharsPerNuint)
	{
	nuint utf16Data = Unsafe.ReadUnaligned<nuint>(pInputBuffer);

	utf16Data &= unchecked((nuint)0xFF80_FF80_FF80_FF80ul);
	if (utf16Data == 0)
	{

	/*
	* !! WARNING !!
	*
	* COMPLETELY UNTESTED CODE
	*/

	using Microsoft.Win32.SafeHandles;
	using System.Diagnostics;
	using System.Runtime.CompilerServices;
	using System.Runtime.ConstrainedExecution;