Skip to content

Instantly share code, notes, and snippets.

@GrabYourPitchforks
Created May 18, 2021 22:31
Show Gist options
  • Save GrabYourPitchforks/056f967376c9538870138dd501387e71 to your computer and use it in GitHub Desktop.
Save GrabYourPitchforks/056f967376c9538870138dd501387e71 to your computer and use it in GitHub Desktop.
OOBing Rune

Problem summary

While working on the System.Text.Encodings.Web refactoring I noticed that we have several duplicates of the System.Text.Rune type (or its backing logic) throughout our projects.

The official implementation exposed by System.Runtime:

A copy used by the Utf8String OOB:

Multiple copies used by System.Text.Encodings.Web:

Duplicated logic in System.Text.Json:

System.Text.Json also has some duplicated logic from what turned into the System.Text.Unicode.Utf8 helper APIs:

Proposal

I propose to create a standalone System.Text.Unicode package which includes the functionality of the Rune type plus select helper methods. The exposed API surface would be as follows.

/*
 * These APIs are type-forwarded into System.Runtime or System.Memory on .NET Core 3.1+.
 */

namespace System.Text
{
    //
    // This is the public API surface of Rune as it existed in .NET Core 3.1.
    //
    public readonly partial struct Rune : System.IComparable<System.Text.Rune>, System.IEquatable<System.Text.Rune>
    {
        private readonly int _dummyPrimitive;
        public Rune(char ch) { throw null; }
        public Rune(char highSurrogate, char lowSurrogate) { throw null; }
        public Rune(int value) { throw null; }
        [System.CLSCompliantAttribute(false)]
        public Rune(uint value) { throw null; }
        public bool IsAscii { get { throw null; } }
        public bool IsBmp { get { throw null; } }
        public int Plane { get { throw null; } }
        public static System.Text.Rune ReplacementChar { get { throw null; } }
        public int Utf16SequenceLength { get { throw null; } }
        public int Utf8SequenceLength { get { throw null; } }
        public int Value { get { throw null; } }
        public int CompareTo(System.Text.Rune other) { throw null; }
        public static System.Buffers.OperationStatus DecodeFromUtf16(System.ReadOnlySpan<char> source, out System.Text.Rune result, out int charsConsumed) { throw null; }
        public static System.Buffers.OperationStatus DecodeFromUtf8(System.ReadOnlySpan<byte> source, out System.Text.Rune result, out int bytesConsumed) { throw null; }
        public static System.Buffers.OperationStatus DecodeLastFromUtf16(System.ReadOnlySpan<char> source, out System.Text.Rune result, out int charsConsumed) { throw null; }
        public static System.Buffers.OperationStatus DecodeLastFromUtf8(System.ReadOnlySpan<byte> source, out System.Text.Rune value, out int bytesConsumed) { throw null; }
        public int EncodeToUtf16(System.Span<char> destination) { throw null; }
        public int EncodeToUtf8(System.Span<byte> destination) { throw null; }
        public override bool Equals(object? obj) { throw null; }
        public bool Equals(System.Text.Rune other) { throw null; }
        public override int GetHashCode() { throw null; }
        public static double GetNumericValue(System.Text.Rune value) { throw null; }
        public static System.Text.Rune GetRuneAt(string input, int index) { throw null; }
        public static System.Globalization.UnicodeCategory GetUnicodeCategory(System.Text.Rune value) { throw null; }
        public static bool IsControl(System.Text.Rune value) { throw null; }
        public static bool IsDigit(System.Text.Rune value) { throw null; }
        public static bool IsLetter(System.Text.Rune value) { throw null; }
        public static bool IsLetterOrDigit(System.Text.Rune value) { throw null; }
        public static bool IsLower(System.Text.Rune value) { throw null; }
        public static bool IsNumber(System.Text.Rune value) { throw null; }
        public static bool IsPunctuation(System.Text.Rune value) { throw null; }
        public static bool IsSeparator(System.Text.Rune value) { throw null; }
        public static bool IsSymbol(System.Text.Rune value) { throw null; }
        public static bool IsUpper(System.Text.Rune value) { throw null; }
        public static bool IsValid(int value) { throw null; }
        [System.CLSCompliantAttribute(false)]
        public static bool IsValid(uint value) { throw null; }
        public static bool IsWhiteSpace(System.Text.Rune value) { throw null; }
        public static bool operator ==(System.Text.Rune left, System.Text.Rune right) { throw null; }
        public static explicit operator System.Text.Rune (char ch) { throw null; }
        public static explicit operator System.Text.Rune (int value) { throw null; }
        [System.CLSCompliantAttribute(false)]
        public static explicit operator System.Text.Rune (uint value) { throw null; }
        public static bool operator >(System.Text.Rune left, System.Text.Rune right) { throw null; }
        public static bool operator >=(System.Text.Rune left, System.Text.Rune right) { throw null; }
        public static bool operator !=(System.Text.Rune left, System.Text.Rune right) { throw null; }
        public static bool operator <(System.Text.Rune left, System.Text.Rune right) { throw null; }
        public static bool operator <=(System.Text.Rune left, System.Text.Rune right) { throw null; }
        public static System.Text.Rune ToLower(System.Text.Rune value, System.Globalization.CultureInfo culture) { throw null; }
        public static System.Text.Rune ToLowerInvariant(System.Text.Rune value) { throw null; }
        public override string ToString() { throw null; }
        public static System.Text.Rune ToUpper(System.Text.Rune value, System.Globalization.CultureInfo culture) { throw null; }
        public static System.Text.Rune ToUpperInvariant(System.Text.Rune value) { throw null; }
        public static bool TryCreate(char highSurrogate, char lowSurrogate, out System.Text.Rune result) { throw null; }
        public static bool TryCreate(char ch, out System.Text.Rune result) { throw null; }
        public static bool TryCreate(int value, out System.Text.Rune result) { throw null; }
        [System.CLSCompliantAttribute(false)]
        public static bool TryCreate(uint value, out System.Text.Rune result) { throw null; }
        public bool TryEncodeToUtf16(System.Span<char> destination, out int charsWritten) { throw null; }
        public bool TryEncodeToUtf8(System.Span<byte> destination, out int bytesWritten) { throw null; }
        public static bool TryGetRuneAt(string input, int index, out System.Text.Rune value) { throw null; }
    }

    public ref partial struct SpanRuneEnumerator
    {
        private object _dummy;
        private int _dummyPrimitive;
        public System.Text.Rune Current { get { throw null; } }
        public System.Text.SpanRuneEnumerator GetEnumerator() { throw null; }
        public bool MoveNext() { throw null; }
    }

    public partial struct StringRuneEnumerator : System.Collections.Generic.IEnumerable<System.Text.Rune>, System.Collections.Generic.IEnumerator<System.Text.Rune>, System.Collections.IEnumerable, System.Collections.IEnumerator, System.IDisposable
    {
        private object _dummy;
        private int _dummyPrimitive;
        public System.Text.Rune Current { get { throw null; } }
        object? System.Collections.IEnumerator.Current { get { throw null; } }
        public System.Text.StringRuneEnumerator GetEnumerator() { throw null; }
        public bool MoveNext() { throw null; }
        System.Collections.Generic.IEnumerator<System.Text.Rune> System.Collections.Generic.IEnumerable<System.Text.Rune>.GetEnumerator() { throw null; }
        System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() { throw null; }
        void System.Collections.IEnumerator.Reset() { }
        void System.IDisposable.Dispose() { }
    }
}

namespace System.Text.Unicode
{
    public static partial class Utf8
    {
        public static System.Buffers.OperationStatus FromUtf16(System.ReadOnlySpan<char> source, System.Span<byte> destination, out int charsRead, out int bytesWritten, bool replaceInvalidSequences = true, bool isFinalBlock = true) { throw null; }
        public static System.Buffers.OperationStatus ToUtf16(System.ReadOnlySpan<byte> source, System.Span<char> destination, out int bytesRead, out int charsWritten, bool replaceInvalidSequences = true, bool isFinalBlock = true) { throw null; }
    }
}

/*
 * These APIs are extension methods to mimic instance APIs added in .NET Core 3.1.
 * The extension methods cannot be type-forwarded, but their 3.1+ implementations can act as
 * thin wrappers around the real framework APIs. Downlevel implementations will have the
 * logic fully contained within these methods.
 */

namespace System
{
    public partial static class MemoryExtensions
    {
        public static System.Text.SpanRuneEnumerator EnumerateRunes(this System.ReadOnlySpan<char> span) { throw null; }
        public static System.Text.SpanRuneEnumerator EnumerateRunes(this System.Span<char> span) { throw null; }
    }

    public partial static class StringExtensions
    {
        public static System.Text.StringRuneEnumerator EnumerateRunes(this string text) { throw null; }
    }
}

The package should target the frameworks net461, netstandard2.0, and netcoreapp3.1. When running on .NET Core 3.1+, the implementations are largely removed and the package type-forwards into the SDK.

Consumers like System.Text.Json and System.Text.Encodings.Web do not need to include references to this package as part of their ref set. However, when compiling for targets before .NET Core 3.1, the implementation assemblies would have a reference to this pacakge.

Q&A

Are MemoryExtensions and StringExtensions a good idea? The proposal above introduces them into the System namespace so that they look like the shipped .NET Core 3.1+ APIs. The type name MemoryExtensions is clearly already taken, but since most callers should invoke the APIs via extension method syntax rather than typical static method invocation syntax, I don't believe this will cause conflicts in practice. The only time it could cause conflicts is if a caller is targeting .NET Core 3.1+ and also has referenced this package explicitly.

If desired, we can also [Obsolete] the APIs in the .NET Core 3.1+ ref, which signals to library authors that they should only be pulling in this package when compiling for downlevel targets. If they're cross-compiling for multiple targets, they should configure their environment not to include this package reference in later APIs. We can even delete the APIs from the ref but leave them in the implementation, which forbids compiling against them but won't block existing compiled code from binding against it during load.

What about new APIs being introduced? For example, there may be desire for downlevel customers to call APIs introduced in dotnet/runtime#28230, which is slated for introduction in the 6.0 timeframe. We can take these on a case-by-case basis; but generally speaking, there's nothing stopping us from utilizing the same techniques here. Implement the APIs to the best of our ability downlevel - which may involve leaving off some of the API surface - and type-forward on compatible runtimes.

What about the APIs added under System.Globalization? APIs like CompareInfo.IndexOf(ReadOnlySpan<char>, Rune, ...) cannot be implemented out-of-band without taking a significant performance hit. As it is, the OOB implementation (not the inbox implementation) of System.Text.Rune may already allocate for scenarios like Rune.GetUnicodeCategory(new Rune(0x10000)), but we would try to avoid allocations in the common case. For Rune-enlightened APIs on CompareInfo, we would not be able to avoid this allocation under even the most basic of scenarios. I don't think we should attempt to implement such APIs out-of-band.

@lilith
Copy link

lilith commented Feb 29, 2024

That's the tip of the iceberg overall - I'm betting there are hundreds of copies of Rune floating around just so people can URL decode unicode properly on .NET Standard 2. https://github.com/search?q=IEquatable%3CRune%3E&type=code

I would love for this to exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment