Skip to content

Instantly share code, notes, and snippets.

@wi7a1ian
Created February 17, 2025 15:33
Show Gist options
  • Save wi7a1ian/786f2c00f51c9211b52cca18cecb8cde to your computer and use it in GitHub Desktop.
Save wi7a1ian/786f2c00f51c9211b52cca18cecb8cde to your computer and use it in GitHub Desktop.
Passing char offsets between #python utf8 codepoints and #csharp utf16 code units
# in C# a char is a utf16 code unit
# in Python len(str) counts code points, not utf16 code units
with open("test.txt", 'r', encoding='utf-8') as file:
content = file.read()
utf16codeUnitCount = int(len(content.encode("utf-16-le"))/2) # count utf16 code units
print(f"Number of characters: {utf16codeUnitCount}")
using System.Text;
byte[] contentBytes = File.ReadAllBytes("test.txt");
Console.WriteLine($"Number of characters: {Encoding.UTF8.GetCharCount(contentBytes)}"); // GetCharCount -> should be renamed to Getunicode16LECodeUnitCount()
// both Encoding.UTF8.GetCharCount() and content.Length count utf16 code units (known in .NET as "characters") and not utf8 codepoints (known in Python as "characters")
aπŸ˜ŠπŸ˜΅β€πŸ’«θ‘›
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment