Created
February 17, 2025 15:33
-
-
Save wi7a1ian/786f2c00f51c9211b52cca18cecb8cde to your computer and use it in GitHub Desktop.
Passing char offsets between #python utf8 codepoints and #csharp utf16 code units
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# in C# a char is a utf16 code unit | |
# in Python len(str) counts code points, not utf16 code units | |
with open("test.txt", 'r', encoding='utf-8') as file: | |
content = file.read() | |
utf16codeUnitCount = int(len(content.encode("utf-16-le"))/2) # count utf16 code units | |
print(f"Number of characters: {utf16codeUnitCount}") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
using System.Text; | |
byte[] contentBytes = File.ReadAllBytes("test.txt"); | |
Console.WriteLine($"Number of characters: {Encoding.UTF8.GetCharCount(contentBytes)}"); // GetCharCount -> should be renamed to Getunicode16LECodeUnitCount() | |
// both Encoding.UTF8.GetCharCount() and content.Length count utf16 code units (known in .NET as "characters") and not utf8 codepoints (known in Python as "characters") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
aππ΅βπ«θ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment