Skip to content

Instantly share code, notes, and snippets.

@milovidov983
Created August 14, 2018 13:49
Show Gist options
  • Save milovidov983/e6bd83028d694cc0c2c8d647b94a99d1 to your computer and use it in GitHub Desktop.
Save milovidov983/e6bd83028d694cc0c2c8d647b94a99d1 to your computer and use it in GitHub Desktop.
How to convert UTF-8 to UTF-8 with BOM c# string
private string ConvertStringToUtf8Bom(string source) {
var data = Encoding.UTF8.GetBytes(source);
var result = Encoding.UTF8.GetPreamble().Concat(data).ToArray();
var encoder = new UTF8Encoding(true);
return encoder.GetString(result);
}
@milovidov983
Copy link
Author

1

@djonasdev
Copy link

Thanks! 👍

@chucklu
Copy link

chucklu commented Apr 12, 2021

@dojo90 @milovidov983 What's the point of the convert? as the str and str2 are the same string after the convert

 var str = "aîn";
            var str2 = ConvertStringToUtf8Bom(str);
            Console.WriteLine(str2);

@milovidov983
Copy link
Author

milovidov983 commented Apr 12, 2021

@dojo90 @milovidov983 What's the point of the convert? as the str and str2 are the same string after the convert

 var str = "aîn";
            var str2 = ConvertStringToUtf8Bom(str);
            Console.WriteLine(str2);
  1. At the time when I wrote this function, it seemed to me that it solved my problem and the resulting CSV file immediately recognized the correct cyrillic encoding when opened by Excel.

  2. By the way it looks like the strings are not equivalent. Or am I wrong?:
    https://dotnetfiddle.net/mYgvdl

Example

using System;
using System.Text;
using System.Linq;
public class Program {
	public static void Main() {
		var str = "aîn";
		var str2 = ConvertStringToUtf8Bom(str);
		Console.WriteLine(str2);

		Console.WriteLine(str == str2);

		var str3 = "aîn";
		Console.WriteLine(str == str3);
	}
	static string ConvertStringToUtf8Bom(string source) {
		var data = Encoding.UTF8.GetBytes(source);
		var result = Encoding.UTF8.GetPreamble().Concat(data).ToArray();
		var encoder = new UTF8Encoding(true);

		return encoder.GetString(result);
	}

}

// Output:
/*
aîn
False
True
*/

@chucklu
Copy link

chucklu commented Apr 12, 2021

@milovidov983,Thanks, I tried to print the bytes array, it make sense.

  var str = "aîn";
            var str2 = ConvertStringToUtf8Bom(str);
            Console.WriteLine(str2 == str);

            Console.WriteLine($"length of {str} is {str.Length}");
            var bytes1 = Encoding.UTF8.GetBytes(str);
            Console.WriteLine(GetHexString(bytes1));

            Console.WriteLine();

            Console.WriteLine($"length of {str2} is {str2.Length}");
            var bytes2 = Encoding.UTF8.GetBytes(str2);
            Console.WriteLine(GetHexString(bytes2));

False
length of aîn is 3
61 C3 AE 6E

length of aîn is 4
EF BB BF 61 C3 AE 6E

@chucklu
Copy link

chucklu commented Apr 12, 2021

By the way, I am using stream writer to create a new file with Encoding.UTF8, and it will handle the BOM automatically.
https://github.com/dotnet/runtime/blob/6ef4b2e7aba70c514d85c2b43eac1616216bea55/src/libraries/System.Private.CoreLib/src/System/IO/StreamWriter.cs#L273

@muru82
Copy link

muru82 commented Nov 9, 2022

does the above code add Carriage return when processing ?

@luanrem
Copy link

luanrem commented Mar 22, 2023

Thanks guys! This helped!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment