Skip to content

Instantly share code, notes, and snippets.

@bjulius
Last active November 28, 2022 04:40
Show Gist options
  • Save bjulius/5f649f4080f814721780d53043c34ddf to your computer and use it in GitHub Desktop.
Save bjulius/5f649f4080f814721780d53043c34ddf to your computer and use it in GitHub Desktop.
Test Diacritic Replacement with Undocumented Binary Encoding Functions – Brian Julius
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("XZPBTttAEIZfZcQplXpwQgjOEQIESKAoLlQV4jCxp/aSzQ5aey0lL8Az9FaOPfTUHtte3LxXZ0NiOZws+f88s/v/v+/v99r9/eDIpmQKZXDolNZk5wxRUb1MlSZohe/2Ht6vue5gSXHGueYSZwo/6GT1VcUZXNMjJbr6A62Dmg2PLS6VHlP1y6gEc2gdbrWD4FU7SmiuLLT6tdA9dyZFu4iqF5OwhRHHuZJP2+0aCc8smpguXV7AGZsClZEztve3QK/zOnyI1ioTZwit7lvtDsvqZef9IJOrjokNafDL44yWO8DmYGe6+mmVAHCkp2SLHSbiUlEBt0axuUO9dhQuSjRcvuFihVr5C1BCFjVM6MlNtYqBv8Bnl4rDWCo8sbj6iwYuyc5k9L/n5pTeDdvCpahPXV59nypu+HgYfCKZPpQg0SzkkcBV9dsHK04FNdW9YY0mGdolpWyXMMaCGzEdhnUtrtDKghHNn0jS6G2JsHNRoF7cIGuGCee5aoq9U5Ou54tvMJaYZn5/rfeD9ccR6lL2WoIolhA0xs0h/e6x02K8wnMrhvlasjRu5h1tQBMnu6WPlELkjZ9xUw4HlrEQP7GUTq2+ueZBOkGwacWEDeqE68KL0ttYeKXsOhMYac6prnknaAcb4mPGcyl5bXODuKYiI+udyCUWTQuIDKnH5A0WPUmV5YgqgTvvw454a13qcHGiJCmpvdW+gg2gO5AI5lOFlziXiCac2OpH6nyJ63u2w00g5/JrLGCE/sfx6sN/", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"Highest Goal Scorer" = _t]),
SplitByCharTrans = Table.SplitColumn(Source, "Highest Goal Scorer", Splitter.SplitTextByCharacterTransition({"0".."9"}, (c) => not List.Contains({"0".."9"}, c)), {"Highest Goal Scorer.1", "Highest Goal Scorer.2", "Highest Goal Scorer.3"}),
SplitByDelim = Table.SplitColumn(SplitByCharTrans, "Highest Goal Scorer.2", Splitter.SplitTextByDelimiter("(", QuoteStyle.Csv), {"Highest Goal Scorer.2.1", "Highest Goal Scorer.2.2"}),
SplitByCharTrans2 = Table.SplitColumn(SplitByDelim, "Highest Goal Scorer.2.1", Splitter.SplitTextByCharacterTransition({"a".."z"}, {"A".."Z"}), {"Highest Goal Scorer.2.1.1", "Highest Goal Scorer.2.1.2"}),
RemoveCols = Table.SelectColumns(SplitByCharTrans2,{"Highest Goal Scorer.2.1.2"}),
Rename = Table.RenameColumns(RemoveCols,{{"Highest Goal Scorer.2.1.2", "Highest Goal Scorer"}}),
ToBinary = Table.AddColumn(Rename, "BinaryConversion", each Text.ToBinary([Highest Goal Scorer], 1361 )),
ToText = Table.AddColumn(ToBinary, "NoDiacritics", each Text.FromBinary( [BinaryConversion], 1250 )),
DiacriticPos = Table.AddColumn(ToText, "DiacriticPositions", each List.PositionOf( Text.ToList( [NoDiacritics] ), "?", 100)),
CountDiacritics = Table.AddColumn( DiacriticPos, "DiacriticCount", each List.Count( [DiacriticPositions]))
in
CountDiacritics
@bjulius
Copy link
Author

bjulius commented Nov 28, 2022

In the ToBinary step, experiment with different values for the encoding page. For example replace 1361 (Korean) with 1251 (Cyrillic). The former does not recognize diacritics, so will replace all diacritics with a "?", whereas the latter will correctly replace the diacritics with the proper letter.

Using an encoding that does not recognize diacritics will allow one to calculate for example the positions of all diacritics or the number of diacritics in a given text string, as shown in the DiacriticPos and CountDiacritics steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment