Last active
November 28, 2022 04:40
-
-
Save bjulius/5f649f4080f814721780d53043c34ddf to your computer and use it in GitHub Desktop.
Test Diacritic Replacement with Undocumented Binary Encoding Functions – Brian Julius
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
let | |
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("XZPBTttAEIZfZcQplXpwQgjOEQIESKAoLlQV4jCxp/aSzQ5aey0lL8Az9FaOPfTUHtte3LxXZ0NiOZws+f88s/v/v+/v99r9/eDIpmQKZXDolNZk5wxRUb1MlSZohe/2Ht6vue5gSXHGueYSZwo/6GT1VcUZXNMjJbr6A62Dmg2PLS6VHlP1y6gEc2gdbrWD4FU7SmiuLLT6tdA9dyZFu4iqF5OwhRHHuZJP2+0aCc8smpguXV7AGZsClZEztve3QK/zOnyI1ioTZwit7lvtDsvqZef9IJOrjokNafDL44yWO8DmYGe6+mmVAHCkp2SLHSbiUlEBt0axuUO9dhQuSjRcvuFihVr5C1BCFjVM6MlNtYqBv8Bnl4rDWCo8sbj6iwYuyc5k9L/n5pTeDdvCpahPXV59nypu+HgYfCKZPpQg0SzkkcBV9dsHK04FNdW9YY0mGdolpWyXMMaCGzEdhnUtrtDKghHNn0jS6G2JsHNRoF7cIGuGCee5aoq9U5Ou54tvMJaYZn5/rfeD9ccR6lL2WoIolhA0xs0h/e6x02K8wnMrhvlasjRu5h1tQBMnu6WPlELkjZ9xUw4HlrEQP7GUTq2+ueZBOkGwacWEDeqE68KL0ttYeKXsOhMYac6prnknaAcb4mPGcyl5bXODuKYiI+udyCUWTQuIDKnH5A0WPUmV5YgqgTvvw454a13qcHGiJCmpvdW+gg2gO5AI5lOFlziXiCac2OpH6nyJ63u2w00g5/JrLGCE/sfx6sN/", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [#"Highest Goal Scorer" = _t]), | |
SplitByCharTrans = Table.SplitColumn(Source, "Highest Goal Scorer", Splitter.SplitTextByCharacterTransition({"0".."9"}, (c) => not List.Contains({"0".."9"}, c)), {"Highest Goal Scorer.1", "Highest Goal Scorer.2", "Highest Goal Scorer.3"}), | |
SplitByDelim = Table.SplitColumn(SplitByCharTrans, "Highest Goal Scorer.2", Splitter.SplitTextByDelimiter("(", QuoteStyle.Csv), {"Highest Goal Scorer.2.1", "Highest Goal Scorer.2.2"}), | |
SplitByCharTrans2 = Table.SplitColumn(SplitByDelim, "Highest Goal Scorer.2.1", Splitter.SplitTextByCharacterTransition({"a".."z"}, {"A".."Z"}), {"Highest Goal Scorer.2.1.1", "Highest Goal Scorer.2.1.2"}), | |
RemoveCols = Table.SelectColumns(SplitByCharTrans2,{"Highest Goal Scorer.2.1.2"}), | |
Rename = Table.RenameColumns(RemoveCols,{{"Highest Goal Scorer.2.1.2", "Highest Goal Scorer"}}), | |
ToBinary = Table.AddColumn(Rename, "BinaryConversion", each Text.ToBinary([Highest Goal Scorer], 1361 )), | |
ToText = Table.AddColumn(ToBinary, "NoDiacritics", each Text.FromBinary( [BinaryConversion], 1250 )), | |
DiacriticPos = Table.AddColumn(ToText, "DiacriticPositions", each List.PositionOf( Text.ToList( [NoDiacritics] ), "?", 100)), | |
CountDiacritics = Table.AddColumn( DiacriticPos, "DiacriticCount", each List.Count( [DiacriticPositions])) | |
in | |
CountDiacritics |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In the ToBinary step, experiment with different values for the encoding page. For example replace 1361 (Korean) with 1251 (Cyrillic). The former does not recognize diacritics, so will replace all diacritics with a "?", whereas the latter will correctly replace the diacritics with the proper letter.
Using an encoding that does not recognize diacritics will allow one to calculate for example the positions of all diacritics or the number of diacritics in a given text string, as shown in the DiacriticPos and CountDiacritics steps.