Last active
May 12, 2021 15:48
-
-
Save calebwaldner/9c784ba7952d0b4f7c57d93f19ceaad1 to your computer and use it in GitHub Desktop.
A BigQuery function that accepts an array of similar strings and returns the string that occurs most within array.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
CREATE TEMP FUNCTION GetMainStringValue(arr ARRAY<STRING>) RETURNS STRING AS ( | |
ARRAY( | |
SELECT list FROM ( | |
SELECT list, | |
count(*) count, | |
FROM UNNEST(arr) list | |
GROUP BY list | |
ORDER BY count DESC | |
) | |
)[SAFE_OFFSET(0)] | |
); | |
SELECT GetMainStringValue(["CoolWords", "CoolerWords ", "CoolWord", "CoolWard", "CoolWord", "CoolWord"]) AS string_with_most_occurrences |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
A BigQuery function that accepts an array of similar strings and returns the string that occurs most within array. Intended for use with SOUNDEX() function to facilitate users inputing the similar words but with some possible spelling errors; GetMainStringValue would accept that array of user values and output what seems the be the main intended value (assuming the more users spelled the words correctly than did not).