Created
July 13, 2022 00:09
-
-
Save astariul/1cb2ccf1c205b9c30c0dad6d2a7d51cd to your computer and use it in GitHub Desktop.
Part 2 of the Fleksy NLP challenge
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// First, define some constants | |
VOCAB = ["_", " ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "à", "ç", "è", "é", "í", "ñ", "ò", "ó", "ú"] | |
// Then, define our helper functions | |
function greedy_decoding(nn_output): | |
// This function use a greedy approach to decode a word from the output of a neural network trained with CTC. | |
// Here, nn_output is the output of the neural network. | |
// In the context of this challenge, it's the content of the CSV file. | |
// I assume it's a matrix of size [n, v], where n is the number of character positions (40 in this example) and v is the size of the vocabulary (37 in this example) | |
// The greedy decoding algorithm is simple : each time step, take the character with the highest probability | |
word = "" | |
for step in nn_output: | |
// Here, step is a vector of size [v] | |
vocab_idx = argmax(step) | |
sentence = sentence + VOCAB[vocab_idx] | |
return word | |
function clean_decode(word): | |
// This function clean the word predicted by the neural network, to account for repetitions introduced by CTC. | |
// As described in the blog, this is done in 2 steps : | |
// * Removing duplicate characters | |
// * Removing blanks | |
// First, let's remove the duplicate characters | |
prev_c = None | |
undup_word = "" | |
for c in word: | |
// Here we iterate each character of the word : c is a character | |
if c != prev_c: | |
undup_word = undup_word + c | |
prev_c = c | |
// Then, remove the blank character | |
clean_word = "" | |
for c in undup_word: | |
if c != VOCAB[0]: | |
clean_word = clean_word + c | |
return clean_word | |
// Finally define our main function, the one to be used for this challenge | |
function ctc_decode(nn_output): | |
// Here, nn_output is the output of the neural network. | |
// In the context of this challenge, it's the content of the CSV file. | |
// I assume it's a matrix of size [n, v], where n is the number of character positions (40 in this example) and v is the size of the vocabulary (37 in this example) | |
// Greedy-decode the output of the network into a text | |
raw_word = greedy_decoding(nn_output) | |
// Clean the word of CTC repetitions and special characters | |
word = clean_decode(raw_word) | |
return word | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment