Skip to content

Instantly share code, notes, and snippets.

@gaphex
Created May 9, 2019 15:11
Show Gist options
  • Select an option

  • Save gaphex/9af75a045004a655f38981610597e7ee to your computer and use it in GitHub Desktop.

Select an option

Save gaphex/9af75a045004a655f38981610597e7ee to your computer and use it in GitHub Desktop.
RAW_DATA_FPATH = "dataset.txt" #@param {type: "string"}
PRC_DATA_FPATH = "proc_dataset.txt" #@param {type: "string"}
# apply normalization to the dataset
# this will take a minute or two
total_lines = count_lines(RAW_DATA_FPATH)
bar = Progbar(total_lines)
with open(RAW_DATA_FPATH,encoding="utf-8") as fi:
with open(PRC_DATA_FPATH, "w",encoding="utf-8") as fo:
for l in fi:
fo.write(normalize_text(l)+"\n")
bar.add(1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment