Skip to content

Instantly share code, notes, and snippets.

@StrikingLoo
Created October 25, 2019 04:01
Show Gist options
  • Save StrikingLoo/d4b09557c9d94876ca785c9ff4dd0c05 to your computer and use it in GitHub Desktop.
Save StrikingLoo/d4b09557c9d94876ca785c9ff4dd0c05 to your computer and use it in GitHub Desktop.
corpus = ""
for file_name in file_names:
with open(file_name, 'r') as f:
corpus+=f.read()
corpus = corpus.replace('\n',' ')
corpus = corpus.replace('\t',' ')
corpus = corpus.replace('“', ' " ')
corpus = corpus.replace('”', ' " ')
for spaced in ['.','-',',','!','?','(','—',')']:
corpus = corpus.replace(spaced, ' {0} '.format(spaced))
len(corpus) #10510355 characters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment