Last active
May 22, 2020 21:21
-
-
Save fpaupier/08475906e6bf97035a012a051dbb7b71 to your computer and use it in GitHub Desktop.
Shuffle the paragraphs of a text file with python π
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #!/usr/bin/python | |
| # -*- coding: utf8 -*- | |
| import codecs | |
| import random | |
| def shuffle_text_paragraphs(file_path, output_file_path): | |
| """ | |
| Args: | |
| file_path (string): absolute path of the file whose paragraph must be shuffled. | |
| output_file_path (string): absolute path of the file to write the paragraphs in. | |
| Returns: | |
| No output, write directly the shuffled paragraph in the output file. | |
| """ | |
| with codecs.open(file_path, mode='r', encoding='utf-8') as f: | |
| data = f.read() | |
| # Split on \n\n | |
| paragraphs = data.split('\n\n') | |
| # Shuffle splits | |
| random.shuffle(paragraphs) | |
| with codecs.open(output_file_path, mode='w', encoding='utf-8') as output: | |
| for paragraph in paragraphs: | |
| output.write(paragraph) | |
| # Add the line break | |
| output.write('\n\n') | |
| if __name__ == '__main__': | |
| shuffle_text_paragraphs('/path/to/input.txt', '/path/to/output.txt' |
Should you perform file closing?
f.close()
and
output.close()
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
What
This snippet proposes a copy/pastable code enabling you to write a new text file in which the paragraph from the original text file are shuffled.
Why
In a machine learning pipeline, shuffling the paragraph of a text dataset can be an interesting step of data augmentation.
Your neural network will learn on different sequences and batches while keeping a meaningful unit of language : a consistent paragraph.
Indeed, shuffling on sentences can lead to non-sense when generating text afterwards.
Learn more on how we used it in our project of rap music lyrics generation using neural networks.
Assumptions
We define a paragraph as a list of lines. Each paragraph in the
input.txtare expected to be separated by two line breakers'\n\n'for the following snippet to work properly (this can be tweaked though). In any cases, you can perform simple regex operations to comply with this.Note
This snippet is python2 and python3 ready