Skip to content

Instantly share code, notes, and snippets.

@fpaupier
Last active May 22, 2020 21:21
Show Gist options
  • Save fpaupier/08475906e6bf97035a012a051dbb7b71 to your computer and use it in GitHub Desktop.
Save fpaupier/08475906e6bf97035a012a051dbb7b71 to your computer and use it in GitHub Desktop.
Shuffle the paragraphs of a text file with python 🐍
#!/usr/bin/python
# -*- coding: utf8 -*-
import codecs
import random
def shuffle_text_paragraphs(file_path, output_file_path):
"""
Args:
file_path (string): absolute path of the file whose paragraph must be shuffled.
output_file_path (string): absolute path of the file to write the paragraphs in.
Returns:
No output, write directly the shuffled paragraph in the output file.
"""
with codecs.open(file_path, mode='r', encoding='utf-8') as f:
data = f.read()
# Split on \n\n
paragraphs = data.split('\n\n')
# Shuffle splits
random.shuffle(paragraphs)
with codecs.open(output_file_path, mode='w', encoding='utf-8') as output:
for paragraph in paragraphs:
output.write(paragraph)
# Add the line break
output.write('\n\n')
if __name__ == '__main__':
shuffle_text_paragraphs('/path/to/input.txt', '/path/to/output.txt'
@fpaupier
Copy link
Author

fpaupier commented Aug 16, 2018

What

This snippet proposes a copy/pastable code enabling you to write a new text file in which the paragraph from the original text file are shuffled.

Why

In a machine learning pipeline, shuffling the paragraph of a text dataset can be an interesting step of data augmentation.
Your neural network will learn on different sequences and batches while keeping a meaningful unit of language : a consistent paragraph.

Indeed, shuffling on sentences can lead to non-sense when generating text afterwards.

Learn more on how we used it in our project of rap music lyrics generation using neural networks.

Assumptions

We define a paragraph as a list of lines. Each paragraph in the input.txt are expected to be separated by two line breakers '\n\n' for the following snippet to work properly (this can be tweaked though). In any cases, you can perform simple regex operations to comply with this.

Note

This snippet is python2 and python3 ready

@melnarte
Copy link

Should you perform file closing?
f.close()
and
output.close()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment