Created
June 4, 2016 08:50
-
-
Save rongjiecomputer/94154e0bf01ef19a4999fef70264c48a to your computer and use it in GitHub Desktop.
Python code to trim raw text of Complete Sherlock Holmes
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Original raw text: | |
http://sherlock-holm.es/ascii/ | |
Trimmed Format: | |
- Each line is a complete paragraph. | |
- Each line is ended with two new line characters ('\n\n') (including the last line). | |
- Disclaimer at the end of the raw text is not deleted, you need to delete it yourself. | |
""" | |
input = open("cano.txt", "r") | |
output = open("cano-trim.txt", "w") | |
newParagraph = True | |
line = input.readline() | |
while line: | |
n = len(line) | |
if n > 5 and line[5] != ' ' and line[5] != '\n': | |
if not(newParagraph): | |
output.write(' ') | |
else: | |
newParagraph = False | |
output.write(line.strip()) | |
elif n == 1 and line[0] == '\n' and not(newParagraph): | |
output.write('\n\n') | |
newParagraph = True | |
line = input.readline() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment