Created
September 9, 2015 05:40
-
-
Save juanshishido/ce1c033856892228f603 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "# John Stuart Mill" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## Overview" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "For my text collection, I have chosen to go with the writings of 19th century philosopher John Stuart Mill. Project Gutenberg hosts 11 of his works. Mill primarily writes about social issues and political economy. Combined, the files are 6.8 MB. The text does not contain much markup. There are some Unicode characters. More importantly, there is introductory text at the beginning of each file that is not part of Mill's writing that needs to be removed." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## Initial Computation" | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "%pprint\n\nimport os\nimport re", | |
"execution_count": 1, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "Pretty printing has been turned OFF\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "files = [f for f in os.listdir() if re.search('.txt$', f)]", | |
"execution_count": 2, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "texts = ''\nfor f in files:\n with open (f, 'r', encoding='utf-8') as jsm:\n t = jsm.read().replace('\\n', ' ').replace('\\r', ' ')\n texts = texts + ' ' + t", | |
"execution_count": 3, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "print('The string is', len(texts), 'characters in length.')", | |
"execution_count": 4, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "The string is 6658453 characters in length.\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "text = [w.lower() for w in texts.split(' ') if w != '']", | |
"execution_count": 5, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "print('There are', len(text), 'tokens.')", | |
"execution_count": 6, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "There are 1113445 tokens.\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": false | |
}, | |
"cell_type": "code", | |
"source": "print('There are', len(set(text)), 'unique tokens in this text collection.')", | |
"execution_count": 7, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "There are 52907 unique tokens in this text collection.\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"trusted": true, | |
"collapsed": true | |
}, | |
"cell_type": "code", | |
"source": "", | |
"execution_count": null, | |
"outputs": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3", | |
"language": "python" | |
}, | |
"language_info": { | |
"pygments_lexer": "ipython3", | |
"file_extension": ".py", | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"nbconvert_exporter": "python", | |
"name": "python", | |
"version": "3.4.2", | |
"mimetype": "text/x-python" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment