Skip to content

Instantly share code, notes, and snippets.

@juanshishido
Created September 9, 2015 05:40
Show Gist options
  • Save juanshishido/ce1c033856892228f603 to your computer and use it in GitHub Desktop.
Save juanshishido/ce1c033856892228f603 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "# John Stuart Mill"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Overview"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "For my text collection, I have chosen to go with the writings of 19th century philosopher John Stuart Mill. Project Gutenberg hosts 11 of his works. Mill primarily writes about social issues and political economy. Combined, the files are 6.8 MB. The text does not contain much markup. There are some Unicode characters. More importantly, there is introductory text at the beginning of each file that is not part of Mill's writing that needs to be removed."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Initial Computation"
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "%pprint\n\nimport os\nimport re",
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"text": "Pretty printing has been turned OFF\n",
"name": "stdout"
}
]
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "files = [f for f in os.listdir() if re.search('.txt$', f)]",
"execution_count": 2,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "texts = ''\nfor f in files:\n with open (f, 'r', encoding='utf-8') as jsm:\n t = jsm.read().replace('\\n', ' ').replace('\\r', ' ')\n texts = texts + ' ' + t",
"execution_count": 3,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "print('The string is', len(texts), 'characters in length.')",
"execution_count": 4,
"outputs": [
{
"output_type": "stream",
"text": "The string is 6658453 characters in length.\n",
"name": "stdout"
}
]
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "text = [w.lower() for w in texts.split(' ') if w != '']",
"execution_count": 5,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "print('There are', len(text), 'tokens.')",
"execution_count": 6,
"outputs": [
{
"output_type": "stream",
"text": "There are 1113445 tokens.\n",
"name": "stdout"
}
]
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "print('There are', len(set(text)), 'unique tokens in this text collection.')",
"execution_count": 7,
"outputs": [
{
"output_type": "stream",
"text": "There are 52907 unique tokens in this text collection.\n",
"name": "stdout"
}
]
},
{
"metadata": {
"trusted": true,
"collapsed": true
},
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"pygments_lexer": "ipython3",
"file_extension": ".py",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"nbconvert_exporter": "python",
"name": "python",
"version": "3.4.2",
"mimetype": "text/x-python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment