juanshishido · September 9, 2015 05:40
diff --git a/text-collection-explore.ipynb b/text-collection-explore.ipynb
 {
 "cells": [
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "# John Stuart Mill"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "## Overview"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "For my text collection, I have chosen to go with the writings of 19th century philosopher John Stuart Mill. Project Gutenberg hosts 11 of his works. Mill primarily writes about social issues and political economy. Combined, the files are 6.8 MB. The text does not contain much markup. There are some Unicode characters. More importantly, there is introductory text at the beginning of each file that is not part of Mill's writing that needs to be removed."
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "## Initial Computation"
  },
  {
   "metadata": {
    "trusted": true,
    "collapsed": false
   },
   "cell_type": "code",
   "source": "%pprint\n\nimport os\nimport re",
   "execution_count": 1,
   "outputs": [
    {
     "output_type": "stream",
     "text": "Pretty printing has been turned OFF\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {
    "trusted": true,
    "collapsed": false
   },
   "cell_type": "code",
   "source": "files = [f for f in os.listdir() if re.search('.txt$', f)]",
   "execution_count": 2,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "collapsed": false
   },
   "cell_type": "code",
   "source": "texts = ''\nfor f in files:\n    with open (f, 'r', encoding='utf-8') as jsm:\n        t = jsm.read().replace('\\n', ' ').replace('\\r', ' ')\n    texts = texts + ' ' + t",
   "execution_count": 3,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "collapsed": false
   },
   "cell_type": "code",
   "source": "print('The string is', len(texts), 'characters in length.')",
   "execution_count": 4,
   "outputs": [
    {
     "output_type": "stream",
     "text": "The string is 6658453 characters in length.\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {
    "trusted": true,
    "collapsed": false
   },
   "cell_type": "code",
   "source": "text = [w.lower() for w in texts.split(' ') if w != '']",
   "execution_count": 5,
   "outputs": []
  },
  {
   "metadata": {
    "trusted": true,
    "collapsed": false
   },
   "cell_type": "code",
   "source": "print('There are', len(text), 'tokens.')",
   "execution_count": 6,
   "outputs": [
    {
     "output_type": "stream",
     "text": "There are 1113445 tokens.\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {
    "trusted": true,
    "collapsed": false
   },
   "cell_type": "code",
   "source": "print('There are', len(set(text)), 'unique tokens in this text collection.')",
   "execution_count": 7,
   "outputs": [
    {
     "output_type": "stream",
     "text": "There are 52907 unique tokens in this text collection.\n",
     "name": "stdout"
    }
   ]
  },
  {
   "metadata": {
    "trusted": true,
    "collapsed": true
   },
   "cell_type": "code",
   "source": "",
   "execution_count": null,
   "outputs": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3",
   "language": "python"
  },
  "language_info": {
   "pygments_lexer": "ipython3",
   "file_extension": ".py",
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "nbconvert_exporter": "python",
   "name": "python",
   "version": "3.4.2",
   "mimetype": "text/x-python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
 }
	{
	"cells": [
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "# John Stuart Mill"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "## Overview"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "For my text collection, I have chosen to go with the writings of 19th century philosopher John Stuart Mill. Project Gutenberg hosts 11 of his works. Mill primarily writes about social issues and political economy. Combined, the files are 6.8 MB. The text does not contain much markup. There are some Unicode characters. More importantly, there is introductory text at the beginning of each file that is not part of Mill's writing that needs to be removed."
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "## Initial Computation"
	},
	{
	"metadata": {
	"trusted": true,
	"collapsed": false
	},
	"cell_type": "code",
	"source": "%pprint\n\nimport os\nimport re",
	"execution_count": 1,
	"outputs": [
	{
	"output_type": "stream",
	"text": "Pretty printing has been turned OFF\n",
	"name": "stdout"
	}
	]
	},
	{
	"metadata": {
	"trusted": true,
	"collapsed": false
	},
	"cell_type": "code",
	"source": "files = [f for f in os.listdir() if re.search('.txt$', f)]",
	"execution_count": 2,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"collapsed": false
	},
	"cell_type": "code",
	"source": "texts = ''\nfor f in files:\n with open (f, 'r', encoding='utf-8') as jsm:\n t = jsm.read().replace('\\n', ' ').replace('\\r', ' ')\n texts = texts + ' ' + t",
	"execution_count": 3,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"collapsed": false
	},
	"cell_type": "code",
	"source": "print('The string is', len(texts), 'characters in length.')",
	"execution_count": 4,
	"outputs": [
	{
	"output_type": "stream",
	"text": "The string is 6658453 characters in length.\n",
	"name": "stdout"
	}
	]
	},
	{
	"metadata": {
	"trusted": true,
	"collapsed": false
	},
	"cell_type": "code",
	"source": "text = [w.lower() for w in texts.split(' ') if w != '']",
	"execution_count": 5,
	"outputs": []
	},
	{
	"metadata": {
	"trusted": true,
	"collapsed": false
	},
	"cell_type": "code",
	"source": "print('There are', len(text), 'tokens.')",
	"execution_count": 6,
	"outputs": [
	{
	"output_type": "stream",
	"text": "There are 1113445 tokens.\n",
	"name": "stdout"
	}
	]
	},
	{
	"metadata": {
	"trusted": true,
	"collapsed": false
	},
	"cell_type": "code",
	"source": "print('There are', len(set(text)), 'unique tokens in this text collection.')",
	"execution_count": 7,
	"outputs": [
	{
	"output_type": "stream",
	"text": "There are 52907 unique tokens in this text collection.\n",
	"name": "stdout"
	}
	]
	},
	{
	"metadata": {
	"trusted": true,
	"collapsed": true
	},
	"cell_type": "code",
	"source": "",
	"execution_count": null,
	"outputs": []
	}
	],
	"metadata": {
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3",
	"language": "python"
	},
	"language_info": {
	"pygments_lexer": "ipython3",
	"file_extension": ".py",
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"nbconvert_exporter": "python",
	"name": "python",
	"version": "3.4.2",
	"mimetype": "text/x-python"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}