Skip to content

Instantly share code, notes, and snippets.

@rjweiss
Last active December 28, 2015 23:09
Show Gist options
  • Save rjweiss/7576985 to your computer and use it in GitHub Desktop.
Save rjweiss/7576985 to your computer and use it in GitHub Desktop.
File formats
{
"metadata": {
"name": "fileformats_solutions.ipynb"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# File formats\n",
"\n",
"In the block below, change to the directory the zip file you just downloaded, which contained this ipython notebook."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import os\n",
"#os.chdir('/path/to/extracted/fileformat.zip')\n",
"\n",
"'''\n",
"os.chdir expects a string object as an argument\n",
"the string can be either a path relative to your current working directory or an absolute path\n",
"I'm passing an absolute path below:\n",
"'''\n",
"\n",
"os.chdir('/Users/rweiss/Dropbox/presentations/IRiSS2013/text1/fileformats/')\n",
"\n",
"'''\n",
"os.listdir also expects a string object\n",
"os.getcwd() returns the current working path as a string object\n",
"so, you can pass os.getcwd() as an argument to os.listdir...when the code is interpreted, python will handle the returns for you\n",
"note that os.listdir() returns a list of strings, where each element in the list is a string representing the filename of the\n",
"files in the directory \n",
"'''\n",
"os.listdir(os.getcwd())"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 1,
"text": [
"['.DS_Store',\n",
" 'amazon',\n",
" 'example.xls',\n",
" 'example.xlsx',\n",
" 'fileformats_lecture.ipynb',\n",
" 'jstor.zip',\n",
" 'syllabus.docx']"
]
}
],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## CSV\n",
"\n",
"### Example 1"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"'''\n",
"csv is a module in the python standard library\n",
"you need to import csv in order to gain access to the methods and objects available to you in the csv module\n",
"'''\n",
"\n",
"import csv\n",
"\n",
"'''\n",
"this code will print the first line in the file 'sociology_2010.csv'\n",
"because we are using csv.reader and the file has field names, the first line are field names\n",
"'''\n",
"\n",
"with open('amazon/sociology_2010.csv', 'rb') as csvfile: # this will automatically call csvfile.close() when the code below completes\n",
" amazon_reader = csv.reader(csvfile) #this instantiates a csv.reader object called \"amazon_reader\". \n",
" #csv.reader will preloaded with the values from the file argument \"csvfile\", which is the data from the file \"sociology_2010.csv\" above.\n",
" for row in amazon_reader: # amazon_reader is an iterable object, so you can write the expression \"for VALUE in ITERABLE\" and whatever code you write that\n",
" #affects VALUE will affect every iterable element in the iterable object (in this case, the csv.reader)\n",
" print row #prints whatever is the value of row\n",
" break # this just breaks after one iteration of the loop. the file is big, so i didn't want it to fill the screen"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"['doi', 'title', 'amazon_id', 'review_text']\n"
]
}
],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example 2"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import csv\n",
"with open('amazon/sociology_2010.csv', 'rb') as csvfile:\n",
" amazon_reader = csv.DictReader(csvfile, delimiter=',', quotechar='\\\"') #a DictReader has more methods available to it than a regular csv.reader()\n",
" #though csv.readers also can have delimiters and quotechars specified...this stuff is listed in the python docs online.\n",
" for row in amazon_reader:\n",
" print row #because this is a DictReader, every row in the csv is represented as a dictionary, not as a list (as per the regular csv.reader)\n",
" break"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"{'review_text': \"This book was given to me by my best friend after finishing college. I will always treasure this thoughtful and special gift.<p>_Girlfriends_ is a collection of stories that explore and celebrate female friendship through the eyes, ears, and hearts of everyday women. Some of the women were friends for a lifetime, others for a short time. However, all understood and/or demonstrated the meaning of &quot;true friendship.&quot; For example, the stories included everything from the thankful musings of a once-ill woman about the extraordinaty kindness of her girlfriends to a giggly account of how two eerily-simiar best friends met as assigned roomates their first day of college. (The latter tale struck very close to home in a wonderfully spooky way.)<p>While many of the stories tugged at the heartstrings, I never felt manipulated by the authors. (Note: Part of the reason why I don't like the &quot;Chicken Soup for the Soul&quot; series is that I feel that the authors are just dying to make the reader clutch for the box of tissues.) Rather, I appreciated the &quot;real&quot; tone of the stories, as they read like good conversation shared over a nice pot of Hazlenut coffee.<p>Some readers have commented on the book's simple language and lack of depth. I don't think the goal here was to explore the psychology of friendship, rather I think it was intended to be a simple and beautiful celebration meant to be enjoyed by &quot;Girlfriends&quot; everywhere. Enjoy!\", 'doi': '10.2307/25677732', 'amazon_id': '1885171080', 'title': 'Invisible Girlfriend'}\n"
]
}
],
"prompt_number": 3
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print amazon_reader.fieldnames #DictReaders have a member method \"fieldnames\" which returns a list of strings corresponding to the first row of the file"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"['doi', 'title', 'amazon_id', 'review_text']\n"
]
}
],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"with open('amazon/sociology_2010.csv', 'rb') as csvfile:\n",
" amazon_reader = csv.DictReader(csvfile, delimiter=',')\n",
" for row in amazon_reader:\n",
" print row['title'] #because they are dicts, you can refer to the value by the key; in this case, the \"title\" field\n",
" break\n",
" #print [row['title'] for row in amazon_reader]\n",
" "
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Invisible Girlfriend\n"
]
}
],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Exercises on your own time\n",
"\n",
"- Create a list of the text from all Amazon reviews\n",
"- Create a list of dictionaries from all Amazon reviews, where each dictionary is a mapping of unique ID to text body (hint, {doi: review_text})"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#solution 1\n",
"with open('amazon/sociology_2010.csv', 'rb') as csvfile:\n",
" amazon_reader = csv.DictReader(csvfile, delimiter=',')\n",
" amazon_reviews = [row['review_text'] for row in amazon_reader]\n",
"\n",
"print len(amazon_reviews)\n",
"\n",
"#solution 2\n",
"with open('amazon/sociology_2010.csv', 'rb') as csvfile:\n",
" amazon_reader = csv.DictReader(csvfile, delimiter=',')\n",
" amazon_review_dicts = [{row['doi']: row['review_text']} for row in amazon_reader]\n",
"\n",
" \n",
" \n",
"#doc_tf_vectors = [tf(term, text) for term, text in amazon_reviews]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"381\n",
"0.366204 <= #1\n",
"0.000000 <= TEXT\n",
"0.000000 <= DOCUMENT\n"
]
}
],
"prompt_number": 79
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Microsoft Excel"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### openpyxl: working with `.xlsx` files\n",
"\n",
"You should be able to install `openpyxl` if it is not on your machine already using either `pip` or `easy_install`. If you are using Canopy, you can install these through the GUI: Tools > Package Manager > Search for \"openpyxl\" and install. Otherwise, enter one of the following in the shell (I prefer `pip`).\n",
"\n",
" sudo pip install openpyxl\n",
" \n",
" sudo easy_install openpyxl\n",
"\n",
"### Example 3"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from openpyxl import load_workbook\n",
"import csv, sys\n",
"\n",
"wb = load_workbook('example.xlsx')\n",
"print type(wb)\n",
"#print dir(wb)\n",
"# create object from xlsx workbook\n",
"wb.get_sheet_names() # print out all sheet names\n",
"print type(wb.get_sheet_names())\n",
"print type(wb.get_sheet_names()[0])\n",
"print type(wb.worksheets)\n",
"for sheet in wb.worksheets: \n",
" print type(sheet)\n",
"#\tprint sheet.title # another way of printing all sheet names\n"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"<class 'openpyxl.workbook.Workbook'>\n",
"<type 'list'>\n",
"<type 'str'>\n",
"<type 'list'>\n",
"<class 'openpyxl.worksheet.Worksheet'>\n",
"<class 'openpyxl.worksheet.Worksheet'>\n"
]
}
],
"prompt_number": 71
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sheet1 = wb.worksheets[0]\n",
"print sheet1# grab the first sheet (also look up .get_sheet_by_name())\n",
"print sheet1.cell('A1').value # print the text value of cell A1"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"<Worksheet \"First\">\n",
"This is the first cell, A1, sheet \"First\"\n"
]
}
],
"prompt_number": 73
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"row2 = sheet1.rows[1] # create tuple of Cell objects\n",
"for cell in row2:\n",
"\tprint cell.value # print all the text values of every cell in the row"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"THis is the first column, second row A2, sheet \"First\"\n",
"THis is the second column, second row B2, sheet \"First\"\n",
"THis is the third column, second row C2, sheet \"First\"\n"
]
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercises on your own time:\n",
" \n",
"- `.get_sheet_names()` is a member method of `Workbook` objects. What are some other methods for `Workbook` objects? (hint, what does `dir(wb)` return?)\n",
"- Take one of the methods that you discovered in question 1. What kind of objects do they return?\n",
"- You can learn a lot more about `openpyxl` by reading the [documentation](http://pythonhosted.org/openpyxl/usage.html)\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print dir(wb)\n",
"\n",
"print wb.encoding\n",
"print type(wb.encoding)\n",
"print wb.properties\n",
"print dir(wb.properties)\n",
"print wb.properties.creator"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"['_Workbook__optimized_read', '_Workbook__optimized_write', '_Workbook__thread_local_data', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_active_sheet_index', '_local_data', '_named_ranges', '_set_optimized_read', 'add_named_range', 'add_sheet', 'create_named_range', 'create_sheet', 'encoding', 'excel_base_date', 'get_active_sheet', 'get_index', 'get_named_range', 'get_named_ranges', 'get_sheet_by_name', 'get_sheet_names', 'loaded_theme', 'properties', 'remove_named_range', 'remove_sheet', 'save', 'security', 'strings_table_builder', 'style', 'worksheets']\n",
"utf-8\n",
"<type 'str'>\n",
"<openpyxl.workbook.DocumentProperties object at 0x10e4b9510>\n",
"['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'category', 'company', 'created', 'creator', 'description', 'excel_base_date', 'keywords', 'last_modified_by', 'modified', 'subject', 'title']\n",
"Unknown\n"
]
}
],
"prompt_number": 35
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### xlutils: working with `.xls` files\n",
"\n",
"You should be able to install `xlutils` if it is not on your machine already using either `pip` or `easy_install`. If you are using Canopy, you can install these through the GUI: Tools > Package Manager > Search for \"xlutils\" and install. Otherwise, enter one of the following in the shell (I prefer `pip`).\n",
"\n",
" sudo pip install xlutils\n",
" \n",
" sudo easy_install xlutils\n",
"\n",
"### Example 4"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from xlrd import open_workbook\n",
"\n",
"wb = open_workbook('example.xls') # create object from xls workbook\n",
"print wb.sheets()\n",
"for sheet in wb.sheets():\n",
" print sheet.name # printing all sheet names\n"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[<xlrd.sheet.Sheet object at 0x10e4b92d0>, <xlrd.sheet.Sheet object at 0x101067850>]\n",
"First\n",
"Second\n"
]
}
],
"prompt_number": 36
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sheet1 = wb.sheets()[0] # grab the first sheet, also look up .sheet_by_index(NUM)\n",
"print sheet1.cell(1,1)# print the text value of cell A1\n",
"row2 = sheet1.row(1) # create tuple of Cell objects\n",
"\n",
"for cell in row2:\n",
" print cell.value # print all the text values of every cell in the row"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"text:u'THis is the second column, second row B2, sheet \"First\"'\n",
"THis is the first column, second row A2, sheet \"First\"\n",
"THis is the second column, second row B2, sheet \"First\"\n",
"THis is the third column, second row C2, sheet \"First\"\n"
]
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercises on your own time:\n",
"\n",
"1. `.sheets()` is actually a method that returns a list. What are other methods available to `Book` objects?\n",
"2. `.cell()`, `.row()` and `.value` are actually methods too. What do they return?\n",
"\n",
"You can learn more about `xlutils` in the [documentation](http://www.python-excel.org/). "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print dir(wb)\n",
"\n",
"print wb.encoding\n",
"print wb.nsheets\n",
"print type(wb.sheets())\n",
"print type(wb.sheets()[0])\n",
"print type(sheet1.row(1))\n",
"print type(row2[0])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"['__class__', '__delattr__', '__dict__', '__doc__', '__enter__', '__exit__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_all_sheets_count', '_all_sheets_map', '_externsheet_info', '_externsheet_type_b57', '_extnsht_count', '_extnsht_name_from_num', '_position', '_repr_these', '_resources_released', '_rich_text_runlist_map', '_sh_abs_posn', '_sharedstrings', '_sheet_list', '_sheet_names', '_sheet_num_from_name', '_sheet_visibility', '_sheethdr_count', '_supbook_addins_inx', '_supbook_count', '_supbook_locals_inx', '_supbook_types', '_xf_epilogue_done', '_xf_index_to_xl_type_map', 'actualfmtcount', 'addin_func_names', 'base', 'biff2_8_load', 'biff_version', 'builtinfmtcount', 'codepage', 'colour_indexes_used', 'colour_map', 'countries', 'datemode', 'derive_encoding', 'dump', 'encoding', 'encoding_override', 'fake_globals_get_sheet', 'filestr', 'font_list', 'format_list', 'format_map', 'formatting_info', 'get2bytes', 'get_record_parts', 'get_record_parts_conditional', 'get_sheet', 'get_sheets', 'getbof', 'handle_boundsheet', 'handle_builtinfmtcount', 'handle_codepage', 'handle_country', 'handle_datemode', 'handle_efont', 'handle_externname', 'handle_externsheet', 'handle_filepass', 'handle_font', 'handle_format', 'handle_name', 'handle_obj', 'handle_palette', 'handle_sheethdr', 'handle_sheetsoffset', 'handle_sst', 'handle_style', 'handle_supbook', 'handle_writeaccess', 'handle_xf', 'initialise_format_info', 'is_date_format_string', 'load_time_stage_1', 'load_time_stage_2', 'logfile', 'mem', 'name_and_scope_map', 'name_map', 'name_obj_list', 'names_epilogue', 'nsheets', 'on_demand', 'palette_epilogue', 'palette_record', 'parse_globals', 'pickleable', 'ragged_rows', 'raw_user_name', 'read', 'release_resources', 'sheet_by_index', 'sheet_by_name', 'sheet_loaded', 'sheet_names', 'sheets', 'stream_len', 'style_name_map', 'unload_sheet', 'use_mmap', 'user_name', 'verbosity', 'xf_epilogue', 'xf_list', 'xfcount']\n",
"utf_16_le\n",
"2\n",
"<type 'list'>\n",
"<class 'xlrd.sheet.Sheet'>\n",
"<type 'list'>\n",
"<class 'xlrd.sheet.Cell'>\n"
]
}
],
"prompt_number": 56
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Microsoft Word\n",
"\n",
"Instead of a pile of Excel files, maybe you have a collection of Word files. Let's say you've got a bunch of student essays, or maybe first person reports, or government reports, etc. \n",
"\n",
"If you are lucky, you will be working with `.docx` files. If you aren't working with `.docx`, you might want to try `wv`, but to be honest, it would be better to try to convert them to `.docx` using a macro or some other process (this could be a very good teach-yourself-python project!). \n",
"\n",
"Why `.docx`? Office 2007 and later use formats called [Office Open XML](http://en.wikipedia.org/wiki/Office_Open_XML). `.docx` and `.xlsx` and all those other formats are actually **zipped xml files**. That's what the x at the end stands for.\n",
"\n",
"That means you can use a third-party python library that can parse XML files (you'll learn more about parsing in the Web Scraping module). Here's some example code using [python-docx](https://github.com/mikemaccana/python-docx)\n",
"\n",
"> **How to install `python-docx`**\n",
"\n",
"> This module is not in `pypi`, which means that you have to install it manually. On the page linked above, you can either `git clone` the file or just download the zipped archive.\n",
"\n",
"> - Unzip the archive somewhere and navigate to that directory in the shell\n",
"> - In the shell, type the following:\n",
"\n",
" ~$ python setup.py install\n",
" \n",
"> - If you aren't using the Enthought Python distribution, you will probably need to install the following third-party libraries which **are** in `pypi`\n",
"\n",
" ~$ sudo pip install lxml\n",
" \n",
" ~$ sudo pip install PIL\n",
" \n",
"or\n",
"\n",
" ~$ sudo easy_install lxml\n",
" \n",
" ~$ sudo easy_install PIL\n",
" \n",
"### Using python-docx\n",
"\n",
"This is actually a pretty easy module to use. Check out the example code.\n",
"### Example 5"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from docx import opendocx, getdocumenttext\n",
"\n",
"document = opendocx('syllabus.docx')\n",
"print type(document)\n",
"paragraphs = getdocumenttext(document)\n",
"print type(paragraphs)\n",
"print len(paragraphs)\n",
"print paragraphs[0]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"<type 'lxml.etree._Element'>\n",
"<type 'list'>\n",
"74\n",
"IRiSS Computational Social Science Summer Workshop Schedule 2013\n"
]
}
],
"prompt_number": 11
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print type(paragraphs[0])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"<type 'unicode'>\n"
]
}
],
"prompt_number": 12
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print type(paragraphs[0].encode('utf-8'))\n",
"print paragraphs[0]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"<type 'str'>\n",
"IRiSS Computational Social Science Summer Workshop Schedule 2013\n"
]
}
],
"prompt_number": 13
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tarfiles and Zipfiles\n",
"\n",
"### Example 6"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import zipfile\n",
"import os\n",
"\n",
"zfile = zipfile.ZipFile(\"jstor.zip\")\n",
"dirname = 'jstor'\n",
"\n",
"print os.listdir(os.getcwd())\n",
"\n",
"#this would be better as try-except, but that's for future learning\n",
"if os.path.isdir(dirname):\n",
" pass\n",
"else:\n",
" os.makedirs(dirname)\n",
"\n",
"print 'Files in directory '+ dirname + ': '+ ', '.join(os.listdir(os.path.join(os.getcwd(), dirname)))\n",
"\n",
"for name in zfile.namelist():\n",
" print 'Extracting file ' + name + '...'\n",
" fd = open(os.path.join(dirname, name), \"w\")\n",
" fd.write(zfile.read(name))\n",
" fd.close()\n",
"\n",
"print 'Files in directory '+ dirname + ': '+ ','.join(os.listdir(os.path.join(os.getcwd(), dirname)))\n"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"['.DS_Store', 'amazon', 'example.xls', 'example.xlsx', 'fileformats_lecture.ipynb', 'jstor.zip', 'syllabus.docx']\n",
"Files in directory jstor: \n",
"Extracting file biologicalsciences-discipline-2010-unigrams.txt...\n",
"Extracting file literature-discipline-2010-unigrams.txt...\n",
"Extracting file sociology-discipline-2010-unigrams.txt..."
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Files in directory jstor: biologicalsciences-discipline-2010-unigrams.txt,literature-discipline-2010-unigrams.txt,sociology-discipline-2010-unigrams.txt"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n"
]
}
],
"prompt_number": 14
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment