-
-
Save dribnet/cd80fd1d5d113da870dc0aa38aa4b6f5 to your computer and use it in GitHub Desktop.
Quick intro to CSVs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "# Working with CSV files\n", | |
| "\n", | |
| "By [Allison Parrish](http://www.decontextualize.com/)\n", | |
| "\n", | |
| "Humans have represented data in the form of tables (i.e., organized in rows and columns) for [thousands of years](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.565.2159&rep=rep1&type=pdf). In a contemporary computational context, the tool we use for working with tabular data is called a [spreadsheet](https://en.wikipedia.org/wiki/Spreadsheet). Starting with VisiCalc and Lotus-1-2-3 in the late 1970s and early 1980s, spreadsheet software has consistently been among the best-selling software on personal computers. Today, most computer users are familiar with spreadsheet software like Excel or Google Sheets and many use them daily in their day-to-day work, regardless of their familiarity with math, statistics or computer programming.\n", | |
| "\n", | |
| "It makes sense to want to be able to take work that we do in Python and import it into our spreadsheet software, or take data from spreadsheets and work with it in Python. But there's no one obvious way of representing tabular data in computer-readable format, and spreadsheet software from different vendors use different formats internally that aren't necessarily interoperable. (If you save a spreadsheet in Excel, you might not be able to open that spreadsheet in, say, Apple's Numbers software.) What we need is a common, easy-to-use way to format tabular data so we can move it between tools without a lot of trouble.\n", | |
| "\n", | |
| "## Comma-separated values\n", | |
| "\n", | |
| "\"CSV,\" short for \"comma-separated values,\" is just such a format. A file in CSV format represents tabular data as a series of lines in a plain text file, in which values for each column of the table are separated by commas. Take a look at [this table in Google Sheets](https://docs.google.com/spreadsheets/d/16JhXTxkKmKmsCpAAy0e54Ky8zHspRq6bgaUv8bnqfX8/pubhtml?gid=0&single=true) of the top ten lakes by area in the United States ([source data here](https://en.wikipedia.org/wiki/List_of_largest_lakes_of_the_United_States_by_area)). In CSV format, it looks like this:\n", | |
| "\n", | |
| " Name,States/Provinces,Area (sq mi)\n", | |
| " Lake Superior,Michigan-Minnesota-Wisconsin-Ontario,31700\n", | |
| " Lake Michigan,Illinois-Indiana-Michigan-Wisconsin,22300\n", | |
| " Lake Huron,Michigan-Ontario,22300\n", | |
| " Lake Erie,Michigan-New York-Ohio-Ontario-Pennsylvania,9910\n", | |
| " Lake Ontario,New York-Ontario,7340\n", | |
| " Great Salt Lake (salt),Utah,2117\n", | |
| " Lake of the Woods,Manitoba-Minnesota-Ontario,1679\n", | |
| " Iliamna Lake,Alaska,1014\n", | |
| " Lake Oahe (man-made),North Dakota-South Dakota,685\n", | |
| " Lake Okeechobee,Florida,662\n", | |
| " \n", | |
| "This may look incomprehensible at first, but look carefully and you can see the structure. Each row is on a separate line, and each line has the contents of every cell, separated by commas. (The first line has the names of the columns themselves. This row is called the *header*, and may or may not be present in a given CSV file.)\n", | |
| "\n", | |
| "Sometimes the character that separates the cells on each row is something other than a comma. A few alternative you're likely to see are the pipe character (`|`), a colon (`:`) or a tab. But even if a different delimiter other than a comma is used, such files are still called `CSV` files. (Although sometimes files with tab-separated values are called `TSV` files.)\n", | |
| "\n", | |
| "### Exporting CSV files\n", | |
| "\n", | |
| "Most spreadsheet software packages support CSV files in one form or another. Google Sheets, for example, lets you export a spreadsheet as CSV using the \"Download As...\" item in the File menu. In Excel, CSV is one of the supported formats in the \"Save As...\" dialog box. It's important to remember, however, that a CSV file isn't a perfect replica of the spreadsheet that you're exporting! When you save a spreadsheet as CSV, you lose all formatting (like fonts, colors, cell sizes). You also lose any charts or images you may have added to the spreadsheet, along with formulas, etc.\n", | |
| "\n", | |
| "The most important (and vexing) thing to remember about CSV files is that the values in the table *don't have types*. A CSV file doesn't distinguish between numbers, currency amounts, dates, etc. That means that when you're reading a CSV file into Python, or importing a CSV file from Python into Excel, you'll need to find a way to recover the data types for each column. (We'll see an example of this below.)\n", | |
| "\n", | |
| "## From Counter to spreadsheet\n", | |
| "\n", | |
| "Our first task is going to be writing a Python program that writes out a CSV file, which we can then import into a spreadsheet program to draw a pretty graph. Here's a program that uses [TextBlob](https://textblob.readthedocs.io/en/dev/) to count how many times each part of speech occurs in a text.\n", | |
| "\n", | |
| "First, let's make sure that you have a copy of [this file](http://rwet.decontextualize.com/texts/genesis.txt) in the same directory as your Python script.)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 1, | |
| "metadata": { | |
| "collapsed": false | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "--2017-07-19 12:28:56-- http://rwet.decontextualize.com/texts/genesis.txt\n", | |
| "Resolving rwet.decontextualize.com... 207.38.94.47\n", | |
| "Connecting to rwet.decontextualize.com|207.38.94.47|:80... connected.\n", | |
| "HTTP request sent, awaiting response... 200 OK\n", | |
| "Length: 4119 (4.0K) [text/plain]\n", | |
| "Saving to: ‘genesis.txt’\n", | |
| "\n", | |
| "genesis.txt 100%[===================>] 4.02K --.-KB/s in 0s \n", | |
| "\n", | |
| "2017-07-19 12:28:57 (187 MB/s) - ‘genesis.txt’ saved [4119/4119]\n", | |
| "\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "!wget http://rwet.decontextualize.com/texts/genesis.txt" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "For a fresh nltk / textblob install, this will also be necessary." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 5, | |
| "metadata": { | |
| "collapsed": false | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml\n" | |
| ] | |
| }, | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "True" | |
| ] | |
| }, | |
| "execution_count": 5, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "import nltk\n", | |
| "nltk.download()" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "The script takes this data and then produces a CSV file with each part of speech and its count.\n", | |
| "\n", | |
| "(If you're not familiar with the `Counter` object, [return to this tutorial for a refresher](https://gist.github.com/aparrish/4b096b95bfbd636733b7b9f2636b8cf4). If you're not familiar with TextBlob, [I wrote a tutorial about it here](http://rwet.decontextualize.com/book/textblob/).)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 2, | |
| "metadata": { | |
| "collapsed": true | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "import csv\n", | |
| "from collections import Counter\n", | |
| "from textblob import TextBlob\n", | |
| "\n", | |
| "# read in the contents of genesis.txt\n", | |
| "text = open(\"genesis.txt\").read()\n", | |
| "blob = TextBlob(text)\n", | |
| "\n", | |
| "all_pos = list()\n", | |
| "for word, pos in blob.tags:\n", | |
| " all_pos.append(pos)\n", | |
| " \n", | |
| "pos_count = Counter(all_pos)\n", | |
| "\n", | |
| "# open the file \"genesis_pos.csv\" for writing...\n", | |
| "with open(\"genesis_pos.csv\", \"w\") as csvfile:\n", | |
| " # create a csv \"writer\" object\n", | |
| " writer = csv.writer(csvfile)\n", | |
| " # write the header row\n", | |
| " writer.writerow([\"part of speech\", \"count\"])\n", | |
| " # write out each pair as a line in the CSV\n", | |
| " for item, count in pos_count.most_common():\n", | |
| " writer.writerow([item, count])" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "The `with` clause may look new to you! We won't go over the technical details here ([though you can read a good introduction here](http://effbot.org/zone/python-with-statement.htm)). Essentially, the line starting with the `with` keyword creates a new file object with the filename `genesis_pos.csv` and assigns it to a variable called `csvfile`. \n", | |
| "\n", | |
| "In the indented section beneath, we create a special kind of Python value called a \"CSV writer\" (from the [`csv` module](https://docs.python.org/2/library/csv.html), passing it the file object as a parameter. The writer object has one method of importance to us, `.writerow()`, which takes a list of values. The writer object takes the values passed to `.writerow()`, formats them by converting them to strings and inserting commas between them, and writes them to the file opened in the initial line of the `with` clause. The first call to `.writerow()` writes the *header row*, which our spreadsheet software will interpret as column names. Inside a `for` loop, we visit each key/value pair from the `pos_count` Counter object and write it to the file.\n", | |
| "\n", | |
| "The resulting file (`genesis_pos.csv`) looks like this:\n", | |
| "\n", | |
| "<pre>\n", | |
| "part of speech,count\n", | |
| "NN,158\n", | |
| "DT,124\n", | |
| "CC,97\n", | |
| "IN,97\n", | |
| "VBD,46\n", | |
| "VB,42\n", | |
| "PRP,38\n", | |
| "NNP,38\n", | |
| "JJ,25\n", | |
| "VBN,24\n", | |
| "NNS,24\n", | |
| "RB,18\n", | |
| "VBG,14\n", | |
| "PRP$,13\n", | |
| "TO,11\n", | |
| "WDT,5\n", | |
| "VBZ,5\n", | |
| "EX,5\n", | |
| "VBP,4\n", | |
| "CD,2\n", | |
| "WP$,2\n", | |
| "MD,2\n", | |
| "JJR,2\n", | |
| "WRB,1\n", | |
| "</pre>\n", | |
| "\n", | |
| "You can now import this CSV file into the spreadsheet program of your choice! [I made this spreadsheet and graph](https://docs.google.com/spreadsheets/d/1HtEnMb0JMFWCSo-nD-53Mv94gdXM4lX9LcB4VkiWNIU/edit?usp=sharing) by loading the CSV into Google Sheets." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## From spreadsheet to Python\n", | |
| "\n", | |
| "t's true that your average spreadsheet program is good enough for basic number crunching tasks, bringing your spreadsheet data into Python opens up a world of computational possibilities! We're going to use [this data about dog names in NYC](https://project.wnyc.org/dogs-of-nyc/), which I've compiled in a Google Sheets table [here](https://docs.google.com/spreadsheets/d/1SmxsgSAcNqYahUXa9-XpecTg-fhy_Ko_-RMD3oT2Ukw/edit?usp=sharing).\n", | |
| "\n", | |
| "To export this data as CSV, select `File > Download as... > Comma-separated values`. Copy it into the directory where your Python scripts are running. (I renamed my copy to `dogs-of-nyc.csv`, just to make it easier to type.) The first few lines of the file look like this:\n", | |
| " \n", | |
| " dog_name,gender,breed,birth,dominant_color,secondary_color,third_color,spayed_or_neutered,guard_or_trained,borough,zip_code\n", | |
| " Buddy,M,Afghan Hound,Jan-00,BRINDLE,BLACK,n/a,Yes,No,Manhattan,10003\n", | |
| " Nicole,F,Afghan Hound,Jul-00,BLACK,n/a,n/a,Yes,No,Manhattan,10021\n", | |
| " Abby,F,Afghan Hound,Nov-00,BLACK,TAN,n/a,Yes,No,Manhattan,10034\n", | |
| " Chloe,F,Afghan Hound,1/2/2017,WHITE,BLOND,n/a,Yes,No,Manhattan,10024\n", | |
| " Jazzle,F,Afghan Hound,10/2/2017,BLOND,WHITE,BLACK,Yes,No,Manhattan,10022\n", | |
| " Trouble,M,Afghan Hound,1/3/2017,BLOND,WHITE,BLACK,Yes,No,Bronx,10472\n", | |
| " Grace,F,Afghan Hound,6/3/2017,CREAM,n/a,n/a,Yes,No,Manhattan,10021\n", | |
| " Sisu,M,Afghan Hound,10/4/2017,BLACK,WHITE,GRAY,No,No,Manhattan,10023\n", | |
| " \n", | |
| "Again, if you're not used to reading CSV files, this looks like a pretty big mess at first. But there's a method to the mess! The top row is a header row that describes the data in the table's columns. The rest of the rows are data, with each cell from the row in a separate column.\n", | |
| "\n", | |
| "Here's a program that reads in this data and then prints out five random dog names:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 4, | |
| "metadata": { | |
| "collapsed": false | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "Rose, Edie, Ichi, Q, Dixie\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "import csv\n", | |
| "import random\n", | |
| "\n", | |
| "all_names = list()\n", | |
| "\n", | |
| "for row in csv.DictReader(open(\"dogs-of-nyc.csv\")):\n", | |
| " all_names.append(row['dog_name'])\n", | |
| " \n", | |
| "print(\", \".join(random.sample(all_names, 5)))" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "This program is a little tricky! Here's how it works: The call to `csv.DictReader(open(\"dogs-of-nyc.csv\"))` returns a list-like object of *rows* from the CSV with the given filename. Inside the loop, the variable `row` is a dictionary that has a key for every column named in the header row of the CSV file. (So, if there's a column in the header row called `borough`, you can access the value for that column in the current row using the expression `row['borough']`.) This program simply loops through every row in the table and makes a list of every name, then prints out five random names at the end." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "As another quick example, here's a program that prints out the most common dog colors:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 5, | |
| "metadata": { | |
| "collapsed": false | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "BLACK 23578\n", | |
| "WHITE 18621\n", | |
| "BROWN 9181\n", | |
| "TAN 8942\n", | |
| "BLOND 4241\n", | |
| "GRAY 2777\n", | |
| "BRINDLE 2627\n", | |
| "RUST 2263\n", | |
| "BLUE 1378\n", | |
| "FAWN 1367\n", | |
| "RED 1338\n", | |
| "CREAM 1186\n", | |
| "GOLD 1015\n", | |
| "n/a 771\n", | |
| "ORANGE 705\n", | |
| "CHOCOLATE 538\n", | |
| "APRICOT 468\n", | |
| "SILVER 353\n", | |
| "BLUE MERLE 103\n", | |
| "CHARCOAL 90\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "import csv\n", | |
| "from collections import Counter\n", | |
| "\n", | |
| "all_colors = list()\n", | |
| "\n", | |
| "for row in csv.DictReader(open(\"dogs-of-nyc.csv\")):\n", | |
| " all_colors.append(row['dominant_color'])\n", | |
| " \n", | |
| "color_count = Counter(all_colors)\n", | |
| "\n", | |
| "for item, count in color_count.most_common():\n", | |
| " print(item, count)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "And here's a program that generates a few random dogs and prints a little sentence about each:" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 7, | |
| "metadata": { | |
| "collapsed": false | |
| }, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "Hi! My name is Kuma. My coat is gray and I'm a kinda smelly dog. I live in Manhattan. Ruff-ruff!\n", | |
| "Hi! My name is Luke. My coat is white and I'm a playful doggy. I live in Staten Island. Arf!\n", | |
| "Hi! My name is Princess. My coat is white and I'm a loyal Beagle. I live in Manhattan. Arf!\n", | |
| "Hi! My name is Max. My coat is tan and I'm a clever Pomeranian. I live in Manhattan. Bow-wow!\n", | |
| "Hi! My name is Missy. My coat is rust and I'm a happy Labrador Retriever. I live in Brooklyn. Arf!\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "import csv\n", | |
| "\n", | |
| "all_names = list()\n", | |
| "all_colors = list()\n", | |
| "all_breeds = list()\n", | |
| "all_boroughs = list()\n", | |
| "\n", | |
| "for row in csv.DictReader(open(\"dogs-of-nyc.csv\")):\n", | |
| " if row['dog_name'] != 'n/a':\n", | |
| " all_names.append(row['dog_name'])\n", | |
| " all_colors.append(row['dominant_color'].lower())\n", | |
| " if row['secondary_color'] != 'n/a':\n", | |
| " all_colors.append(row['secondary_color'].lower())\n", | |
| " if row['third_color'] != 'n/a':\n", | |
| " all_colors.append(row['third_color'].lower())\n", | |
| " all_breeds.append(row['breed'].title())\n", | |
| " all_boroughs.append(row['borough'].title())\n", | |
| " \n", | |
| "for i in range(5):\n", | |
| " name = random.choice(all_names)\n", | |
| " color = random.choice(all_colors)\n", | |
| " breed = random.choice(all_breeds)\n", | |
| " borough = random.choice(all_boroughs)\n", | |
| " if borough == \"Bronx\":\n", | |
| " borough = \"the Bronx\"\n", | |
| " if breed == \"Mixed/Other\":\n", | |
| " breed = random.choice([\"pup\", \"puppy\", \"dog\", \"doggo\", \"doggy\"])\n", | |
| " adj = random.choice([\"happy\", \"playful\", \"loyal\", \"fluffy\", \"friendly\", \"clever\", \"clumsy\", \"kinda smelly\"])\n", | |
| " breedtxt = \"I'm a {} {}\".format(adj, breed)\n", | |
| " bark = random.choice([\"Woof\", \"Arf\", \"Bow-wow\", \"Yip\", \"Ruff-ruff\"])\n", | |
| " print(\"Hi! My name is {}. My coat is {} and {}. I live in {}. {}!\".format(name, color, breedtxt, borough, bark))" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Further reading and resources\n", | |
| "\n", | |
| "* [Chapter 14](https://automatetheboringstuff.com/chapter14/) from [Automate the Boring Stuff with Python](https://automatetheboringstuff.com/) is a great overview of the basics of working with CSV files in Python.\n", | |
| "* Hungry for more data? You're in luck! There's *tons* of data out there in CSV format just waiting for you to play around with it. Try [this list of awesome public data sets](https://github.com/caesar0301/awesome-public-datasets) or [the structured archive](https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit) from the [Data Is Plural newsletter](https://tinyletter.com/data-is-plural).\n", | |
| "* CSV files are so useful and popular that there's a whole [conference named after them](https://csvconf.com/)!" | |
| ] | |
| } | |
| ], | |
| "metadata": { | |
| "kernelspec": { | |
| "display_name": "Python 3", | |
| "language": "python", | |
| "name": "python3" | |
| }, | |
| "language_info": { | |
| "codemirror_mode": { | |
| "name": "ipython", | |
| "version": 3 | |
| }, | |
| "file_extension": ".py", | |
| "mimetype": "text/x-python", | |
| "name": "python", | |
| "nbconvert_exporter": "python", | |
| "pygments_lexer": "ipython3", | |
| "version": "3.6.0" | |
| } | |
| }, | |
| "nbformat": 4, | |
| "nbformat_minor": 1 | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment