Skip to content

Instantly share code, notes, and snippets.

@embray
Created August 12, 2014 14:27
Show Gist options
  • Save embray/079f1ae87d527e02a80f to your computer and use it in GitHub Desktop.
Save embray/079f1ae87d527e02a80f to your computer and use it in GitHub Desktop.
IPython notebook from https://github.com/swcarpentry/bc/blob/master/novice/python/01-numpy.ipynb converted to Python module with nbconvert
# coding: utf-8
### Analyzing Patient Data
# We are studying inflammation in patients who have been given a new treatment for arthritis,
# and need to analyze the first dozen data sets.
# The data sets are stored in [comma-separated values](../../gloss.html#comma-separeted-values) (CSV) format:
# each row holds information for a single patient,
# and the columns represent successive days.
# The first few rows of our first file look like this:
#
# 0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0
# 0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1
# 0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1
# 0,0,2,0,4,2,2,1,6,7,10,7,9,13,8,8,15,10,10,7,17,4,4,7,6,15,6,4,9,11,3,5,6,3,3,4,2,3,2,1
# 0,1,1,3,3,1,3,5,2,4,4,7,6,5,3,10,8,10,6,17,9,14,9,7,13,9,12,6,7,7,9,6,3,2,2,4,2,0,1,1
# We want to:
#
# * load that data into memory,
# * calculate the average inflammation per day across all patients, and
# * plot the result.
#
# To do all that, we'll have to learn a little bit about programming.
# #### Objectives
#
# * Explain what a library is, and what libraries are used for.
# * Load a Python library and use the things it contains.
# * Read tabular data from a file into a program.
# * Assign values to variables.
# * Select individual values and subsections from data.
# * Perform operations on arrays of data.
# * Display simple graphs.
#### Loading Data
# Words are useful,
# but what's more useful are the sentences and stories we use them to build.
# Similarly,
# while a lot of powerful tools are built into languages like Python,
# even more lives in the [libraries](../../gloss.html#library) they are used to build.
#
# In order to load our inflammation data,
# we need to [import](../../gloss.html#import) a library called NumPy
# that knows how to operate on matrices:
# In[1]:
import numpy
# Importing a library is like getting a piece of lab equipment out of a storage locker
# and setting it up on the bench.
# Once it's done,
# we can ask the library to read our data file for us:
# In[2]:
numpy.loadtxt(fname='inflammation-01.csv', delimiter=',')
# The expression `numpy.loadtxt(...)` is a [function call](../../gloss.html#function-call)
# that asks Python to run the function `loadtxt` that belongs to the `numpy` library.
# This [dotted notation](../../gloss.html#dotted-notation) is used everywhere in Python
# to refer to the parts of things as `whole.part`.
#
# `numpy.loadtxt` has two [parameters](../../gloss.html#parameter):
# the name of the file we want to read,
# and the [delimiter](../../gloss.html#delimiter) that separates values on a line.
# These both need to be character strings (or [strings](../../gloss.html#string) for short),
# so we put them in quotes.
#
# When we are finished typing and press Shift+Enter,
# the notebook runs our command.
# Since we haven't told it to do anything else with the function's output,
# the notebook displays it.
# In this case,
# that output is the data we just loaded.
# By default,
# only a few rows and columns are shown
# (with `...` to omit elements when displaying big arrays).
# To save space,
# Python displays numbers as `1.` instead of `1.0`
# when there's nothing interesting after the decimal point.
# Our call to `numpy.loadtxt` read our file,
# but didn't save the data in memory.
# To do that,
# we need to [assign](../../gloss.html#assignment) the array to a [variable](../../gloss.html#variable).
# A variable is just a name for a value,
# such as `x`, `current_temperature`, or `subject_id`.
# We can create a new variable simply by assigning a value to it using `=`:
# In[3]:
weight_kg = 55
# Once a variable has a value, we can print it:
# In[4]:
print weight_kg
# and do arithmetic with it:
# In[5]:
print 'weight in pounds:', 2.2 * weight_kg
# We can also change a variable's value by assigning it a new one:
# In[6]:
weight_kg = 57.5
print 'weight in kilograms is now:', weight_kg
# As the example above shows,
# we can print several things at once by separating them with commas.
#
# If we imagine the variable as a sticky note with a name written on it,
# assignment is like putting the sticky note on a particular value:
# <img src="files/img/python-sticky-note-variables-01.svg" alt="Variables as Sticky Notes" />
# This means that assigning a value to one variable does *not* change the values of other variables.
# For example,
# let's store the subject's weight in pounds in a variable:
# In[7]:
weight_lb = 2.2 * weight_kg
print 'weight in kilograms:', weight_kg, 'and in pounds:', weight_lb
# <img src="files/img/python-sticky-note-variables-02.svg" alt="Creating Another Variable" />
# and then change `weight_kg`:
# In[8]:
weight_kg = 100.0
print 'weight in kilograms is now:', weight_kg, 'and weight in pounds is still:', weight_lb
# <img src="files/img/python-sticky-note-variables-03.svg" alt="Updating a Variable" />
# Since `weight_lb` doesn't "remember" where its value came from,
# it isn't automatically updated when `weight_kg` changes.
# This is different from the way spreadsheets work.
#
# Now that we know how to assign things to variables,
# let's re-run `numpy.loadtxt` and save its result:
# In[9]:
data = numpy.loadtxt(fname='inflammation-01.csv', delimiter=',')
# This statement doesn't produce any output because assignment doesn't display anything.
# If we want to check that our data has been loaded,
# we can print the variable's value:
# In[10]:
print data
# #### Challenges
#
# 1. Draw diagrams showing what variables refer to what values after each statement in the following program:
#
# ~~~python
# mass = 47.5
# age = 122
# mass = mass * 2.0
# age = age - 20
# ~~~
#
# 1. What does the following program print out?
#
# ~~~python
# first, second = 'Grace', 'Hopper'
# third, fourth = second, first
# print third, fourth
# ~~~
#### Manipulating Data
# Now that our data is in memory,
# we can start doing things with it.
# First,
# let's ask what [type](../../gloss.html#data-type) of thing `data` refers to:
# In[11]:
print type(data)
# The output tells us that `data` currently refers to an N-dimensional array created by the NumPy library.
# We can see what its [shape](../../gloss.html#shape) is like this:
# In[12]:
print data.shape
# This tells us that `data` has 60 rows and 40 columns.
# `data.shape` is a [member](../../gloss.html#member) of `data`,
# i.e.,
# a value that is stored as part of a larger value.
# We use the same dotted notation for the members of values
# that we use for the functions in libraries
# because they have the same part-and-whole relationship.
# If we want to get a single value from the matrix,
# we must provide an [index](../../gloss.html#index) in square brackets,
# just as we do in math:
# In[13]:
print 'first value in data:', data[0, 0]
# In[14]:
print 'middle value in data:', data[30, 20]
# The expression `data[30, 20]` may not surprise you,
# but `data[0, 0]` might.
# Programming languages like Fortran and MATLAB start counting at 1,
# because that's what human beings have done for thousands of years.
# Languages in the C family (including C++, Java, Perl, and Python) count from 0
# because that's simpler for computers to do.
# As a result,
# if we have an M&times;N array in Python,
# its indices go from 0 to M-1 on the first axis
# and 0 to N-1 on the second.
# It takes a bit of getting used to,
# but one way to remember the rule is that
# the index is how many steps we have to take from the start to get the item we want.
#
# > #### In the Corner
# >
# > What may also surprise you is that when Python displays an array,
# > it shows the element with index `[0, 0]` in the upper left corner
# > rather than the lower left.
# > This is consistent with the way mathematicians draw matrices,
# > but different from the Cartesian coordinates.
# > The indices are (row, column) instead of (column, row) for the same reason.
# An index like `[30, 20]` selects a single element of an array,
# but we can select whole sections as well.
# For example,
# we can select the first ten days (columns) of values
# for the first four (rows) patients like this:
# In[15]:
print data[0:4, 0:10]
# The [slice](../../gloss.html#slice) `0:4` means,
# "Start at index 0 and go up to, but not including, index 4."
# Again,
# the up-to-but-not-including takes a bit of getting used to,
# but the rule is that the difference between the upper and lower bounds is the number of values in the slice.
#
# We don't have to start slices at 0:
# In[16]:
print data[5:10, 0:10]
# and we don't have to take all the values in the slice---if we provide a [stride](../../gloss.html#stride),
# Python takes values spaced that far apart:
# In[17]:
print data[0:10:3, 0:10:2]
# Here,
# we have taken rows 0, 3, 6, and 9,
# and columns 0, 2, 4, 6, and 8.
# (Again, we always include the lower bound,
# but stop when we reach or cross the upper bound.)
# We also don't have to include the upper and lower bound on the slice.
# If we don't include the lower bound,
# Python uses 0 by default;
# if we don't include the upper,
# the slice runs to the end of the axis,
# and if we don't include either
# (i.e., if we just use ':' on its own),
# the slice includes everything:
# In[18]:
small = data[:3, 36:]
print 'small is:'
print small
# Arrays also know how to perform common mathematical operations on their values.
# If we want to find the average inflammation for all patients on all days,
# for example,
# we can just ask the array for its mean value
# In[19]:
print data.mean()
# `mean` is a [method](../../gloss.html#method) of the array,
# i.e.,
# a function that belongs to it
# in the same way that the member `shape` does.
# If variables are nouns, methods are verbs:
# they are what the thing in question knows how to do.
# This is why `data.shape` doesn't need to be called
# (it's just a thing)
# but `data.mean()` does
# (it's an action).
# It is also why we need empty parentheses for `data.mean()`:
# even when we're not passing in any parameters,
# parentheses are how we tell Python to go and do something for us.
#
# NumPy arrays have lots of useful methods:
# In[20]:
print 'maximum inflammation:', data.max()
print 'minimum inflammation:', data.min()
print 'standard deviation:', data.std()
# When analyzing data,
# though,
# we often want to look at partial statistics,
# such as the maximum value per patient
# or the average value per day.
# One way to do this is to select the data we want to create a new temporary array,
# then ask it to do the calculation:
# In[21]:
patient_0 = data[0, :] # 0 on the first axis, everything on the second
print 'maximum inflammation for patient 0:', patient_0.max()
# We don't actually need to store the row in a variable of its own.
# Instead, we can combine the selection and the method call:
# In[22]:
print 'maximum inflammation for patient 2:', data[2, :].max()
# What if we need the maximum inflammation for *all* patients,
# or the average for each day?
# As the diagram below shows,
# we want to perform the operation across an axis:
# <img src="files/img/python-operations-across-axes.svg" alt="Operations Across Axes" />
# To support this,
# most array methods allow us to specify the axis we want to work on.
# If we ask for the average across axis 0,
# we get:
# In[23]:
print data.mean(axis=0)
# As a quick check,
# we can ask this array what its shape is:
# In[24]:
print data.mean(axis=0).shape
# The expression `(40,)` tells us we have an N&times;1 vector,
# so this is the average inflammation per day for all patients.
# If we average across axis 1, we get:
# In[25]:
print data.mean(axis=1)
# which is the average inflammation per patient across all days.
# #### Challenges
#
# A subsection of an array is called a [slice](../../gloss.html#slice).
# We can take slices of character strings as well:
# In[26]:
element = 'oxygen'
print 'first three characters:', element[0:3]
print 'last three characters:', element[3:6]
# 1. What is the value of `element[:4]`?
# What about `element[4:]`?
# Or `element[:]`?
#
# 1. What is `element[-1]`?
# What is `element[-2]`?
# Given those answers,
# explain what `element[1:-1]` does.
#
# 1. The expression `element[3:3]` produces an [empty string](../../gloss.html#empty-string),
# i.e., a string that contains no characters.
# If `data` holds our array of patient data,
# what does `data[3:3, 4:4]` produce?
# What about `data[3:3, :]`?
#### Plotting
# The mathematician Richard Hamming once said,
# "The purpose of computing is insight, not numbers,"
# and the best way to develop insight is often to visualize data.
# Visualization deserves an entire lecture (or course) of its own,
# but we can explore a few features of Python's `matplotlib` here.
# First,
# let's tell the IPython Notebook that we want our plots displayed inline,
# rather than in a separate viewing window:
# In[27]:
get_ipython().magic(u'matplotlib inline')
# The `%` at the start of the line signals that this is a command for the notebook,
# rather than a statement in Python.
# Next,
# we will import the `pyplot` module from `matplotlib`
# and use two of its functions to create and display a heat map of our data:
# In[28]:
from matplotlib import pyplot
pyplot.imshow(data)
pyplot.show()
# Blue regions in this heat map are low values, while red shows high values.
# As we can see,
# inflammation rises and falls over a 40-day period.
# Let's take a look at the average inflammation over time:
# In[29]:
ave_inflammation = data.mean(axis=0)
pyplot.plot(ave_inflammation)
pyplot.show()
# Here,
# we have put the average per day across all patients in the variable `ave_inflammation`,
# then asked `pyplot` to create and display a line graph of those values.
# The result is roughly a linear rise and fall,
# which is suspicious:
# based on other studies,
# we expect a sharper rise and slower fall.
# Let's have a look at two other statistics:
# In[30]:
print 'maximum inflammation per day'
pyplot.plot(data.max(axis=0))
pyplot.show()
print 'minimum inflammation per day'
pyplot.plot(data.min(axis=0))
pyplot.show()
# The maximum value rises and falls perfectly smoothly,
# while the minimum seems to be a step function.
# Neither result seems particularly likely,
# so either there's a mistake in our calculations
# or something is wrong with our data.
# #### Challenges
#
# 1. Why do all of our plots stop just short of the upper end of our graph?
# Why are the vertical lines in our plot of the minimum inflammation per day not vertical?
#
# 1. Create a plot showing the standard deviation of the inflammation data for each day across all patients.
#### Wrapping Up
# It's very common to create an [alias](../../gloss.html#alias) for a library when importing it
# in order to reduce the amount of typing we have to do.
# Here are our three plots side by side using aliases for `numpy` and `pyplot`:
# In[31]:
import numpy as np
from matplotlib import pyplot as plt
data = np.loadtxt(fname='inflammation-01.csv', delimiter=',')
plt.figure(figsize=(10.0, 3.0))
plt.subplot(1, 3, 1)
plt.ylabel('average')
plt.plot(data.mean(0))
plt.subplot(1, 3, 2)
plt.ylabel('max')
plt.plot(data.max(0))
plt.subplot(1, 3, 3)
plt.ylabel('min')
plt.plot(data.min(0))
plt.tight_layout()
plt.show()
# The first two lines re-load our libraries as `np` and `plt`,
# which are the aliases most Python programmers use.
# The call to `loadtxt` reads our data,
# and the rest of the program tells the plotting library
# how large we want the figure to be,
# that we're creating three sub-plots,
# what to draw for each one,
# and that we want a tight layout.
# (Perversely,
# if we leave out that call to `plt.tight_layout()`,
# the graphs will actually be squeezed together more closely.)
# #### Challenges
#
# 1. Modify the program to display the three plots on top of one another instead of side by side.
# #### Key Points
#
# * Import a library into a program using `import libraryname`.
# * Use the `numpy` library to work with arrays in Python.
# * Use `variable = value` to assign a value to a variable in order to record it in memory.
# * Variables are created on demand whenever a value is assigned to them.
# * Use `print something` to display the value of `something`.
# * The expression `array.shape` gives the shape of an array.
# * Use `array[x, y]` to select a single element from an array.
# * Array indices start at 0, not 1.
# * Use `low:high` to specify a slice that includes the indices from `low` to `high-1`.
# * All the indexing and slicing that works on arrays also works on strings.
# * Use `# some kind of explanation` to add comments to programs.
# * Use `array.mean()`, `array.max()`, and `array.min()` to calculate simple statistics.
# * Use `array.mean(axis=0)` or `array.mean(axis=1)` to calculate statistics across the specified axis.
# * Use the `pyplot` library from `matplotlib` for creating simple visualizations.
# #### Next Steps
#
# Our work so far has convinced us that something's wrong with our first data file.
# We would like to check the other 11 the same way,
# but typing in the same commands repeatedly is tedious and error-prone.
# Since computers don't get bored (that we know of),
# we should create a way to do a complete analysis with a single command,
# and then figure out how to repeat that step once for each file.
# These operations are the subjects of the next two lessons.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment