These notes are adapted from the official Carpentries documentation for Data Analysis and Visualization in Python for Ecologists: Before we start. They have been reordered, and some material has been expanded on, and there are some Smithsonian-specific details incorporated. The original lessons are licensed under CC-BY 4.0 2018–2020 by The Carpentries. These adapted notes are available for use under the same license.
Official lesson link: https://datacarpentry.org/python-ecology-lesson/
Jupyter documentation: https://datacarpentry.org/python-ecology-lesson/jupyter_notebooks/
Amanda's instructor notes for the first part, which are slightly reorganized and modified: https://gist.github.com/amdevine/39b4cab8d12952a3b9ceb9e16496c2cb
MyBinder Jupyter link, if you are having trouble running Jupyter on your computer: https://mybinder.org/v2/gh/SmithsonianWorkshops/binders/python
Portal project data (surveys.csv): https://ndownloader.figshare.com/files/2292172
Import surveys.csv directly from the web into Python:
import pandas as pd
surveys = pd.read_csv('https://ndownloader.figshare.com/files/2292172')
Please add your name and one work/personal project where you could apply OpenRefine or SQL techniques.
Challenge Question: With the person/people next to you, discuss: What is programming? What is coding? What are the differences between these terms, if any?
Programming is the process of writing “programs” that a computer can execute and produce some (useful) output. Programming is a multi-step process that involves the following steps:
- Identifying the aspects of the real-world problem that can be solved computationally
- Identifying (the best) computational solution
- Implementing the solution in a specific computer language
- Testing, validating, and adjusting implemented solution.
While “Programming” refers to all of the above steps, “Coding” refers to step 3 only: “Implementing the solution in a specific computer language”.
Python is a general purpose programming language that supports rapid development of data analytics applications. The word “Python” is used to refer to both, the programming language and the tool that executes the scripts written in Python language.
So, why do you need Python for data analysis?
-
Easy to learn: Python is easier to learn than other programming languages. This is important because lower barriers mean it is easier for new members of the community to get up to speed.
-
Reproducibility: Reproducibility is the ability to obtain the same results using the same dataset(s) and analysis pipeline. Data analysis written as a Python script can be reproduced on any platform. Moreover, if you collect more data or correct existing data, you can quickly and easily re-run your analysis! Also, an increasing number of journals and funding agencies expect analyses to be reproducible, so knowing Python will give you an edge with these requirements.
-
Versatility: Python is a versatile language that can accomplish many different tasks in many different ways. For example, you can write Python programs that can be run in the command line, or can be incorporated into other programs. You can use Python to generate manuscripts, and if you update your data, your manuscript figures can automatically update. You can use Python to create web applications. You can even run snippets of Python code in OpenRefine to clean data!
-
Interdisciplinary and extensible: Python provides a nice framework, or a nice set of common tools, that people can use to run analyses and do work in different fields and areas of expertise. Python has an active development community, and people have written specific Python packages to analyze bioinformatics data, or pull stock prices from the web, or even calculate fantasy sports statistics!
-
Python has a large and active community: Thousands of people use Python daily. People are constantly creating new Python packages and libraries and updating and improving existing ones. And if you have questions or run into problems, there are a ton of tutorials and places to get help with your code.
-
Free, Open-Source, and Cross-Platform: Anyone can download and use Python for free on any kind of computer. Any work you do can be shared with collaborators, or made available on the web for people who are interested.
The material we cover during this workshop will give you an initial taste of how you can use Python to analyze data for your own research. However, you will need to learn more to do advanced operations such as cleaning your dataset, using statistical methods, or creating beautiful graphics. The best way to become proficient and efficient at python, as with any other tool, is to use it to address your actual research questions. As a beginner, it can feel daunting to have to write a script from scratch, and given that many people make their code available online, modifying existing code to suit your purpose might make it easier for you to get started.
- Python documentation
- Tutorials
- StackOverflow
- Smithsonian resources! (https://github.com/SmithsonianWorkshops/workshop-slides/blob/main/data-carpentry-wrap-up.pdf)
- Carpentries Instructors
- Carpentries Slack
- SI DataScience
- Data Science Listserv
It is a good idea to keep a set of related data, analyses, and text in a single, well-organized folder. When your project is organized and in one place, it makes it easy to locate the files you need, and it makes it easy to refer to data files in your code. It also allows you to share your work with others, and for them to understand your workflow. This also makes it easy for future you, if you need to come back to a project in a few years and remember what you did, and which files were important!
Let's create a folder for the Python work we are doing today and tomorrow. If you already have a folder from this morning, you can use that one if you'd like. Just make sure you know where to find it!
The folder, or directory, that contains your project is called your working directory. Within your working directory, you want to create seperate sub-directories, or sub-folders, for different components of your analysis.
Let's create a folder named data. For the sake of reproducibility, you should always keep a copy of your raw data. If you need to clean up data, try to document the changes you make; this could be in the form of a Python script, or an export of the transformations you did in OpenRefine saved as a JSON file. At any rate, it's good practice to keep your raw data and your cleaned data separate - you could create subfolders in your data folder, one named raw and one named cleaned. For this workshop, we'll just stick with a basic data folder.
Some other folders that could be useful are a folder containing documents, like manuscripts or README files, and a folder containing scripts, like your Python analysis scripts or your OpenRefine transformations. We won't create these folders right now, but it is a good practice to do so!
Anaconda is a distribution of Python, or a special way to package Python and make it easily available for people to download. Anaconda is nice because it provides you a copy of Python, then automatically installs many of the most popular Python packages on your computer, saving you the effort of having to install them manually. It also installs several useful tools for interacting with Python and writing code.
One nice tool Anaconda provides is a program called Spyder. Spyder is an IDE, or Integrated Development Environment. It's one Python equivalent to RStudio, if you've used that for R programming before. If we open up Spyder, we can see that it includes multiple useful windows we might need when programming.
- On the left is a window that we can use to edit Python scripts, and a play button we can use to run them.
- On the bottom right is an interactive Python console, which you can use to type Python commands and get instant feedback.
- On the upper right is a window that gives us some help documentation, shows us the files in our working directory, and shows us variables that we have created in our current Python environment.
For this workshop, we won't be using Spyder, but I just wanted to show you that it is one option for writing and running Python code. For this workshop, we're going to use a different program that comes installed with Anaconda, called Jupyter.
Jupyter, or Jupyter Notebook, is an open-source browser-based application. It allows you to create and share documents, or notebooks, that that combine Python code, visualizations, and nicely formatted text.
See official Carpentries documentation: https://datacarpentry.org/python-ecology-lesson/jupyter_notebooks/
Basic Markdown syntax: https://personal.math.ubc.ca/~pwalls/math-python/jupyter/markdown/
You can get output from Python by running simple math expressions.
3 + 5
12 / 7
However, to do useful and interesting things, we need to assign values to variables. We do this by writing the name of the variable we want to create, then using the assignment operator, the equal sign, to assign it a value.
plot_id = 3
species = "DM"
weight_kg = 25.82
extinct = False
To review the value of a variable, we can type the name of the variable into a cell, then press Shift+Return.
species
Every variable that we create in Python has a type. To get the type of a variable, we can use a function type().
type(species)
type(plot_id)
type(weight_g)
type(extinct)
The variable species is type str, which stands for string. Strings hold sequences of characters, which can be letters, numbers, punctuation or more exotic forms of text (even emoji!). plot_id is an int, or integer. weight_g is a float, which is another term for a decimal number. extinct is a type called a boolean, which holds the value True or False.
We can use another function called print() to print the value of a variable to our output.
print(weight_g)
This may seem redundant, since we can just type the name of the variable to see its value. However, when you're running a script, just writing the name doesn't work. The only way you can print things to your output is to use the print function.
Now that Python has weight_kg in memory, we can do arithmetic with it. For instance, we may want to convert this weight into pounds (weight in pounds is 2.2 times the weight in kg). We can then save this conversion in a new variable, weight_lb.
weight_lb = 2.2 * weight_kg
We can use standard arithmetical operators when you do math in Python: +, -, *, /, ** (power), % (modulo)
We can also perform arithmetic on a variable, and then save the result back into that same variable to update the value.
species_count = 10
species_count
species_count = species_count + 2
species_count
We can also use comparison operators like <, >, ==, != <=, and >=. These operators compare two variables or values and return a boolean value to us.
weight_lb > 100
weight_lb > weight_kg
weight_lb == weight_kg
weight_lb != weight_kg
Finally, we can use the logical operators and, or, and not. These operators evaluate two (or more) logical statements and return a boolean. and returns True if all logical statements are true. or returns True if any logical statement is true. not returns the inverse of the logical statement.
higher_lb = weight_lb > weight_kg
higher_lb
extinct
higher_lb and extinct
higher_lb or extinct
higher_lb and not extinct
Challenge question:
We are going out out into the field to make some recordings in one of our study plots. The forecasted temperature for the day is 50 degrees F. We want to know how the batteries in our recording equipment may perform in these temperatures.
1. Create a variable, temp_f, containing the forecasted temperature.
2. Convert this temperature to degrees Celsius, and store the result in a variable called temp_c. (To perform this conversion, subtract 32 from degrees Farenheit, then divide the result by 1.8.)
3. The ideal temperature range for our batteries is between 15-35 degrees C. Create a variable, temp_ideal, that contains a boolean value telling us whether the forecasted temperature falls in this range. (Try to use comparison operators and logical operators to create this variable.)
Lists are a common data structure to hold an ordered sequence of elements. Each element can be accessed by an index, which is its position in the list. Note that Python indexes start with 0.
numbers = [1, 2, 3]
numbers[0]
You can use a for loop to access the elements in a list or other Python data structure one at a time.
for num in numbers:
print(num)
To add elements to the end of a list, we can use the append method. Methods are commands specific to a particular Python object (a list, for example). We can invoke a method using the dot . followed by the method name and a list of arguments in parentheses. Let’s look at an example using append:
numbers.append(4)
print(numbers)
A tuple is similar to a list in that it’s an ordered sequence of elements. However, tuples can not be changed once created (they are “immutable”). Tuples are created by placing comma-separated values inside parentheses ().
# Tuples use parentheses
a_tuple = (1, 2, 3)
another_tuple = ('blue', 'green', 'red')
# Note: lists use square brackets
a_list = [1, 2, 3]
Challenge question:
1. What happens when you execute a_list[1] = 5?
2. What happens when you execute a_tuple[2] = 5?
3. What does type(a_tuple) tell you about a_tuple?
A dictionary is a container that holds pairs of objects - keys and values. Dictionaries are created with curly braces. Inside the braces, key and value pairs are separated by commas.
translation = {'one': 'first', 'two': 'second'}
Dictionaries work a lot like lists - you can index them to get values out of them. Unlike lists, however, you provide they key to the index, not the position of the item.
translation['one']
translation[0]
You can think about a key as a name or unique identifier for the value it corresponds to.
To add an item to the dictionary we assign a value to a new key:
translation['three'] = 'third'
translation
You can even use for loops with dictionaries. Because we have two components to each item (the key and the value), we'll use two temporary variables to access each of these components. We are also using the .items() method to access the full key and value pairs of each item.
for key, value in translation.items():
print(key, "translates to", value)
Challenge question:
Create a new dictionary:
rev = {'first': 'one', 'second': 'two', 'third': 'three'}
1. Print the value of the rev dictionary to the screen.
2. Reassign the value that corresponds to the key second so that it no longer reads “two” but instead 2.
3. Print the value of rev to the screen again to see if the value has changed.
Functions are “canned scripts” that automate more complicated sets of commands. Many functions are predefined, or can be made available by importing Python packages, which we'll discuss later. A function usually takes one or more inputs called arguments. Functions often (but not always) return a value, which can be assigned to its own variable.
We have been using functions like type() and print() to examine our variables. We can also write our own functions, to make it easier to carry out operations we'd like to do in our analyses.
Defining a section of code as a function in Python is done using the def keyword. For example a function that takes two arguments and returns their sum can be defined as:
def add_function(a, b):
result = a + b
return result
z = add_function(20, 22)
print(z)
Portal project data (surveys.csv): https://ndownloader.figshare.com/files/2292172
Import surveys.csv directly from the web into Python:
import pandas as pd
surveys = pd.read_csv('https://ndownloader.figshare.com/files/2292172')
Challenge question (Data Frames):
Using our DataFrame surveys_df, try out the attributes & methods below to see what they return.
1. surveys_df.columns
2. surveys_df.shape Take note of the output of shape - what format does it return the shape of the DataFrame in?
3. surveys_df.head() Also, what does surveys_df.head(15) do?
4. surveys_df.tail()
Challenge question (Statistics):
1. Create a list of unique site ID’s (“plot_id”) found in the surveys data. Call it site_names. How many unique sites are there in the data? How many unique species are in the data?
2. What is the difference between len(site_names) and surveys_df['plot_id'].nunique()?
Challenge question (Summary Data):
1. How many recorded individuals are female F and how many male M?
2. What happens when you group by two columns using the following syntax and then grab mean values?
grouped_data2 = surveys_df.groupby(['plot_id','sex'])
grouped_data2.mean()
3. Summarize weight values for each site in your data. HINT: you can use the following syntax to only create summary statistics for one column in your data. by_site['weight'].describe()
Challenge question (Make a list):
What’s another way to create a list of species and associated count of the records in the data? Hint: you can perform count, min, etc functions on groupby DataFrames in the same way you can perform them on regular DataFrames.
Challenge question (Plots):
Create a plot of average weight across all species per site.
Create a plot of total males versus total females for the entire dataset.