Bootcamp Site: https://columbiaswc.github.io/2018-08-27-Columbia-B/
Instructors:
- Cesar Arias – [email protected]
- Marii Nyröp – [email protected] | @mnyrop | marii.info
Helpers:
- Jochen Weber –
- Parixit Davé –
- Yanchen Liu – [email protected]
Day 1 | Day 2 | |
---|---|---|
09:00 | Automating tasks with the Unix shell | Building programs with Python |
10:30 | Break | Break |
12:00 | Lunch break | Lunch break |
01:00 | Version control with Git | Building programs with Python, Continued |
02:30 | Break | Break |
04:00 | Wrap-up | Wrap-up & post-workshop survey |
04:30 | END | END |
Setup check in:
- Do you have the necessary software (Bash, Git, and Python) installed?
- Do you have Jupyter notebooks installed? (You can check by opening a Terminal and entering the command
jupyter notebook
. This should open a browser.) - Do you have the demo data for bash and python downloaded?
- Do you have a GitHub account?
- Do you have 2 color post-its?
Setup resources:
- https://columbiaswc.github.io/2018-08-27-Columbia-B/#setup
- http://swcarpentry.github.io/shell-novice/setup.html
- http://dhbox.org/ ~> instructions: https://gist.github.com/mnyrop/4f6ff1b8a3737782c12fcedbb8da8966
- What is a command shell and why would I use one?
- Graphical User Interface (GUI) versus Command Line Interface (CLI)
- Read-evaluate-print-loop (REPL)
- Command, flag, argument
- Flexibility and automation
-
pwd
-
ls -F /
- How can I move around on my computer?
- How can I see what files and directories I have?
- How can I specify the location of a file or directory on my computer?
- File system hierarchy
- Current working directory
- Root directory
- Home directory
man
and--help
- Abosulute vs Relative path
-
ls
-
ls -F
-
ls --help
-
man ls
-
ls -j
-
ls -l
-
ls -R
-
ls -F Desktop
-
pwd
-
cd desktop
orcd Desktop
-
cd data-shell
-
cd data
-
cd ..
-
cd ~/desktop/data-shell/data
orcd ~/Desktop/data-shell/data
-
cd /
-
cd ~
-
cd -
- move into
data-shell
directory - enter the command
ls north-pacific-gyre/2012-07-03/
using tab completion
- How can I create, copy, and delete files and directories?
- How can I edit files?
- Naming conventions for files and directories
- The Nano text editor
- Deleting with
rm
is forever
- Go back to
data-shell
by checking where you arepwd
and usingcd
- Check what's in
data shell
withls -F
- Make a new directory called 'thesis'
mkdir thesis
- Check what's in the directory again
ls -F
- Nothing is in
thesis
because it's brand new. Check withls -F thesis
- Move into
thesis
withcd thesis
- Use Nano to add and edit a file
nano draft.txt
. Add some lines of text, then use Ctrl-X to exit. - Go to your home directory and make a file using touch.
cd ~
followed bytouch my_file.txt
- Use
ls -l
to inspect the files. How large ismy_file.txt
? Why? - Move back into
thesis
indata-shell
withcd
and remove the draft file with rmrm draft.txt
using tab completion. But be careful! - Run
ls
to see if the file is still there. - Re-add the file and move back into
data-shell
withnano draft.txt
,ls
, andcd ..
- Try removing
thesis
withrm thesis
. What happens? - Type out
rm -r thesis
for removing the directory recursively. But don't hit enter! We're not ready to delete it yet. - Instead try removing the directory safely with
rm -r -i thesis
. Typey
for each file to delete. - Make the
thesis
directory indata-shell
again by checking where you arepwd
and usingmkdir thesis
- Remake the file
draft.txt
with nanonano thesis/draft.txt
- Change the filename of
draft.txt
toquotes.txt
usingmv thesis/draft.txt thesis/quotes.txt
- Check what happened with
ls thesis
- Move
quotes.txt
into the current working directory withmv thesis/quotes.txt .
- See what's in
thesis
withls thesis
- Find
quotes.txt
in the current working directory withls
- How can I combine existing commands to do new things?
- How can I perform the same actions on many different files?
Full Tutorial: https://swcarpentry.github.io/shell-novice/
- What is version control and why should I use it?
- Version control – keeps track of changes and allows for greater control over them
- Manages collaboration and change conflicts
- How do I get set up to use Git?
- Git is the software, GitHub is a popular service for hosting content that is version controlled by Git
- Your local Git needs to be configured to work with your GitHub account
- Configure Git to use your user name with
git config --global user.name "your-username"
- Configure Git to use your email with
git config --global user.email "[email protected]"
- Configure your Git to use Nano as its text editor with
git config --global core.editor "nano -w"
- Check your Git configuration with
git config --list
- Check possible config commands with
git config -h
- Where does Git store information?
- A repository is where your Git project files and the history of all your project’s commits live
git init
initializes a repository (and everything in it!)git status
shows the repository's current status- The Git repository history and info lives in a directory called
.git
-
cd ~/Desktop
-
mkdir planets
-
cd planets
-
git init
initializes theplanets
repository
- How do I record changes in Git?
- How do I check the status of my version control repository?
- How do I record notes about what changes I made and why?
- Adding and committing
- Commit messages
- Reading the commit history
- Check current directory with
pwd
- Make sure you are in
~/Desktop/planets
usingcd
- Use Nano to make a file
nano mars.txt
and add the sentence 'Cold and dry, but everything is my favorite color' to it before saving - Use
ls
to list the directory contents - Use
cat mars.txt
to print out the file's content - Try
git status
again - Tell Git to track our new file with
git add mars.txt
- Try
git status
again - Commit the changes and give a message about what the changes are with
git commit -m 'start notes on mars as a base'
- Try
git status
again - Look at the commit history with
git log
- Add some additonal info to
mars.txt
withnano mars.txt
, for example 'The two moons may be a problem for Wolfman' - Try
git status
again - Try
git diff
- Try committing the new changes with
git commit -m 'add concerns about the moons and wolfman'
- Try
git status
again - Add the file with
git add mars.txt
then retry the commitgit commit -m 'add concerns about the moons and wolfman'
- Add a third piece of info to the file with
nano mars.txt
, e.g., 'The mummy will appreciate the lack of humidity' - Print it with
cat mars.txt
- Try
git diff
- Add the file with
git add mars.txt
- Commit the changes with
git commit -m 'add mummy climate concerns'
- Try
git status
again - Look at the commit history with
git log
- Try
git log -1
. What happens? - Try
git log --oneline
- Try
git log --oneline --all --decorate
- How can I tell Git to ignore files I don’t want to track?
- There will often be files you don't want Git to track, for security or efficiency reasons
- Ignored files, directory, and file patterns are listed in a
.gitignore
file
- Make sure we're still in
~/Desktop/planets
withpwd
- Make a new directory with
mkdir results
- Add a few files with
touch a.dat b.dat results/a.out results/b.out
- Run
git status
- Make a
.gitignore
file withnano .gitignore
with*.dat
on the first line andresults/
on the second line. - Print out the file's content with
cat .gitignore
- Run
git status
again - Add and commit the
.gitignore
file withgit add .gitignore
andgit commit -m 'ignore data files and results folder'
- Run
git status again
- Try running
git add a.dat
. What happens? - Run
git status --ignored
- How do I share my changes with others on the web?
- Repository remotes (origin)
- Git Push
- Git Pull
- Log into GitHub
- Add a new public repository called 'planets'
- Copy the 'quick setup' link (it should be 'https://github.com/YOUR_USERNAME/planets.git')
- And paste it into your terminal for the command
git remote add origin 'https://github.com/YOUR_USERNAME/planets.git'
- Run
git remote -V
- Run
git push -u origin master
and enter your GitHub password when prompted - Try
git pull origin master
- Refresh your repository page on GitHub. What do you see?
- Collaborator permissions
- Git clone
- How can I use version control to collaborate with other people?
- Pair up with a neighbor and decide who will start as the 'owner' and who will start as the 'collaborator'
- Owner: click on the 'Settings' tab on your
planets
repository in GitHub, navigate to 'Collaborators' and add your partner's GitHub username. - Collaborator: Go to https://git.521000.best.notifications and accept the owner's request to collaborate on their repository
- Collaborator: clone the owner's
planets
repository onto your computer asOWNER_USERNAME-planets
withgit clone 'https://github.com/OWNER_USERNAME/planets.git' ~/Desktop/OWNER_USERNAME-planets
- Collaborator: Change directory into the newly cloned repository with
cd
- Collaborator: Add a new file
nano pluto.txt
with the content 'It is so a planet!' - Collaborator: Run
cat pluto.txt
- Collaborator: Add the file with
git add pluto.txt
- Collaborator: Commit the changes and add a commit message
git commit -m 'add notes about Pluto'
- Collaborator: Push the changes to the owner's remote repository on GitHub with
git push origin master
- Owner: Refresh the repository page on GitHub
- Owner: Back in the bash terminal, make sure you are in
planets
withpwd
andcd
if necessary - Owner: Pull in the new changes from the remote repository on GitHub with
git pull origin master
- Owner and Collaborator: switch roles and add another planet
- What do I do when my changes conflict with someone else’s?
- When working on the same files, collaborators can create content conflicts.
- Version control with Git provides means for managing and reconciling conflicts
- Use
git fetch
andgit pull
often to avoid/preempt conflicts
- Collaborator: Create a conflict by adding another line to
mars.txt
withnano
: "This is a new line in OWNER_USERNAME's copy" - Collaborator: Push the change to GitHub with
git add mars.txt
,git commit -m 'add a line to OWNER_USERNAME copy'
, andgit push origin master
- Owner: Make a change in your own copy of
mars.txt
withnano
: 'this is a different line added' - Owner: Push the change to GitHub with
git add mars.txt
,git commit -m 'add a line in my copy'
, andgit push origin master
. What happens? - Owner: Pull in collaborator's changes with
git pull origin master
- Owner: Look at the conflict with
cat mars.txt
- Owner: Use
nano mars.txt
to reconcile the conflict. - Owner: Merge the changes by committing them with
git add mars.txt
,git status
,git commit -m 'merge in changes from GitHub'
, and finallygit push origin master
- Collaborator: Pull in the newly reconciled change with
git fetch
andgit pull origin master
- Collaborator: Check the results with
cat mars.txt
- How can version control help me make my work more open?
- What licensing information should I include with my work?
- How can I make my work easier to cite?
- Where should I host my version control repositories?
- How can I identify old versions of files?
- How do I review my changes?
- How can I recover old versions of files?
Full Tutorial: https://swcarpentry.github.io/git-novice/
- How can I process tabular data files in Python?
- Variable assignment
variable = value
- Int, Float, and String types
1
,1.0
,'1'
- Arrays and N-Arrays
[0, 1, 2]
and[[1, 0],[0, 1],[1, 2]]
- Print
print(variable)
- Python libraries (e.g., numpy)
import numpy
- CSV file (Comma-separated values)
- Indexing and slicing
data[0, 0]
anddata[:3, 10:]
- IPython mystery functions (
.function
, and.function?
) - Add comments with '#'
# this is what this line does
- Get stats with
numpy.mean(array)
,numpy.max(array)
,numpy.min(array)
- Get stats for a given axis with
numpy.mean(axis=0)
ornumpy.mean(axis=1)
- Plot and visualize with
matplotlib.pyplot
- Make sure you have the demo data downloaded
- Move into the
data
folder withcd ~/Desktop/swc-python/data
- Start a Jupyter notebook with
jupyter notebook
and New > Python 3 (Notebook) - Enter
3 + 5 * 4
into a cell and press Shift+Enter to run it. - Set a variable
weight_kg = 60
- Change it to a float with
weight_kg = 60.0
- Set a variable
weight_kg_text
to the string 'weight in kilograms: ' withweight_kg_text = 'weight in kilograms:'
- Print out the weight in kilograms with
print(weight_kg)
- Print out both the text and the weight with
print(weight_kg_text, weight_kg)
- Print out the weight in pounds as a sentence with
print('weight in pounds: ', 2.2 * weight_kg)
- Try
print(weight_kg)
. Did theweight_kg
variable change? - Try
weight_kg = 65.0
andprint('weight in kilograms is now: ', weight_kg)
. Did the variableweight_kg
change? - Set
weight_lb = 2.2 * weight_kg
the runprint(weight_kg_text, weight_kg, 'and in pounds:', weight_lb)
- Reassign
weight_kg = 100.0
and runprint('weight in kilograms is now:', weight_kg, 'and weight in pounds is still:', weight_lb)
. Why isn'tweight_lb
updated?
- Clear the notebook and at the top, run
import numpy
- Next, load the first sample data file with
numpy.loadtxt(fname='inflammation-01.csv', delimiter=',')
- Save the loaded file to a variable called data with
data = numpy.loadtxt(fname='inflammation-01.csv', delimiter=',')
- Print out the data
print(data)
- Print out the data type for the variable
data
withprint(type(data))
- Print out the data's
dtype
, or the data type of the items withindata
usingprint(data.dtype)
- Get the shape of the data (rows, columns) with
print(data.shape)
- Get the first value in the data using the index of [0, 0]:
print('the first value in data:', data[0, 0])
- Print out the middle value of
data
withprint('middle value in data:', data[30, 20])
- Select the first 10 days (columns) from the first 4 patients (rows) with
print(data[0:4,0:10])
- Shift the slice to the 5th day with
print(data[5:10, 0:10])
- You can drop the first number for a short-hand way to slice from the beginning, and drop the second number to slice through the end. Try
small = data[:3, 36:]
, thenprint('small is:')
andprint(small)
. (This selects rows 0-2 and columns 36-end)
-
Try
doubledata = data * 2.0 print('original:') print(data[:3, 36:]) print('doubledata:') print(doubledata[:3, 36:])
-
Try
tripledata = doubledata + data print('tripledata') print(tripledata[:3, 36:])
-
Get the mean of the original data with
print(numpy.mean(data))
-
Try a function without an input:
import time
thenprint(time.ctime())
to get the current time -
Get the maximum value, minimum value, and standard deviation with
maxval, minval, stdval = numpy.max(data), numpy.min(data), numpy.std(data) print('maximum inflammation:', maxval) print('minimum inflammation:', minval) print('standard deviation', stdval)
-
Set patient zero to its own variable and print it out
patient_0 = data[0, :] print('maximum inflammation for patient 0:', patient_0.max())
-
Skip storing patient 0 as its own variable and do it in one line
print('maximum inflammation for patient 0:', numpy.max(data[2, :]))
-
Get the average across the 0 axis (rows) with
print(numpy.mean(data, axis=0))
-
Get the shape to double check it's doing what we'd expect
print(numpy.mean(data, axis=0).shape)
-
Print out the average inflammation per patient across all days with the 1 axis (columns)
print(numpy.mean(data, axis=1))
-
Get a plot started for the data
import matplotlib.pyplot %matplotlib inline image = matplotlib.pyplot.imshow(data) matplotlib.pyplot.show
-
Try a plot with the average inflammation over time
ave_inflammation = numpy.mean(data, axis=0) ave_plot = matplotlib.pyplot.plot(ave_inflammation) matplotlib.pyplot.show()
Is this expected?
-
What about the maximum value along the first axis (0)?
max_plot = matplotlib.pyplot.plot(numpy.max(data, axis=0)) matplotlib.pyplot.show()
-
What about the minimum value along the first axis (0)?
min__plot = matplotlib.pyplot.plot(numpy.min(data, axis=0)) matplotlib.pyplot.show()
-
Group the plots together in one figure to compare, and start from scratch at the top of your notebook
import numpy import matplotlib.pyplot %matplotlib inline data = numpy.loadtxt(fname='inflammation-01.csv', delimiter=',') fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0)) axes1 = fig.add_subplot(1, 3, 1) axes2 = fig.add_subplot(1, 3, 2) axes3 = fig.add_subplot(1, 3, 3) axes1.set_ylabel('average') axes1.plot(numpy.mean(data, axis=0)) axes2.set_ylabel('max') axes2.plot(numpy.max(data, axis=0)) axes3.set_ylabel('min') axes3.plot(numpy.min(data, axis=0)) fig.tight_layout() matplotlib.pyplot.show()
- How can I do the same operations on many different values?
- Loop with
for variable in collection: do things with variable
- Body of the loop must be indented
- Use
len(variable)
to get the length of an array or string
-
Print out the characters in the word 'lead':
word = 'lead' print(word[0]) print(word[1]) print(word[2]) print(word[3])
-
Try with 'tin'. What happens?
word = 'tin' print(word[0]) print(word[1]) print(word[2]) print(word[3])
-
Try printing the characters with a loop instead:
word = 'lead' for char in word: print(char)
-
Switch the word to 'oxygen'. Does it work?
word = 'oxygen' for char in word: print(char)
-
Swap out the variable name. What happens?
word = 'oxygen' for banana in word: print(banana)
-
Make a loop that updates a variable:
length = 0 for vowel in 'aeiou': length = length + 1 print('There are', length, 'vowels')
-
Try another loop. Does it do what you'd expect?
letter = 'z' for letter in 'abc': print(letter) print('after the loop, letter is', letter)
-
Use a shortcut to count the vowels:
print(len('aeiou'))
- How can I store many values together?
- Mutable vs. immutable data (lists and arrays vs strings and numbers)
- In-place modifications can be tricky
- Lists use bracket notation
list = [item, item2, item3]
, and can be indexedlist[0]
and slicedlist[:2]
-
Make a list
odds = [1, 3, 5, 7] print('odds are', odds)
-
Select items from the list by index
print('first and last', odds[0], odds[-1])
-
Loop over items in the list
for number in odds: print(number)
-
Change an item in a list (fix a typo)
names = ['Curie', 'Darwing', 'Turing'] print('names is originally', names) names[1] = 'Darwin' print('final value of names:', names)
-
Now try changing a character in a string. What happens?
name = 'Darwin' name[0] = 'd'
-
Modify a list based on a list. Does it do what you'd expect?
salsa = ['peppers', 'onions', 'cilantro', 'tomatoes'] my_salsa = salsa salsa[0] = 'hot peppers' print('Ingredients in my salsa:', my_salsa)
-
Try making an independent copy of a list instead
salsa = ['peppers', 'onions', 'cilantro', 'tomatoes'] my_salsa = list(salsa) # list() makes a copy salsa[0] = 'hot peppers' print('Ingredients in my salsa:', my_salsa)
-
Try making a list of lists
x = [['pepper', 'zucchini', 'onion'], ['cabbage', 'lettuce', 'garlic'], ['apple', 'pear', 'banana']]
-
Print the first line as a list within a list
print([x[0]])
-
Print the first line as a list
print(x[0])
-
Make a heterogeneous list
sample_ages = [10, 12.5, 'Unknown']
-
Add an item to a the list
odds
odds.append(11) print('odds after adding a value:', odds)
-
Remove an item from the list
odds
by indexdel odds[0] print('odds after removing the first element', odds)
-
Reverse the list
odds.reverse() print('odds after reversing', odds)
-
Try modifying a list in place. (Remember, it's immutable!) What happens?
odds = [1,3, 5, 7]
primes = odds
primes.append(2)
print('primes:', primes)
print('odds:', odds)
- Try again by making a copy with
list()
odds = [1,3, 5, 7] primes = list(odds) primes.append(2) print('primes:', primes) print('odds:', odds)
- How can I do the same operations on many different files?
- Use the
glob
library for working with files, directories, and file paths. - Use
glob/glob(pattern)
to create a list of files that match the pattern - Use
*
in a pattern to match 0 or more characters (of any kind) and?
to match a single character
-
Import the
glob
libraryimport glob
-
Use
glob.glob
to list the inflammation data files in the current directoryprint(glob.glob('inflammation*.csv'))
-
Make inline graphs for the first 3 inflammation data files
import numpy import matplotlib.pyplot %matplotlib inline filenames = sorted(glob.glob('inflammation*.csv')) filenames = filenames[0:3] for f in filenames: print(f) data = numpy.loadtxt(fname=f, delimiter=',') fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0)) axes1 = fig.add_subplot(1, 3, 1) axes2 = fig.add_subplot(1, 3, 2) axes3 = fig.add_subplot(1, 3, 3) axes1.set_ylabel('average') axes1.plot(numpy.mean(data, axis=0)) axes2.set_ylabel('max') axes2.plot(numpy.max(data, axis=0)) axes3.set_ylabel('min') axes3.plot(numpy.min(data, axis=0)) fig.tight_layout() matplotlib.pyplot.show()
- How can my programs do different things based on data values?
if
,elif
, andelse
are conditionals for control flow- You can combine conditionals with
and
andor
if (a > 1) and (b == 5): # do something
True
andFalse
are booleans
-
Try out an
if / else
num = 37 if num > 100: print('greater') else: print('not greater') print('done')
-
Try without an
else
num = 53 print('before conditional...') if num > 100: print(num, 'is greater than 100') print('... after conditional')
-
Chain several tests with
elif
num = -3 if num > 0: print(num, 'is positive') elif num == 0: print(num, 'is zero') else: print(num, 'is negative')
-
Combine tests with
and
if (1 > 0) and (-1 < 0): print('both parts are true') else: print('at least one part is false')
-
Try with
or
if (1 < 0) or (-1 < 0): print('at least one test is true')
- How can I define new functions?
- What’s the difference between defining and calling a function?
- What happens when I call a function?
- How does Python report errors?
- How can I handle errors in Python programs?
- How can I make my programs more reliable?
- How can I debug my program?
- How can I write Python programs that will work like Unix command-line tools?