Skip to content

Instantly share code, notes, and snippets.

@disulfidebond
Created August 31, 2018 02:59
Show Gist options
  • Save disulfidebond/9e0c09710a83d8df05f9ecdf3f348570 to your computer and use it in GitHub Desktop.
Save disulfidebond/9e0c09710a83d8df05f9ecdf3f348570 to your computer and use it in GitHub Desktop.
Etherpad from Software Carpentry 2018 Workshop
†Welcome to Software Carpentry Etherpad!
This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents.
Use of this service is restricted to members of the Software Carpentry and Data Carpentry community; this is not for general purpose use (for that, try etherpad.wikimedia.org).
Users are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html
All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/
Link to workshop website: https://uw-madison-aci.github.io/2018-08-29-uwmadison-swc/
Unix Shell:
Commands we've learned:
pwd - 'print working directory', prints the location your terminal is looking at in your filesystem
ls - 'list', prints the contents of the directory you are in
options:
-l: long format (see info about files)
-h: human readable (makes the size more readable)
-a: shows hidden directories and files
-F: show indicators for files vs. folders
-..: show the directory contents above my current directory
<DIRECTORYNAME>: lists a different directory than you are in
<file name>: searches for some specified file in current directory
man - gives you the manual for a particular command
if "man" does not work, you can try --help after the command ("man" doesnt work for windows)
cd - 'change directory', changes which your terminal is looking at in your filesystem
cd .. - navigate up one directory
cd ~ -navigate to my home directory
cd - - navigate to the last place I was ('previous channel button')
mkdir - make a new directory
nano 'some file' - open 'some file' within the nano command line text editor
'CTRL-x' to exit, then 'y' to save, then enter to keep the name
'CTRL-o' to save without exiting (stands for 'write out')
rm 'some file' - removes 'some file'. This only works on files.
options:
-r 'some directory' - removes 'some directory' and anything it contains.
-i : interactive, prompts you for each item being removed (use 'y' or 'n' for yes and no)
touch <filename>
create an empty file
mv <file> <new-location>
moved a folder or file to a new location or new name
cp <file> <new name or location>
copies a file to a new name or new location
Absolute paths start with a /.
Tab Completion <3<3<3<3<3
Hit "tab" when you are typing a command or location to autocomplete something
CTRL-a : jump to the beginning of a line
CTRL-e : jump to the end of a line
Wildcards
* - asterisks can represent any characters
example: *.pdb will mean any files ending in ".pdb"
cat <file> - shows contents in terminal
(word count) wc <path> - counts number of words in a text file
-l count lines
-c count characters
sort <file> - sorts file contents
-n sort numerically, rather than alphabetically
-k <num> sort on column <num>, instead of the first column
-r reverse the sort order
head <file> - shows the top of a text file (default is the first 10 lines)
-n <num> show first num lines
tail <file> - show the end of a text file (default is the first 10 lines)
-n <num> show the last num lines
redirect ( <any_cmd> > <output_filename>) : direct the output that would've gone to the screen to a text file instead
PIPES ( <cmd1> | <cmd2> | <cmd3> ) : strings multiple commands together, passing the output from the previous command to the next
Loops
one-line syntax: for <loop_variable> in <list of items>; do <cmd1>; <cmd2>; ... ; done
- use the value of the loop variable using $ ($loop_variable)
ex: for filename in basilisk.dat unicorn.dat; do head -n 3 $filename; done
ex: for filename in *.dat; do cp $filename original-$filename; done
Regular Expressions
these are really handy ways to filter filenames or text
*[AB].txt - "all files which end in either A.txt or B.txt"
^ - the "not" operator, this negates or filps the effect of the expression
*[^AB].txt - "all files which end in anything but A.txt or B.txt"
A "quick" guide to regular expression usage with a few nice examples can be found here: http://marvin.cs.uidaho.edu/Handouts/regex.html
echo <text> - returns the text after the command as output to the string, nice for examining the values of variables in bash
to write the value of a bash (or for loop) variable to screen: echo $variable_name
Bash Scripts
- bash scrips typically end with ".sh"
- to run a bash script use either "bash <script_name>.sh" or "./<srcipt_name>.sh"
script arguments:
- arguments can be provided to a bash script like so: bash <script_name>.sh arg1 arg2 arg3...
- in the script, the value of an argument can be accessed using $1, $2, $3 where those values would be the values of arg1, arg2, arg3 respectively
- if the script tries to access the value of an argument that doesn't exist, it can result in unexpected behavior!!!
comments:
- any line starting with a "#" symbol will be ignored in the script's execution
- comments are a nice way of leaving notes for collaborators (or your future self) about what the script does and how it is used
Unix Questions(write yours below) and answers from instructors:
when i use "LS" I get coloring, when I use "ls" I get the same info without coloring, is that significant? how? (windows)
interesting, this is probably due to a setup file on your system that tells it that "LS" should be colored. Up to you which you'd like to use.
(colors in 'ls' hightlight different type of files, and/or permissions)
Is there a shortcut to jump to previous space or '/' similar to ctrl-a or ctrl-e?
`cd -` allows you to move to the previous folder,
is that what you wanted to know ?
I meant more if I'm typing in the command line something like cd 'folder/folder1/folder2/ and I made a typo. Is there shortcut to jump to the previous / to correct it instead of hitting left arrow a bunch?
From google:
Some useful line editing key bindings provided by the Readline library:
Ctrl-A : go to the beginning of line.
Ctrl-E : go to the end of line.
Alt-B : skip one word backward.
Alt-F : skip one word forward.
Ctrl-U : delete to the beginning of line.
Ctrl-K : delete to the end of line.
Alt-D : delete to the end of word.
(The command with Alt based shortcuts are not working for me, the Ctrl based shortcuts are working on my terminal)
(It looks like these also work in jupyter notebook)
Are 'bash goostats' and './goostats' synonymous?sorta..they are the same if the goostats has execute permissions.
Question from sticky notes: a way to list # of files in a direcotry would be helpful
`ls | wc -l`
or if you have specific files (example those ending in .txt) you could do `ls *.txt | wc -l`
I use it all the time to make sure I have the same number of output files as input files after my analyses run - Sarah
Yes, for bash scripts. The './goostats' could be used to run other types of scripts/programs that aren't bash.
Python:
Open jupyter notebook (in terminal command line):
jupyter notebook
Correction: "is" is used for more than just comparing type and value. It would be best to use `==` in most cases.
If you are working with multi-dimensional data (not tabular), you may be interested in the "xarray" library (http://xarray.pydata.org/en/stable/).
Python Questions(write yours below) and answers from instructors:
If you have a really long list is there a way to search within that list to figure out where (what position) some item is located?
Yes. You can use the `.index` method. Example:
a = [1, 2, 3]
a.index(2)
# returns 1 (for the item in the second position)
# this will raise an exception if the item doesn't exist in the list
How to remove an item from a list
There are a couple of options, here is a blog post that shows a few examples and explains pros and cons http://gloriadwomoh.me/blog/deleting-an-item-from-a-list-in-python-pop-remove-delete/
Regarding number precision (previously asked question):
Python built in float numbers are 64-bit (double) floats. If you are concerned with precision (8, 16, 32 bit integers and 32-bit versus 64-bit floats) you may want to look in to the "numpy" python library. Steve is about to talk about the "pandas" library which uses numpy underneath.
Still a bit uncertain about when to use single and double brackets and ( ) and [ ].
( ) indicates that the comma-separated values are a tuple, while the [ ] indicates that the comma-separated values are a list. Does that kinda cover it?
Also when to use no space vs. space after a character.
Do you mean when defining a string? Like "a" vs "a " Or am I missing the crux of the question? :)
Or do you mean `print( "a" )` versus `print("a")`? In this case it doesn't matter except for code style preferences. A good starting point for python style guidlines is PEP8: https://www.python.org/dev/peps/pep-0008/
For two plots, use subplot()
see https://stackoverflow.com/questions/42818361/how-to-make-two-plots-side-by-side-using-python
Git:
Sarah will let you know when the links below come into play
Socrative:
https://b.socrative.com/login/student/
ROOM: SSTEVENS
Info on different config options (eg, text editors): https://uw-madison-aci.github.io/20170830-git-novice/02-setup/
https://github.com/UW-Madison-ACI/countries
countries:
france
brazil
Senegal
Chile
Australia
Spain
Canada
Sweden
Swaziland
China
Egypt
Madagascar
Germany
Iceland
New Zealand
norway
South Korea
Canada
Colombia
Japan
srilanka
United Kingdom
Notes for git:
$ git config --global user.name "name-in-quotes"
$ git config --global user.email "email-address"
$ git config --global color.ui "auto"
$ git config --global core.editor "nano -w"
git init = initialize a repository in current folder
git status = what is the status of the current repo? what branch? what is ready to commit? what is staged? ...
Untracked files: files that you have not yet told your git repo about but which are in its location
git status --ignored - display which files are being ignored
git add <path/to/file/or/directory> = adds one or more files to staging - can be new files or changes already being tracked
git commit -m "description-of-changes" = commits the file to the repository
When message flag is is missing, it will launch the configured text editor to add the message.
Note: the two separate steps of adding and staging provide flexibility in managing the repository.
git log = show the history of commits with their details (author, date, message...)
git log -1 = show one commit
git log --oneline = show commits as a single line
git log --graph --all --oneline : gives you a pretty graphical representations of your branches and commits
git diff = display the differences between the staged state of files and their most recent edits
Can also see the differences between specific commits
git diff --staged = what is the difference between the staged version and the last committed version
git diff HEAD <file> - what is the difference between current and most recent commit?
git diff HEAD~1 <file> - what is the difference between current version and one version before (minus) the most recent commit?
git diff <commit-id> <file> - what is the difference between current version and the commit with ID <commit-id>?
git checkout <commit-id> <file> = make the current version of <file> the same as it was when it was committed as <commit-id>
git checkout HEAD <file> = check the version out of the repository that is in HEAD
git checkout <branch-name> - switch to <branch-name>
git checkout -b <branch-name> - create a new branch and check it out in one step
.gitignore = a file that tells a git repository which files should not be tracked
Can list individual files or use wildcards
Can create an entry for an entire directory
The .gitignore file itself should be committed to the repository
git branch - display the branches
git branch <name> - create a new branch called <name>
See git checkout <branch>
git branch -d <name> - delete an unused or merged branch.
git branch -D <name> - using capital D if there are changes not yet merged that should not be merged into the current branch.
git merge <other-branch> - merge the changes from a different branch into the currently checked out branch
git remote - show information about origins
git remote add origin <url> - add a remote connection named "origin" that points to the URL online
"origin" is the convention for the main remote repository
git push <remote-name> <branch-name>
git push origin master - push all the committed changes not yet in origin up to the origin remote repository
(in other words, push from your computer to github)
git clone <url> - create a local copy of a repository from the remote copy at the URL
Will automatically create an "origin" remote entry
to clone a repo into a directory that has a different name than the repo
git clone <url> <local folder name>
example:
On github, we have repo called "planets", but we already have a "planets" folder on our computer where we want to clone the repo.
git clone <planet-repo-url> planets-2
git pull <remote-name> <branch-name>
git pull origin master - update the local copy with the latest commits from the remote branch.
On Github
create new repositories (new)
"Fork" a repository that someone else already has
will create a copy of another person's repo that you can edit that has all the history of the original repo, but any changes you make on your fork won't automatically go into the original repo.
Make a pull request (PR)
when you make changes on your branch or fork, make a pull request into the original repo where you are basically asking "please pull my changes in"
Helpful blog on writing pretty/consistent commit messages: https://chris.beams.io/posts/git-commit/
Git Questions(write yours below) and answers from instructors:
Can you edit files within a git repository outside of the command line (like in some language environment) and then stage and commit those changes?
Yes- As long as the file you change is located in the folder that is the git repo. The file system you see and edit at the command line is the same file system you see in the view fnder. So if you make a change using a GUI program, you will need to commit those changes in your git repo as well. Thanks.
Question was asked if you can add and commit all in one line. The answer is yes.
git commit -a -m "my message"
with this one, the -a addes ALL FILES that are being tracked (not files that are in the gitignore), even if you don't want to, so this can be a dangerous command
git commit --only <file> -m "message"
*I think* this is how this adds and committs only one file
or rather maybe: commit <file> -m "my message"
If you would like to join Github Education (gives free private reposes among other stuff), you can sign up here: https://education.github.com/ ('join' at the top of page) - I highly recommend signing up!
Why did I need a username and password and the instructor did not get a prompt?
Sarah has her git/computer set up to automatically accept her username and password
Here is how you can set it up too: https://stackoverflow.com/questions/8588768/how-do-i-avoid-the-specification-of-the-username-and-password-at-every-git-push
- Recommend trying this maybe at lunch if you want to set it up
- I personally do not use it because, yes, it is tedious, but it can be a nice catch if I accidentally start to push to a remote that I don't want to.
Auto tab complete git commands:
https://git-scm.com/book/en/v1/Git-Basics-Tips-and-Tricks
At this website, the top sectionhas a file you can download into your home directory (~). Then you'll have to edit your .bashrc file which is located in your home directory to add the line "source ~/git-completion.bash". I recommend just doing the very top part of this site (the other stuff is more complicated and less fixable if something gets messed up).
If I have 3 different versions in the history of a repo, does git save 3 copies of that repo, or only the differences between the versions? Similarly, if I have a branch and master, do I have two copies of everything on my disk, or just the one copy, and the differences from master?
Are you asking if you made 3 different branches in a repo and make different changes on all of them?
If you make a new branch, no there is not a new copy of the file. Git technically stores the "diffs" or changes between different states. Diffs are much smaller in memory and is not necessarily a second copy of the file.
This is the same with the commits - in each commit, we aren't actually storing the "state" or file at that time, but rather the difference between commits. This is how git tracks our changes.
Also, when you switch branches, you only have access to the state of the files that are on that branch.
Did that answer your question?
Yup, that got all of my questions. Thanks!
When you want to make changes in a git repro is it best to create a new branch, work on changes there, and then merge that branch with the master?
yes! that is a great and an encouraged workflow
If I want to use code without making modifcations, but want to access and new code from developer, would I clone, or should I fork anyway?
If you do not plan to change the code, just cloning the original repo without forking is a good idea. But forking and cloning shouldn't hurt it either. In both cases, if the orignal repo changes, you will need to update your local copy with git pull.
Some questions from the morning that are good to answer but I don't plan on covering them this afternoon:
Does Git work with files stored in google drive?
Yes, there is a "hack-ish" way to do this. You can sync your google drive folder on your computer and then use "git init" and set it up as a git repository. When people make changes on a drive file, your git repo should show that there are changes that you need to commit and push.
Keep in mind, that (1) these google files are likely binaries so it is harder to determine the differences, and (2) this process is not seemless. Sometimes the google drive doesn't sync properly and you can end up with two copies of the same file. (So I've been told).
Is there a way to fork and create a PR (pull request) from the command line?
Yes, to fork you must first clone, and then your can use "git fork" on the command line: https://www.quora.com/Git-How-can-I-fork-a-repository-using-only-the-command-line
And yes, you can make a PR from the command line, you have to know what commit you want to start the PR from: https://git-scm.com/docs/git-request-pull
Clarifying the difference between the "origin" and "upstream" terms:
These are both names for remote instances of a repository and the names are conventions or best practices
In general, the "origin" remote is one that your user will have permission to push changes directly into while the "upstream" remote is one that you do not have permission to push changes into. Therefore, the typical workflow would involve
1. forking someone else's repo (the upstream one), which create a personal copy but which is remote (on a server)
2. cloning the personal, forked copy on to the computer where you do your work (e.g., your laptop). this process will automatically set the remote personal copy as the "origin"
3. committing changes locally and then pushing them back up to your personal remote to copy, which is "origin"
4. submitting a pull request to merge your changes in your origin remote to the upstream
Step 4 may be approved or rejected by the owner of the upstream repo.
What is the difference between cloning to a remote server with git clone and pulling information from github to your computer?
Git repositories can be hosted on any computer. You could have them on a remote server at your workplace or on github (or other git hosting site). They both act as git remotes and can usually be treated in the same way. Github just happens to have a web interface. You can `git clone` to create a local copy of the remote repository, `git pull` to update your local copy with changes on the remote, and `git push` to push your local changes to the remote if you have permission.
What is the difference between clone, pull, and fetch?
Clone will create the repository locally, add the remote (origin), and pull the current version from the remote (default: master branch)
Pull will get the changes from the remote and merge them in to your current branch
Fetch is typically not needed in a "normal" workflow but acts as the synchronization step to let your local git repository know about the changes available on the remote, but it doesn't affect your working directory. In fact, as described in the `git pull --help` information, git pull is actually a combination of `git fetch` and `git merge`.
Be careful when copying and pasting from nano--if there is a "$" then your copy command will only copy up to that point.
import sys
# import the sys library
print(sys.argv)
# sys.argv is a list of values. This command instructs python to print the entire list
print(sys.argv[1])
# sys.argv is a list of values. This command instructs python to only print the value at position 1
# remember that lists always start at 0, not 1
# the 0th item is always the name of the program, you usually will not use this value
If you use print(), python will automatically convert whatever is inside the parenthesis to a string. Be careful that some types of values cannot be automagically converted, and this will cause an error.
for filename in *.csv # scan the current directory, select all files that end in ".csv", the * is a wildcard that tells Bash: "match anything" + "that ends in csv"
do
python gdp_plots.py $filename
# run the command "python gdp_plots.py" using the file that was assigned to the $filename variable
done
# python equivalent:
for filename in file_list:
# for each item in the list "file_list", assign it to the variable filename, then do the following:
data = pandas.read_csv(filename, index_col = 'counts')
# read the values in the csv file and store them in the 'data' variable
ax = data.plot(title=filename)
# creae a plot of the data, and store it in the 'ax' variable
ax.set_xlabel('Year')
# assign the x label inside ax to 'Year'
ax.set_ylabe('GDP Per Capita')
# assign the y label inside ax to 'GDP Per Capita'
ax.set_xticks(range(len(data.index)))
# this is a bit complicated, but with any python command, start from the inside parenthesis and work your way out, similar to solving a math problem
# get the length of the values for the data rows above
# you're setting up tick marks, and this requires the argument "range". assign the value from the length of the data rows as the range
# assign this to the ax variable
# the time command automagically starts when the command runs, and stops when it ends, basically a computer stopwatch.
# then it outputs what would be on the stopwatch screen
What real, sys, and user time means when using the time command: https://stackoverflow.com/questions/556405/what-do-real-user-and-sys-mean-in-the-output-of-time1
if len(sys.argv) == 1:
# IMPORTANT: sys.argv will always have one value, which is the name of the python script. If the length is 1, then nothing else was added
# if the length of the list is 1, do the following:
print('Not enough arguments have been provided')
print('Usage: python gdp_plots.py <filename>')
print('Options: -a plot all gdp data in the current directory')
sys.exit()
# or use exit()
# exit the script, do not continue further
if file_list == []:
# if file_list is empty, do the following
# other ways to do the same:
# # if len(file_list) == 0: if the length of the file_list is 0, i.e. empty
# do something
# # if file_list == False: # if the file_list is False, i.e. there are no values inside it
# # do something
# # if not file_list: # the same as if file_list == False
# # do something
if sys.argv[1] == '-a':
# look at the sys.argv. if the value at position 1 is equal to '-a', then do the following:
file_list = glob.glob("*gdp*.csv")
# use the glob library to match any files that match using wildcards, this means
# "match anything" + "match the string gdp" + "match anything" + "match .csv"
# example: this will match the file helloworld_gdp_file.csv
# this will not match the file hello_world.csv
else:
file_list = sys.argv[1:]
# take all arguments that you've provided, then add them to file_list
# this will match anything that you've typed after the filename gdp_plots.py
# this will cause an error if what you've typed is not a filename in the directory!
Standard python docstring conventions:
https://www.python.org/dev/peps/pep-0257/
"Google style": https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html
"Numpy style": https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html#example-numpy
More ACI/SWC helpful resources: https://github.com/UW-Madison-ACI/swcarpentry-workflows-in-practice/blob/master/resources.md
Wrap-up slides: https://docs.google.com/presentation/d/1CjkZkwr5E9VEheujYF9BfLCtBwxAmdOOCQGrvJES4fo/edit?usp=sharing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment