Update: these notes are now here: https://sovacool.dev/posts/2019/12/urssi-winterschool-notes
Slides & other resources: https://github.com/si2-urssi/winterschool
Contents:
- Whether you know it or not, you’re doing software design. Make those decisions with intent & purpose.
- Characteristics of good design
- Firmness: hard to write bugs accidentally
- Suitable for intended purpose
- Interesting & useful to users
- Principles of design:
- Traceability - easy to understand what the software is supposed to do.
- Minimize intellectual distance - as close to the real-world as possible
- Don’t reinvent the wheel. Re-use good design if it’s already a solved problem.
- Accommodate change.
- Fail gracefully.
- Solve easy problems; defer hard ones until they are easy.
- Zen of Python excerpt: “If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea."
- Think about code before you write it
- Identify use cases
- Define goals from use cases
- Split into small, easy pieces
- Define one piece at a time
- Thinking about use-cases before the goal helps you focus on the small, easy-to-solve pieces (exact problem at hand, limit the scope of the problem) without getting bogged down in any grandiose, big-picture ideas.
- Encapsulation
- Isolate unrelated concerns.
- Hide changing things.
- Python details:
- Use the
@property
&@var.setter
decorators for getters & setters. @classmethod
decorator for different constructors & other methods that work on the class but not instances of the class.
- Use the
- Environment variables
collections.ChainMap
: use it to prioritize program options.os.environ
to access shell environment variables.- Defaults = some dict
combined = ChainMap(cmdline_args, os.environ, defaults)
- Equivalent of stringing together
dictionary.update
but in reverse
- Equivalent of stringing together
- Structuring code for readability: Trey Hunner blog post: craft your python like poetry.
- Low barrier to entry. Make your code usable & accessible to lots of people.
- Make a runnable sample
- Keep it short; one command if possible.
- slick example:
curl URL/quickstart.sh | bash
- slick example:
- Clean up after running
- Run it multiple times in a row & it does the exact same thing every time
- Keep it short; one command if possible.
- Make a runnable sample
- Chain of responsibility
- Common interface to handle requests, but user doesn’t need to know which specific method handles the request.
- Creational pattern: Builder
- Create various representations of the same object. Abstract construction steps with different implementations of methods for different object variants.
- Structural pattern: Proxy
- Only load something when you actually need it if it takes a long time to load or is expensive to create. e.g. when loading webpage, it’ll display the text before images have finished loading, with blank placeholder where image will load.
- More resources
- “Gang of four” original book on design patterns
- toptal.com python design patterns book
- Module: any python file that contains definitions & statements.
- Package: a collection of modules in the same directory.
- Must contain the init.py file. (Except for namespace packages...)
- Often this file is empty.
- Python executes this file before anything else when imported.
- Can contain subdirectories with “submodules” containing more Python files and another init.py file.
- Tests subdirectory for test files (more on pytest later).
- Must contain the init.py file. (Except for namespace packages...)
- Lots of different ways to import modules.
- Kyle’s preferred way: explicit relative imports
- Uses dot notation (
.
for current path,..
for one level up)
- Uses dot notation (
- Kyle’s preferred way: explicit relative imports
- DON’T REINVENT THE WHEEL
- Rely on the standard library, numpy, scipy, etc.
- main() & main (Bryan Weber, writes for RealPython)
- Can use a module both as a module AND a script.
- main() is the entry point to the program.
- Import guard example: realpython.com/python-main-function
- main.py: special use case to execute your package as a script. e.g. pip.
- Package management
- pip to install packages on PyPI or from source.
-e
flag for development version.
- The setup.py file (at same level as source directory) tells pip how to install your package.
- See slides for example use.
- See Kyle’s "better example” slide for cool use of
path.abspath
&path.join
withhere
variable (kinda like R’shere
pkg) - Changelog: keepachangelog.com
- Semantic versioning: semver.org (PEP 440)
- MAJOR.MINOR.PATCH
- Problem with setup.py: could have malicious code. PyPA has come up with pyproject.toml & flit to get around that. Also easier than using setup.py.
- Also look into cookiecutter templates.
- pip to install packages on PyPI or from source.
- Think about this at the very beginning so you don’t have to re-organize everything later.
- Workflows
- Centralized workflow
- Only works for really small projects
- Everyone just commits to master 😬
- Feature branching workflow
- Also work in a feature branch.
- Start a pull request before merging to master.
- Delete branches after they’re merged.
- Forking workflow
- Only reason to fork is if you don’t have write access to someone else’s project / when you’re not a core contributor.
- Create a PR when ready to merge.
- Centralized workflow
- Alias
git
tohub
. https://hub.github.com/- Extensions to interface with GitHub from the command line.
- Create a GitHub repo from a local git repo:
git create username/reponame
- Open up the repo in your browser:
git browse
- Open a new PR:
git pull-request
- Compare 2 branches:
git compare master..feature-branch
- If you clone a repo but realize you wanted to fork it:
git fork
- On branches:
- A branch is just a pointer to a commit. Commits are linked nodes.
- Use pull requests as much as possible.
- Fosters code review.
- Facilitates discussion.
- Can use continuous integration to run tests automatically.
- Someone else should merge your code into master so two sets of eyeballs review each feature.
- Pick one or two people to do a technical review & an end-user review.
- Instead of creating a merge commit, could use rebase to squash all the commits from that branch into one.
- NEVER SEND A PULL REQUEST FROM MASTER.
- Master branches will become incompatible.
- GitHub now warns you if you attempt to do this.
- Never send a large pull request without notice.
- Read the contributing doc.
- Common practice is to ask whether the maintainers want the feature before you work on it.
- Pull requests should be small, digestible changes.
- Make each unit of code simple enough for someone to review & accept.
- Tips:
- Always
git pull
before you start new work. - Keep branch names descriptive.
- Generously use branches, but delete them when you’re done.
- Use the
hub
extension to make your life easier.
- Always
- Material: https://jameshowison.github.io/peer_production_course/docs/additional_git_exercises.html
- Pull requests are communication; make them digestible.
- Note: any time you edit files, that’s a feature, so you should always do that in a branch.
- Maintainer as developer AND champion of the community.
- Create a welcoming & active environment.
- How long ago was the last commit is really important. Is the project active?
- “Turn the music on — make it feel like a party!"
- Even when you’re working with people face-to-face, you should document discussions on GitHub.
- How to split pull requests.
- How do you know your code gives the right answers? …what about after you make changes to the code?
- When: ALWAYS
- Where: external test suite
- e.g. inside
tests/
subdir in package repo. - Some tests are better than no tests. But a rigorous test suite is best!
- e.g. inside
- Why: make sure our results are trustworthy.
- It’s really easy to make subtle mistakes.
- Helps us know that a PR won’t break anything.
- Unit tests are good examples of how a package works.
- What and how
- Tests compare expected vs observed outputs for known inputs.
- You don’t have to have a function written in order to write a test.
- Use assertions (e.g.
assert exp == obs
). - Use
math.isclose()
ornp.allclose()
to get around floating point precision. - Use
pytest
package.-s
to keep standard output.-k
to run tests matching a substring.-q
run specific test file & test function.
- What cases to test
- Interior: precise values don’t matter (just test one of these).
- Edge: beginning or end of range of inputs (test all of these).
- Corner cases: 2 or more edge cases that intersect.
- Pytest test generators
- Decorator to feed lots of inputs to one test function:
@pytest.mark.parametrize
- Decorator to feed lots of inputs to one test function:
- Types of tests
- Unit test: test individual functions & methods.
- Have to break up your code into small functions.
- Integration test: verify that multiple pieces of the code work together.
- Regression test: confirm that new results match prior results (which are assumed correct).
- Unit test: test individual functions & methods.
- Test-driven development (TDD): write your tests before you implement the functions.
- More tips
- Test for consistency with PEP8.
- e.g.
flake8
: linter & style-checker. - Plugins for your favorite IDE to run it continuously.
- e.g.
- Test that exceptions are raised:
pytest.raises(ExceptionClass)
- Mocking
- Replace parts of the system with precisely controllable code to specify return values & throw exceptions.
- Test for consistency with PEP8.
- Test coverage
- Percentage of code (in number of lines) that are touched by tests.
- 100% test coverage doesn’t guarantee that you catch all potential errors; it means every line of code is run by at least one test.
pytest-cov
creates a coverage report.- codecov.io integrates with GitHub.
- Continuous integration
- Ensure all changes to your project pass tests through automated test & build process.
- Services: GitHub Actions, travis, CircleCI, AppVeyor, Jenkins (used by mothur)
- Add the CI badge to your readme: it signals that your tool is being actively maintained.
- See PyTeCK repo as an example of useful badges.
- Tests in the wild: PyTeCK
- https://learngitbranching.js.org/?NODEMO
- Note:
git cherry-pick
keeps the original author information. 😄 git rebase
re-writes history to move the branch point. Obviates merge commits, instead makes them fast-forwards.git rebase -I
in interactive mode is a good idea. Allows you to squash commits and clean things up.
- Code review augments testing, but doesn’t replace testing.
- Efficiency, readability, etc. can’t be tested for.
- The purpose is to make the code better. Everyone makes mistakes. There’s no expectation that you’ll do it exactly right the first time.
- By doing code review, you save time down the road.
- Goals:
- Team cohesion.
- Gain shared understanding of the project.
- Get to know teammates skills’ better.
- Code quality.
- Find problems early.
- Get different perspectives.
- Consistency & readability.
- Makes code easier to maintain.
- Personal learning.
- Reading other people’s code & having your code reviewed.
- Team cohesion.
- Mindset:
- Developer:
- Recognize that a code critique is not a personal attack. You are not your code.
- Be ready & willing to learn new things.
- Expect that there will be changes. Remove the fear of making mistakes.
- Be humble.
- Reviewer:
- Don’t assume that your way is the best.
- Make positive comments, not only negative ones.
- Understand why the developer asked you to review the code.
- Focus on the code, not on the author.
- Pick your battles.
- Developer:
- Techniques
- Prioritize things that humans can spot that automated testing can't.
- Readability
- Algorithms
- Prioritize things that humans can spot that automated testing can't.
- How we communicate matters (applies in all types of feedback-giving)
- Ask questions where possible.
- e.g. “Have you considered…” -- Maybe they have and there's a good reason for it.
- No personal attacks. It's about the code, not the person!
- Be as specific as possible about how the code could be improved instead of making general statements.
- Put yourself in others' shoes.
- If you wouldn't want to get the comment, you probably shouldn't give it to someone else.
- Explain why you're making the suggestion.
- Ask questions where possible.
- Checklist
- Before you ask someone to review your code:
- Write tests.
- Make sure the code runs & passes the tests.
- Write comments & other documentation.
- Document any weird edge cases & work-arounds
- Follow the style guide.
- When you review someone else's code:
- Comments are understandable & appropriate.
- DRY up repetitive code.
- Code runs & passes tests.
- Exceptions are handled appropriately.
- Before you ask someone to review your code:
- Best practices
- Communicate goals of code review.
- Do it early & often.
- Review a small amount of code.
- If it takes longer than 60 minutes to review, that's too much.
- Establish a process for what to do after reviews.
- Is it a hard gate that you have to make the reviewer happy, or are they just suggestions you could choose not to follow?
- Issues you might identify in code review
- Inconsistent style
- Inefficiency
- Unvalidated inputs
- Lack of exception handling
- Why is code review important for research software specifically?
- Just like peer-reviewing publications, we want to make sure the code underlying the science is sound.
- Science depends on the correctness of your code.
- Help spread best-practices & high-level understanding in the scientific community.
- Results may not always be known. There's not always "ground truth" (e.g. in simulations).
- GitHub-specific tips: using Pull Requests for code review (examples: pr-omethe-us/PyKED) (Kyle Niemeyer)
- Use pull request templates.
- Could enforce check boxes like which issue(s) it resolves, that test cases were added, etc.
- Easily view file diffs & add comments right alongside the code. Facilitates conversation.
- You can leave comments at multiple lines.
- Make suggestions for small, easy changes. There's an "insert suggestion" button! (Don't do this for design changes.)
- Under settings > branches, you can protect branches
- e.g. require that a PR has to be reviewed before merging into master.
- More on code owners: https://help.github.com/en/github/creating-cloning-and-archiving-repositories/about-code-owners
- Tool: octobox.io for managing GitHub notifications.
- Use pull request templates.
- TLDR: if you make your code public, pick a license and put a
LICENSE
file in your repo. - Copyright
- Facts & ideas are not copyrightable.
- Expressions of ideas are copyrightable.
- Right of first publication: goes to the first creator even if not explicitly specified.
- You should include a license with all publicly available software code so people know how they can (or can't) use it.
- Or, you can explicitly put work into the public domain, then it's free for anyone & everyone to use & modify.
- Software Licenses
- Types:
- Proprietary
- Free/open source (FOSS, FLOSS, OSS)
- Permissive: BSD 3-clause, MIT
- Copyleft: GPL (the license is "viral")
- Pick an existing license; don't make your own!
- Resource: https://choosealicense.com
- Open Source Initiative (OSI) Licenses
- To call your work "open-source", you have to release it under one of the OSI licenses.
- Types:
- Non-software: Creative Commons
- Codes:
- BY: Attribution (similar to permissive)
- SA: ShareAlike (similar to copyleft)
- ND: NoDerivatives
- NC: NonCommercial
- e.g. CC BY, CC BY-SA
- CC0: like the public domain version of creative commons.
- Codes:
- More concepts
- Patents: cover ideas & concepts (which copyright doesn't).
- Trademarks: symbols that represent a business or organization.
- Export control: gov't may forbid transfer of code/data/ideas to another country or foreign national.
- HIPAA: cannot share human patient data.
- Archiving & Citing Software
- Services: Zenodo, figshare, something within your University's library (UMich has one)
- Archives your stuff forever and makes it citable with a DOI.
- figshare: company, for-profit...
- Zenodo: run by CERN. Will be around as long as the EU exists.
- Free! Good file
- size limits
- Connects with GitHub! When you turn on Zenodo for your repo, it creates a new DOI when you cut a new release.
- Without proper citations, your work is not reproducible.
- Academia relies on citations for credit.
- Paper: Software Citation Principles
- Software should be "first-class" citations just like other publications.
- How? name, author(s), DOI or other persistent identifier.
- A GitHub link is not a persistent identifier, but it's better than nothing.
- If there's a paper describing it, cite both the paper & the code DOI.
- How can we make our software easily citable?
- Create a DOI (e.g. via Zenodo)
- Include a
CITATION
file in your GitHub repo.
- Tool in development: httsp://citeas.org (James Howison)
- Web scraper to find the right citation given a package name or website.
- Services: Zenodo, figshare, something within your University's library (UMich has one)
-
repro-packs (Kyle Niemeyer)
- Lorena Barbra: "reproducibility packages (repro-pack)" -- packages associated with papers shared under CC-BY.
- Produce a single repro-pack for an entire paper
- containing:
- Code, results, input data (if small enough)
- Figures (vector format)
- Config file, etc
- Upload to FigShare/Zenodo under CC-BY license.
- Cite using the resulting DOI in the associated papers.
- containing:
- Benefits
- Improve reproducibility & impact of your work.
- Reviewers love it.
- Lets you reuse your figures without violating a journal copyright.
- When published, the journal (one that isn't open access) owns the paper & everything in it that isn't licensed from somewhere else.
- Can include an appendix with statement about the availability of material. Or put it in the methods section.
- Research compendium: make your paper like a package so it's easily-installable. Uses lightweight packaging structure.
-
rOpenSci (Karthik Ram)
- rOpenSci: Scientific software for R. Helping researchers write sustainable software tools.
- software-review: rOpenSci Software Peer Review of community-contributed packages
- JOSS got started when rOpenSci realized the need extends beyond R packages.
- dev-guide: https://devguide.ropensci.org/
- PyOpenSci recently got started as the Python version of ROpenSci. (David Nicholson)
-
JOSS: Journal of Open Source Software (Kyle Niemeyer)
- Open, no fees.
- If you've already licensed your code & have good documentation, it should take under an hour to submit to JOSS.
- Very short paper to describe the software.
- All the conversation happens on GitHub.
- Uses same structure as JOSE (Journal of Open Source Education).
- Questions from the audience: when to submit as a package (e.g. to JOSS) versus in a repro-pack (to your society journal)?
- If anyone else would ever use it, it should probably be a package.
- If the code is only used for creating a paper, it should just be in the repro-pack.
- If your goal is to write a methods paper, it probably wouldn't go to JOSS.
- If you have the option to submit to a domain journal, do that first instead of JOSS. (Karthik's take)
- JOSS is meant to fill in the gap for people who don't have a place to publish their software.
- This is for getting research credit. But you still the need to cite the specific version you used (e.g. from Zotero) for reproducibility purposes.
-
Sidney Bell at Chan Zuckerberg Initiative
- CZI started funding scientific software.
- foundational packages (e.g. scikitlearn, matplotlib, pandas).
- biology domain-specific packages.
- First cycle of funding awarded. Second round closes in Feb.
- Funding awarded to organizations (e.g. NumFocus, Universities), not people.
- CZI started funding scientific software.
- Value of documentation.
- The value & extent of your work if it's understandable by your colleagues.
- Provides provenance for your scientific process.
- Demonstrates your skill & professionalism.
- "A love-letter that you write to your future self."
- It's easier than you think!
- Types:
- user & developer guides
README
file accompanied byLICENSE
,CITATION
,CHANGELOG
, etc.
- code comments
- docstring
- for functions & classes
- available within Python via
help()
& easy to parse by Sphinx.
- in-line
- bad: polluting the code with unnecessary information that's already evident from reading the code.
- good: use sparingly to explain reasons behind choices & complicated sections
- docstring
- self-documenting code
- intelligently name things that tells you why it exists, what it does, and how it's used.
- write really simple functions that do only one thing.
- "A
function
should have a function, not multiple functions."
- "A
- follow consistent style.
- generated API documentation
- user & developer guides
- Tools
- Sphinx: automatically generate documentation
- Set it up with CI to automatically build your documentation website when you make changes.
- Writing docstrings that are compatible with Sphinx:
- Styles: NumPy, Google, reStructuredText...
- Specify parameters, returns, & include a short description
- Easy to get started quickly. See slides for more details.
# at top-level of repo, same level as package dir mkdir docs/ cd docs/ sphinx-quickstart make html
doctr
: auto-deploy docs to GitHub pages using TravisCI.- Read the Docs to host your documentation.
- Example: https://github.com/kyleniemeyer/ME373
- Sphinx: automatically generate documentation